Genome Informatics 2008
GENOME INFORMATICS SERIES (GIS) ISSN: 0919·9454 The Genome Informatics Series publishes peer-reviewed papers presented at the International Conference on Genome Informatics (GIW) and some conferences on bioinformatics. The Genome Informatics Series is indexed in MEDLINE.
No.
Title
Year
ISBN CIJPa.
1
Genome Informatics Workshop I
1990
(in Japanese)
2
Genome Informatics Workshop II
1991
(in Japanese)
3
Genome Informatics Workshop III
1992
(in Japanese)
4
Genome Informatics Workshop IV
1993
4-946443-20-7
5
Genome Informatics Workshop 1994
1994
4-946443-24-X
6
Genome Informatics Workship 1995
1995
4-946443-33-9
7
Genome Informatics 1996
1996
4-946443-37-1
8
Genome Informatics 1997
1997
4-946443-47-9
9
Genome Informatics 1998
1998
4-946443-52-5
10
Genome Informatics 1999
1999
4-946443-59-2
11
Genome Informatics 2000
2000
4-946443-65-7
12
Genome Informatics 2001
2001
4-946443-72-X
13
Genome Informatics 2002
2002
4-946443-79-7
14
Genome Informatics 2003
2003
4-946443-82-7
15
Genome Informatics 2004 Vol. 15, No.1
2004
4-946443-88-6
16
Genome Informatics 2004 Vol. 15, No.2
2004
4-946443-91-6
17
Genome Informatics 2005 Vol. 16, No.1
2005
4-946443-93-2
18
Genome Informatics 2005 Vol. 16, No.2
2005
4-946443-96-7
19
Genome Informatics 2006 Vol. 17, No.1
2006
4-946443-97 -5
20
Genome Informatics 2006 Vol. 17, No.2
2006
4-946443-99-1
21
Genome Informatics 2007 Vol. 18
2007
978-1-86094-991-3
22
Genome Informatics 2007 Vol. 19
2007
978-1-86094-984-5
23
Genome Informatics 2008 Vol. 20
2008
978-1-84816-299-0
Genome Informatics Series Vol. 20
ISSN: 0919-9454
Genome Informatics 2008 Proceedings of the 8th Annual International Workshop on Bioinformatics and Systems Biology (lBSB 2008) Zeuten Lake, Berlin, Germany
9 -11 June 2008
Ernst-Walter Knapp Free University Berlin, Germany
Gary Benson Boston University, USA
Herman-Georg Holzhutter Charita-University Medicine Berlin, Germany
Minoru Kanehisa Kyoto University, Japan
Satoru Miyano University of Tokyo, Japan
~_________________________Im __p_e_ri_a_l_C_O_ll_e_g_e_p_re_s__ s
Published by Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
GENOME INFORMATICS 2008 Proceedings of the 8th Annual International Workshop on Bioinformatics and Systems Biology (mSB 2008) Copyright © 2008 by the Japanese Society for Bioinformatics (http://www.jsbi.org) All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the JSBi.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN-13978-1-84816-299-0 ISBN-I0 1-84816-299-5
Printed by Fulsland Offset Printing (S) Pte Ltd, Singapore
CONTENTS
Preface
ix
Program Committee
xi
Exploring the Effect of Variable Enzyme Concentrations in a Kinetic Model of Yeast Glycolysis J. Bruck, W. Liebermeister fj E. Klipp
1
The Role of IP 3 R Clustering in Ca 2+ Signaling A. Skupin fj M. Falcke
15
Rule-Based Reasoning for System Dynamics in Cell Systems E. Jeong, M. Nagasaki fj S. Miyano
25
Estimation of Nonlinear Gene Regulatory Networks via Ll Regularized NVAR from Time Series Gene Expression Data K. Kojima, A. Fujita, T. Shimamura, S. Imoto fj S. Miyano
37
ModelMage: A Tool for Automatic Model Generation, Selection and Management M. Flottmann, J. Schaber, S. Hoops, E. Klipp fj P. Mendes
52
A Framework for Determining Outlying Microarray Experiments R. Wan, A. M. Wheelock fj H. Mamitsuka Exploring the Impact of Osmoadaptation on Glycolysis Using Time-Varying Response-Coefficients C. Kuhn, E. Petelenz, B. Nordlander, J. Schaber, S. Hohmann fj E. Klipp Comparing Flux Balance Analysis to Network Expansion: Producibility, Sustainability and the Scope of Compounds K. Kruse fj o. EbenhOh Semi-Supervised Graph Partitioning with Decision Trees T. Hancock fj H. Mamitsuka
v
64
77
91
102
vi
Contents
Measuring Correlations in Metabolomic Networks with Mutual Information J. Numata, O. Ebenhoh fj E.- W. Knapp
112
Optimality Criteria for the Prediction of Metabolic Fluxes in Yeast Mutants E. S. Snitkin Cd D. Segre
123
Biosynthetic Potentials from Species-Specific Metabolic Networks G. Basler, Z. Nikoloski, O. EbenhOh fj T. Handorf Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions Y. Shimizu, M. Hattori, S. Goto fj M. Kanehisa Optimal Metabolic Regulation Using a Constraint-Based Model W. 1. Riehl Cd D. Segre Comparative Determination of Biomass Composition in Differentially Active Metabolic States H.-C. Chiu fj D. Segre Suffix Techniques as a Rapid Method for RNA Substructure Search R. A. Bauer, K. Rother, J. M. Bujnicki fj R. Preissner The Relationship between Fine Scale DNA Structure, GC Content, and Functional Elements in 1% of the Human Genome S. C. J. Parker, E. H. Margulies fj T. D. Tullius A Novel Strategy to Search Conserved Transcription Factor Binding Sites Among Coexpressing Genes in Human Y. Hatanaka, M. Nagasaki, R. Yamaguchi, T. Obayashi, K. Numata, A. Fujita, T. Shimamura, Y. Tamada, S. Imoto, K. Kinoshita, K. Nakai fj S. Miyano Modeling IL-2 Gene Expression in Human Regulatory T Cells M. Benary, H. Bendfeldt, R. Baumgrass fj H. Herzel Toxicity versus Potency: Elucidation of Toxicity Properties Discriminating between Toxins, Drugs, and Natural Compounds S. Struck, U. Schmidt, B. Gruening, 1. S. Jaeger, J. Hossbach fj R. Preissner Comparative VEGF Receptor Tyrosine Kinase Modeling for the Development of Highly Specific Inhibitors of Tumor Angiogenesis U. Schmidt, J. Ahmed, E. Michalsky, M. Hoepfner fj R. Preissner
135
149
159
171
183
199
212
222
231
243
Contents
vii
Network Analysis of Adverse Drug Interactions M. Takarabe, S. Okuda, M. ftoh, T. Tokimatsu, S. Goto €1 M. Kanehisa
252
Sampling Geometries of Protein-Protein Complexes A. Guerler, S. Lorenzen, F. Krull €1 E. - W. Knapp
260
Computer Aided Optimization of Carbon Atom Labeling for Tracer Experiments B. S. Menkiic, C. Gille €1 H.-G. Holzhiitter
270
Web-Links as a Means to Document Annotated Sequence and 3D-Structure Alignments in Systems Biology C. Gille, A. Hoppe €1 H.-G. Holzhiitter
277
Author Index
285
This page intentionally left blank
PREFACE
Genome Informatics Vol. 20 contains a selection of peer-reviewed papers presented at the Eighth Annual International Workshop on Bioinformatics and Systems Biology on 9-11 June of 2008. This time the workshop was held in the Teikyo Hotel at the Zeuthen Lake near Berlin, jointly organized by the German members of the International Research Training Group (IRGT) 'Genomics and System Biology of Molecular Networks' and supported by the German Science Foundation (DFG). These workshops were created to give doctoral students and young researchers the opportunity to present and discuss their research work in Bioinformatics and Systems Biology in the frame of an international scientific meeting. The first workshop was held 2001 in Berlin. It was organized by Prof. Dr. Reinhart Heinrich, a co-founder of this series of workshops. Since 2001, the workshop has been held in Boston (2002), Berlin (2003), Kyoto (2004), Berlin (2005), Boston (2006) and Tokyo (2007). The present workshop was held in Zeuthen near Berlin as a part of a collaborative educational program involving the leading institutions committing the following programs and partner institutions of the US, Japan and Germany: • Boston - Graduate Program in Bioinformatics, Boston University • Berlin - The International Research Training Group (IRTG) "Genomics and Systems Biology of Molecular Networks" • Kyoto/Tokyo - Joint Bioinformatics Education Program of Kyoto University and University of Tokyo Partner Institutions • • • • • • • • • •
Boston University Charite Berlin Free University Berlin Humboldt University Berlin Kyoto University, Bioinformatics Center, Institute for Chemical Research Kyoto University, Department of Bioinformatics and Chemical Genomics, Graduate School of Pharmaceutical Sciences Max Delbriick Centre for Molecular Medicine, Berlin Max Planck Institute for Molecular Plant Physiology, Potsdam Max-Planck Institute of Molecular Genetics, Berlin University of Tokyo, Human Genome Center, Institute of Medical Science
This time we decided to first perform the workshop and to collect and re-
ix
x
Preface
view the manuscripts three weeks later, such that the discussions and criticisms at the workshop could be considered appropriately by the authors. However, there was also a pre-selection of the oral and poster contributions to be accepted at the workshop. The contributors were then allowed to submit manuscripts for the Genome Informatics volume. These contributions were reviewed by the members of the workshop event. We have selected 25 papers after revision. These papers will be indexed in Medline, and their electronic versions are freely available from the website of Japanese Society for Bioinformatics as Genome Informatics Online (http://www.jsbLorg/modulesjjournal/index.php/index.html). Former publications are also electronically available as Genome Informatics Vol. 15, No.1 (2004), Vol. 16, No.1 (2005), Vol. 17, No.1 (2006), and Vol. 18 (2007). We wish to thank all of those who submitted papers and helped with the reviewing process. We also wish to thank all those who helped in organizing this workshop for their efforts in local arrangement, especially the Local Committe Members: Martin Falcke, Alexander Skupin, Bianca Sprincenatu, Oliver Ebenhoh, Moritz Schutte, and Johannes Bausch.
Program Committee Chair Ersnt-Walter Knapp Organizers Gary Benson Hermann-Georg Holzhiitter Minoru Kanehisa Satoru Miyano
PROGRAM COMMITTEE Ernst-Walter Knapp Tatsuya Akutsu Gary Benson Oliver Ebenhoh Martin Falcke Hermann-Georg Holzhiitter Minoru Kanehisa Hiroshi Mamitsuka Satoru Miyano Robert Preissner Daniel Segre Brandon Xia
Free University Berlin, PC Chair Kyoto University Boston University Humboldt University Berlin Max-Delbriick-Center for Molecular Medicine Charite-University Medicine Berlin Kyoto University Kyoto University University of Tokyo Charite-University Medicine Berlin Boston University Boston University
xi
This page intentionally left blank
EXPLORING THE EFFECT OF VARIABLE ENZYME CONCENTRATIONS IN A KINETIC MODEL OF YEAST GLYCOLYSIS JOZSEF BRUCK',2
[email protected]
WOLFRAM LIEBERMEISTER'
[email protected]
EDDAKLIPP'
[email protected] , Max Planck Institutefor Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany University Berlin, Department of Biology, Chair of Theoretical Biophysics, Invalidenstr. 42,10115 Berlin, Germany
2 Humboldt
Metabolism is one of the best studied fields of biochemistry, but its regulation involves processes on many different levels, some of which are still not understood well enough to allow for quantitative modeling and prediction. Glycolysis in yeast is a good example: although high-quality quantitative data are available, well-established mathematical models typically only cover direct regulation of the involved enzymes by metabolite binding. The effect of various metabolites on the enzyme kinetics is summarized in carefully developed mathematical formulae. However, this approach implicitly assumes that the enzyme concentrations themselves are constant, thus neglecting other regulatory levels - e.g. transcriptional and translational regulation - involved in the regulation of enzyme activities. It is believed, however, that different experimental conditions result in different enzyme activities regulated by the above mechanisms. Detailed modeling of all regulatory levels is still out of reach since some of the necessary data - e.g. quantitative large scale enzyme concentration data sets - are lacking or rare. Nevertheless, a viable approach is to include the regulation of enzyme concentrations into an established model and to investigate whether this improves the predictive capabilities. Proteome data are usually hard to obtain, but levels of mRNA transcripts may be used instead as clues for changes in enzyme concentrations. Here we investigate whether including mRNA data into an established model of yeast glycolysis allows to predict the steady state metabolic concentrations for different experimental conditions. To this end, we modified an established ODE model for the glycolytic pathway of yeast to include changes of enzyme concentrations. Presumable changes were inferred from mRNA transcript level measurement data. We investigate how this approach can be used to predict metabolite concentrations for steady-state yeast cultures at five different oxygen levels ranging from anaerobic to fully aerobic conditions. We were partly able to reproduce the experimental data and present a number of changes that were necessary to improve the modeling result. Keywords: yeast; glycolysis; fermentation; respiration; kinetic modeling; metabolic regulation
1.
Introduction
Cellular metabolism is one of the key components of living systems. Its most basic functions are to generate the energy and the building blocks necessary to sustain the cells' life. Elucidation of central carbon metabolism, the source of energy for all heterotrophic life, is one of the success stories of biochemistry: function and mechanism of most of its components are known in considerable detail. A large class of the regulatory mechanisms of metabolism is well understood: the catalytic function of many enzymes is influenced by metabolites present in the cell. This kind of interactions have been successfully
1
2
J. Bruck, W. Liebermeister f3 E. Klipp
quantified in enzyme kinetic laws, which has led to ODE based models of metabolic pathways with considerable predicting power, as described in [4, 2, 7] and applied among others in [9, 5, 11]. However, metabolism is also regulated by other functional units of the cell, most importantly the transcriptional-regulatory system. It acts by changing the concentration of various enzymes via regulated production and degradation. This kind of regulation is necessary for the cell to steer its metabolism to meet its needs under various conditions. However, change in protein levels is usually not implemented in kinetic models: these typically adopt kinetic expressions for the included reactions with fixed maximal velocities, which amounts to the implicit assumption of constant enzyme concentrations. One of the possible reasons is that quantitative data on concentrations of single proteins in different experimental conditions are still lacking or rare. A fundamental determinant of the concentration of an enzyme's active form, and hence, its activity, is the amount of mRNA transcripts presents in the cell. However, many other layers of regulation exist, e.g. at the level of translation and allosteric regulation of the final protein among many others. It is controversial to what extent the final enzyme activity is determined by or correlated to the concentrations of its mRNA components. While genome-wide comparisons between mRNA and enzyme concentrations exist [1, 3], the abundance of a given set of proteins and their corresponding transcription rates should be systematically compared in different cell states to obtain a clearer picture. To the authors' knowledge such studies are not yet available. Based on an established ODE-based model of yeast glycolysis, we present an approach for modeling how metabolism is regulated by the transcriptional-regulatory system. In the model we include the change in enzyme concentrations in various experimental conditions. We used experimental data [12] from steady state yeast cultures with five different oxygen levels ranging from anaerobic to fully aerobic conditions. We implemented the change in enzyme concentrations by changing the maximal rates of the enzymatic reactions. For the above mentioned reasons, we determined these changes from mRNA concentration measurements, using them as inputs for the model. The model allows for computing metabolite concentrations and fluxes, which we compared to the corresponding experimental values. We performed parameter estimation to determine a set of parameters which best fit for the experimental data. The main question posed is the following: to what extent can experimental data for different cell states be explained by including expression data in the model under the assumption that biochemical reaction rates obey rate laws known from enzyme kinetics?
Exploring the Effect of Variable Enzyme Concentrations
2.
3
Methods
2.1. Experimental data
We used metabolite concentration and flux data from Wiebe et al. [12] obtained from cultures of Saccharomyces cerevisiae CEN.PKI13-1A grown in glucose-limited chemostat cultures (dilution rate D=O.lO/h). External conditions in these cultures could be controlled to a high extent. Steady-state cultures were obtained under one anaerobic (0% oxygen) and four aerobic conditions (0.5%, 1%, 2.8%, 20.9% oxygen in the inlet gas) with all other external conditions being kept constant. Measured quantities included biomass, concentration of external metabolitesa (Glucose, Ethanol, Glycerol), of intermediate metabolites (G6P, F6P, F16P, PEP, PYR, ATP, ADP, AMP, and the sum of 3PG and 2PG concentrations), net fluxes (consumption rates of oxygen and glucose and exhaust rate of ethanol, glycerol and C02) per unit of biomass, and relative fold changes of the mRNA concentrations compared to the anaerobic cultures for 69 genes with functions in carbon metabolism. 2.2. Mathematical model
We constructed a mathematical model of central carbon metabolism in S. cerevisiae based on the glycolytic pathway model by Teusink et al. [11]. The original model was based on measurements on steady state cell cultures under anaerobic conditions by comparison of experimental data of concentrations and fluxes of intermediate and external metabolites. The sum of the concentrations [NAD+] and [NADH] is a conserved moiety of the model. The adenosine species [ATP], [ADP] and [AMP] are not dynamical variables of the original model, instead, they were written as analytic expressions in term of the sum of high-energy phosphates. These were obtained under the assumptions that a) the sum of their concentrations is conserved, and b) the reaction catalyzed by adenosine kinase is fast in comparison to the other reactions, and hence in equilibrium. The metabolites GAP and DHAP are lumped to a single chemical species called "triose" reflecting the assumption that the transforming reaction between them (catalyzed by TPI) is also in equilibrium. The kinetic constants were largely obtained from experiments and fitted only to a minimal extent. The side branches of glycolysis contained in the model were
aAbbreviations: G6P: Glucose-6-phosphate; F6P: Fructose-6-phosphate; F l6P: Fructose-I,6-bisphosphate; Triose-P: sum of GAP: Glyceraldehyde-3-phosphate and DHAP: Dihydroxyacetone phosphate; BPG: 1,3bisphosphoglycerate; 3PG and 2PG: 3- and 2-phosphoglycerate respectively; PG: sum of 3PG and 2PG; PEP: Phosphoenolpyruvate; ACA: Acetaldehyde; AMP, ADP, ATP: Adenosine-mono-, di-, and triphosphate, respectively. NAD+, NADH: oxidation states of Nicotinamide adenine dinucleotide. Enzymes: ENO: Enolase; GAPDH: D-glyceraldehyde-3-phosphate dehydrogenase; ADHI, ADH2: Alcohol dehydrogenase I and 2, respectively; HK: Hexokinase; PGI: Phosphogluco isomerase; PFK: Phosphofructokinase; ALD: Aldolase; G3PDH: Glycerol-3-phosphate-dehydrogenase; PGK: Phosphoglycerate kinase; PGM: Phosphoglycerate mutase; PYK: Pyruvate kinase; PDC: Pyruvate decarboxylase; FBPI: Fructose-I,6-bisphosphatase.
4
J. Bruck, W. Liebermeister
f<j
E. Klipp
found to be crucial to reproduce the data. Glycerol producing branch was simplified to the reaction catalyzed by the enzyme G3PDH. The products ethanol and CO 2 were assumed to diffuse out of the cell quickly, thus their concentrations inside and outside the cell as equal in the steady state. We obtained the original model in SBML format from the JWS online database [14] (download on 26th May 2008). It is worth noting that the kinetic expression for PFK in the published SBML file differs from the one described in the article [11]; we adopted the latter version. Table I. List of the reactions which were added to the Teusink model. Numbers in brackets refer to reactions in Fig.1. Square brackets denote concentrations described by dynamic variables of the mathematical model. All other quantities are parameters ofthe model: their values are either adopted from [II], set to the measured values of external metabolites, or estimated. Name
Reaction
Adenosine kinase (19)
ATP+AMP ;::2 2 ADP
G6P consumption (3)
G6P+ATP -+ ADP
glycerol transport (9) TCA (16)
respiration (18)
Reaction rate expression VmAl(
([ATP] [AMP]-[ADP] [ADP]/KeqAI()
VmG6p [G6P] [ATP] VmOLY ([GLY] - GLyOU!)
4NAD +ADP+ACE ;::2 4 NADH + ATP + 2 CO,
V rnTCA ([ACE] [NAD] [ADP]-[NADH] [ATPl/KeqTCA)
0.5 0, + NADH + 2.5 ADP ;::2
NAD+2.5 ATP
ATP consumption (20)
ATP -+ ADP
PDC (15)
PYR ;::2 ACE + CO,
FBPI (6)
FI6P -+ F6P
VmRESP (02 [NADH] [ADP]- [NAD] [ATP]/K""RESP)
v
= VmATP.",[ATP]
We modified the original Teusink model in several details to fit our purposes. Reaction numbers refer to Fig. 1, for details of the stoichiometry and the kinetic expressions see Table 1. 1. We explicitly modeled the concentrations of AMP, ADP, and ATP as dynamic variables. The adenosine kinase reaction (reaction 19), modelled with reversible mass-action kinetics, was introduced to maintain the moiety conservation of the pool of these species.
Exploring the Effect of Variable Enzyme Concentrations 1
Glucoseo"
2
4
~ Glucose~G6P ~ • AOP • }IATP ATP 3 AOP
5
6 7 10 F6P'O F16P +---+(2) Triose-p7"""t BPG ( 5"\ ATP AOP
NAOH
~
~AOP
NAO + NAOH 11 NAO+ 8 Glycerol 3P G
t
t
g
12
t
2 G
Glycerol." ATP
~
ATP
AOP
.13 PEP
~
19 ATP + AMP+---+ 2 AOP
14
AOP
ATP
Pyr
0.5°2 2.5ATP .L....2.5AOP NAO' NAOH 18
}:C02
16
2C02~ACA
.J.
..
~NAOH
4 NAOH 4NAO' 17 ATP AOP NAO' Ethanol
Fig. I. Reaction scheme of the kinetic model of glycolysis. The numbers refer to the following reactions I :glucose transport; 2:HK; 3:G6P consumption; 4:PGI; 5:PFK; 6:FBPI; 7:ALD; 8:G3PDH; 9:glycerol diffusion; IO:GAPDH; II:PGK; 12:PGM; I3:ENO; 14:PYK; 15:PDC; 16:TCA; 17:ADH; 18:respiration; 19:adenosine kinase; 20:ATP consumption. Reaction 7 produces two Triose-P per FI6P, as indicated. Subscript "out" refers to species outside the cell. Reactions which were added to the original model by Teusink et a!. [II] are listed in Table I.
2.
3. 4.
5.
Instead of considering two side chains with constant fluxes at G6P (leading to glycogen and trehalose), we replaced them by a single G6P-consuming process (reaction 3) with irreversible mass action kinetics. We did not distinguish between them since we do not have measurements for metabolites or fluxes of this branches that would allow for distinguishing one from the other. At the end of the glycerol-producing branch, we included a diffusive transport reaction for glycerol out of the cell (reaction 9). The original model contains the TCA cycle in the form of a succinate production branch. In this reaction, two molecules of acetaldehyde are consumed to produce one molecule of succinate. Since our model is aimed to describe respiration, we replaced this reaction by a simplified description of a running TCA cycle (reaction 16) and the respiratory chain (reaction 18): we consider two reactions which consume acetaldehyde and oxygen to produce energy in form of ATP and NADH as well as the by-product CO 2 [8]. We assumed that CO 2 concentration in the cell remains low due to rapid diffusion, therefore we did not include it in the backward rate expression of reaction 16. The ATP-consuming reactions are summarized in one effective ATPase reaction (20). In the original model, this reaction had constant flux which we replaced by irreversible mass-action kinetics.
6
J. Bruck, W. Liebermeister €3 E. Klipp
6.
Reversibility of the main glycolytic chain is crucial to obtain qualitative agreement with the measured fluxes. Therefore, we changed the irreversible Hill kinetics of the PDC reaction (reaction 15) to a reversible kinetics by including an additional term with a parameter K:iiJc in the original rate expression as shown in the table. 7. Also the reaction catalyzed by PFK is irreversible and modeled without product inhibition. To allow for a slowing down of the glycolytic flux at higher product concentrations, we included the reaction catalyzed by FBPI into the model (reaction 6). In gluconeogenesis, this reverses the effect of PFK, but without involvement of ATP. All other parts of the model including the values of the parameters which are not explicitly mentioned in this article were adopted from [11]. In contrast to glucose and glycerol, it was assumed that ethanol diffusion through the cell membrane is fast enough to keep the outer and inner concentrations close, therefore no distinction was made between extra- and intracellular ethanol. The resulting model has 20 reactions and 17 dynamic variables representing metabolite concentrations. It is available in SBML and text formats as supplementary material.
2.3. Transcriptional regulation and external metabolites In order to include transcriptional regulation in our model, we write reaction rates for reaction i in the experimental conditionj as (1) where Eij denotes the concentration of the active form of the corresponding enzymes in the steady state cultures, R; denotes the rest of the kinetic expression, and 0 denotes the vector of all metabolite concentrations at condition j. We compared the four aerobic states to the anaerobic stateb • We indicate quantities belonging to this condition by the subscript j=O. Transcriptional regulation was accounted for in the following way: for each enzymatic reaction i and each aerobic condition j, we calculated Eij / EiO , the relative change of enzyme concentration of the four aerobic states from the transcription data by setting
Eij a -E =gij
(2)
;0
where the scaling exponent a is a constant and gij denotes the transcription fold change associated with reaction i in conditionj. By definition, g;o=1 for every reaction. Assuming that the activity of an enzyme is proportional to its concentration, we describe the effect of transcriptional regulation on the reaction rate Vij through replacing it by
Exploring the Effect of Variable Enzyme Concentrations
7
(3) for each reaction i and condition). For most enzymatic reactions, we calculated giJ as the arithmetic mean of the measured mRNA concentration fold change for the genes associated with reaction i. See the Appendix for the list of genes associated with each enzymatic reaction. Since the transcriptional activities corresponding to Enolase and GAPDH were not measured, for these reactions we computed the value of giJ by averaging the values for the next-neighbor reactions (PGM, PYK) and (ALD, PGK), respectively. Also the reaction ADH was treated differently. The expression data for ADHl, together with ADH2, the isoenzyme responsible for converting ethanol to acetaldehyde, indicate that net Ethanol production is shut down with growing oxygen supply, reaching virtually zero in fully aerobic condition. The resulting ethanol flux also reflects this behavior (Fig. 2). For simplicity, instead of including ADH2, which would involve yet more unknown parameters, we only included the reaction for ADHI and described its regulation, by setting gij to the values of the measured ethanol flux, normalized to the anaerobic condition. The resulting giJ values for all experimental conditions are shown in Fig. 2. A -+-HK(2) -----*-PGI(4) -€r-PFK(5)
~
&:5
Iii
e
ti~
'OJ
:g -g
~19 "'.g E
5
2.5
-+-- FBP1
2.5
(6)
-s-ALD (7) 2 1.5
1.5 1 0%
.
1%
2.8%
20.9%
C ENO" (13) - l i t - PYK (14) ----e- poe (15) -t--
~;g
Iii!
0.5%
2.5
~TCA(1B)
.
i "0
:g
ti,s "O.~ :g -g
2
'I! a
"'·S
1$
I
---a- resp. (18)
:l!;; E
5
D
" "
0.1
a
1 0%
0.5% 1% 2.8% 20.9% oxygen concentration in steady-state cell culture
0.01 0%
0.5% 1% 2.8% 20.9% oxygen concentration in steady-state cel/ culture
Fig. 2. A,B,C: fold change of mRNA concentration associated with reactions in the mathematical model, normalized to the anaerobic state (denoted by gij in the text). The values were calculated from the expression data of the genes associated with each reaction as given in the appendix. Numbers in brackets refer to reaction numbers in Fig.1. For reactions marked with (*), no transcript analysis was undertaken; the values were averaged from neighbors as described in the text. D: fold change of the genes ADHI and ADH2 and the resulting ethanol flux. At the highest oxygen concentration the flux drops to zero (not shown in the logarithmic scale).
The external metabolites glucose, glycerol and ethanol were represented by the model species Glucose oub Glycerolout and Ethanol (cf. Fig. 1). Their concentrations were set to constant values according to the experimental data: Glucose out was set to the
8
J. Bruck, W. Liebermeister €3 E. Klipp
corresponding concentration in the inlet feed solution, 55.55 mmol/l, in all conditions. Measured glycerol concentrations was 8.90 mmoVl for the anaerobic condition, and zero for all aerobic conditions. Measured ethanol concentration was 75.37 mmoVl, 59.01 mmoVl 47.56 mmoVl, 3.66 mmoVl, and 0 mmol/l for the conditions with 0%, 0.5%,1%, 2.8%, and 20.9% oxygen, respectively. 2.4. Parameter estimation We performed parameter estimation on a subset of the model parameters to achieve agreement with the data. Metabolite concentrations were compared with concentrations in the model. The measured fluxes for glucose, oxygen, ethanol, glycerol and CO 2 were each compared to the rates O.5r" r'8, rl7, r8, r'5 + 2r16, respectively, where ri denotes the rate of the reaction i in Fig. 1. We quantified goodness of fit for each possible set P of values for the estimated parameters by the following cost function:
(3) where we denote the steady-state value of a metabolite concentration or flux k for the condition j by Vkjim and Vk;xP for simulation results and experimental data values, respectively. ~jim values were obtained by runs of 10000 seconds of simulation time. 2 U kj is a weight factor in which U is often set to the value of the experimental error. However, this choice does not reflect an appropriate weight measure in our case, since we do not expect to be able to reproduce the experimental data within the errors. At the same time, small experimental error of a quantity does not necessarily correlate with higher importance of a good fit compared to other quantities with larger errors. To assign the same weight to all relative deviations, we set Ukj to be proportional to Vk5 im in the following way:
1/
Ukj
= 0.15· V~xP,
in case
V~xP"*
Ukj
=0.15.ll1in(~?),
m case
V~xP
0,
= 0,
J
7
where Vk denotes all nonzero values for the concentration or flux k among all conditions. To avoid non-steady state solutions, we introduced a penalty term ( exp ( K ) -1) in the cost function. The term K quantifies the deviation of the solution from steady state. It is defined as 17
K =
3
IIIXk(t~omp)-Xk(t1ast)1 ' k~1 '~1
Exploring the Effect of Variable Enzyme Concentrations
9
where Xk(tlaSI) denotes the simulated value of the concentration Xk at the last time instance t lasl =10000 sec, and Xk (tlomp ) denotes its value at some earlier time instance tlomp . The values tl"m p where chosen as t~omp = 0.5 . t last , t~omp = 0.75 . t last , and t~omp = 0.8 . t last • We estimated a total of 31 parameters which was an acceptable number given a total number of data points of 70. The values of all other parameters were taken from [11] . We estimated the following groups of parameters: 1. Since the experiment by Wiebe et al. and the experiments underlying the Teusink model differ in the experimental conditions and the yeast strain, we could not rely on the absolute enzyme concentrations to be comparable. Therefore, we fitted all Vm values and the diffusion coefficient for reaction 9 (20 parameters). 2. We also fitted the new kinetic parameters of the reactions that were added to the original model (4 parameters, cf. Table 1.) 3. The sum of [NAD+] and [NADH] is a conserved quantity of the model, determined by the initial concentrations of these species. Since experimental data were not available, we estimated this quantity for each condition separately (5 parameters). 4. We fitted the scaling exponent a from Eq. (2). 5. Concentration units: reaction rate expressions in our model are based on enzyme kinetics and hence the concentrations of the reactants need to be known. However, all metabolite concentrations and fluxes were measured in units per gram dry weight of biomass (gDW). The values were determined after collecting the cells from the culture by centrifugation, washing by distilled water, and drying to constant weight at 100Co. To determine the cytosol concentrations of the measured values, the net cytosol volume of the cells of IgDW is needed. Although estimates for this number exist (amounting Ig dry weight to 2 ml cytosol, [13]), we preferred to fit this quantity along with the other parameters of the model.
2.5. Genetic algorithm and semiglobal search We adopted the genetic algorithm Differential Evolution [16] to search for a parameter set with best fit to the experimental data. In a truly global search, parameters could assume any values between zero and infinity, with the aim to find a global optimum of fit. However, we found that this approach was not practicable since many parameter sets are, although in principle viable, not practical to work with. Some may not generate a steady state (for example due to accumulation ofFI6BP), others require long computation times. Therefore we developed the following semi-global approach: at a given time, only a limited region of parameter space was screened. This was achieved by limiting each parameter to a certain range. If a parameter repeatedly (4 out of the last 5 times) produced values in the upper or lower 20% of its search range, the range was relocated such that the parameter value corresponding to the hitherto best result became the center of the search range of this parameter. If this process would have resulted in a negative value for the lower limit, the latter was set to zero. The width of the search range was
10
J. Bruck, W. Liebermeister
fj
E. Klipp
kept constant during the process and was determined at the beginning of the parameter estimation to be [( 1- r) Po, (1 + r) Po ] where Po denotes the initial value of the parameter and r was set to 0.5. Since evaluating the cost function (Eq. 3) involves integrating a system of 20 differential equations numerically, we used various software tools to convert the SBML model to an executable C-code for faster integration [6, 10]. The integrator used in the process was CVaDE from Sundials [15].
3.
Results
3.1. Parameter estimation
We ran four parameter estimation processes to find model parameter values which produce the best possible fit to the experimental data. Fig. 3. shows the evolution of the goodness of fit (as quantified by the cost function) and the value of five parameters during a parameter estimation process (data for all parameters published as supplementary file). Most, but not all parameters converged to a certain value. However a unique
i'
name:fmma start value: 1 last value: 1.6
change of cost during estimation
ti
5
i " 25
1
~10~
name: fwstst
start value; 500 last value: 29
l
j~ 2~ i iE.,
1j)
a.
ii\"_ _ _ _ _ _ _
1
..8
:
:
_
0
I..
1.sf-1 ---1 11
i 5L----~ ~ os! 20000 40000 60000
name: nadsum 3 start value: 1.6
last value: 0.3
0:--::200:07:00-4-::00-::0::-0-:6-::0000 nama: GLYtrs VmGLY stan value: "1 0000
:i;:'·"~~·'
lZ'
name: vPOC_KmPDCACE start value: 5
""'~ ::1
o
20000 40000 60000
nr. of generations
o~
2~OO
40000 60000
nr. of generations.
00
20000 40000 60000
nr. of generations
Fig. 3. Evolution of goodness of fit (cost) of the best parameter set (top left) and corresponding values of five of the 31 model parameters during a parameter estimation process of ca. 49000 generations. Shown are values corresponding to the parameter set with the best fit to data (as defined by the cost function, see text) after a certain number of generations. The momentary search range for each parameter (see text for description) is specified by upper and lower bounds (shown by lines). The parameters frnrna (called a in the text), fwstst, nadsum_3 are explained in section 2.4 under points 4, 5 and 3, respectively. GLYtrs_VmGLY denotes the diffusion coefficient in reaction 9, and vPDC_KmPDCACE denotes the constant in reaction 15 (cf. Tablel).
parameter set with best fit could not be determined within the available computing time (24 hours of computing time amounting to roughly 7000 generations on an AMD 3800+ processor), since a number of parameters did not converge to similar values during these parameter estimations (data not shown).
Exploring the Effect of Variable Enzyme Concentrations
11
Notably, these parameter sets produced mostly similar simulation values. As shown in Fig. 4., the largest quantitative differences between the predictions generated by the four parameter sets can be observed in the simulation results for F 16P concentration (0% oxygen) and of the 02. Some of the parameters were seemingly not, or only weakly determined, i.e. their values did not matter for change in the cost function. This was to be expected, since only about two third of the dynamical variables of the model is measured. Since the number of data points (70) is more than twice the number of parameters (31), we do not expect
6IX~'0~ 1.51,1 +
)( 10~
g~ g'5, 4
n:~ o
0.5
1
2.8 20.9
o.o,~
'OOO5~ * o
t
o
0.5
1
2.8 20.9
6XW'o.
~~ 4 ~~
BE
0.02!
°f~-~ o.o'Li' o 0'
PG
§..
F16P
F6P
G6P
ATP.
"."
.
2
8.S 0·---·------·-·-o
0.5
1
2.8 20.9
o
0.5
)(10-3
1
. 2.8 20.9
,+
o
01
X
-r0.5
o
10.3
+
1
t
2.8 20.9
AD?
21~
aL-------~o 0.5
1
2.8 20.9
1
1+ ...._.............+
+ + . ...............,-
0.5
2.8 20.9
0""
o
1
l"IC ,. ---:,
[:[ ----- .J~-~--~~ 0
0.5
1
2.8 20.9
0
0.5
AMP
)(10"
4.
02[.~
oxygen concentration (%)
2.8 20.9
'id 01
::::J
1
4f
t·A
ol-.-.-----..... o
CO2 flux
C
0.5
PEP
0.5
1
2.8 20.9
glucose flux 02[
I
0.1(.,
:+~
0'·....- ".'-" ............... _.. o 0.5 , 2.8 20.9 oxygen concentration (%)
r.:e:"expefiment'! L". .:.:~._. ~i~~lati~.~
,
1 2.8 20.9
oxygen concentration <%)
Fig. 4. Concentrations and fluxes of metabolites: comparison of experimental values and simulation results with parameter sets resulting from four different estimation runs (which we terminated after 49973. 49959. 12435. and 12678 generations of the Differential Evolution algorithm).
overfitting to occur.
3.2. Comparison to experimental data Experimental data of metabolite concentrations and fluxes over the five experimental conditions and corresponding simulation results are shown in Fig. 4. The mathematical model was able to reproduce the experimental data to varying extents. In general, the concentration values of the metabolites in upper glycolysis (G6P,
12
J. Bruck, W. Liebermeister
{3
E. Klipp
F6P, F16P) and of the adenosine species (ATP, ADP, AMP) were reproduced better than those of lower glycolysis (PG, PEP, PYR) and the metabolite fluxes. In the latter group, the decrease of pyruvate concentration and that of carbon dioxide and ethanol fluxes with higher oxygen concentrations was reproduced as a tendency, but neither the exact absolute values, nor the sharp difference between anaerobic and aerobic conditions was reproduced correctly by the model. Also the predicted increase of the O2 flux with external oxygen concentration was qualitatively correct, but the experimental values for cultures with higher oxygen concentrations were not reproduced correctly. However, other measured quantities show distinctly different behavior from our simulation results: the model failed to reproduce the measured decrease of the glucose flux with increasing oxygen concentration predicting nearly constant simulation values instead, as well as the similar behavior of the glycerol flux for which it predicted an increase.
4.
Discussion
We explored whether an established mathematical model of yeast glycolysis, created to describe one anaerobic condition, could be extended to describe different cell states corresponding to experimental conditions with various oxygen concentrations. To this end, regulation through enzyme concentration changes and a simple model of the TCAcycle and respiratory chain were included in the model. As enzyme data were not available, we assumed that differential enzyme concentrations and differential mRNA concentrations are related by a power law with a single exponent. This assumption is of course questionable: enzyme concentrations are also regulated posttranscriptionally, so changes in enzyme levels can, in principle, take place irrespective of differential expression and vice versa. However, a monotonous relationship between the two quantities holds, at least, on average; in a comparison of mRNA and protein levels for different genes, a scaling exponent of about 0.6 has been reported [I]. In our attempt to reproduce the experimental data, we were led to make a number of further changes in the original model. Most remarkably, we found that a number of reactions of the pathway (either by altering the kinetics as in PDC, or by including a reverse reaction such as FBP) needed a reversible description for the following reason: As we compared steady state cultures with higher concentration of oxygen, the data clearly showed that flux through glycolysis decreases in spite of upregulation of most enzymes in carbon metabolism (while the ethanol producing branch is simultaneously shut down). Although at first somewhat counterintuitive, this behavior can be reproduced without introducing posttranscriptional regulation into the model. In our case, the flux of the pathway is redirected from fermentation to respiration, i.e. to a branch with typically considerably lower reaction rates. This can result in a lower flux if concentrations of some metabolites rise enough to slow down the reactions producing them. Speaking in loose terms, the pipeline of the pathway becomes "jammed" which causes the flux slowing down. In contrast, fast diffusion of ethanol during fermentation may keep the lower glycolysis concentrations low, which speeds up the reactions. In principle, upregulating the enzyme production might even be an attempt of the cell to keep the flux through the pathway as high as possible. However, in a mathematical
Exploring the Effect of Variable Enzyme Concentrations
13
model, this effect is only possible if the kinetics of each reaction is chosen such that the reaction rate is sufficiently slowed down by rising product concentration. This is true for reversible reactions, but not true for the irreversible kinetic expressions we replaced in the original model. Although the introduced changes increased the agreement with the experimental data, at this stage the model did not agree with the data in a number of points. Probably most important is the measured decrease of glucose flux in spite of a general upregulation at higher oxygen concentrations. It is possible that further refinement of the model wi11lead to at least qualitative agreement with the experiment in this point. There are a number of possible ways to refine the method presented here. Increasing the number of data points compared to the number of parameters to be estimated is, of course, desirable. An important special case would be the experimental determination of a remaining conserved quantity, the sum of [NAD+] and [NADH]. Lacking such data, we fitted this quantity for each condition separately. Regarding the model input, including appropriate measurement data on the transcriptional activity of GAPDH or Enolase would probably also improve the model. However, this can change its behavior only to the extent to which these values behave differently from their neighbors. The parameter estimation process was found very demanding in terms of computational power. Increasing its speed and finding well-defined parameter sets is a necessary technical step in further development. Kinetic models have been successful in describing metabolism in cell states in which its explicit regulation through changing enzyme concentrations is negligible. Developing a class of models describing this additional layer regulation is a logical next step, and this might enable us to describe - or even predict - various states of the cell with one single model.
Acknowledgments
The authors thank to Marilyn Wiebe and Merja Penttila of VTT Technical Research Centre, Helsinki, Finland for kindly providing experimental data and for stimulating discussions on the subject, as well as Preben G. S0rensen of the University of Copenhagen for drawing our attention to useful methods. The financial support of the Marie Curie EST project "Systems Biology" (EC contract number MEST-2-CT-2004514 I 69), of the Bacillus Systems Biology project (BaSysBio), and of the Yeast Systems Biology Network (YSBN) is gratefully acknowledged by the authors.
Appendix
Enzymatic model reactions and genes associated with them in brackets (for ENO, GAPDH, ADH, see text): HK(GLK1, HXK1), PGI(PGIl), PFK(PFK1, PFK2), ALD(FBA1), G3PDH(GPDI, GPD2), PGK(PGKl), PGM(PGMI), PYK(PYKI, PYK2); PDC(PDCI), TCA-Cycle(CIT3, KGDl, SDHI, SDH2, SDH3, SDH4, FUMI, LSCI,
14
J. Bruck, W. Liebermeister foj E. Klipp
LSC2, PDA1, PDB1, CIT2), Respiratory chain(CYB2, COX5a, COX5b, CYCI, CYC7, NDEl, NDE2), FBPI(FBPI) . References [1] Beyer, A. et aI., Post-transcriptional expression regulation in the Yeast S.c. on a genomic scale, Molecular and cellular proteomics, 3.11 :1083-1092,2004. [2] Fell, D., Understanding the Control of Metabolism, Portland Press, 1997 [3] Greenbaum, D., et aI., Comparing protein abundance and mRNA levels on a genomic scale, Genome Biology, 4: 117, 2003 [4] Heimich, R.; Schuster, S. The Regulation of Cellular Systems. New York: Chapman and Hall, 1996. [5] Hynne, F., Dana, S. S0rensen, P.G., Full-scale model of glycolysis in Saccharomyces cerevisiae. Biophys Chem., 94 (1-2):121-63, 2001 [6] Keating, S.M, SBMLToolbox: an SBML toolbox for MATLAB users., Bioinformatics. 22(10): 1275-7,2006. [7] Klipp, E., Herwig, R., Kowald, A., Wierling, C. and Lehrach, H., Systems Biology in Practice: Concepts, Implementation and Application. Wiley-VCH, 2005. [8] Michal, G., Biochemie-Atlas; Biochemical Pathways, German ed., Spektrum Akademischer Verlag, 1999 [9] Rizzi, M., Michael Baltes, Uwe Theobald and Matthias Reuss, In Vivo Analysis of Metabolic Dynamics in Saccharomyces cerevisiae: II. Mathematical Model, Biotechnology and Bioengineering, 55 :592-608, 1997 [10] Schmidt, Henning, Systems Biology Toolbox for MATLAB: A computational platform for research in Systems Biology, Bioinformatics, 22(4):514-515, 2006. (http://www.sbtoolbox2.org) [II] Teusink, B., et al.: Can yeast glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry. Eur J Biochem. 267(17):53135329,2000 [12] Wiebe MG et aI.: Central carbon metabolism of Saccharomyces cerevisiae in anaerobic, oxygen-limited and fully aerobic steady-state conditions and following a shift to anaerobic conditions FEMS Yeast Research 8(1): 140-154,2008 [13] Wiebe, M.G., personal communication [14] http://www.jjj.bio.vu.nV [ 15] https://computation.llnI.gov/casc/sundials/description!description.html [16] http://www.icsi.berkeley.eduJ~stornlcode.htmI
THE ROLE OF IP3R CLUSTERING IN Ca 2 + SIGNALING ALEXANDER SKUPIN alexander.skupin~bmi.de
MARTIN FALCKE falcke~bmi. de
Max-Delbriick-Center for Molecular Medince, Departement of Mathematical Cell Physiology, Robert-Rossle-Str. 10, 13125 Berlin, Germany Ca 2 + is the most important second messenger controlling a variety of intracellular processes by oscillations of the cytosolic Ca 2+ concentration. These oscillations occur by Ca 2 + release from the endoplasmic reticulum (ER) into the cytosol through channels and the re-uptake of Ca2 + into the ER by pumps. A common channel type present in many cell types is the inositol trisphosphate receptor (IP3R), which is activated by IP3 and Ca2 + itself leading to Ca2+ induced Ca2 + release (CICR). We have shown in an experimental study [15], that Ca2 + oscillations are sequences of random spikes that occur by wave nucleation. We use here our recently developed model for Ca 2 + dynamics in 3 dimension to illuminate the role of IP 3R clustering within spatial extended systems.
Keywords: cell signaling; calcium oscillations; modeling; clustering
1. Introduction
Calcium is a ubiquitous messenger used by cells to control a variety of different physiological processes like muscle contraction, gene expression or secretion. Most importantly Ca2+ translates external stimuli into intracellular responses by a transient increase of the cytosolic Ca2+ concentration [2, 4, 7, 18J, which can act on distinct pathways or protein functions in dependence on their durations, strength and expressed components. The increase of cytosolic Ca2 + is often caused by Ca 2 + release from internal stores, especially from the endoplasmic reticulum (ER) and the sarcoplasmic reticulum by release channels. The nonlinear properties of these channels combined with other complex control mechanisms within cells, as e.g. buffer reactions and pumps, lead to a rich spectrum of different Ca2+ signals including traveling waves and global oscillations [7J. Figure 1 exhibits an example. A versatilely used pathway is the inositol 1,4,5-trisphosphate (IP 3 ) pathway leading to intracellular Ca2+ responses. If a plasma membrane receptor detects a signal molecule, as e.g. serotonin, a phospholipase C (PLC) is activated by a Gprotein and produces IP3 at the cell membrane. IP 3 diffuses into the cytosol where it can be bound by receptor channels (IP 3R) on the membrane of the ER. If IP 3 and Ca 2 + are bound to an 1P3R, it can open and Ca 2+ will diffuse into the cytosol. The released Ca2+ is pumped back into the ER by Sacro-Endoplasmic Reticulum
15
16
A. Skupin & M. Faleke
A u..
:§: ~
'J~J\lliJWWJ
:t 40
0
• • •
• • • • • • • t (5)
u..
:§:
J 400
200
B
600
~
]WllWJ : f· 30
0
•
• •
..
:
250 t (5)
• •
J 500
Fig. 1. Ca2 + oscillations in experiment and simulation. Upper panels show the fluorescent signals t:.F = F / Fo as the ratio of the measured signal F divided by the initial signal strength Fo visualizing the cytosolic Ca2 + concentration. Lower panels exhibit the inter spike intervals (ISIs), i.e. the time between two successive fluorescent maxima. A: Experimentally measured spontaneous Ca 2 + oscillations of a PLA cell (for more details see [15].) B: Simulated Ca 2 + oscillation of a cell with 16 randomly distributed channel clusters consisting of a random number of channels between 3 and 15 each. The Ca2+ base level is [Ca2 +]o = 35 nM and [IP3]= 80 nM.
Calcium ATPases (SERCAs) pumps. The open probability of IP 3 Rs depends on the IP 3 concentration and the calcium concentration in the cytosol [7, 11, 17]. It increases with increasing IP 3 concentration. It is low for low calcium concentration, increases with increasing Ca2+ and finally decreases again for even higher concentrations. This behavior leads to Calcium Induced Calcium Release (CICR) as Ca2+ released by one channel diffuses in the cytosol and increases the open probability of adjacent channels. Ca2+ terminates its own release by inhibiting the channels at high Ca2+ concentrations. The localized release, Ca2 + binding to buffers and removal of Ca2+ by pumps cause huge concentration gradients close to open clusters. IP 3 Rs are grouped into randomly distributed channel clusters on the ER membrane containing 1-40 channels and separated by 1-7 Il-m. This spatial inhomogeneity combined with the SERCA pumps and Ca2+ buffers causes huge concentration differences close to open clusters. Ca2+ oscillations have been intensively studied in both experiments and theory. Most traditional models neglect concentration gradients, and thus describe Ca2+ dynamics by ordinary differential equations [13]. But we have shown recently in an experimental work [15], that intracellular Ca2 + oscillations using IP 3 receptors are sequences of random spikes initiated by the local stochastic behavior of ion channels transformed into a global Ca 2+ signal by wave nucleation. A Ca2+ signal originates from the opening of a single channel, called "blip", which might cause opening of other channels within the cluster yielding in an elemental event called "puff" [3,5,7-9]. A puff can activate neighboring clusters and if a supercritical number of puffs arises, Ca2+ release spreads through the whole cell. This nucleation process carries the fluctuations of the state of individual channels
The Role of [P3R Clustering in CaH Signaling
17
up to the cell level. The question is now, why cells build distinct channel cluster and do not use a diffuse arrangement of channels or work with one huge cluster. While the influence of IP3R clustering has been studied on the level of a single cluster [10] and in two dimensions with a reduced model for the IP 3R [14], an investigation of this issue in three dimensions and the above depicted hierarchical picture still lacks. In order to close this gap we use here our recently developed method of modeling Ca2+ dynamics in 3d [16] to explore the role of IP 3R clustering in a bottom up approach. 2. Methods and Results
2.1. [PaR Model A commonly used IP 3R model is the DeYoung-Keizer (DK) model. The DK model assumes each IP 3R to consist of four identical subunits having 3 binding sites each. One for IP 3, one for Ca 2 + activating the subunit and another one for Ca2+, which inhibits the subunit. Since each binding site can be free or occupied a single subunit has 23 different states X ijk and 12 possible transitions, which can be visualized on a cube as shown in Fig. 2A. The first index of Xijk specifies IP 3 binding and is 1 if IP 3 is bound and 0 otherwise. Analogously the second index indicates Ca2+ binding to the activating site and the last one corresponds to Ca 2 + binding to the dominant inhibiting site. A subunit is active in the state X 110 only and a channel will open if at least three of the four subunits are activated. The transitions between the states Xijk occur by stochastic binding and dissociation of signaling molecules to the corresponding binding sites. The rates for binding depend on the particular rate constants ai and on the Ca2+ concentration C and the IP 3 concentration I, respectively as shown in Fig. 2A, whereas dissociation occurs with constant rates b i. The binding of Ca 2+ to the activating as well as to B
•.Channe, • ER
\
I
cytosol
cell Fig. 2. A: Scheme of the DeYoung-Keizer model for a single subunit. A subunit is active, if 1P3 is bound and Ca2 + is only bound to the activating site, i.e. in state X110. A channel opens if at least 3 of its 4 subunits are active. See text for more details and Table 1 for values of rates bi and rate constants ai. B: Sketch of our two compartment model. We overlay the two compartments, i.e. each point in space within our spherical cell corresponds to the ER and the cytosol simultaneously.
18
A. Skupin
fj
M. Faleke
Table 1. al bl a2 b2 a3 b3 a4 b4 a5 b5
Rates of the DK model used within simulations.
20 (J.tMs)-1 rate co.nstant for 1P3 binding with no inhibiting Ca2+ bound 20 s-1 rate of 1P3 dissociation with no inhibiting Ca 2+ bound 0.001 (J.tMS)-1 rate constant for Ca2+ binding to the inhibiting site with 1P3 bound 0.03 s-1 rate of Ca2+ dissociation from the inhibiting site with 1P3 bound 2.6 (J.tMs)-l rate constant for 1P3 binding with inhibiting Ca2+ bound 20 s-l rate of 1P3 dissociation with inhibiting Ca 2+ bound 0.025 (J.tMs)-l rate constant for Ca2+ binding to the inhibiting site with no IP3 bound 0.1 S-l rate of Ca2+ dissociation from the inhibiting site with no IP 3 bound 10 (J.tMs)-1 rate constant for Ca 2+ binding to the activating site 1.225 s-l rate of Ca2 + dissociation to the activating site
the inhibiting site leads to a bell shaped open probability in dependence on Ca2+ representing a key element of CICR.
2.2. CellModel We assume the cell is a sphere. The ER is a tubular network spreading throughout the whole cell. Therefore we describe the cell by a two-compartment model as sketched in Fig. 2B. The two compartments interact through open channels, the leak flux and by SERCA pumps. Opening and closing of channels, the Ca2+ pump flux into the ER and the reaction of Ca2+ with buffers determines the concentration dynamics in the cytosol and the ER, i.e. we have two reaction diffusion systems (RDSs), each for one compartment, which are coupled by the Ca2+ fluxes. However, we are only interested in the cytosolic Ca2+ dynamics and need the concentration within the ER to determine the channel fluxes. Thus we use the single channel approximation derived in [1] for the flux J of an open channel J=
8F~D as·V/DJtDcO'e "DC
1+
FE
c
E
tanh
(as'V/DJtDcO'e)
. /DE+DcO' V DcDE e
c
E
_
(E-Ca2+),
(1)
which depends on the diffusion coefficients of Ca2+ within the cytosol Dc and the ER DE, the channel radius a, the flux constant O'e and on the average concentrations within the compartments. For channel clusters with more than one open channel, we scale the radius by the cubic root of the number of open channels N open , i.e. a = as N open , taking the increase of the source volume due to channel opening into account. With Eq. (1) we can neglect the spatially resolved dynamics within the ER. In the following we will take one mobile [B] and one immobile [Bi] (with DBi = 0) buffer into account yielding in a system of three coupled PDEs. In order to derive an analytical solution we linearize the PDEs around the resting state where no channels are open and all three components (Ca 2 +, mobile and immobile buffer) are homogeneously distributed and in equilibrium. After rescaling time t ---'> tiT and space r ---'> rlL with the diffusion time T = (k+[B]T)-l and length L =
\I
The Role of IP3R Clustering in Ca2 + Signaling
19
J
DCa(k+(BJT )-1 the resulting system in dimensionless units defined in Table 2 takes the form (16]
(2a) (2b) (2c) where the first equation describes the dimensionless free Ca2 + concentration and the other two correspond to the scaled free mobile and immobile buffer concentrations, resp. The first term in Eq. (2a) corresponds to diffusion of Ca2+, whereas the next four terms describe the reactions with buffers and the coupling with the ER by the pumps and the leak flux (O' p and 0'1 respectively). The last term specifies release of Ca2+ by channels, which we assume to be delta sources. Nevertheless we incorporate their spatial character by using Eq. (1) for the scaled flux o'. The two remaining equations in (2) describe the buffers dynamics. The dimensionless resting conditions are given by eo = (Ca 2+Jo/K, bo = (eo + 1)-1 and bi,o = (eo;;; + 1)-1 depending on the buffer dissociation constant K of the mobile buffer and the ratio ;;; of the dissociation constants of the two buffer types. For the linear system of PDEs (2) we derived an analytical solution by means of coupled Green's functions for a spherical cell with noflux boundary condition at the cell membrane (16). The solution for the concentration dynamics can now be used as a natural environment for localized IP 3 R clusters to study the interplay of their nonlinear stochastic opening behavior and the feedback on Ca2+. Therefore we couple the global deterministic solution to the local stochastic channel behavior by a Gillespie algorithm described in (12J. Table 2. c
b bi e d ~T
~i
T
ER CTi CT K, K,E
Definition of dimensionless parameters.
dimensionless free Ca 2 + concentration dimensionless free mobile buffer concentration dimensionless free immobile buffer concentration dimensionless free Ca 2 + concentration within the ER ratio of the diffusion coefficients DB/Dca time separation of the mobile buffer [BJT/K time separation of the immobile buffer [BiJT/K [BdTki /[BJTk- ratio of buffer influence scaled fluxes of CTI and CTp P;lk+[B1T
[Ca2+J/K [Bl/[B1T [Bi]/[BilT [E]/KE
J"'k+[B]T 2FK
K/Ki K/KE
scaled channel flux ratio of the dissociation constants of the mobile and immobile buffer ratio of the dissociation constants of the cytosolic and lumenal buffer
20
A. Skupin
fj
M. Falcke
B
z
'" d:.c
~
0
C
N=2
~ il ~~~IIII il l l l lil
o:~ 0
250
500 t (5)
750
1000
N=32
1
: fiji" " ,[ .:,
1: ::::J 0
250
500 t (5)
750
1000
Fig. 3. A: Sketch of the spatial arrangement for the clustering analysis. Clusters are put on a regular grid around the origin. Band C : Representative examples of the channel dynamics. Upper panels show number of open channels and the lower panels the amount of inhibited subunits for a cell with 128 channels in total, which are distributed on N clusters.
2.3. Results For the following investigation we use the parameters of the DK model listed in Table 1 and standard parameters for the RDS listed in Table 3 reflecting typical properties of eukaryotic cells. Our results do not depend qualitatively on this explicit choice, but can differ in a quantitative manner for different parameters. To study the influence of 1P3R clustering we vary the number of clusters N in the cell arranged on a regular grid with a grid constant d as depicted in Fig. 3A. The grid constant influences the spatial coupling between the clusters as the pumps will decrease the Ca2+ signal at adjacent clusters with increasing separation d and thus decrease the probability for a global event. Figure 3B and C exhibits two representative examples of the cooperative channel behavior for a cell with 128 channels distributed equally on N clusters separated Table 3.
R
Standard values of parameters used for simulations.
cell radius channel radius diffusion coefficient of cytosolic Ca 2 + diffusion coefficient of lumenal Ca 2 + 70 fJ,m 2 /s 2 diffusion coefficient of mobile buffer 95 fJ,m /s 50 nM cytosolic Ca 2 + base level 90 nM IP3 concentration 25 fJ,M total mobile buffer concentration 600 (fJ,Ms)-l on rate of the mobile buffer 100 s-l dissociation rate of the mobile buffer 30 fJ,M total immobile buffer concentration 600 (fJ,Ms)-l on rate of the immobile buffer 100 s-l dissociation rate of the immobile buffer 86 s-l pump rate 4.3 10 6 s-l channel flux constant i'::j 0.01 s-l leak flux constant implicitly given by Pp and [Ca2+]o 10 fJ,m
8nm 220 fJ,m2/s
The Role of 1P3R Clustering in Ca2+ Signaling
21
by d = 1 J.Lm. The upper panels show the number of open channels Nopen and the lower panels depict the degree of inhibition R inh , which is zero if no subunit is inhibited and one for total inhibition. We observe for two clusters each consisting of 64 channels a relatively regular spiking caused by the self amplifying character of CICR. If one channel of a cluster opens, it will open other channels of the cluster, too, leading to an increase of the cytosolic Ca2 + concentration which will activate the second cluster. The resulting high [Ca2+] leads to a almost complete inhibition of channels terminating the spike. If we distribute the 128 channels on 32 clusters, i.e. each cluster has 4 channels, the amplitude and frequency decreases, since the spatial coupling is decreased. Thus we observe a higher uncoordinated background activity, i.e. opening events of very few channels, that leads rarely to global events as the puffs are too small to nucleate a global wave. To characterize such oscillations we will determine in the following the mean amplitude and the mean period Tav by averaging over the ISIs, here given by the time between to successive maxima of open channels. Cells can control the number of IP3R and the degree of clustering. Thus, we are interested in how cells can tune spiking with these two variables. We compare a stimulated cell with the above mentioned high [IP 3] and a cell with a lower IP 3 concentration. It turned out that cells with high [IP3] and a sufficiently high number of channels exhibit a saturated behavior as can be seen in fig. 4. Here the squares show T av and the number of open channels for a cell with a fixed number of channels Nch = 320, which are distributed equally on N clusters separated by d = 1 J.Lm. Both, T av and the amplitude exhibit only small fluctuations indicating the strong coupling between the clusters. This behavior changes if we switch to low IP 3 concentrations as can be seen by the dots in fig. 4. Here each cluster contains 100 channels, i.e. by increasing the number of clusters we increase the number of channels. The amplitudes increase by increasing the number of clusters. Thereby Tav decreases from about 50 s for 2 clusters to about 20 s for 15 clusters. That is
A
B
60
f
45
~
+
> l-
'"
30
IjJ
ill
i~
IjJ
+
,
OJ "0
f !
90
60
.-2
C. E
IjJ
III
III III
30
•• 0
15
0
9
18
27
number of clusters
36
0
III III
•• 9
f t
+
i
III
18
III
27
III
36
number of clusters
Fig. 4. Comparison of a cell with [IP3]=50 nM and a fixed number of channels distributed equally on clusters (squares) with a cell with [IP3 ]=10 nM, where each cluster consists of 100 channels (dots). A: Dependence of the mean period Tav on the number of clusters. B: Averaged maximal amplitude of the channel oscillations. (All error bars denote SEM.)
22
A. Skupin E9 M. Faleke
in the range of the mean period of the saturated cell and is due to the increased nucleation probability by the increased number of channels. For even more clusters, T av increases again since inhibition obstructs the more regular behavior. That is a consequence of the increased amplitudes shown in Fig. 4B for higher amounts of clusters and channels leading to higher Ca2+ concentrations. We observe a steep increase of the amplitudes up to the level of the saturated cell of about 45 channels. From that point on a further expression of channels is less sufficient as the amplitude increases slower and exhibits larger variations. Interestingly this cross over point of the amplitudes coincides with the fastest oscillation period in 4A. To analyze the effect of channel distribution further we use a grid with a grid constant d = 1.5 ~m and less channels to avoid a saturated behavior. Figure 5 exhibits Tav and the amplitude for two different cell setups. The dots correspond to Neh= 128 and the squares mark Nch= 256. The mean periods in Fig. 5A exhibit a pronounced change for less than ten channels per cluster. Another property is shown by the amplitudes. Althol}gh the squares have the double amount of channels compared to the dots, the average maximal amplitude is only slightly increased caused by self inhibition. These results suggest that cells with 128 channels have a larger dynamic range for frequency coding. In addition T av exhibits a more pronounced change than the amplitude and could be used for a robust control mechanism. We now return to the question about diffusive arranged channels. In a third approach to the analysis of the cluster distribution, we preserve the channel density by scaling the grid constant with the cubic root of the number of channels per cluster, i.e. d = dl (Nch/Ncl)1/3, where d 1 denotes the minimal grid constant for one channel per cluster. In Fig. 6 we compare two cells with the same [IP3] and Ca2+ base level concentration but with two different number of channels Nch and minimal grid constants d 1. Both setups, the one with Nch = 128 and dP)= 1 ~m denoted by the squares and the setup with Nch = 256 and d~2) = 1.5 ~m shown by the dots, exhibit a minimum in Tav' as shown in Fig. 6A. That means, cells with a more
A
B
160 120
~
> f-'"
80 40 0
~~ 0
•+
[!][!J
9
• III
18
+f ID
t
number of clusters
..
30
'" .-2 "0
~
a. 20 E III
10
ID
27
40
36
0
¢ifi
IDID
IjJ
I!J
' •• • ¢ •• 0
9
18
27
~
+ 36
number of clusters
Fig. 5. Influence of clustering with a conserved number of channels (triangles denote N c h=128 and squares Nch = 256, i.e. each square has doubled amount of channels as the corresponding dots) and a fixed grid constant d = 1.5 !-lm . A: Mean period Tav against the number of clusters. B: Amplitude dependence for two different total number of channels within the cell.
The Role of IPa R Clustering in Ca 2 + Signaling
A
B
60
+
40
+
~
~ l-
ID
rn
20
0
0
++
~f
10
~ t1 20
number of clusters
•• •
50 40
+
30
CD "0
.€
III
III
•
30
Q.
E
'"
20
•
•
•
III III
ill III
10 0
23
I!l
• III
0
9
18
27
36
number of clusters
Fig. 6. Influence of clustering with a conserved channel density. A: The comparison of T av for a cell with Nch = 128 channels and dl =1 ~m (squares) and a cell with Nch = 256 channels and dl =1.5 ~m (dots) demonstrate that the minimal Tav is not a simple effect of the density. B: The amplitudes exhibit a constant region and show, that diffusively arranged channels do not create global oscillations for physiological regions, as the period increases and the amplitude goes to zero for increasing number of clusters.
diffusive arrangement of channels can decrease T av and increase the amplitude by clustering of IP3Rs. That is due to the existence of an optimal coupling strength for systems with discrete excitable stochastic elements [14]. Once the minimal Tav is reached, further clustering results again in slower oscillations, since inhibition blocks the channel clusters. Further we see that oscillations with a lower channel density (dots) are slower compared to those with a higher density (squares). The two minima of T av for the two setups occur at distinct cluster numbers and T av values, but in both minima each cluster has 16 channels. We observe for both realizations a plateau of the amplitudes for a relatively large range from about 8 to 23 clusters. In this range the cell with the larger amount of channels exhibits a nearly doubled average amplitude, whereas the amplitude is only slightly higher for few clusters due to inhibition and goes to zero for a diffusive arrangement of channels at larger cluster numbers. Interestingly the minimal periods are in this range of constant amplitudes what might indicate a stabilized regime. 3. Discussion In this paper we used our recently developed method for modeling Ca2+ dynamics in three dimensions to investigate the role of IP 3R clustering. We found that spike amplitudes and lSI depend on the degree of clustering, cluster configuration and number of clusters. We found optimal configurations and numbers of channels with respect to a variety of properties. Reliable fast spiking can be obtained with about 10 channels per clusters and cluster densities of about 0.01 {tm- 3. That would wean numbers of channels per cell which are about one order of magnitude smaller than those estimated from IP 3 binding experiments (see [9] and references therein). Remarkably, expressing move IP 3 or increasing the degree of clustering does not improve
24
A. Skupin f3 M. Faleke
regularity or accelerate spiking. It is currently believed that Ca2+ oscillations use frequency encoding. Small channel numbers appear more suitable for that purpose than large ones. Clustering of channels consistently improved spiking with respect to regularity of ISIs and amplitudes of spikes. If we assume that the ability to spike and to use frequency coding is the purpose of the Ca2+ signaling pathway, our results indicate that it can be achieved with surprisingly small channel numbers and if channels cluster.
References [1] Bentele, K. and Falcke, M., Quasi-Steady Approximation for Ion Channel Currents, Biophys. J., 93:2597-2608, 2007. [2] Berridge, M., Inositol trisphosphate and calcium signalling, Nature, 361:315-325, 1993. [3] Berridge, M., Elementary and global aspects of calcium signalling, J. Physiol., 499:291-306, 1997. [4] Berridge, M., Lipp, P. and Bootman, M., The versatility and universality of calcium signalling, Nature Rev. Mol. Cell Biol., 111-22,2000. [5] Bootman, M., Niggli, E., Berridge, M., and Lipp, P., Imaging the hierarchical Ca2 + signalling in HeLa cells, J. Physiol, 499:307-314, 1997. [6] Falcke, M., On the role of stochastic channel behavior in intracellular Ca 2 + dynamics, Biophys. J., 84:42-56, 2003. [7] Falcke, M., Reading the patterns in living cells - the Physics of Ca2+ signaling, Advances in Physics, 53:255-440, 2004. [8] Marchant, J., Callamaras, N., and Parker, 1., Initiation of IP3-mediated Ca 2 + waves in Xenopus oocytes, The EMBO J., 18:5285-5299, 1999. [9] Marchant, J. and Parker, 1., Role of elementary Ca2 + puffs in generating repetitive Ca2+ oscillations, The EMBO Journal, 20:65-76, 200l. [10] Meinhold, L. and Schmansky-Geier, L., Analytical description of stochastic calcium periodicity PRE, 66: 050901(R), 2002. [11] Putney, J. and Bird, G., The inositolphosphate-calcium signaling system in nonexcitable cells, Endocrine Reviews, 14:610-631, 1993. [12] Rudiger, S. et al., Hybrid Stochastic and Deterministic Simulations of Calcium Blips, Biophys. J., 93:1847-1857, 2007. [13] Schuster, S., Marhl, M., and HOfer, T., Modelling of simple and complex calcium oscillations, Eur. J. Biochem., 269:1333-1355, 200l. [14] Shuai, J. and Jung, P., Optimal ion channel clustering for intracellular calcium signaling, PNAS, 100:506-510, 2003. [15] Skupin, A. et al., How does intracellular Ca 2 + oscillate: By chance or by the Clock, Biophys. J., 94:2404-2411, 2008. [16] Skupin, A. and Falcke, M., How to model Ca2+ dynamics in 3D, submitted, 2008. [17] Taylor, C., Inositol trisphosphate receptors: Ca 2 + -modulated intracellular Ca 2+ channels, Biochimica and Biophysica Acta, 1436:19-33, 1998. [18] Tsien, R. and Tsien, R., Calcium channels, stores and oscillations, Annu. Rev. Cell Biol., 6:715-760, 1990.
RULE-BASED REASONING FOR SYSTEM DYNAMICS IN CELL SYSTEMS EUNA JEONG eajeongQims.u-tokyo.ac.jp
MASAO NAGASAKI masaoQims.u-tokyo.ac.jp
SATORU MIYANO miyanoQims.u-tokyo.ac.jp
Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo 108-8639, Japan A system-dynamics-centered ontology, called the Cell System Ontology (CSO), has been developed for representation of diverse biological pathways. Many of the pathway data based on the ontology have been created from databases via data conversion or curated by expert biologists. It is essential to validate the pathway data which may cause unexpected issues such as semantic inconsistency and incompleteness. This paper discusses three criteria for validating the pathway data based on CSO as follows: (1) structurally correct models in terms of Petri nets, (2) biologically correct models to capture biological meaning, and (3) systematically correct models to reflect biological behaviors. Simultaneously, we have investigated how logic-based rules can be used for the ontology to extend its expressiveness and to complement the ontology by reasoning, which aims at qualifying pathway knowledge. Finally, we show how the proposed approach helps exploring dynamic modeling and simulation tasks without prior knowledge.
Keywords: Cell System Ontology; CSO; rule-based inference; pathway knowledge base; ontology validation
1. Introduction
The Cell System Ontology (CSO) [5] has been developed as a unified framework for the representation of biological pathways, based on the notion of hybrid functional Petri net with extension [8]. CSO defines classes for modeling, visualizing, and simulating biological pathways and relationships between classes in the Web Ontology Language (OWL) [12]. Furthermore, the selected controlled vocabularies are defined in CSO to easily represent biological pathways. The pathway data based on the CSO classes are created by data integration and exchange efforts such as BioPAX2CSO [4] and Transpath2CSML [9], modeling and simulating tools such as Cell Illustrator [14, 15], or ontology editors such as Protege [13] and SWOOP [18]. The Cell System Markup Langauge (CSML) [16] is fully compatible with CSO. The static pathway models in other biological knowledge resources are reconstructed into mathematical models with improved visualization in CSO via data conversion. The CSO tools [6, 7, 14, 15] allow to explore the possible dynamic behavior of pathway components. Unfortunately, there is ambiguous and missing information in those resources [4, 9] which makes any semantic inconsistency
25
26
E. Jeong, M. Nagasaki €:J S. Miyano
and incompleteness in the pathway data in eso. As a huge volume of the eso data is generated, it is crucial to provide a knowledge base which enables dynamic simulation and hypothesis testing of biological models. In this paper, we first propose three criteria for validating the pathway data in eso in terms of both Petri nets and biological meaning. Modeling and validating biological pathways with Petri nets are shown in many studies [2, 3, 10, 11] because Petri nets allow graphical representation and simulation for biological pathways. However, the related studies are focused on representing dynamics of the system such as how to set relevant logical parameters for Petri net components. In fact, the Petri net components rarely embed semantics in biology in the sense that whether a place represents a gene or a protein, or whether a transition is gene expression or protein modification is not important. Secondly, we propose a rule-based approach to extend the expressiveness of the ontology and to complement the ontology by reasoning, which aims at qualifying pathway knowledge. In the next section, we briefly introduce how eso describes biological pathways. In Sec. 3, we define three criteria for validating the pathway data and present how rules are used in conjunction with eso. Finally, a small example shows how the proposed rule-based approach helps exploring dynamic modeling and simulation tasks without prior knowledge.
2. Cell System Ontology
eso
defines a model as a set of processes. The processes have entities as participants. The processes and entities are related via directed connectors. The main classes to represent biological interactions are Process, Entity, and Connector. Each process represents a biological event such as binding, translation, and activation. eso currently supports biological entities such as genes, proteins, RNA, small molecules, and complexes. RNA is further classified into its subclasses. A connector defines a role of the entity which is involved into a process. Depending on its role, the connector class is further classified into InputAssociationBiological, InputInhibitorBiological, InputProcessBiological, and OutputProcessBiological which mean an activator, an inhibitor, an input, and an output, respectively. These basic elements are defined as BiologicalElement in eso as shown in Figure lA. Furthermore, with the eso schema, one can specify simulation-related parameters for mathematical models, graphical visualization of biological elements, and available literature data. eso also provides comprehensive controlled vocabularies for such as biological events, cellular compartments, organism type, and cell type, to model biological pathways with different scales and modalities in cell systems. The formal schema of the complete ontology is available at [16]. Figure lB describes asserted facts for a simple model, where simulation- and visualization-related properties are abbreviated for convenience. In the figure, the property values for a biological event, a cellular compartment, and a fea-
Rule-Based Reasoning for System Dynamics in Cell Systems 27
• SmallMolecule • Enti tyBiologicalOther • Enti tyBiologicalUnknown .. • Enti tyNonBiological . . . Fact " • Process • ProcessBiological . . . ProcessNonBiol.
ProcessBiological (pl) hasBiologicalEvent (p3, ME_phosphorylation) hasConnector(pl, c6) hasConnector(pl, c7) ProcessBiological (p2) hasBiologicalEvent (p2, ME_binding) hasConnector(p2, cl) hasConnector(p2, c3) hasConnector(p2, c2) ProcessBiological (p3) hasBiologicalEvent (pl, ME_translocation) hasConnector(p3, c4) has Connector (p3, e5) InputProcessBiological (el) hasEntity(cl, el) InputProeessBiological (e2) hasEntity(e2, e2) OutputProeessBi ological (c3) hasEntity(c3, c3) InputProcessBiological (e4) hasEntity(c4, e2) OutputProeessBiologieal (e5) hasEntity(c5, e4) InputProcessBiological (c6) hasEntity(e6, e1) OutputProcessBiologi eal (e7) hasEntity(c7,e5) Protein (el) Protein (e2) locatedIn(e2, CC_cytoplasm) Complex (e3) Protein(e4) locatedln(e4, CC..nucleus) Protein (e5) hasFeature (e5, FT _phosphorylated)
A. Biological elements defined in CSO.
B. Asserted facts in CSO for a simple model.
" • BiologicalElement " • Connector " • Input . . . InputAssociation • InputAssociationBiological .. • InputAssociationNonBiol. . . . InputInhibitor • InputInhibi torBiological . . . InputInhibitorNonBiol. .. • InputProcess • InputProcessBiological .. • InputProcessNonBiol. " • Output .. • OutputProcess • OutputProcessBiological .. • OutputProcessNonBiol. " • Entity " • Enti tyBiological • EntityBiologicalCell • Enti tyBiologicalCompartment • Enti tyBiologicalEnvironment . . . EntityBiologicalMolecule •
Complex
•• ObjectOther Dna • •
ObjectUnknown Protein
. . . Rna
C. A simple model visualized in Cell Illustrator.
Fig. 1. The biological elements defined in CSO (A), asserted facts for a simple model (B), and its visualization in Cell Illustrator (C).
ture type refer to the controlled vocabulary terms defined in e80, prefixed with ME., CC., and FL, respectively. For example, p3 represents a biological event as translocation and has two connectors, c4 and c5, each of which is an instance of InputProcessBiological and OutputProcessBiological, respectively. The connector c4 (c5) is related to the entity e2 (e4), respectively. The two entities, e2 and e4, have location properties. The related facts are underlined in the figure.
28
E. Jeong, M. Nagasaki &! S. Miyano
Figure lC shows a graphical illustration of the simple model imported into Cell Illustrator. The graphical images and positions for biological elements are also stored in CSO as a machine-readable format. Because of this, visualization tools can facilitate these data for automatic drawing of biological networks considering cellular compartments [6] and the hierarchy of the CSO classes [7].
3. Rule-Based Reasoning for Ontology Validation
We define three criteria for qualifying pathway knowledge as follows: • Structurally correct models in terms of Petri nets. • Biologically correct models to capture biological meaning. • Systematically correct models to reflect biological behaviors. Although CSO defines sophisticated classes and relationships to describe the details of any given interaction unambiguously, sometimes only an OWL ontology is not enough for providing a qualified knowledge base of biological pathways. In OWL, there is no proper way to constrain what kind of entities can participate in which types of biological processes, or what data values are valid for a particular process. For ontology validation based on the three criteria, we use a rule-based approach represented in OWL constructors and axioms [12]. The available constructors and their correspondence in the Description Logic (DL) with the First Order Logic (FOL) are shown in Table 1. Table 1.
OWL constructors and DL FOL equivalence.
Constructor intersectionOf unionOf complement Of oneOf allValuesFrom someValuesFrom minCardinality max Cardinality
DL syntax
Cl n ···nCn C 1 U ... UCn ~C
{al···an } VP.C 3P.C ~nP.C
":;;nP.C
FOL syntax
Cl(X) /\ ... /\ Cn(x) C1(x)V···vCn(x) ~C(x)
x = al V ... V x = an Vy.(P(x,y) -+ C(y» 3y.(P(x,y) /I C(y» 3~ny.(p(x, y) /I C(y» 3(n y .(p(x, y) /I C(y»
In Tab. 1, eli) is a class, P is a property, ali) is an individual, n is a non-negative integer, and x and yare variables. In FOL, classes correspond to unary predicates, properties correspond to binary predicates, and individuals are equivalent to constants. In the following description, rules are described in FOL. A rule has the form: H <- Bl 1\ ... 1\ B n , where H, Bi are OWL constructors and axioms. H is called the head of the rule and Bi is called the body of the rule. The rule is read as "if [body], then [head]."
Rule-Based Reasoning for System Dynamics in Cell Systems
29
3.1. Structurally correct models We define a structurally correct model as a bipartite graph with two disjoint sets of entities and processes. This means that if one entity is involved in a process, it has to play only one role in the process. The relationship between a process and its participants in eso is described in Fig. IE. Process p2 and entity e1 are related to each other via connector c1. For this relationship, four facts are used: ProcessBiological (p2), hasConnector (p2, c1), InputProcessBiological(c1), and hasEntity(c1, e1). eso defines that one process can have mUltiple hasConnector properties and one connector has only one hasEntity property. In OWL, however, a ternary relationship among classes cannot be represented. A rule is required to constrain their relationships among three classes as follows: Rl: VALIDCONNECTION(Xi, X2)
~
Process(xd /\ 3X2, 3:>;i x3 .(hasConnector(xi, X3) /\ Connector(x3)/\ hasEnti tY(X3' X2) /\ Enti tY(X2))
Given any pair of one entity and one process, if there exists zero or one connector between them, this relationship is correct. Some biological knowledge resources allow physical entities to have multiple roles in a process. The results of data conversion from those resources into eso may violate this rule. There also exist gaps between different levels of abstraction, different structured manners used in biological knowledge resources. For example, in BioPAX [1], a catalyzed inactivation process is represented as two different processes: a catalysis and an inactivation. A catalysis describes that an enzyme catalyses the inactivation process. In this case, the enzyme may participate as an activator of the catalysis process as well as an input of the inactivation process. In eso, a catalyzed inactivation is described as one process. After conversion from BioPAX to eso, one input entity is connected to the process with two roles: a catalyzer and a substrate. It is not allowed in eso based on Petri nets because a catalyzer does not change its concentration during interaction, but a substrate does it. A query to evaluate the relationship between process Xi and entity X2 can be written as follows: Ql: If not VALIDCONNECTION(xl, X2) then alert
In QI, it requires user intervention (alert) to select a correct relationship if there exist multiple connections between Xi and X2, because it is difficult to decide which one is correct without understanding the details of interaction.
3.2. Biologically correct models In many cases, controlled vocabularies are used to control and limit terms to describe biological processes, whose definitions are usually given as comments for human users. In eso, the type of a process is identified with the property of a biological
30
E. Jeong, M. Nagasaki 8J S. Miyano
event which has cardinality 1. On the other hand, it is optional in BioPAX and has different meaning [9] in TRANSPATH [21]. It is useful to formalize the definitions based on shared knowledge underlying biological processes. We define a biologically correct model as a model to correctly represent biological meaning of processes as a machine-readable format. In this paper, the three processes depicted in Fig. Ie are considered to represent rules. Translocation is a process which has a biological event as ME_Translocation. Similarly, binding and phosphorylation are processes annotated as ME...Binding and ME..Phosphorylation, respectively. The following rules define the three processes. R2: TRANSLOCATION(Xl)
f-
Process(xl) /\ hasBiologicalEvent(xl, ME_Translocation) R3: BINDING(Xl)
f-
Process(xl) /\ hasBiologicalEvent(xl, ME...Binding) R4: PHOSPHORYLATION(Xl)
f-
Process(xl) /\ hasBiologicalEvent(x1> ME..Phosphorylation) The following queries, Q2, Q3, and Q4, are evaluating whether the given process satisfies some conditions. In the queries, HASINPUT and HAS OUTPUT are defined as follows: HASINPUT(Xl,X3)
f-
3X2, X3.(hasConnector(xl' X2) /\ Input(x2) /\ hasEnti tY(X2' X3)) HASOUTPUT(x1> X3)
f-
3X2, x3.(hasConnector(xl, X2) /\ Output (X2) /\ hasEntity(x2, X3)) If an entity is connected to a process via the Input connectors, then we say that the process has an input entity. On the other hand, the process has an output entity if the entity is connected to the process via Output. In the queries, DifferentFrom and SameAs, are OWL axioms for identification of individuals. Each has the form {Xl} ~ ""{X2} and {xd == {X2} in DL, respectively. Q2: If TRANSLOCATION(Xl) then
If ...,(3Xi.2';;i';;7.HASINPUT(Xl, X2) /\ Entity(x2) /\ locatedIn(x2, X4)/\ hasXref(x2, xs) /\ HASOUTPUT(Xl, X3) /\ Enti ty(X3) /\ locatedIn(x3, xs)/\ hasXref(x3, X7) /\ DifferentFrom(x4, xs) /\ SameAs(xs, X7)) then
alert Q3: If BINDING(Xl) then If ...,(3~2X2' 3';;lX3.HASINPUT(x1> X2) /\ Entity(x2))/\ HASOUTPUT(x1> X3) /\ Complex(x3)) then
alert
Rule-Based Reasoning for System Dynamics in Cell Systems
31
Q4: If PHOSPHORYLATION(Xl) then
If -{3Xi,2";H;S.HASINPUT(Xl, X2) /\ Enti tY(X2) /\ hasXref(x2, X4)/\ HASOUTPUT(Xl, X3) /\ Enti ty(X3) /\ hasXref(x3, xs)/\ hasFeature(x3, FE_phosphorylated) /\ SameAs(x4, xs)) then
alert The definition of translocation in CSO is the process that an entity located in one cellular compartment is moved to another cellular compartment. In CSO, the same molecule in different locations is recognized as two different entities. Q2 describes that a translocation process has to satisfy the constrains that the input and output entities have the same external reference and different cellular locations. A binding process is an interaction of a molecule with specific sites on another molecule. In Q3, a binding process needs at least two input entities and generates one output entity as Complex. Formally, phosphorylation is the process of introducing a phosphate group into a molecule, usually with the formation of a phosphoric ester. In Q4, the constraints describe that the input and output entities have the same external reference and the sequence of the output entity has phosphorylated features. If the constraints are not satisfied, then prompt users for intervention (alert). Users may be guided to add missing constraints into the knowledge base.
3.3. Systematically correct models CSO is an ontology to represent dynamics of biological pathways and is supposed to simulate complex molecular mechanisms at different level of details. Once a mathematical model of biological pathways has been generated, it is necessary to estimate any free parameters and unknown rate constants based upon experimental data. We limit our consideration to generating a simulatable model ready for evaluation. We define a systematically correct model as a model to capture generic behaviors that govern the system dynamics. In the current state of this paper, we focused on protein turnover. Normally, proteins are synthesized within the cell and over time are gradually broken down into individual amino acids, and this cycle is repeated. To capture this behavior, we define three rules to recognize which entities are synthesized and degraded. R5 defines a starting entity as an entity except for a complex, which is connected to processes via only Input connectors. This indicates that a starting entity is not a product of any process. A predicate with a superscript of minus sign means the inverse of the predicate, e.g. hasEnti ty-. R6 identifies a starting entity whose type is complex. In addition, R7 is defined for biological entities except for genes to be degraded. R5: STARTINGENTITY(Xl)
<-
Entity(xd /\ -,Complex(xl) /\ \fx2.(hasEntitY-(Xl,X2)
->
Input(x2))
32
E. Jeong, M. Nagasaki €:f S. Miyano
R6: STARTINGCOMPLEX(xd
~
Complex(xd!\
\fx2.((hasEntity-(xl,X2)
R7: DEGRADINGENTITY(Xl)
~
->
Input(x2))
Protein(xl) V Complex(xl) V mRNA(Xl)
The next three queries are generated from rules R5, R6, and R7, which will complement the given models by adding new instances (add-instance) and properties (add-property). The variable in braces, e.g. {xd, denotes a new instance ID. In Q5, if a given entity is STARTINGENTITY whose type is not complex, then a production process ({X2}) as a pre-process of the entity, a connector ({X3}) to relate xi and {X2}, and any necessary properties are added. This will make the starting entity be a product of the production process. In Q6, if a given entity is STARTINGCOMPLEX, then we assume that the complex is generated via a binding process whose participants are the components of the complex. Depending on the number of components of the complex, multiple connectors will be added. For degrading entities including protein, complex, and mRNA, a degradation process is added with a connector between the entity and the degradation process in Q7. In the Petri net formalism, adding pre-processes for starting entities (complexes) makes those processes to be fired without any constraints when the simulation is started. All entities consume their initial concentrations at the starting point of simulation. This complementation of the pathway data in eso will help users to intuitively understand the given model and how it works.
Q5: If STARTINGENTITY(xd then
add-instance Process( {X2}), OutputProcessBiological( {X3}) add-property hasBiologicalEvent ({X2}, ME_UnknownProduction) , hasConnector( {X2}, {X3}), hasEntity( {X3}, Xl) Q6: If STARTINGCOMPLEX(Xl) then
add-instance Process( {X2}), OutputProcessBiological( {X4}) add-property hasBiologicalEvent ({ X2}, ME..Binding), hasConnector- ({X4}, {X2})
for all hasComponents(xt, X3) do add-instance InputProcessBiological( {Xi} ) add-property hasConnector- (X3' {xd) Q7: If DEGRADEDENTITY(Xl) then
add-instance Process( {X2}), InputProcessBiological( {X3}) add-property hasBiologicalEvent( {X2}, ME...Degradation) , hasConnector( {X2}, {X3}), hasEntity( {X3}, Xl)
Rule-Based Reasoning for System Dynamics in Cell Systems
33
4. Experimental Results
In order to perform the rule-based system, we used AllegroGraph 2.2.5 [17] for the CSO data storage and query engine, SPQRQL query language [20] for querying, Java applications and Perl scripts for query manipulation and knowledge base manipulation, respectively. AllegroGraph is a RDF graph database with support for SPARQL. Signaling by FGFR pathway from Reactome (ID=190236) [19] is selected as an example. The 22 members of the fibroblast growth factor (FGF) family of growth factors mediate their cellular responses by binding to and activating the different isoforms encoded by the four receptor tyrosine kinases (RTKs) designated FGFRl, FGFR2, FGFR3, and FGFR4. These receptors are key regulators of several developmental processes in which cell fate and differentiation to various tissue lineages are determined. This leads to stimulation of intracellular signaling pathways that control cell proliferation, cell differentiation, cell migration, cell survival and cell shape, depending on the cell type or stage of maturation [19]. The Reactome data exported into the BioPAX format is converted into the CSO format by BioPAX2CSO [4]. Figure 2 shows the result of BioPAX2CSO. In the figure, the squared boxes point places to be evaluated by queries described in Sec. 3. Figure 3 shows the result of ontology validation for the same model in Fig. 2. Via ontology validation, seven not-valid connections are corrected and six starting complexes have pre-binding processes. In addition, 15 unknown production and 43 degradation processes are added for starting entities and degrading entities, respectively. This validation makes the given model to be simulatable when loaded in Cell Illustrator without any changes. The results of simulation are shown as charts in the below of Fig. 3. 5. Conclusions
We have presented a rule-based approach to provide qualified knowledge bases for biological pathways. Three criteria had been proposed for ontology validation in terms of both Petri nets and biological meaning. The experimental result shows how ontology validation can be done by using rules in conjunction with CSO. The main contributions of this work are summarized as follows: (1) to give a formal representation for biological events and biological behaviors and (2) to provide new criteria for qualifying biological pathway knowledge. Our proposed method can be used for biological pathway models generated via data conversion and manual curation. In addition, it can be used as a plugin of modeling and simulating tools such as Cell Illustrator. When users create models, users are guided to generate models which are simulatable as well as biologically correct. As a result, the proposed method helps to generate qualified pathway models, which allow to easily explore the possible dynamic behavior of pathway components. In future work, we plan to define rules for the biological events defined in CSO as much as possible. Furthermore, we will define more rules to capture generic
34
E. Jeong, M. Nagasaki & S. Miyano
Fig. 2.
eso.
The signaling by FGFR pathway from Reactome (ID=190236) [19] after conversion into
biological behaviors learned from modeling experts and literature. For example, the speed of processes are different depending on biological events: binding and dimerization may have different speed; the speed of natural degradation is slower than other processes; and the transcription speed of mRNA is quicker than that of miRNA. Moreover, time to translate a protein and time to transcribe a gene are different depending on species.
Rule-Based Reasoning faT System Dynamics in Cell Systems 35 Utl/Prot; P224S5
lItl1Prot:P22607_2
Fig. 3. The results of ontology validation of the pathway described in Fig. 2 and the simulation results with default values of parameters.
References [1] Bader, G. and Cary, M., BioPAX - biological pathways exchange language level 2, version 1.0 documentation, 2005.
36
E. Jeong, M. Nagasaki &J S. Miyano
[21 Genrich, H.J., Kiiffner, R., and Voss, K., Executable Petri net models for the analysis of metabolic pathways, International Journal on Software Tools for Technology Transfer, 3(4):394-404, 200l. [31 Hofestiidt, R. and Thelen, S., Quantitative modeling of biochemical networks, In Silico Bioi., 1(1):39-53, 1998. [4] Jeong, E., Nagasaki, M., and Miyano, S., Conversion from BioPAX to CSO for system dynamics and visualization of biological pathway, Genome Informatics, 18:225-236, 2007. [5] Jeong, E., Nagasaki, M., Saito, A., and Miyano, S., Cell system ontology: representation for modeling, visualizing, and simulating biological pathways, In Silico Biology, 7(6):623-638, 2007. [6] Kojima, K., Nagasaki, M., Jeong, E., Kato, M., and Miyano, S., An efficient grid layout algorithm for biological networks utilizing various biological attributes, BMC Bioinformatics, 8:76, 2007. [7] Kojima, K., Nagasaki, M., Miyano, S., Fast grid layout algorithm for biological networks with sweep calculation, Bioinformatics, 24(12):1433-1441, 2008. [8] Nagasaki, M., Doi, A., Matsuno, H., and Miyano, S., A versatile Petri net based architecture for modeling and simulation of complex biological processes, Genome Informatics, 15(1):180-197, 2004. [9] Nagasaki, M., Saito, A., Li, C., Jeong, E., and Miyano, S., Systematic reconstruction of TRANSPATH data into Cell System Markup Language, BMC Systems Biology, 2:53,2008. [10] Peleg, M., Yeh, I., and Altman, R.B., Modelling biological processes using workflow and Petri Net models, Bioinformatics, 18:6, 825-837, 2002. [11] Reddy V.N., Liebman, M.N., and Mavrovouniotis, M.L., Qualitative analysis of biochemical reaction systems, Comput. Bioi. Med., 26:9-24, 1996. [12] Smith, M., Welty, C., and McGuinness, D., OWL Web Ontology Language Guide, 2004. [13] http://protege . stanford. edu/ The Protege ontology editor and knowledge acquisition system. [14] http://www.cellillustrator.com/. Cell Illustrator 3.0. [15] http://cionline.hgc.jp/, Cell Illustrator Online. [16] http://www . csml. org/, Cell System Markup Language (CSML). [17] http://www.franz.com/. AllegroGraph - Web 3.0 database. [18] http://www . mindswap. org/2004/SWDDP / SWOOP - hypermedia-based OWL ontology browser and editor. [191 http://www . reactome. org/, Reactome - a curated knowledge base of biological pathways. [20] http://www.w3.org/TR/rdf-sparql-query/, SPARQL query language for RDF. [21] http://www.biobase.de/. TRANSPATH the pathway databases.
ESTIMATION OF NONLINEAR GENE REGULATORY NETWORKS VIA L1 REGULARIZED NVAR FROM TIME SERIES GENE EXPRESSION DATA KANA ME KOJIMA
ANDRE FUJITA
TEPPEI SHIMAMURA
kaname~ims.u-tokyo.ac.jp
afujita~ims.u-tokyo.ac.jp
shima~ims.u-tokyo.ac.jp
SEIYA IMOTO
SATORU MIYANO
imoto~ims.u-tokyo.ac.jp
miyano~ims.u-tokyo.ac.jp
Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Recently, nonlinear vector autoregressive (NVAR) model based on Granger causality was proposed to infer nonlinear gene regulatory networks from time series gene expression data. Since NVAR requires a large number of parameters due to the basis expansion, the length of time series microarray data is insufficient for accurate parameter estimation and we need to limit the size of the gene set strongly. To address this limitation, we employ Ll regularization technique to estimate NVAR. Under Ll regularization, direct parents of each gene can be selected efficiently even when the number of parameters exceeds the number of data samples. We can thus estimate larger gene regulatory networks more accurately than those from existing methods. Through the simulation study, we verify the effectiveness of the proposed method by comparing its limitation in the number of genes to that of the existing NVAR. The proposed method is also applied to time series microarray data of Human hela cell cycle.
Keywords: time series gene expression data; gene regulatory networks; vector autoregression; B-spline; group LASSO
1. Introduction
Using time series microarray data, estimation of gene regulatory networks is one of the essential roles to elucidate transcriptional systems. Recently, various statistical approaches have been proposed to capture gene regulations using dynamic Bayesian network [13, 18], vector autoregressive model [7-9], and state space model [12, 25J based on statistical causality. In this study, we use vector autoregressive model and capture gene regulations based on Granger causality. Linear vector autoregressive models are well-established in statistics and in existing literature it has been applied to estimate gene regulatory networks. However, most of the regulations cannot be limited by linear [9J, and we need to extend classical vector autoregressive models into nonlinear vector autoregressive models. Fujita et at. [9J introduced nonparametric regression technique to vector autoregressive model for estimating nonlinear and nonmonotonic regulations in gene
37
38
K. Kojima et al.
regulatory networks. In non parametric regression, since basis expansion technique was applied to build nonlinear mean function, the number of parameters increases rapidly. In addition, the number of genes that can be handled is highly limited by the fact that the length of time series microarray data is very short. Thus, we propose to use L1 regularization technique and address the estimation of nonlinear and nonmonotonic gene regulatory networks. L1 regularized nonparametric regression is reduce to group LASSO problem [16, 21]. For the solution of group LASSO, we show a new efficient method based on interior point method. Also, the estimates of group LASSO depend on the regularization parameter oX that determines which variables are chosen. Therefore, appropriate choice of oX is essential for statistical modeling based on group LASSO. We investigate this problem from a Bayesian point of view and derive an information criterion to choose the value of oX. We apply the proposed method to the artificial network of ten genes and twenty edges [9]. From the comparison of true positive rates of our proposed method and the methods based on ordinary least square (OLS) and L2 penalization, i.e., ridge estimator, under false discovery rate control, the effectiveness of the proposed method is verified especially from the time series data of length less than 75. Our proposed method is also applied to time series gene expression data from Human hela cell cycle [24] and the obtained gene regulatory network is analyzed. This manuscript is organized as follows: Section 2.1 gives the definition of group LASSO model and its efficient solution. L1 regularized spline additive model and its relationship to group LASSO are described in Section 2.2. In Section 2.3, an information criterion of group LASSO is derived and the selection of the regularized parameter oX is shown. Statistical test for Granger causality is illustrated in Section 2.4. In Section 3, our proposed method is applied to the time series data from the artificial network and real data. Finally, we discuss our work in Section 4.
2. L1 spline additive regression
2.1. Preliminary 2.1.1. Vector autoregressive model
Given gene expression profile vectors of p genes and T time points {Xl, ... , XT}, first order vector autoregressive (VAR(l)) model at time point t is given by:
(1) where A is a p x p autoregressive coefficient matrix, and e is a vector of normally distributed noise Ci,t rv N(O, for the expression of gene i at t time point. For simplicity of explanation, we use the following notations: Yi = (Xi,2, ... , Xi,T)', X = (Xl, ... ,XT-l)', f3 i = (ai,l, ... ,ai,p)', and ei = (ci,l, ... ,ci,T-d'. By using these notations, autoregressive model each gene i in Equation (1) can be given as:
an
Yi
= Xf3 i + ei·
(2)
Estimation of Nonlinear Gene Regulatory Networks
39
Granger [l1J defined a concept of Granger causality, in which a cause cannot come after the effect. Thus, if a gene Xi affects a gene Xj, the expression of gene Xi should help improving the prediction of the expression of gene Xj. To estimate Xi has significant Granger causality to Xj, we test whether the autoregressive coefficient aj,i is O.
2.1.2. Linear autoregression with grouped covariates We consider that p covariates are partitioned into disjoint G groups and rewrite the regression model in Equation (2) as: G
Yi
= E X 9{3i,g + ei, g=l
where {3i,g is a sub-vector of {3i corresponding to Pg covariates in the gth group, and Xi,g is a (T -1) x Pg matrix of columns corresponding to covariates in the gth group. Like LASSO, group LASSO [26] can put the restriction that all coefficients in some {3i,g'S are simultaneously and exactly zero. The estimates of group LASSO are obtained by solving the following minimization:
arg~in• {(Yi -
EX9{3i,g)'(Yi - EX9{3i,9) 9 9
+ AE 9
J{3~,gKi'9{3i'9}'
(3)
where Ki,g is a Pg x Pg positive semi-definite matrix. Since Equation (3) is a convex optimization problem but not differentiable at (3i,g = 0, Park and Hastie [21J proposed to use interior point method, introducing dummy variables.
2.1.3. Bayesian information criterion Given data D, we may select a model M of maximum posterior probability P(MID) among the models of interest. If prior probability P(M) for model is assumed to be uniform, due to the Bayes theorem, the posterior probability of M is proportional to the marginal likelihood P(DIM). Suppose that a model M is characterized by a parametric model f(DI6) and prior distribution 7r(6) for parameter 6. Marginal likelihood of model M with respect to data D is given by:
P(DIM) =
J
f(DI6)7r(6IM)d6.
Bayesian information criterion [1, 22J was proposed as an approximation of the posterior probability of the model to select the optimal model based on the data: BIC
~ -2 log P(MID) = -2 log
J
f(DI6)7r(6IM)d6.
40
K. Kojima et al.
2.2. L1 regularized spline additive model for gene regulatory network estimation In non parametric regression, spline function is often used for constructing regressors. Let Si,j(Xj,t) be the spline function for the expression of gene j at time point t, Xj,t. In this study, third-order B-splines are used as base of spline function and spline function Si,j(Xj,t) for variable Xj,t is represented by I:;1'kbi,j,k(Xj,t). The smoothing spline additive model is obtained by minimizing the loss function: T
f; p
t;(Xi,t -
Si,j(Xj,t-lW
+
f; J p
d2 dX2 Si,j(X)
{
A
}2 dx.
Lin and Zhang [16], and Bach et at. [3] extended the above smoothing spline additive model to L1 regularized spline additive model in which L1 norms of first and second derivatives of spline functions are used as penalization. In L1 regularized spline additive model, the following loss function is optimized:
t, { t s" x,.' -
In B-spline,
(x,,_,)
r
+>
~ (J {! S'., (x)
r
dx
+
J{,;::,
S,.,(x)
r
dx ) ,
(4)
f {d~ Si,j (x) } 2 dx and f { -/l;z Si,j (x) } 2 dx can be given as following
forms [4]:
J{d~Si,j(X)r J{d~2
dX=1':,jD 1,i,j1'i,j,
Si,j(X)} 2 dx
= I:,jD 2,i,J1'i,j'
Therefore, we can rewrite Equation (4) by: p
(Yi -
L
P
p
B i,J1'i)'(Yi -
j=1
L B ,J1'i) + A L i
j=1
V1':,jEi,J1'i,j,
(5)
j=1
where Yi = (Xi,I, ... , Xi,T-l)', mi,j is the number of basis functions for variable Xj, li,j = (')'i,j,l, ... , 1'i,j,m.)', Ei,j = D 1,i,j + D 2,i,j, and Bi,j is a (T -1) x mi,j matrix:
b"j,1 (XJ,I) [
b',J,m,,) (XJ,I)
b',J,I (LT-l) ::: bi,J,m,,)
1
~XJ'T-l)
In the L1 regularized spline additive model, since we would like to evaluate whether all coefficients of some splines are simultaneously and exactly zero, we can thus use the procedure based on group LASSO. However, use of dummy variables increases the number of variables to be concerned. In addition, unstable constraints caused by dummy variables induce the slow
Estimation of Nonlinear Gene Regulatory Networks
41
convergence. Thus, we propose to convert the optimization problem in Equation (5) to:
where Bi = (Bi,l, ... ,Bi,p), and Ii = (,:,1' ... ":,p)'. The optimization problem in Equation (6) can be solved by interior point method without using dummy variables. See Appendix A for details.
2.3. Bayesian information criterion for nonparametric group LASSO regression Selection of regularization parameter). is important for variable selection and coefficient estimation in group LASSO. We derive Bayesian information criterion for L1 spline additive model and A minimizing the criterion. From the view point of Bayesian statistics, probabilistic model of L1 spline additive model can be characterized as likelihood function !(YiIBi, Ii' of linear regression with product of Laplacian prior 7fi,j(,i,jluT,A) for li,j given by:
an
!(YiIBi, Ii' u;)
=
~T-l exp {- 212 (Yi 27fUT
U..
7fi,j(,ilu2,).)
=
Li,j exp ( -
2~; VI:,jEi,i'Yi,j) ,
= _pl/2 27T
ai
where L iJ, ·
1.,)
(2=2>-
)Pi,j det(EiJ·)1/2r~i,j/~). 'Pt,)
Bnd(Yi - Bni)} ,
(7) (8)
Using Equations (7) and (8) in
Equation (4), the posterior probability of the model based on group LASSO can be given as:
P(DIM)
=
J
!(YiIBi, Ii' u;) I17fi,j(,i)U;, A)d,i ·
(9)
J
Note that the variance u; is considered to be known. For unknown uT, we use CrT = "L.'{=2(Yi,t - Yi,t? /(T - 1) as the estimator of uT- Hereafter, we omit uT and A in !(Yi IBi, Ii' uT) and 7fi,j (,i IUT, A) if no confusion occurs. In the following, we explain how the integration in Equation (9) is solved. Let Ai be a set of group vector li,j estimated as non-zero in group LASSO. If li,k is not in Ai, i.e., estimated as exactly 0 in group LASSO, it implies Laplacian prior of li,k is much stronger than likelihood function. Thus, we approximately calculate
42
K. Kojima et al.
the integration these li,k'S
J = J~T-l
rt Ai, ignoring li,k in the likelihood function:
!(YiIBi',i)7ri,jhi,k)d,i,k
27ra 2
exp {-
2~2 (Yi - LBi,i'Yi)'(Yi - LBi,i'Yi)} 7ri,khi,k)d,i,k j
t
j
~ ~T-l exp {-2~2(Yi - LBi,i'Yi)'(Yi - LBi,jli)}' 27ra joIk j'lk t
i
After integrating all I i,j' j BICGL
rt Ai, we have:
~ -2 log
J
(10)
!(YiI B i"A.)7rA,hA,)d,Ai ,
where
~T-l exp {- 212 (Yi - L Bi,i'Yi,j)'(Yi - L Bi,jli,j)} ' 27ra; at jEAi jEAi
!Ai (YiIBA" IA,) =
=
~T-l exp {- 21? (Yi 27ra; at
BAilAJ'(Yi - BAilA,)} '
II 7ri,jhi),
7rA i hA.) =
Here, BAi is a sub-matrix Bi for covariates in Ai, and I Ai is a sub-vector of Ii for covariates in Ai' For the integration with respect to li,j' j E Ai, Laplace approximation is used. By Laplace approximation, the integration is approximated as:
J
exp {q(O)} dO
~ exp {q(6)} (27r)p/2/ 1- ~:~:~ I,
where 6 = arg maxo q( 0) . Applying Laplace approximation to Equation (10), we have: log
J
!Ai(YiIBA"'A,)7rAihA.)d'Ai
~ log !Ai (YiIBA"iA,)7rA.(iA,) + l~illog27r -logdetJ(iA,), where IAi I is the length of I Ai' and J (i AJ is a IAi I x IAi I matrix given as:
J(iA,)
= -
=
a a;,
IAi IAi
log!Ai(YiIBA,,'A.)7rAihAJI
~ (B' B . + ~d' [ a2 Ai A. 2 lag
.."1Ai =-"1Ai
Ei,j - Ei,jii,j {Ei,jii,j }']) E. . 3 It,J t,Jlt,J li,jEt,Jli,j A'
•
A
•
•
A
,
••
A
(11)
Estimation of Nonlinear Gene Regulatory Networks
43
Thus, BICGL is given as: BICGL = -2Iog!A.(YiI B Ai ,i'A,)11"Ai (i'AJ -IA i !log21r+2IogdetJ(i'A,) = -(T - 1 + 21Ail) log 2 - (T - 1) log1r - (T - 1 + lAd) loga 2
+ 21Aillog A
+ 2 L (IOgr(pi,j/2) -IOgr(Pi,j) + ~ log IEi,jl) JEAi
-:2
{(Yi- BA ii'AY(Yi- BA .i'AJ+A.2: JEAi
Ji'~,jEi,ji'i,j} -logIJ(i'A,)I· (12)
2.4. Wald test for Granger causality
The variables selected by group LASSO are considered as the candidate variables having Granger causality to the response. In order to control false discovery rate of those candidates, we test the coefficients of basis functions li,j corresponding to each selected variable Xj. Usually, to test whether all the coefficients of grouped variables in linear regression are simultaneously zero, i.e., li,j = 0, we may use Wald test. However, Wald test is based on asymptotic normality of maximum likelihood estimators. Since group LASSO is not a maximum likelihood method due to the existence of Laplacian prior, it is impossible to use Wald test directly for the estimators of group LASSO. Konishi and Kitagawa [15J considered that a parameter () is represented as a functional T(G) for the true distribution G(x) and the estimator iJ for () is given by T(a), where a is the empirical distribution of G. The asymptotic normality of T( a) was shown:
Fn(T(a) - T(G)) ....... N(O,
J
T(l)(G) {T(l) (G)}, dG(x))
in law,
where T(1)(x, G) is influence function for T(l) given by:
T (l)( x, G) -_ l'1m T((l - E)G + E6"x} - T(G) . ...... 0
E
Here, 6"x is a distribution function having a probability of 1 at point x. Since various estimators including maximum likelihood estimator and maximum penalized likelihood estimator can be represented as T(a), we exploit this property for Wald test. Let T Ai (G) be a functional for I Ai' group LASS 0 coefficients in Ai' Due to the KKT conditions for group LASSO estimators [3, 21], functional T Ai (G) for, Ai satisfies
J
\l1 Ai (y, T Ai (G))dG(y)
where
= 0,
44
K. Kojima et al.
In addition, it is natural to assume that set of groups selected Ai by group LASSO is invariant for small perturbation Eb x to the distribution [23J. Thus, for small E, we have:
J
IJI Ai (y, T Ai ((1 - E)G + Ebx))d ((1- E)G(y)
+ Ebx(Y)) = O.
By following the derivation of the influence function for M-estimator in [15], influence function T~: (x, G) for T Ai (G) is given as:
EP T~:(x,G)= {Ja'Ai IJIAi(y,'AJ! dG(y)}_lIJ1Ai(X,TAi(G)). 'YAi=TAi(G) By using empirical distribution LASSO estimator:
G for G, we have the covariance matrix of group
where I ('A) k t
2
1 ( BA, A2 BA - -Wk A A B'k A ' + -WkWA· A , ) , 1 'A Bk - -2 = --4 lWk na 2 4n t
t
t
t
and J(iAJ is given by Equation (11). Here, A a vector comprised of yI,~i,~ii'i, ,j E Ai. "'ti,j
l
t
= diag [Xi,t -
t
t
i~ibAi,tl, and WAi is
t,j1i,j
For the null hypothesis, Ho : R, Ai group LASSO coefficients as follows:
= r, we can derive Wald statistics W GL for
WGL = (RiAi - r)' {R~bAJR'} -1 (RiAi - r)
-+
X;ank(R)
in law.
(,;,1' ,;,2' ,;,3' ,;,4)'
For example, suppose that, Ai = and we would like to evaluate the null hypothesis Ho : = 0, we set R = (Omi,2,mi,l' 1mi ,2' Om,,2,mi,3' Omi,2,mi,.) and r = 0, where Om,n is an m x n matrix whose elements are zero and 1m is the identity matrix of size m.
'2
3. Numerical examples
3.1. Simulation data examples We use an artificial network of ten genes having twenty linear and nonlinear relationships and show the performance of our proposed method, L1 NVAR. For the competitors of L1 NVAR, OLS based nonlinear vector autoregressive based (NVAR) model [9J and nonlinear vector autoregressive model with L2 penalization (L2 NVAR) are employed. In L2 NVAR, L1 penalization in Equation (5) is replaced by L2 penalization, and regularization parameter is selected by Bayesian information criterion. A Wald test derived in a similar manner for L1 NVAR is used to
Estimation of Nonlinear Gene Regulatory Networks
45
capture significant Granger causalities of L2 NVAR. Twenty edges in the artificial network are set as follows: XI,t X2,t X3,t X4,t X5,t X6,t X7,t X8,t Xg,t
= 0.5 X I,t-1 + cI,t = 0.6 X 2,t-l + c2,t = 0. 7X 3,t-l + E3,t = 0.8X4,t-1 + C4,t = 0.9X5,t-1 + C5,t = sin(xl,t_l) + 0.5X2,t-l - 0.5x9,t-l + 2 + C6,t = 2COS(X2,t-l) - 2sin(x3,t_l) + 0.6XlO,t-1 + E7,t = 0.8 COS(X3,t_l) + 0.6X4,t-l + COS(X6,t-l) + 1 + c8,t = sin(x4,t-l) + COS(X5,t-l) - 0.8X7,t-l + cg,t
XlO,t =
sin(xl,t-l) - 0.8X5,t-l
+ cos(x8,t-dclO,t
Graphical representation of the artificial network drawn with Cell Illustrator [19, 27] is shown in Figure 1. From the artificial network, we generate time series data of various length {1O, 20, 30, 40, 50, 75, 100} and apply NVAR, Ll NVAR and L2 NVAR to them. Since time series length is not sufficient for the estimation, the number of B-splines is set to four. We repeat the experiment 100 times for each time series length. Granger causalities are estimated under false discovery rate 5%.
Fig. 1. An artificial network of 10 genes and 20 edges used in a simulation study. Some of edges represent nonlinear causality.
First, in order to verify the false discovery rate control, we calculate the true false discovery rates by comparing edges in the artificial network and significantly estimated Granger causalities in NVAR, L1 NVAR, and L2 NVAR for each time series length. Those true false discovery rates are summarized in Table l. When the length of time series is short, i.e., data is not enough, false discovery rate is not controlled within 5% in Ll NVAR. In L2 NVAR, false discovery rate is exploded for all the time series length. This problem may be related to the convergence of covariance matrix for coefficients.
46
K. Kojima et al.
In Wald tests for L1 NVAR and L2 NVAR, asymptotic normality is used for the derivation of covariance matrix. On the other hand, covariance matrix in NVAR coincides with its unbiased estimator, and thus asymptotic normality of maximum likelihood estimator is actually not used. This hypothesis is supported by the fact that false discovery rate is controlled for all the time series length in NVAR. For relatively long time series data, e.g., times series data of length 50, false discovery rate is correctly controlled in L1 NVAR, while it is out of control in L2 NVAR. In L1 NVAR, some variables are dropped in estimation, and thus convergence of covariance matrix is faster than the case considering all the variables. However, in L2 NVAR, no variable is dropped for the estimation. Thus, false discovery rate is converging to 5 % as time series length increase, but it is still not converged in time series data of sufficient length. Table 1. True false discovery rates obtained by comparing the artificial network and· "estimated Granger causalities in Ll NVAR and NVAR under false discovery rate controlled 5 % (mean ± standard deviation, in %). Time series length 10 20 30 40 50 75 100
NVAR
2.23 ± 9.02 3.83 ± 6.02 4.37 ± 5.21
Ll NVAR 40.54 ± 35.71 18.72 ± 16.23 11.64 ± 11.04 7.40 ± 8.95 4.13 ± 5.88 3.48 ± 4.42 2.18 ± 4.03
L2 NVAR 67.13 ± 20.25 65.11 ± 13.55 61.23 ± 8.20 55.59 ± 8.27 49.84 ± 8.49 39.67 ± 10.01 31.91 ± 8.88
True positive rates of estimated Granger causalities in NVAR, L1 NVAR, and L2 NVAR under false discovery rate 5 % are compared in Figure 2(a). Since false discovery rate in L2 NVAR is completely out of control, we also calculate the true positive rates obtained by controlling true false discovery rate within 5 % in Figure 2(b). According to the results in Figure 2(a), L1 NVAR overwhelms NVAR for time series data of length less than 75. On the other hand NVAR gives slightly better performance than L1 NVAR for long time series data. However, our interest and design of L1 NVAR is the estimation of Granger causality using insufficient time series data, and this point do not have to be concerned. L2 NVAR seems to give the best performance among the three methods in Figure 2(a), but under the control of true false discovery rate within 5 %, L1 NVAR gives the best performance among them. Therefore we conclude that L1 NVAR is the best of option among them to estimate the nonlinear and nonmonotonic gene regulatory network from the short time series data. 3.2. Application of expression data of human hela cell
We apply the proposed method to the time series gene expression data of Human hela cell [24]. 48 times points for 94 genes selected by [8] are used in our study. The
Estimation of Nonlinear Gene Regulatory Networks
47
number of B-splines is set to four, that is, the number of parameters is approximately eight times as many as the length of data. Figure 3 shows the estimated gene regulatory network under false discovery rate 5%. In the following, significantly estimated Granger causalities and biologically reported facts are compared: • Transcription factor NF-KB is known to work as the central mediator of the human immune response [20]. IeAM-I, eyelin DI, A20, lAP are reported to be target genes of NF-I\;B [20]. In our estimated network, lAP is estimated as the Granger causal of NF-I\;B, eyclin DI, A20, and IeAM-l. PKR is known to activate NF-I\;B. In the estimated network, PKR and NF-I\;B have connection with the Granger causal of lAP. • Bel-2 is known to inhibit PERP-induced cell death [2]. This regulation coincides the Granger causality, Bel-2 --> PERP in the estimated network. • E2FI is a transcription factor known to regulate the transcription of eyelin EI [10]. The estimated Granger causality eyelin EI --> E2FI is oppsite to the biologically known fact. • Puma is known to bind Mel-I to maintain the expression level of Mel-I [6]. This coincides the estimated Granger causality Puma --> Mel-I. • In colon cancer cell, over expression of E2F-I is reported to down-regulate Mel-I, up-regulate c-myc, and induce apoptosis [5]. In hela cell, this mechanism may be different, but interestingly, in the estimated network there is a completely opposite Granger causality path, c-myc --> Mel-I --> PUMA --> E2F-1. • Fas is a well known target gene of P53. On the other hand, Fas is Granger causal of P53 in the estimated network. • P2I is estimated to have self loop. This self loop is also detected by NVAR based on OL8 and verified [9].
~
,---------------------------, "' .
.'
. r····r··
,.<:;:>.:t .'X
...
"-~~<'.~.' 20
60
40
TlmalMlriesl&ogttt
(a)
100
"
20
100
TIme&EHiallength
(b)
Fig. 2. True positive rates for estimated Granger causalities in NVAR, L1 NVAR, and L2 NVAR. (a) True positive rates under false discovery control 5 %. (b) True positive rates obtained by controlling true false discovery rate within 5 %.
48
K. Kojima et ai.
'o,x)
'ffll
',)/",
'",,,V
<:OJ
•
f
-~ I~
I
I I I
Fig. 3.
The resulting gene regulatory network from Ll NVAR under false discovery rate 5%.
4. Discussion
In this study, we proposed L1 regularized nonlinear vector autoregressive (L1 NVAR) models to estimate gene regulatory networks from short time series microarray data. The difficulty in short time series microarray data is that the number of parameters in the model is greater than the number of samples and this leads overfitting and decreases the predictive power of the model. To overcome the difficulty, we apply group LASSO technique to build nonparametric regressions for gene regulation models. For the edge selection in gene regulatory networks, we derived an information criterion from a Bayesian point of view. For the post-treatment of the obtained model to find Granger causality, we established Wald test for regression parameters estimated by group LASSO procedure. The simulation study indicated that the proposed method outperforms nonlinear vector autoregressive model estimated by ordinary least squares in terms of the accuracy of the estimated networks. We also applied the proposed method to human hela cell time series microarray data. Some of the estimated edges are supported by the literature, but we observed that some edges have opposite direction. We need to investigate this point in future paper, the relationship between biological regulation
Estimation of Nonlinear Gene Regulatory Networks 49
and Granger causality.
References [1] Akaike, H., Likelihood and the Bayes procedure, Bayesian Statistic (eds. J.M. Bernardo, M.H. DeGroot, D.V. Lindley and A.F.M. Smith), Univ. Press, Valencia, Spain, 1980. [2] Attardi, L.D., Reczek, E.E., Cosmas, C., Demicco, E.G., McCurrach, M.E., Lowe, S.W., and Jacks, T., Perp, an apoptosis-associated target of p53, is a novel member of the PMP-22/gas3 family, Genes fj Development, 14(6): 704-718, 2000. [3] Bach, F.R., Thibaux, R., and Jordan, M.L, Computing regularization paths for learning multiple kernels, In Advanced in Neural Information Processing System 17, 2004. [4] Eilers, P., and Marx, B., Flexible smoothing with B-splines and penalties (with discussion), Statistical SCience, ll: 89-121, 1996. [5] Elliott, M.J., Dong, Y.B., Yang, H., and McMasters, K.M., E2F-l up-regulates c-myc and p14arf and induces apoptosis in colon cancer cells, Clinical Cancer Research, 7: 3590---3597, 2001. [6] Ewings, K.E., Hadfield-Moorhouse, K., Wiggins, C.M., Wickenden, J.A., Balmanno, K., Gilley, R., Degenhardt, K., White, E., and Cook, S.J., Erkl/2-dependent phosphorylation of bim promotes its rapid dissociation from mel-l and bel-xl, EMBO Journal, 26(12): 2856-2867, 2007. [7] Fujita, A., Sato, J.R., Garay-Malpartida, H.M., Morettin, P.A., Sogayar, M.C., Ferreira, C.E., Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method, Bioinformatics, 23(13):1623-1630, 2007. [8] Fujita, A., Sato, J.R., Garay-Malpartida, H.M., Morettin, P.A., Yamaguchi, R., Miyano, S., Sogayar, M.C., Ferreira, C.E., Modeling gene expression regulatory networks with the sparse vector autoregressive model, BMC Systems Biology, 1:39, 2007. [9] Fujita, A., Sato, J.R., Garay-Malpartida, H.M., Sogayar, M.C., Ferreira, C.E., and Miyano, S., Modeling nonlinear gene regulatory networks from time series gene expression data, Journal of Bioinformatics and Computational Biology, in press, 2008. [10] Geng, Y., Eaton, E.N., Picon, M., Roberts, J.M., Lundberg, A.S., Gifford, A., Sardet, C., Weinberg, R.A., Regulation of cyelin E transcription by E2Fs and retinoblastoma protein, Oncogene, 12(6): ll73-1180, 1996. [ll] Granger, C.W.J., Investigating causal relations by econometric models and crossspectral methods, Econometrica 37: 424-438, 1969. [12] Hirose, 0., Yoshida, R., Imoto, S., Yamaguchi, R., Higuchi, T., Charnock-Jones, S.D., Print, C., and Miyano, S., Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models, module finder on gene expression profiles, Bioinformatic8, 24(7): 932-942, 2008. [13] Kim, S., Imoto, S., and Miyano, S., Dynamic Bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data, Biosystems, 75(1-3): 57---65, 2004. [14] Kim, S.J., Koh, K., and Lustig, M., An interior-point method for large-scale 11 regularized least squares, IEEE Journal on Selected Topics in Signal Processing, 4(1): 606---617, 2007. [15] Konishi, S., and Kitagawa, G., Generalized information criteria in model selection, Biometrika, 83(4): 875-890, 1996. [16] Lin, Y., and Zhang, H.H., Component selection and smoothing in multivariate nonparametric regression, Annals of Statistics, 34(5): 2272-2297, 2006.
50
K. Kojima et al.
[17J Lobo, M.S., Vandenberghe, L., Boyd, S., Applications of Second-order Cone Programming, Linear Algebra and its Applications, 284: 193-228, 1998. [18J Nachman, 1., Regev, A., and Friedman, N., Inferring quantitative models of regulatory networks from expression data, Bioinformatics, 4(20): i248-i256, 2004. [19J Nagasaki, M., Doi, A., Matsuno, H., Miyano, S., Genomic Object Net: 1. A platform for modeling and simulating biopathways, Applied Bioinformatics, 2: 181-184,2003. [20J Pahl, H.L., Activators and target genes of rel/nf-KB transcription factors, Oncogene, 18: 6853-6866, 1999. [21J Park, M., and Hastie, T., Regularization path algorithm for detecting gene interactions, preprint, 2006. [22J Schwarz, G., Estimating the dimension of a model, Annals of Statistics, 6: 461-464, 1978. [23J Shimamura, T., Model selection with empirical Bayes criteria Ll-regularization, Ph D dissertation, Hokkaido University, 2007. [24J Whitfield, M.L., Sherlock, G., Saldanha, A.J., Murray, J.1., Ball C.A., Alexander, K.E., Matese, J.C., Perou, C.M., Hurt, M.M., Brown, P.O., Botstein, D., Identification of genes periodically expressed in the human cell cycle and their expression in tumors, Molecular Biology of the Cell, 13: 1977-2000, 2002. [25J Yamaguchi, R., Yoshida, R., Imoto, S., Higuchi, T., Miyano, S., Finding modulebased gene networks in time-course gene expression data with state space models, IEEE Signal Processing Magazine, 24: 37-46, 2007. [26J Yuan, M., and Lin, Y., Model selection and estimation in regression with grouped variables, Journal of The Royal Statistical Society Series B, 68(1): 49-67, 2006. [27J http:www.cellillustrator.com/
Appendix A. We define optimization problem for LI regularized spline additive model in Equation (5) as primal problem P:
(A.I) and derive its dual. Letting z
= Bili - Yi' we can rewrite Equation (A.I) as: subject to
z = Bi/i - Yi.
Thus, dual problem can be given as:
V(a) =
m:x 1~ z' z +).. L VI~,jEi,jli,j + a'(B'i - Yi - z) J
_
-
.
a' a , m:Xl~r - 4 - a Yi +)..
./ , , L V'i,jEi,jli,j + a Bili J
m:x {- a:a - a'Yi} if {
00
)..2
~ a'Bi,jE;} B~,ja
otherwise
Estimation of Nonlinear Gene Regulatory Networks
where a
51
= 2z = 2(Bi''Yi - Yi)' The dual problem can be converted as: V(ri) = arg ~~n (r:B~Bi''Yi) subject to
)..2/42': (Yi - BiTi) 'Bi,j Ei,l B~,j(Yi - BiTi)'
By using barrier function c/>j(rJ:
c/>j(ri)
= log {,A2/4 - (Yi - Bi''Yi)' Bi,jEi~l B:,j(Yi - Bili )}'
we have the following potential function to be minimized in interior point method:
1/J(ri) = ,;B:BiTi + p,
L c/>j(ri)' j
The dual gap
1)
of the interior point method can be given as:
2(r:B~BiTi - y;BiTi + Y:Yi)
+ ,A L
I;,jEi,j/i,j'
j
Here, we show a brief algorithm of interior point method: step 1 Set barrier strength p" acceleration rate l/, and stopping criterion T to some positive values, for instance, p, = 0.2, l/ = 6.5, and T = 0.00001. step 2 Calculate Newton direction and select step size by the Armijo rule. step 3 Update Ii according to the Newton direction and step size obtained at step 2. step 3 If dual gap 1) is less than T, finish the algorithm and output Ii step 4 Update p, as p, +- p,/l/ and go to step 2.
For the details of interior point method, see [14,17].
ModelMage: A TOOL FOR AUTOMATIC MODEL GENERATION, SELECTION AND MANAGEMENT MAX FLOTTMANNl
JORG SCHABER l
STEPHAN HOOPS2
floettma~olgen.mpg.de
schaber~molgen.mpg.de
shoops~vbi.vt.edu
EDDA KLIPpl
PEDRO MENDES2,3
klipp~olgen.mpg.de
mendes«lvt.edu
1 Max-Planck-Institute
for Molecular Genetics, Ihnestr 63-73, 14195 Berlin, Germany Virginia Bioinformatics Institute - 0477, Virginia Tech, Bioinformatics Facility I, Blacksburg, Va 24061, USA 3 The University of Manchester, 131 Princess Street, Manchester Ml 7DN, United Kingdom 2
Mathematical modeling of biological systems usually involves implementing, simulating, and discriminating several candidate models that represent alternative hypotheses. Generating and managing these candidate models is a tedious and difficult task and can easily lead to errors. ModelMage is a tool that facilitates management of candidate models. It is designed for the easy and rapid development, generation, simulation, and discrimination of candidate models. The main idea of the program is to automatically create a defined set of model alternatives from a single master model. The user provides only one SBML-model and a set of directives from which the candidate models are created by leaving out species, modifiers or reactions. After generating models the software can automatically fit all these models to the data and provides a ranking for model selection, in case data is available. In contrast to other model generation programs, ModelMage aims at generating only a limited set of models that the user can precisely define. ModelMage uses COPASI as a simulation and optimization engine. Thus, all simulation and optimization features of COPASI are readily incorporated. ModelMage can be downloaded from http://sysbio.molgen.mpg.de/modelmage and is distributed as free software.
Keywords: model families; systems biology; model discrimination; candidate model; model simulation; model generation
1. Introduction
It has been recognized that despite the increasing amount and accuracy of molecular biological data, uncertainty about biochemical network structures and dynamics is still immense. The uncertainty is not only limited to different parameter sets, but also affects structure and kinetics of models [1]. This naturally prompts alternative hypotheses about biological processes, which directly translate into alternative mathematical model formulations. Generating and handling alternative model formulations poses a considerable challenge to the modeler for several reasons. First, each model has to be implemented, simulated and analyzed separately. Often, model alternatives vary only
52
A Tool for Automatic Model Generation, Selection and Management
53
slightly in structure and kinetics [2], which seduces the modeler to copy the original model and then introduce the modifications by hand. This is an error-prone process. Secondly, changes that affect the whole family of models have to be updated in each model separately, which is also error-prone and a tedious task, especially when the number of models is high. Biological models with many uncertainties often lead to a very high number of possible alternatives, because of combinatorial complexity, which renders it impossible to implement and handle each model individually. Under these conditions keeping track of changes in models is also a hard task. The above mentioned problems resulted in the development of several formalisms and tools that address these problems. In addition to model building and analysis tools like Copasi [3J or Celldesigner [4], there are a number of tools that specifically deal with uncertainties in model building [5-7]. These state-of-the-art approaches have a major short-coming. Even though most tools aim at handling combinatorial complexity, they produce only one model at the end, which includes all or a reduced number of possible molecular interactions generated from certain rules. As an alternative, we present a tool that automatically implements and manages a set of models, which differ in the number of species, reactions or kinetics, because this is what modelers in systems biology usually are confronted with. MMT2 [8] is a tool that aims into that direction but it falls short in the ability to actually control the number and structure of generated models, because it produces all combinatorial possible alternatives. In our daily work and in discussions with the community [9] we see that it is not so much the combinatorial complexity that poses problems for the modelers but rather to create and handle a specific set of candidate models that represent alternative formulation of biological processes. Often modelers have a very clear idea what different versions of a model they want to test. It is just too tiresome and error-prone to do this by hand, and this way good models may be omitted because of a lack of time. ModelMage provides modelers with a tool that facilitates managing a set of candidate models. In the following we describe the main idea and the technology of ModelMage. Finally, we also provide a simple example of its usage.
1.1. The Idea
ModelMage automatically generates models based on a master model and certain directives that specify how candidate models are to be derived from the master model. Such candidate models are created by two basic processes, first, by removing species, modifiers or reactions from the master model and, second, by generating certain alternative kinetics for specified reactions. The generated candidate models are automatically documented in a way that it is always comprehensible how they were derived from the master model, thereby keeping track of model versions. The principle of generating candidate models by leaving out components of a master model implies that the master model must
M. Flottmann et al.
54
comprise all components of the candidate models. The user has to provide only the structure of the master model, with or without kinetics, and directives for ModelMage. When no initial kinetics are assigned, ModelMage assumes mass action kinetics of appropriate order by default. There is no need to edit the individual models at any time. Thereby common parameters, modifications or directives are changed in one place only and are automatically updated in each candidate model. This removes the errors introduced by modifying each model individually. Finally, all models are simulated, fitted to data and compared automatically, if data is available. At the end, the user is provided with a ranking of the model fits and statistical measures that help to discriminate between model alternatives. CommandUne
Comrnil'ldllne
Reduction Directives
Master Model .",
~ .,"'
..
"
I'--~
"
1
Model Ranking
SBM\.!£;opa,'
Child Models ----.
-"'''~M~l!C~O,",~'~~~I Child Model
,
Fig. 1.
Workflow of ModelMage.
2. Methods
2.1. Generating Alternative Models Generation of candidate models in ModelMage is a two step process. The first step is to create a master model in Copasi or in any other SBML [10, 11] compliant editor like CellDesigner [4J or Semantic SBML [12J. This model must include all possible species and reactions that shall be included in at least one of the alternative models. In the second step the set of candidate models is built by removing reactions, species or modifiers from the master model. (Fig. 1) The removal and exchange is done by giving simple directives to the program as command line parameters. There are two parameters that can be specified to generate alternatives (Table 1). The parameter --kinetics is used to exchange the kinetic law in the specified reactions. This can be achieved by a simple syntax without typing the whole formula and with default values for parameters. The user has to choose the reaction and to give ModelMage a short form of the kinetics that shall be passed to it i.e. re-2(MM) would set reaction re_2 to Michaelis-Menten kinetics in one candidate model if the number of reactants and modifiers is suitable. At the moment, only mass action (MA) and Michaelis-Menten (MM) kinetics can
A Tool for Automatic Model Genemtion, Selection and Management
Parameter -r, --remove
Values logical combinations of species and reactions, e.g. species_l & reaction_2
-k, --kinetics
X(MA
-d, --discriminate
RSS, AlC AlCc AICc is default
-0,
--output
-v, --verbose
I MM)
path/filename
no arguments
55
Function defines which species,reactions or modifiers should be removed from the master model change the kinetics of reaction X to mass action (MA) or MichaelisMenten (MM) Run a parameter estimation in COPASI and rank the models by the selected criterium. Path and basic filename for the output files. ModelM age will add identifiers to the created models and create the subdirectories ResultSBMLFiles and ResultCopasiFiles Returns more detailed command line output about the created models.
be set by using this parameter. The SBML structure is not easy to work with internally, so the first step in the internal generation process is the conversion of the master model to a bipartide graph with one set of nodes for the reactions and the other set of nodes for the species. To create the network structures of the candidate models one can use the --remove parameter. The values of this parameter can be all possible logical combinations of species, modifiers and reactions that should be removed from the master model to generate the candidate models.
2.1.1. Removing Reactions and Modifiers Executing the command -r reaction_identifier, ModelMage removes the specified reaction from the model and leaves the connected species untouched. Removing reactions is fairly simple, because these can simply be deleted from the model and the worst effect can be two or more unconnected networks in the model. If the removal of a reaction results in unconnected subgraphs ModelMage by default prints
56
M. Fli5ttmann et al.
out a warning, but nevertheless removes the reaction. The second graph entity that can be removed in ModelMage is the property of a component as a modifier in reactions. This is represented as an edge from the species to the reaction in the graph structure. Modifiers are also removed using the -r option. Removing the species as a modifier is invoked by a colon between both identifiers (e.g. -r species_identifier: reaction_identifier). In this case the modifiers of the specified reaction is removed, and the kinetics are changed to mass action kinetics of appropriate order to make sure that ModelMage generates a working model. A
0~~~0>:::::::~~~®
B
®~
/®
CD
~
0-~-®
0~ _~ /© ® -;:/ [iillo-~!\~ )-l"'] ~
o~/CD
0~ c
-~ _/
0~~~®
~~~fS)-~
l
/~~
®
@)
Fig. 2. Examples of removed species. (A) Removal of an intermediate species. Species T is removed and reactions r1 and r2 are combined. (B) When an enzyme-substrate-complex is removed from a reaction ModelMage creates a new reaction with the enzyme set as a modifier. (C) Removal of species AB leads to new a reaction that is a combination of all incoming and outgoing reactions of the removed species.
2.1.2. Removing Species When the user specifies a species for removal, ModelMage analyzes the neighborhood of this species and rewires the model in an intelligent manner. The rewiring heuristic follows the principle of reachability and it works the following way: First, ModelMage detects the incoming and outgoing reactions of the species that shall be removed and analyses the species involved in these reactions, i.e. the substrates of which the removed species was the product and the products of which the removed species was the substrate (see Fig. 2). Then, ModelMage tries to combine all pairs of incoming and outgoing reactions into one single new reaction. The new reaction inherits all substrates and products from the combined reactions and assigns a kinetic for the new reaction. The inserted kinetic can either
A Tool for Automatic Model Generation, Selection and Management
57
be the kinetics of the ancestors if they were equal and are still suitable for the combined reaction, or mass action otherwise. There are three possible cases for each combination of incoming and outgoing reactions that have to be considered and treated separately. (1) If the combination of a pair of incoming and outgoing reactions would result in reaction, which has the same species, both as substrates and as products, i.e. a self loop to a list of species, the respective reactions are not combined. (2) If there is only a subset of species equal in both substrates and products, then these species are regarded as enzymes and are set as a modifiers for the resulting reaction (Fig. 2 B). (3) If the sets are disjunct then the reactions are combined in the above described way. The main algorithm to remove one species looks like this: def remove (s) : for i in s. incomingReactions : for 0 in s. outgoingReactions : if i = 0: selfLoop(i ,0) else if i partiallyEqual 0: substrateEnzyme (i ,0) else: combine(i ,0) del (i ,0) del(s)
As mentioned before, removing combinations of species and reactions can lead to large numbers of candidate models. The user has to be careful how many components are combined for removal, because there is the danger of combinatorial explosion, especially if the components are combined by OR operators (Fig. 3). Exchanging kinetics also increases the number of models, because every structural model is generated with every possible set of kinetics. To simplify the process of finding the right logical formulations to create certain model families, the user can specify different sets of models in one single run of ModelMage. This can be achieved by concatenating different logical formulas by ',' i.e. MOdelMage. py -r (species_l ~ species_2), (reactionA & reaction_1) example. cps. This would generate three new alternative models. Two are generated from the first bracket, where in one species_l and in the other species_2 is removed. The third model is generated from the second bracket, where both reactions are removed from the model.
2.2. Model Discrimination When data for certain components is available, ModelMage can select the best model from the generated model family. ModelMage can automatically fit the models to the data by estimating parameters and is able to rank the models by different
58
M. Plottmann et al.
B
A
0
0~ © ~
+ ®
®
#~ +-------Uill-
\\
®=®
c 0
/"-..
®--@
®
0
\\
j @ 0/
®=®
D 0
0
0
/"-..
®=®
®=®
j @ 0/
®--@
\\ \\
Fig. 3. Logical combinations of alternative models. (A) Master Model from which the alternatives can be generated by ModelMage. (B) One model is generated by the logical string"B &; re_2". The logical AND means that species B and reaction re_2 must both be removed in this model. (C) The XOR in the command"B - re_2" produces two models, because in each of them only one node is removed. (D) The OR operator"B I re_2" creates a combination of Band C
statistical measures to determine the best model. For parameter estimation ModelMage utilizes Copasi's various parameter estimation routines, which makes it very fast and flexible. The parameter estimation task is most conveniently defined in Copasi's graphical user interface and can later be executed by ModelMage. The user has to set up the task only once for the master model. ModelMage automatically defines the parameter estimation task for all generated candidate models. Parameters of new or changed reactions are added to the estimation task if the parameters of the original reactions were part of the parameters estimation task of the master model. If there are parameters that do not exist in the specific alternative model they are deleted from the estimation task for this model. The parameter estimation in Copasi creates output files for all of the models that contain details about the results of the estimation. ModelMage parses these files and extracts the objective value, which is the sum of weighted squared residuals of the fitted model (RSS). From this, e.g., the Akaike Information Criterion [13] is calculated for each model:
Ale
= 2k + n(ln(RSS/n) + 1),
(1)
where n is the number of observations and k is the number of estimated parameters, which are also parsed from the Copasi output. From the AIC we can also compute the second order AIC (AICc):
A Tool fOT Automatic Model Generation, Selection and Management
AICc = AIC + 2k· (k + 1) n-k-l
59
(2)
The AlCc is used as the standard measure for ranking and model selection in ModelMage. The lower the AlCc the better is the fit to the data and the higher is the ranking the model gets. AlCc corrects the AlC for small number of observations, which is common in systems biology. But it can be also employed with bigger samples, because it converges to AIC when big sample sizes are available. [14, 15J
2.3. Implementation We use SBML and libSBML because it is accepted as standard for exchange of models between systems biology tools. [9, 16J We decided to develop ModelMage closely related to Copasi because it is widely used in the community to work with dynamical models and has a rich set of features to build upon. ModelMage is written in Python, which makes it very flexible and portable to many platforms. Currently, it is tested on Linux, Mac OS and Windows. Parameter estimation is done by the fast Copasi algorithms. The rest of ModelMage's features are not computationally intense which justifies the use of an interpreted language like Python for the tool. To install and run ModelMage, a system must meet the following requirements: • • • •
Python ~ 2.5 Networkx package for Python libSBML 3.0.1 ~ [17J Copasi ~ 4.2
3. Results and Discussion
3.1. Example To verify ModelMage's functionality we created a small master model that resembles a signaling cascade that includes three hypothetical feedback loops. (Fig. 4) From one alternative model we generated time-series data by simulating the model for 80 time units and sampled the values of species S, X5 and X6 at 7 time points. To get a more realistic test case we introduced small normally distributed errors into the data. After that, we used ModelMage to generate a family of ten models from the master model and fitted every model to the artificial data in a blind test. i.e. the person who used ModelMage did not know the model that produced the data beforehand. The model that produced the data was correctly recovered by the discrimination procedure of ModelMage. The fits were done only for parameters of the reactions re3, relO, rell and re12. The rest of the parameters were set to 0.1 to reduce the number of estimated parameters. For the parameter estimation we used the "Evolutionary Programming" algorithm in Copasi and ranked the models by AICc. Results are shown in table 2.
60
M. Flottmann et al.
t~~ , ..
5
, , .. . ,
,,
. ,,
- - - - --
.. .. ,,
- ...,-,
,
,
Fig. 4. Example model for presenting the features of ModelMage. The dashed lines were our hypotheses about feedbacks that possibly regulate the production of T. ModelMage was started with the parameters -r 'rell, rell:X5 & rell:X6 , X4 & rell:X5 & rell:X6 , rell:X6 & re3:X6 , rell:X6 & re3: X6 & X4 , rell: X5 & re3: X6 , rell: X5 & re3: X6 & X4 ' -k "rell (MM)" and produced all possible candidate models with different kinetics for rell where possible. All these alternatives were also created with a double or single phosphorylation. Additionally one model with rell completely removed was created.
# 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Model re11:X5 & re3:X6 & X4 re11:X5 & re11:X6 & X4 re11:X5 & re11:X6 & X4 re3:X6 & re11:X6 & X4 master model re11:X5 & re11:X6 re11:X5 & re3:X6 re3:X6 & rell:X6 re11:X5 & rell:X6 re11
Objective Value 0.0731 0.1246 0.1621 0.1833 1.3734 1.6720 1.6841 1.7246 1.5689 4.4530
AlC -46.9367 -33.7391 -30.2117 -27.6309 14.6623 18.7939 18.9455 19.4449 19.4577 37.3652
AICc -44.9367 -29.7391 -28.2117 -25.6309 16.6623 20.7939 20.9455 21.4449 23.4577 37.3652
n 21 21 21 21 21 21 21 21 21 21
k 4 5 4 4 4 4 4 4 5 3
The ranking of candidate models clearly divided the models into 3 separate groups. The first group consists of the models which did not include species X4. They had objective values that were about 10 times smaller than those of the second group while fitting the same number of parameters. This also leads to big differences in the AlCc. From this group the first ranking model, has a big difference to the following 3 models which all have very similar AlCcs. This model is the one the data was created with.
A Tool for Automatic Model Generation, Selection and Management
61
The second group includes all the models that still include species X4 and represent different ways of feedback. The master model, which was also fitted, ranked highest in this group, which is probably due the fact that it has all feedbacks still included and therefore can better regulate concentration of X6 then all the other models. Because of this obvious classification we did time course simulations for the best candidate model and the one that ranked seventh place. These models are very similar, the only difference is that the first ranking model contains X4 and the other one does not (Fig. 5). The third group consists of the model where reaction rell is completely removed, which has the worst fit of all the models. The bad fit is mainly due to the missing degradation of S. The other curves in this model also fit quite well, because there is still one feedback that can regulate X5 and X6 quite well. 1.2
1.5
1f!',
,\
0.8 (fJ
.'
,
U)
,, ,
x
I
0.5
I
0.2
.
\. 0
0 0
.. - ...
I
\
0.4
;
I
I
0.6
,
20
40
60
80
0
(0
....,
I
I
x
,,
, ...
I I • I I
.....
I
I 0+· • ,
20
40
60
80
0
20
40
60
80
Fig. 5. Plots of simulations with the fitted models. (A) The signal is fitted similarly by models from both groups. (B) The lower ranking models do not reach the high amplitude in the X5 concentration. (C) The concentrations of X6 reach a maximum and decay after 70s in the model from the second group, whereas they reach a limit in the model from the first group. This difference can be seen in all the models.
3.2. Conclusions The software we developed can substantially facilitate and accelerate the generation and discrimination of model alternatives. Model families can be created, analyzed and changed far easier than it was possible before. The generated models are portable to any other SBML compliant software, which gives the user the possibility to view and analyze them with an array of already existing tools. Our example model could be generated and clearly be recovered from a family of slightly different models in a very short time. To use Modelmage it must be installed locally and can be downloaded from http://sysbio.molgen.mpg.de/modelmage, but it would be possible to create a webbased version of the generator to make it easier to use. The ranking criteria worked well despite a limited set of data to fit. If this is also the case in real biological examples remains to be investigated. Nevertheless
62
M. Fliittmann et al.
the user has to be careful when selecting models and the ranking by AIC should only be used as a hint to which might be the best model [8]. The most complicated step in using ModelMage is the formulation of the logical combination of removals, which can become quite difficult in some cases. We hope to improve this by adding a more sophisticated user-interface to ModelMage in upcoming versions. We are also planning to integrate a broader set of exchangeable kinetics to give the user more possibilities for alternatives. Acknowledgements
JS is supported by the European Commision (CELLCOMPUT(04331O)) and MF is supported by the MPI for Molecular Genetics. References [1] Kuepfer, L., Peter, M., Sauer, U., Stelling, J., Ensemble modeling for analysis of cell signaling dynamics, Nat Biotechnol, 25: 1001-10066, 2007. [2] Geva-Zatorsky, N., Rosenfeld, N., Itzkovitz, S., Milo, R., Sigal, A., Dekel, E., Yarnitzky, T., Liron, Y., Polak, P., Lahav, G., Alon, U., Oscillations and variability in the p53 system, Mol Syst Bioi, 2:0033, 2006. [3] Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P., Kummer, U., COPASI-a complex pathway simulator, Bioinformatics, 22(24): 3067-30-74, 2006. [4] Funahashi, A., Morohashi, M., Kitano, H., and Tanimura, N., Celldesigner: a process diagram editor for gene-regulatory and biochemical networks, Biosilico, 1: 159-162, 2003. [5] Shapiro, B.E., Levchenko, A., Meyerowitz, E.M., Wold, B.J., Mjolsness, E.D., Cellerator: extending a computer algebra system to include biochemical arrows for signal transduction simulations Bioinformatics, 19(5): 677-678, 2003. [6] Blinov, M.L., Faeder, J.R., Goldstein, B., Hlavacek, W.S., BioNetGen: software for rule-based modeling of signal transduction based on the interactions of molecular domains, Bioinformatics, 20(17): 3289-3291, 2004. [7] Lok, L. and Brent, R., Automatic generation of cellular reaction networks with moleculizer l.0., Nat Biotechnol, 23(1): 131-136, 2005. [8] M. Haunschild, B. Freisleben, R. Takors and W. Wiechert, Investigating the dynamic behavior of biochemical networks using model families, Bioinformatics, 21(8): 16171625,2005. [9] Klipp, E., Liebermeister, W., Helbig, A., Kowald, A., and Schaber, J., Systems biology standards-the community speaks, Nat Biotechnol, 25: 390-391, 2007. [10] Finney, A. and Hucka, M., Systems biology markup language: Level 2 and beyond, Biochem Soc Trans, 31: 1472-1473, 2003. [ll] Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., Arkin, A. P., Bornstein, B. J., Bray, D., Cornish-Bowden, A., et al., The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models, Bioinformatics, 19: 524-531, 2003. [12] Schulz, M., Uhlendorf, J., Klipp, E., and Liebermeister, W., SBMLmerge, a system for combining biochemical network models, Genome Inform, 17(1): 62-71, 2006. [13] Akaike, H., Information theory and an extension of the maximum likelihood principle, Selected Papers of Hirotugu Akaike, 1998.
A Tool for Automatic Model Genemtion, Selection and Management
63
[14] Wagenmakers, E.-J. and Farrell, S., AIC model selection using Akaike weights, Psychon Bull Rev, 11: 192-196, 2004. [15] Burnham, K. and Anderson, D., Model Selection and Multimodel Inference: A Practical Information- Theoretic Approach, Springer, 2002. [16] Kell, D. B. and Mendes, P., The markup is the model: Reasoning about systems biology models in the semantic web era, J Theor Biol, 252(3): 538-543, 2008. [17] Bornstein, B. J., Keating, S. M., Jouraku, A., and Hucka, M., LibSBML: An API library for SBML, Bioinformatics, 24(6): 880-881, 2008.
A FRAMEWORK FOR DETERMINING OUTLYING MICROARRAY EXPERIMENTS RAYMOND WANl
ASA M. WHEELOCK2
ryan~kuicr.kyoto-u.ac.jp
asa~para-docs.org
HIROSHI MAMITSUKA 1 mami~kuicr.kyoto-u.ac.jp
1 Bioinformatics
Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, 611-0011, Japan 2 Lung Research Lab L4:01, Respiratory Medicine Unit, Department of Medicine, Karolinska Institutet, 171 76 Stockholm, Sweden Microarrays are high-throughput technologies whose data are known to be noisy. In this work, we propose a graph-based method which first identifies the extent to which a single microarray experiment is noisy and then applies an error function to clean individual expression levels. These two steps are unified within a framework baSed on a graph representation of a separate data set from some repository. We demonstrate the utility of our method by comparing our results against statistical methods by applying both techniques to simulated microarray data. Our results are encouraging and indicate one potential use of microarray data from past experiments.
Keywords: microarrays; distance-based outliers; data cleaning; simulated microarrays
1. Introduction
Microarrays are high-throughput technologies that allow researchers to determine the expression levels of thousands of genes simultaneously. Microarray experiments themselves are known to have problems with noise. The validity of a completed microarray experiment needs to be evaluated by the experimentalist while taking into account the monetary costs of producing the slide. This translates into a potential bias in their decision. While statistical techniques exist for determining the validity of a microarray slide across replicates, we present an alternative method which assesses a microarray experiment and optionally cleans individual expression levels by making use of past microarray data as a guide. We propose a framework which extends some earlier work [19J and allows experimentalists to determine the extent to which a microarray experiment t is an "outlier" by using other data R from some repository as a guide. This is in contrast to methods embodied within microarray acquisition software which evaluates each microarray experiment in isolation. The external data could be from the same laboratory or from a public repository and is assume to be "correct" or, at least, of sufficient quality to compare against. As a starting point to our work, we as-
64
A Framework for Determining Outlying Microarray Experiments
65
sume the repository data are replicates. This allows us to use statistical methods for replicates as baselines. The framework builds an undirected graph representation of R where each probe is a vertex and edges indicate probes with similar values across R. We apply this framework in two different ways. The first way scores a microarray experiment using techniques related to distance-based outlier detection [8]. The score of a microarray experiment is calculated from the number of probes which are similar in R, but differ in t. The second method employs an energy function E which cleans individual expression levels which were previously marked as outliers. We demonstrate our method with simulated microarray data created using the SIMAGE web service in order to give us better control over the type of data being used [1]. This paper is structured as follows. We provide some background to this problem and our method in Section 2. Then, in the next 3 sections, we discuss our framework, method for outlier detection, and method for probe cleaning. In Section 6, we describe statistical methods for one dimensional replicate data which can be used as a baseline. Experimental results using data sets compiled by SIMAGE are reported in Section 7. Section 8 summarizes our work and provides some future directions. 2. Background
2.1. Notation The following notation for microarrays and graph theory are used in describing our methodology. A microarray platform specifies what the m probes are for microarray slides based on it. A microarray slide is subjected to a specific condition to form a microarray experiment. In this work, we apply our method to two-channel cDN A microarrays. These microarrays employ two colored dyes for an experiment to distinguish between two different conditions (for example, a treatment and a control): Cy3 (green) and Cy5 (red). A researcher then forms a data set of n microarray experiments for a particular study. The n experiments in the data set typically vary with experimental conditions of biological or experimental replicates, or a combination of both. The expression level at probe i, experiment j is denoted as Pij, for 1 < i < m and 1 < j < n. All of the expression levels for probe i are indicated as Pi, which is a vector of length n. The purpose of this work is to assess a single experiment t in the context of a set of experiments R which are obtained from a private or public repository. These experiments are assumed to represent a "consensus" , making it "more reliable" than the single experiment in question. In order to simplify the problem, we assume that all of the experiments are based on the same platform. We adopt the method chosen by SIMAGE by subtracting the background of each probe spot for both of the channels to arrive at an expression level for each probe. If we denote the two channels as Cy3 and Cy5 and their backgrounds as Cy3bgr and Cy5bgr, respectively, then the background-subtracted log-ratio of the two channels for a probe Pij is:
66
R. Wan,
A. M. Wheelock
fj
H. Mamitsuka
Pij
= log2 Cy5 ij ij
Cy3
-
Cy3bgr ij . Cy5bgr ij
(1 )
Other forms of background correction are possible, including not subtracting at all. Further information about microarrays can be found in other sources [21J. The basis of our method is to form an undirected graph G(V, E) using R to act as a guide in assessing t. The vertices V is formed from the set of all m probes. An edge between two vertices Vi and Vj indicates that the two probes have expression levels which are similar across R. Note that we refer to microarray expression values as either Vi or Pij, depending on whether the value is a value in a vertex or within experiment j in a data set of n experiments.
2.2. Related Work Related topics include combining microarray data, statistical techniques for detecting outliers across replicates, distance-based outlier detection, and data cleaning. As newer techniques for microarray analysis are made available, previously published data can be re-examined. The aggregation of multiple data sets from public repositories has been an active topic in recent years. In one study, researchers looked at pairs of co-expressed genes in 60 human data sets covering 3,924 microarray experiments across multiple platforms [11]. In their work, they looked at pairs of co-expressed genes in each individual data set and then compared these coexpressed links between data sets. Others have focussed on combining single-channel Affymetrix data [17], cancer classification using Support Vector Machines [20]' and even determining the amount of variation between studies [3]. In contrast, our method assumes that the repository data R is "correct" and that a single experiment t is being evaluated. Of course, the quality of data in a repository varies since they can reside on an experimenter's computer as part of a past experiment or be in a public repository. Thus, we are assessing how similar t is to R without any claims on the reliability of R. One method which more closely resembles our's involves the construction of a gene regulatory network [5]. Their gene regulatory network is depicted as a directed graph of probes, constructed using an algorithm dubbed "mode-of-action by network identification" (MNI). The direction of an edge indicates that one gene regulates another. This network is then applied on an experiment test set in order to determine which genes are associated with a particular drug treatment. Their method is an iterative procedure based on principle components regression. In comparison, our method asserts a weaker statement which simply says that two probes are "similar" to each other in terms of their expression profiles. Even so, our assertion is sufficient for our needs since the two purposes differ; instead of relying on gene regulation, we are concerned with outlier detection. Noise in microarray data is handled through various normalization techniques that range from operating at the probe-level up to the slide-level. Often, they adjust
A Framework for Determining Outlying Microarray Experiments
67
expression levels according to some distribution within the data set [12, 21]. In our work, we make a distinction between Rand t since R guides the analysis and noise cleaning of t. Distance-based outliers is a method of locating outliers in a database of records in order to find records of interest [8]. These records are not necessarily erroneous, but have characteristics that separate it from every other record. While the values in each record can be either continuous or discrete, a suitable distance function is required to handle whatever data types are used. For example, if all fields in the database are continuous, then the Euclidean distance between records is one option. Additional parameters are also required which dictate what is an outlier. At least three sets of parameters have been considered in the literature [2]: (1) Outliers are records for which there are fewer than p other examples within a distance d; (2) Outliers are the top n examples whose distance to the kth nearest neighbor is greatest; and (3) Outliers are the top n examples whose average distance to the k nearest neighbors is greatest. Regardless of which definition of outliers is employed, distance-based outliers require that every record be compared with every other record. In our case, the repository R provides information that restricts which comparisons are performed. These restrictions are represented as the undirected graph. The next step beyond detecting outliers or, more generally, "problematic" values, is to replace them. In the context of microarray data, data cleaning is synonymous with normalization. Outside of microarray data, more general data "polishing" has been investigated as an augmentation to the C4.5 decision tree algorithm [15, 18]. Others have constructed a probabilistic model of three components: a clean model, a noise model, and a data corruption model [9]. Our method of cleaning assumes that probes are related to each other according to the undirected graph C.
3. Framework
Our graph-based method is illustrated in Figure 1. The basis of our framework is the construction of an undirected graph C(V, E) from R. In this graph, each vertex is a probe from R. Undirected edges are added to E if two probes are similar to each other. Unlike work by others (for example, [11]), "similarity" refers to expression levels which are equivalent in value and not in expression patterns. This difference is due to our aim being outlier detection rather than knowledge discovery and our use of replicate experiments for R. If R is not composed of replicate experiments, then a different and potentially more relaxed measure would be required. As a result, distance measures are more appropriate than correlation ones and the one we have chosen is the Euclidean distance between two probes. The number of edges added to E are regulated by a distance threshold dt . All edges whose weight (similarity) is less than d t are added to the graph. If d t = 00, then every node is connected to every other node. If the distance between two nodes cannot be calculated due to excessive missing values, then the distance is set to 00.
68
R. Wan,
A. M. Wheelock
fj
H. Mamitsuka
R A B
C
D E
A B
C
D E
~
Fig. 1. Illustration of our graph-based framework. At the top, we have the repository data R of 5 probes and 4 experiments. Below, we have the single experiment t to be evaluated, with the same 5 probes.
For the sake of convenience, we normalize all edges so that they lie within the range [0,100]. After G is built, we apply it to experiment t by inserting the values from t into G. The structure of G indicates which expression levels from t should be compared since each vertex has a neighborhood of vertices. In this work, we make use of the neighborhood of distance 1 from each vertex. If an edge does not exist between probes, then their values across R differ enough that they also should not be compared for t. This framework is used for outlier detection and probe cleaning. 4. Outlier Detection Our application of distance-based outliers follows from the idea used for identifying outlying records in a database. The most significant difference is that we no longer compare every vertex (record) to every other vertex. Our approach is most similar to the first definition of distance-based outliers from Section 2.2 [8]. The combination of a distance d and a proportion p is replaced with two parameters: dt , as described above, and et, which provides an explicit threshold indicating when the expression levels of two adjacent vertices differ enough so that one of them is labeled an outlier. The difference between adjacent vertices is again based on the Euclidean distance since we have numerical expression levels. That is, the distance eij between two vertices i and j is: (2) As with dt , in order to bound et, we normalize all differences between vertices of t so that they are within the range [0,100]. An outlying probe is a probe which has the majority of its neighboring expression levels greater than et. This cut-off
A Pramework for Determining Outlying Microarray Experiments
69
may be too simplistic and in future work, an explicit proportion between expression levels less than et ("near" neighbors) to ones greater than et ("far" neighbors) may be required.
5. Probe Cleaning Probes which have been marked as outliers can also be cleaned using an error function. We define an error function E based on the Euclidean distance between connected, adjacent nodes in G, as shown in Equation (3). This error function defines the energy of the graph as the sum over the difference in expression level of every pair of connected vertices. The higher the energy, the greater the difference between connected vertices. As each pair is counted twice, the energy is halved: 1
E
m
m
i
j
= "2 L: L)Vi -
WijVj)2.
(3)
In order to mmlmlze the error in expression levels, we first take the partial derivative for some vertex Vk where 1 ~ k ~ m. Then, we set this equation to o and solve for Vk, leaving us with:
%!,
vk
22::7' WkiVi = -:=-7'::..c..,=;Y;---;o;INkl + 2::7' W~i '
where INkl is the size of the neighborhood of energy, all m equations are represented as:
(4)
Vk.
As we want the minimum (local)
v=A·v+c.
(5)
In Equation (5), v is the solution vector and A is an n x n matrix. For the moment, c is a zero column vector. Thus, the coefficient for row i, column j in A is: aij
=
2Wij .,----.,----==:"",.-...."...
IN;! + 2::;;'w;k
(6)
Vertices which our previous method has labeled as outliers are the only values which are cleaned. If a vertex Vi was not marked as an outlier, it is left unchanged and its corresponding row in matrix A can be removed. Furthermore, Vi adds a constant to all remaining rows in the matrix. These constants are moved into the constant vector c. Solving these m equations simultaneously gives the locally best expression levels as v. If Wij = Wji in Equation (3), then a local minimum of E is produced. Details are omitted here, but we use a Hessian matrix of second order partial derivatives of E to show that H(j)(x) = ~vtHv is positive for all vectors v [6]. This implies all eigenvalues must be positive and that E" > O. Our implementation makes use of LU-decomposition and back substitution routines [14] instead of Gaussian elimination since it is about three times faster and more numerically stable to round-off errors [4J.
70
R. Wan, A. M. Wheelock & H. Mamitsuka R
R
A B
A B
C
C
D
D
E
E
(a) Statistical methods
(b) Our graph-based method
Fig. 2. A comparison of statistically-based outlier methods against our graph-based one. Each of the two figures represent a microarray data set of replicates where each row is a probe and each column is an experiment. The black square represents the value being evaluated and the gray squares indicate the values used to make the evaluation.
6. Statistical Methods As a baseline for microarray experiment scoring, statistical methods for onedimensional data can be applied as usual for each probe. The difference is that there is no distinction made between the experiments of Rand t. These methods are applied to the combined data set RUt on an expression level-by-expression level basis. Figure 2 illustrates how these statistical methods differ from our framework. In Figure 2( a), the grid represents the unified microarray data set of RUt so that a row is a probe and a column is an experiment. The expression level being evaluated is shaded in black and the values which it is compared with are in gray. Statistical methods treat every experiment the same way and compare each expression level with the replicates within the same probe. In Figure 2(b), our method makes a distinction between Rand t, as described earlier. Statistical methods perform a direct comparison while our framework constructs a graph using the shaded values of R and the evaluation is performed using the shaded values of t. At least three types of statistical methods are at our disposal: (a) comparison against the inter-quartile range (IQR) , (b) standardized scores (or Z-scores) , and (c) Q-test. The inter-quartile range is the range from the first to the third quartile. Values outside of this range are considered outliers. The Z-test calculates a standardized score or Z-score for each value Pij against the overall average and standard deviation for all replicates of Pi. The Z-score reports the number of standard deviations the expression level is from the mean f..Li:
(7) For both IQR and standardized scores, a cut-off is required to indicate either how many times the IQR or how many standard deviations from f..Li are accepted before labeling a value as an outlier. A larger cut-off yields a more conservative test. In the natural sciences, the Q-test compares each value to its nearest neighbor and the overall range of values according to some confidence interval (critical values according to a 90% confidence interval are shown in Table 1):
A Framework for Determining Outlying Microarray Experiments
71
Table 1. Critical values for the Q-test for a 90% confidence interval [16, pg. 35J. N Qc
3 0.94 Table 2.
Name Vi V2 V3 V4
4 0.76
5 0.64
6 0.56
7 0.51
8 0.47
9 0.44
10 0.41
Simulated data sets created using SIMAGE.
Probes 11,664 11,664 11,664 11,664
Experiments 100 100 10 10
Dye-swap Yes No Yes No
Random noise N(0,0.219) N(O, 0.219) N(0,0.500) N(0,0.500)
Q(Pi') = Pij - (closest value to Pij) I J
range
(8)
7. Experiment Results Both the statistical methods in the previous section and our framework was applied to simulated microarray data sets.
7.1. Simulated Microarray Data We employed simulated microarray data to give us better control over our experiments. Several researchers have looked into creating simulated micro array data which are still "real" since they model real microarray data sets [1, 13J. The SIMAGE system is a publicly available web servera which models various aspects of microarray data in a controlled way, including effects from spot pins, channels, and replication [1 J. Four data sets were constructed using SIMAGE, as summarized in Table 2. SIMAGE has default parameters that were chosen through the modeling of a data set of 23 experiments [IJ. These default values, which were left unchanged throughout our work, are not shown in this table. Every data set consists of 11,664 probes and either 100 or 10 experiments. Two data sets were dye-swapped (Vi and V 3 ) and two were not (V2 and V4)' As SIMAGE simulates real microarray data, the default parameters already introduces noise into the microarray data as a Gaussian distribution of N(O, 0.219). The first two data sets contained this level of noise; the remaining two have a larger standard deviation of 0.500. Therefore, two sets of experiments are conducted. In the first set, we used simulated dye-swapped data and formed G using all of Vi and then applied the graph to the first 10 experiments of Vi and V 3 , where the ones in V3 are known to have more noise. In the second scenario, non-dye-swapped data is considered and V 2 is used to form G and it is applied to the first 10 experiments in V 2 and V 4 . aURL: http://bioinformatics . bioI. rug .nl/websoftware/simage/
72
R. Wan,
A.
M. Wheelock &J H. Mamitsuka
R
Percentage of outlying probes (initial)
Fig. 3.
Percentage of
outlying probes (final)
The framework for assessing our graph-based method.
7.2. Framework of Experiments The framework of our experiments encompass both the statistical tests and the use of our graph-based method. For the statistical tests, we combined 9 of the experiments from R with only one experiment known to have more noise to act as t since critical values for the Q-test are available for only up to 10 values (see Table 1). The aim is to determine how well statistical methods can isolate t. As for our graph-based method, we evaluate outlier detection and probe cleaning together using the framework shown in Figure 3. The repository data R is used to construct a graph G by selecting a value for dt . The graph is applied to t and the percentage of outlying probes is reported as the "initial" percentage using a fixed value for et. Afterwards, the probes are cleaned using the same graph structure. Next, the "final" percentage of outlying probes is reported using the same value for et. In addition, the first application of outlier detection is done for the first 10 experiments in R and averaged to act as a baseline. The aim of our framework is to demonstrate the usefulness of our graph-based method in comparison to more well-established statistical methods. In order to unify the comparison, the statistical methods also report a percentage indicating the number of probes which they deemed were outliers. The baseline for the statistical methods is the average percentage across the 9 experiments from R. This is compared to the single percentage obtained from evaluating the probes of the test set t.
7.3. Results The results from our experiments are summarized in the graphs of Figure 4 for both simulated dye-swapped and non-dye-swapped data sets. Beginning with the dye-swapped data sets, Figure 4(a) and Figure 4(b) present the results for statistical methods and methods based on our framework. In both figures, the vertical axes indicate the percentage of probes that are marked as outliers. Along the horizontal axes is the parameter relevant to the method. Beginning with the statistical methods in Figure 4(a), it would seem that the IQR test performs better than the Z-test as there is a clear separation between the two graphs for the baseline and the test set. As expected, for both methods, the number of probes identified as outliers decreases as the parameter increases for
A Framework for Determining Outlying Microarray Experiments
Dye-swapped ('0 1 and '03)
r-------------------------------, ~
-)( -
g r---~~----------------------
73
___
Baseline (3%) Baseline (10%) -)( - Initial test set (3%) -. - Initial test set (10%)
---M-
lOR (Baseline, averaged) lOR (Test set) Z-test (Baseline, averaged)
K
Final test Set (3%)
..•
Fina! test set (10%)
- •. Z-test (Test set) - - -
Q-test (Baseline, averaged) Q-test (Test set)
1;1
,. - - -)( - - - K- - -
i(- - - 'i(- - ..
-J+. - _
-)if. __ ->f ___ )(
~
.
.-
... .
'~'"
0
'"
1.0
1.5
2.5
2.0
.. x
10
3.0 Expression threshold
Parameter
(a) Statistical methods
(b) Graph-based methods
Non-dye-swapped ('0 2 and '04)
g
.---------------------------~ ---M-
lOR (Baseline. averaged)
-)( -
lOR (Test set) Z-le51 (Baseline, averaged)
§
.-----~--------------------~ ---M- Baseline (3%) Baseline (10%) -)( - Initial test set (3%) -. - Initial test set (10%) ,x Final test set (3%) ..• Final test set (10%)
Z-lesl (Test set)
a-test (Baseline. averaged) - - - a-test (Test set) I,,~:
"---)(----t<- --~--_
-)f---_)f.
*:;::;: ~'" = .. " .... """' .... JI;- -
-)lit""""''' -
J(-"-)I(
---)It- __ )( ___ )(
. x
.. ~..
.. x
~-.1.0
1.5
2.0
2.5
Parameter
(c) Statistical methods
10
3.0 Expression threshold
(d) Graph-based methods
Fig. 4. Application of statistical methods ((a) and (c» and our graph-based method ((b) and (d» to simulated microarray data for dye-swapped and non-dye-swapped data sets.
all four graphs of the rQR test and Z-test since the test becomes more and more conservative. The Q-test is represented as horizontal lines since we have fixed our confidence interval at 90% and there is no other associated parameter. As the graphs show, the Q-test performs as well as the rQR test. We evaluated our framework by using all of '0 1 to produce edges which are then normalized to within the range [0,100], as described earlier in Section 3. We applied G to '0 1 and '0 3 and averaged the percentages of outlying probes to produce the graph of Figure 4(b). The graph shows only two values of dt : 3% and 10%. On the horizontal axes are 10 values for et: 1% to 10%. Each distance threshold is represented by three lines each: the baseline (using '01), the initial test set (using
74
R. Wan,
A.
M. Wheelock
fj
H. Mamitsuka
'D3), and then an outlier score again for 'D3 after applying the error function. Generally, the results are as expected. The number of outlying probes is low for the baselines and there is a small increase as we moved from d t = 3% to 10%. The initial percentages for the test set have percentages from 40% to 60%, indicating many expression levels are different from G. Cleaning the expression levels with the error function gradually lowers the number of outlying probes as et increases, for both values of d t . In Figure 4(b), note that the "final" line is slightly below the "initial" line. This indicates a possible problem with "over-cleaning" since the error function may be cleaning expression levels without bound. Turning our attention to non-dye-swapped experiments with 'D2 and 'D4 , the results mirror those of the dye-swapped experiments, indicating that no difference exists between applying our methods to dye-swapped or non-dye-swapped data.
8. Conclusion
We have introduced a graph-based framework for outlier detection and cleaning of a microarray experiment. Our aim is to first provide an overall score for the entire micro array to indicate the extent to which it is an outlying experiment. Optionally, expression levels can be cleaned using the same framework. The framework enables experimentalists to use existing repository data to indicate which expression levels in a single micro array test set should be checked or corrected. The repository data can come from the Internet, or ideally be from an earlier experiment from the same laboratory under similar experimental conditions. We demonstrated our method on simulated microarray data from the SIMAGE web service by creating two experimental sets: one with dye-swap and one without. While the model used to create these data sets is based on real microarray experiments, in the future, we plan to examine the effect from varying the error in the expression levels and having less experiments in R to build G. Our final aim is to apply our techniques to real microarray data, including using data from a more general microarray repository [7] to act as R. While our experiments have demonstrated the utility of our method on outlier detection; cleaning with the error function of Equation (3) requires additional work as it has the potential to over-clean expression levels. We believe that an additional penalization term in our error function is required to control the amount of cleaning performed. In related work, researchers have looked into modeling multivariate data, such as microarray data, as a sum of layers for the purpose of clustering [10]. In their work, additional terms are used to avoid over-parameterization. In the future, we plan to follow a similar path of investigation to determine whether a similar solution would apply to our work as well. A final extension to our work, which was indicated by our experimental results and Figure 2, is to combine statistical methods with our framework so that more expression levels can be used for verifying a data set.
A Framework for Determining Outlying Microarray Experiments 75
Acknowledgements
We thank Dr. Seiya Imoto (University of Tokyo) for valuable scientific communication and Dr. Timothy Hancock (Kyoto University) for suggesting future directions for our work with the error function. RW was supported by a postdoctoral fellowship from the Japan Society for the Promotion of Science (JSPS). AMW was supported by an EU Fp6 Marie Curie Fellowship. This work has been supported in part by BIRD of the Japan Science and Technology Agency (JST). References [lJ Albers, C. J., Jansen, R C., Kok, J., Kuipers, O.P., and van Hijum, S. A., SIMAGE: Simulation of DNA-microarray gene expression data. BMC Bioinformatics, 7(205), 2006. URL: http://bioinformatics . bioI. rug. nl/websoftware/simage/. [2J Bay, S. D. and Schwabacher, M., Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. 9th A CM International Conference on Knowledge Discovery and Data Mining (SIGKDD) , 29-38, Washington, DC, USA, 2003. [3J Choi, J. K, Yu, U., Kim, S., and Yoo, O. J., Combining multiple micro array studies and modeling interstudy variation. Bioinformatics, 19(5uppl. 1):i84-i90, 2003. [4J Cormen, T. H., Leiserson, C. E., Rivest, R L., and Stein, C., Introduction to Algorithms. The MIT Press, second edition, 2001. [5J di Bernardo, D., et al., Chemogenomic profiling on a genome-wide scale using reverseengineered gene networks. Nature Biotechnology, 23(3):377-383, March 2005. [6J Fletcher, R and Fletcher, R, Practical Methods of Optimization. John Wiley & Sons, Inc., second edition, 2000. [7J Gardiner-Garden, M. and Littlejohn, T. G., A comparison of microarray databases. Briefings in Bioinformatics, 2(2):143-158, May 2001. [8J Knorr, E. M., Ng, R T., and Tucakov, V., Distance-based outliers: Algorithms and applications. Special Issue on the Best Papers of VLDB '98, VLDB Journal, 8(34):237-253, February 2000. (9J Kubica, J. and Moore, A., Probabilistic noise identification and data cleaning. In X. Wu, A. Tuzhilin, and J. Shavlik, editors, Proc. 3m IEEE International Conference on Data Mining, 131-138, Melbourne, Florida, USA, November 2003. [lOJ Lazzeroni, L. and Owen, A., Plaid models for gene expression data. Statistica Sinica, 12:61-86, 2002. (l1J Lee, H. K, Hsu, A. K, Sajdak, J., Qin, J., and Pavlidis, P., Coexpression analysis of human genes across many micro array data sets. Genome Research, 14(6):1085-1094, 2004. [12J Nadon, R and Shoemaker, J., Statistical issues with microarrays: processing and analysis. TRENDS in Genetics, 18(5):265-271, May 2002. [13J Nykter, M., Aho, T., Ahdesmiiki, M., Ruusuvuori, P., Lehmussola, A., and Yli-Harja, 0., Simulation of microarray data with realistic characteristics. BMC Bioinformatics, 7(349), 2006. [14J Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P., Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, second edition, 1999. [15J Quinlan, J. R., C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. [16J Shoemaker, D. P., Garland, C. W., and Nibler, J. W., Experiments in Physical Chem-
76
R. Wan,
A.
M. Wheelock 8 H. Mamitsuka
istry. McGraw-Hill, fifth edition, 1989. [17J Stevens, J. R. and Doerge, R W., Combining Affymetrix microarray results. BMC Bioinformatics, 6(57), 2005. [18J Teng, C. M., Polishing blemishes: Issues in data correction. IEEE Intelligent Systems, 19(2) :34-39, Marchi April 2004. [19J Wan, R, Mamitsuka, H., and Aoki, K. F., Cleaning microarray expression data using Markov Random Fields based on profile similarity. In Proc. 20th A CM Symposium on Applied Computing, 206-207, Sante Fe, New Mexico, USA, March 2005. [20J Warnat, P., Eils, R, and Brors, B., Cross-platform analysis of cancer micro array data improves gene expression based classification of phenotypes. BMC Bioinformatics, 6(265), 2005. [21J Wit, E. and McClure, J., Statistics for Microarrays. John Wiley & Sons, Inc., 2004.
EXPLORING THE IMPACT OF OSMOADAPTATION ON GLYCOLYSIS USING TIME-VARYING RESPONSE-COEFFICIENTS CLEMENS KUHNi kuehn0molgen.mpg.de
ELZBIETA PETELENZ2 elzbieta.petelenz0cmb.gu.se
JORG SCHABER1 schaber0molgen.mpg.de
STEFAN HOHMANN2 stefan.hohmann0gu.se
BODIL NORDLANDER 2 bodil.nordlander0cmb.gu.se EDDA KLIPP 1 klipp0ffiolgen.mpg.de
Computational Systems Biology Group, Max Planck Institute for Molecular Genetics, Ihnestrafle 63-73, 14195 Berlin, Germany 2 Department of Cell and Molecular Biology/Microbiology, University of Gothenburg Box 462 SE-405 30 Goeteborg, Sweden 1
We present a model of osmoadaptation in S.cerevisiae based on existing experimental and theoretical work. In order to investigate the impact of osmoadaptation on glycolysis, this model focuses on the interactions between glycolysis and osmoadaptation, namely the production of glycerol and its influence on flux towards pyruvate. Evaluation of this model shows that, depending on initial relations between glycerol and pyruvate production, the increased glycerol production can have a substantial negative effect on the pyruvate production rate. Existing experimental data and a detailed analysis of the model lead to the suggestion of an interaction between activated Hogl and activators of glycolysis such as Pfk26.
Keywords: osmoadaptation; yeast; Hogl; Pfk26; metabolic control analysis
1. Introduction
The yeast Saccharomyces cerevisiae is a unicellular eukaryote, frequently used as a model organism in biological research. Due to its relative simplicity and ease of cultivation S. cerevisiae is a convenient system to study a wide range of physiological and biochemical features conserved among eukaryotes, for example signal transduction via MAP kinase pathways [22]. One of the processes controlled by MAPK signaling in yeast is osmoregulation, mediated by the HOG (High Osmolarity Glycerol)a pathway. This pathway is activated in response to increased external osmolarity; it allows the cell to control its water and glycerol content, as well as a Abbreviated molecule names mentioned in the text: DHAP: Dihydroxyacetone phosphate, Fl6BP: fructose-l,6-bisphosphate, F26BP: Fructose-2,6-bisphosphate F6P: fructose-6-phosphate, Fbp26: Fructose-2,6-bisphosphatase Fpsl: glycerol channel, G3P: Glycerol 3-phosphate, GAP: Glyceraldehyde 3-phosphate Gpdl/2: NAD-dependent glycerol-3-phosphate dehydrogenase, Gppl/2: DLglycerol-3-phosphatase, Hogl: high osmolarity glycerol protein, HoglPP: phosphorylated (active) form of Hogl, Pfk26: 6-phosphofructo-2-kinase, PFK: phosphofructokinase
77
78
c. Kuhn
et al.
other associated parameters like turgor pressure and cell volume [7]. The mechanisms governing adaptation after hyperosmotic shock in S. cerevisiae were elucidated by a substantial amount of existing experimental data as well as theoretical analyses [6, 11, 14]. During hyperosmotic shock, the extracellular osmotic pressure increases rapidly, causing the cell to lose water and thus volume, which is regained during osmoadaptation. As mentioned above, the mechanisms underlying osmoadaptation include activation of the HOG signalling pathway, the closure of the Fps1 glycerol channel and increased production of Gpd1, an enzyme crucial for glycerol production. The resulting increase in glycerol concentration raises the intracellular osmotic pressure, counterbalancing the increase in external osmotic pressure. Water - and volume - are regained. Since glycerol is a byproduct of glycolysis, we consider here the interactions and effects that glycolysis might have on osmoadaptation and vice versa. For our theoretical analysis, we construct a model of ordinary differential equations (ODEs) based on previous models concerning adaptation to hyperosmotic stress [11] and modeling of glycolysis in yeast [9, 18, 20]. We analyze our resulting model using an extension of metabolic control analysis (MeA) that enables computation of sensitivities of species concentrations to parameter variations in a time-dependent manner, namely time-varying response coefficients (Res) [10]. We use these time-varying Res to characterize which reactions have major contributions to glycerol and pyruvate production in a time-dependent manner during osmoadaptation. Simulation results of the model predict that osmoadaptation, with its increased production of glycerol, can significantly reduce pyruvate production (used here as a proxy for ATP production). Although this observation seems straightforward from the topology of the model given in Fig. 1, there is no experimental data available suggesting ATP depletion as a side effect of osmoadaptation. bIz et al. [15] show that S. cerevisiae grown under saline conditions requires considerably more energy than grown on basal medium. Assuming, therefore, that pyruvate production might be maintained during osmoadaptation leads us to investigate possibilities to counteract this reduction in pyruvate production. The time-varying Res show that the reaction F6P --+ F16BP is one of the most effective reactions in controlling pyruvate concentration and experimental results show that it is indeed associated with a protein known to be regulated by Hog1, namely Pfk26 [4, 5]. Incorporation of this interaction into the model results in a stabilized pyruvate concentration and an accelerated adaptation to hyperosmotic shock. 2. Methods
2.1. Details of the model The model presented here is a simplification of the model presented in Klipp et al. [11] with additional modifications to the glycolysis module according to existing models of glycolysis. See Fig. 1 for an overview of the model topology. A list of
Exploring the Impact of Osmoadaptation on Glycolysis glucose,
v1
-
glucose,
!
tV4 JLr----+1--------~~------~---H091PP'
Il
F6P, v11 v1d, v5_F26BPl
1
..c
1
~~ GAP ,
1
~ 1
volume, turgor
" v14• Hog 1 v15
v2ol_ mRNA+-~ v18 v19
1
-Gpd1 ,v21
v7' DHAP!
1 !
, • :
·
• glycerol, - - - - - - - - - - • int_osmolytes
I
val
+
v17 Fps1 - - . Fps1 PP v16
v13
v3 v2 -G6P
F16BP
!
glycerol, - - - - - - - - - - - ... excosmotes
extracellljlar intracelluiar
79
I
v12
1
pyruvate
,. . . . . . . .~~l. . . . .~.I¥.~".I¥.~.i.~..~".<1.lI.I~.!......t:'9..~..~ocJ.IJ.I~.................................................................................... Fig. 1. Topology of the model described. Solid arrows indicate reactions, solid lines ending in filled circles indicate activation. Densely dashed arrows show positive influences, densely dashed lines ending in diamonds indicate negative inputs to a variable. Volume and turgor are combined because they are tightly interconnected. The loosely dashed arrows indicate that glycerol contributes to the intra- or extracellular osmolytes. The two modules of the model as described in the text are indicated by dashed rectangles. The exact allocation of a reaction to either of the modules might be ambiguous. Reaction V12, for example, is part of both modules during parameter estimation.
all differential equations, initial conditions and parameter values can be found in supplementary data. Here, we describe the major changes in relation to the previous model of osmoadaptation on yeast. The phosphorelay module as well as the MAPK module have been removed. Instead, Hog1 is transformed to Hog1-PP depending on set), where
ds(t) dt
= ks . (1 _
t~·
p(t)h8 ) - ks . set) + p(t)h.
(1)
with turgor pressure pet), ks controlling the velocity of changes in set), the value ts, indicating the value of p at which the Hill-function is at half its maximal value and the Hill coefficient h s . Activation and inactivation of Hogl follows simple mass action kinetics. This significantly reduces model complexity while the extent of Hogl activation still resembles experimental data (Fig. 3). The computation of the turgor pressure has been modified according to Schaber
80
C. K iihn et al.
and Klipp [19]: pet)
-c;
= {0
* lnYJ!:l vp=o
if Vet) > Vp=o
else
(2)
with cell volume Vet), c; as a measure of membrane elasticity and Vp=o the volume at which turgor pressure becomes zero. Modeling the nucleus as an individual compartment was omitted for reasons of simplicity. Transcription and translation are here considered only for GPD1. As only relative experimental data on Hog1 activity exist, we chose to set the initial concentration of inactive Hog1 to 1 and adjust the parameter values of the model accordingly. Fps1 is not explicitly modeled as concentrations of an open and closed form but as the relative amount of open Fps1 channels as a measure of conductivity, an expression again based on the Hill-equation: dFps1 (t) dt 0
h
= k V16 * p(t)hV16 /(tvl~6 + p(t)h V16 ) - kv17 * Fps1o(t)
(3)
We furthermore reduced the two reactions DHAP ...... G3P -> Glycerol, catalyzed by Gpd1/2 and Gpp1/2, respectively, to one reaction DHAP -> Glycerol described with simple mass action kinetics. Although Cronwright et al [3] describe the regulation of Gpd1 in detail, we chose to simplify the regulation of Gpd1 here since most regulators are kept constant in this model. To refine the glycolysis module of [11], we examined the individual rate laws for all reactions in three published models of yeast glycolysis [9, 18, 20] and checked each rate law for the underlying reasoning and its applicability here. For an overview over the individual rate laws please refer to the additional material. In order to refine the model, we also consider an additional metabolite, F26BP, a glycolytic intermediate produced by the reaction F6P -> F26BP (VlO here) catalyzed by Pfk26 [12] and degraded again by the reaction F26BP -> F6P (Vl1) catalyzed by Fbp26 [16]. F26BP is an activator of PFK (catalyzing V5) and Pfk26 is reportedly activated by Hog1PP [5]. The activation by Hog1PP is incorporated following a Michaelis-Menten kinetic for F6P -> F26BP modified to include two different K m-Values, one for the Hog1PP-activated form of Pfk26 and one for the basal activity:
vlO(t) = F6P(t)
+
kvlOvmax . F6P(t) HoglPP(t) k + HoglPP(t)+kv10k· k.. 'n k k HoglPP(t)+kv10k· vlOKm2 vlOKml
(4)
where the activity of Pfk26 is indicated by the fraction H~~~i(~)(2kl . This fraction is multiplied by k vlO Km2, a lower Km-value than kvlOKml corresponding to the inactive form. The backwards reaction, catalyzed by Fbp26 is modeled using simple mass action kinetics. Because the concentration of F26BP is very low (0.00014mM before stress, 0.0002mM after activation by Hog1PP) compared to the concentrations of both F6P (0.165mM initially) and F16BP (0.425mM initially), activation
Exploring the Impact of Osmoadaptation on Glycolysis
81
of F26BP production does not decrease F16BP formation by redirection of reaction flux. In contrast to the underlying models, this model does not contain dynamic concentrations of ADP, ATP, AMP, NAD and NADH for simplicity. The SBML model was created and modified using Copasi [8J in version 4.4, build 25, which enables the integration of volume changes in the formulation of the SBML model.
2.2. Experimental data and parameter estimation The experimental data used to fit the parameters have mostly been obtained from [l1J for the part of osmoadaptation. Experimental data on glycolysis have been extrapolated from the models noted before [9, 18, 20J. Due to the large number of different yeast strains combined with the vast number of possible experimental settings (e.g. choice of medium and aerobicity), the differences in metabolite concentrations between experiments can be immense. We take the set of metabolite concentrations mainly from [18J. Although these concentrations were also obtained from anaerobic experiments, the cytosolic free NAD and NADH concentrations agree better with recent measurements [2J than the measurements from [9, 20J. Please refer to Supplementary Table 1 for the choice of the values and a comparison to other experimental data. After sensible initial metabolite concentrations were set, the reaction parameters of both the glycolysis module (for parameter estimation, this module contains reactions Vl to V12 and a glycerol degrading reaction following mass action kinetics) and the HOG module (including reactions V12 to V21 and the dynamics of volume and turgor pressure in this context) had to be modified in order to create a steady state of the system before application of the stress and reproduce the experimental data on osmoadaptation. As Teusink et. al. [20J have pointed out, enzyme properties measured in vitro are not necessarily directly applicable to mathematical models. In order to find a set of parameters that generates the desired concentrations, we resorted to parameter estimation although the experimental data is too sparse to allow for the identification of a unique set of parameters satisfying the experimental data. When estimating a large number of parameters for a system, it might considerably speed up the computation process to divide the model into subsystems that can be joined after a lightweight parameter estimation task for each of the subsystems has been applied [13J. We accordingly divided the model into the glycolysis and HOG module that could be joined after the parameters were fitted. The glycolysis module does not contain a variable cell volume and could be fitted using SBML-PET [23J. The parameters of the HOG module were fitted by hand using the results of previous models.
82
C. K iihn et al.
2.3. Computation of time-varying response coefficients Response coefficients (Res) are a standard measure of MeA, indicating the sensitivity of steady state concentrations to infinitesimally small changes in some parameter. In order to capture the sensitivities during the course of osmoadaptation, we employ an extension of this notion proposed by Ingalls and Sauro [10]. Scaled time-varying response coefficient R~(t) are defined as
RS(t) q
= <J... s
os(t,q) I _ oq q-qo
(5)
and is a measure for the sensitivities of substance concentrations s to an infinitesimally small variation in the set of initial conditions So and parameters Po where qo = So U Po. The concentration of the species s is a function of time t and q,
set, q). These time-varying response coefficients are scaled by the scaling factor ; and are computed together with the computation of the trajectory. Scaled time-varying Res have been computed for all parameters in the reaction network, using Wolfram Mathematica version 6.0.2 [17]. 3. Results and Discussion
3.1. Simulation of model The model including Pfk26 qualitatively reproduces known experimental data on the adaptation to hyperosmotic shock. Fig. 3 shows the key components of osmoadaptation after a hyperosmotic shock with 0.5M NaCl. The conductivity of the glycerol channel Fpsl is rapidly reduced upon shock, Hogl is phosphorylated to its active form and triggers transcription of GPDl mRNA. Hogl-phosphorylation has a peak at 250 seconds (4.2 minutes) after the application of the shock and then declines again. The concentration of GPDl mRNA transiently rises to a peak at about 1800 seconds (30 minutes) after shock while the Gpdl concentration reaches its maximal concentration more than 3000 seconds (50 minutes) after the shock. The glycerol concentration has a sharp initial rise due to decrease of volume and closure of Fpsl followed by an increase due to Gpdl-dependent production and saturates at about 2500 seconds (42 minutes) after the shock. The cell volume (not shown) rapidly decreases to about 70% of its initial value, after which it increases again as the intracellular glycerol concentration increases. Our model does not lead to perfect adaptation, but perfect adaptation can be achieved with the described model using different parameter settings. The reason we do not use perfect adaptation in this model is that existing experimental data do not produce a clear and unambiguous answer to the question whether perfect adaptation is achieved for this amount of stress. The metabolites involved in glycolysis remain in steady state without addition of osmotic shock. After application of the shock and adaptation, the system switches to a new steady state, as given in Supplementary Table 1. For a model without
Exploring the Impact of Osmoadaptation on Glycolysis
83
activation of Pfk26 (NoPfk26), this evaluation of metabolic intermediates reveals a decrease in many metabolite concentrations following the osmotic shock, which is a result of the increased drain by increased glycerol production. The severity of this effect crucially depends on the initial balance of glycerol production and pyruvate production, ~. A comparison between osmoadaptation of three different models is V12 given in Fig.2 and Supplementary Table 1. Depicted in the figure are three different models, NoPfk26, Glycerol and Pfk26. Model Glycerol is derived from NoPfk26, but the parameters of the glycolysis module have been changed to increase V12 by a factor of 10 without changing the metabolite concentrations. Model Pfk26 is the model including activation of Pfk26, as described in section 2 and depicted in Figures 1,3,4,6,5. While model Pfk26 results in an increase in glycerol and pyruvate after osmotic shock, the glycerol concentration in model NoPfk26 rises slower and the pyruvate concentration decreases. In model Glycerol, the chosen reaction parameters do not allow for a sufficient increase in glycerol concentration and thus the volume cannot be regained. The increased flux towards glycerol leads to a strong decrease in pyruvate production and both concentrations eventually level out. I
1.6 1.4
:E
S
1.2("
~
$,.. - ...........
~ 1.0
,/
............
il"
0.8
,
0.+o-c j
........ -...... ----------------... --
t'"
, , 2000
4000
6000 time (second).)
8000
tOOOO
12000
2000
4000
6000
8000
10000
12000
time (seconds)
Fig. 2. Simulation results for glycerol (left) and pyruvate (right) concentrations during osmoadaptation using different models. The solid line refers to model Pfk26, the thick dotted line refers to model NoPfk26 and the thin dotted line refers to model Glycerol. The different models are discussed in the text, stress is applied at t = 100.
A decrease in metabolite concentrations during osmoadaptation has not yet been detected in experimental data. Although it is arguable that pyruvate and other metabolite concentrations need to be constant during adaptation to osmotic shock, the experimental data underlying the activation of Pfk26 suggested its incorporation into the model. This mechanism is also indicated by the drastic effects of a low ~ as shown in Fig 2. In order to support the incorporation of Pfk26, we resorted to the time-varying Res as described later to detect alterations in the reaction network to which the pyruvate concentration is highly sensitive. Together with V2 and V4, V5 could be identified as a susceptible target for regulation in order to increase glycolytic flux because parameters of these reactions have the greatest Res on pyruvate concentration during osmotic shock. The activation of Pfk26 as described in [5] would increase
C. K iihn et al.
84
flux through reaction V5 because Pfk26 is one of two isoenzymes that catalyze the reaction F6P ---+ F26BP [1] (here VlO). The role of F26BP has been discussed in Methods, as well as the kinetics used to incorporate F26BP into the model. As F26BP is an activator of PFK, an increase in F26BP can increase glycolytic flux upstream of the branch dividing pyruvate and glycerol production, thus increasing flux to both reactions. Simulations of the model Pfk26 including activation of Pfk26 result in a different steady state after adaptation to osmotic shock than simulations without this activation, as given in Supplementary Table 1 and Fig. 2, showing significantly increased instead of reduced metabolite concentrations. Furthermore, in the model including F26BP, the time osmoadaptation takes is reduced and glycerol production increased. Although the initial intention was to stabilize the pyruvate concentration, we momentarily accept the resulting increase in pyruvate concentration until new experimental data becomes available.
g~
!
~~'r\ ~ \
06
1
\ \
f"'j \~ ~_ .___
(l.O~. _ _~_~__ ,~ _ _,_ _ _ ~
o
2000
4()O{)
2000
4000
6000
8000
10 000
l2 (Ion
WOO
8000
10000
12000
l.O~
time (1)cconds)
2000
4000
6000
HOOO
10000
1200C
time (seconds)
Fig. 3. Experimental data and simulation results for the model without Pfk26 activation. Lines with points show experimental data, smooth lines show simulation results. Clockwise, from top left: HoglPP, GPDl mRNA, glycerol, Gpdl. Experimental data from [ll].
3.2. Time-varying response coefficients In the following, we present selected time-varying response coefficients for pyruvate and glycerol for the system with integrated Pfk26-activation. Positive values of R(t)~ indicate that parameter y has a positive effect on concentration x at time point t, an increase in y would increase x. Negative values of R(t)~ indicate that an
Exploring the Impact of Osmoadaptation on Glycolysis
85
increase in y would lead to a decrease in x at time point t. Using time-varying ROs, we are able to discriminate between temporary effects, where R(t)~ deviates from o for a very short time and long-lasting effects, where R(t)~ is significantly positive or negative for a longer period of time. Below we discuss and interpret the values of the individual ROs. The R(t)~lycerol and R(t)~ are similar. Generally, we observe for both glycerol and pyruvate that the ROs for parameters involved in the HOG module (containing reactions V14 to V21) are smaller than for parameters involved in the glycolysis module (reactions Vl to V13) by about one order of magnitude. This is due to the fact that glycolytic parameters control the net glycolytic flux, and thus the concentration of metabolites even in the absence of HOG signalling. The response coefficient for the parameters pertaining to Vmax in Michaelis-Menten kinetics are the largest except for the reversible V4, for which the equilibrium constant is maximal.
1.0
...
<8
~ ~
0.0 ~--II--."'-I1mrd
~
-0.5 -2500
o
2500 5000 time (seconds)
7500
10000
Fig. 4. Res for glycerol concentration for selected parameters of the HOG module. Osmotic stress is applied at t O. Markers for each curve are inserted for distinction at arbitrary intervals. t v 16 and ts are the parameters giving the thresholds of turgor change that cause Fpsl and Hogl changes. ks determines the speed of Hogl activation, k V 13 the rate of glycerol transport, k V 19 the rate for mRNA degradation, kv14 and k v 15 are the rate constants for Hogl phosphorylation/dephosphorylation and kv12 is the rate constant used in glycerol production. For detailed kinetics see the model in supplementary data.
. and H OG mo d u1e parameters, R( t )glycerol The ROs for glycerol concentratIOn HOG are distinguishable into early and late parameters, as can be seen in Fig. 4. kV14 (determining the speed of Hog1 activation) and kv15 (determining the speed of HoglPP inactivation), have stronger impact before 1000 seconds (16.6 minutes) after shock. Once the maximal activation is achieved, small changes in these parameters have
86
C. Kuhn et al.
no great impact anymore. While t v 16, the threshold parameter in Fpsl closure, kV12 (involved in glycerol production) and kV13 (involved in glycerol transport) both have greater impact later than 1000 seconds after the onset of stress. t s , the threshold parameter controlling activation of Hogl, has a high impact before 1000 seconds after shock, the R(t)¥;ycerol then declines again but rises almost to its previous maximum again 2500 seconds (41.6 minutes) after shock. This delayed impact of the threshold parameters is connected with the magnitude of the volume change, since the associated turgor pressure initially declines very rapidly past the threshold parameters. But as the volume is regained, turgor pressure approaches the threshold parameters again, therefore increasing their impact on the system. As for parameters of the HOG module, parameters involved in glycolysis have virtually no influence on glycerol concentration before the onset of shock. After application of the osmotic shock, all R(t)~:~:r:!is have transient increases during the phase of Hogl activity and volume regulation, but once the cell volume is stabilized, the R(t)~:~~~r:!iS decline again. The parameters involved in V2, V3 and V4 exhibit the highest Res during osmoadaptation.
0.2
~
~
0.1
~
(:l.
....
<S
'" ~
0.0
~ Q
'" -0.1
-0.2
-2500
o
2500 5000 time (seconds)
7500
10000
Fig. 5. Res of pyruvate concentration for selected parameters of the HOG module. Osmotic stress is applied at t = O. k V 19 determines the rate of mRNA degradation, k v 20 that of Gpdl translation, kV14 and k v 15 determine the speed of Hogl phosphorylation and dephosphorylation, respectively. k V 13 is involved in glycerol transport and kV21 is the rate constant for Gpdl degradation. Detailed kinetics can be found in the model in supplementary data. Markers for each curve are inserted for distinction at arbitrary intervals.
The R(t)1fY;;,;;ate before the onset of the shock show a high positive value for R(t)Pkyruvate and a high negative value for R(t)Pkyruvate, due to the increase in glyv14 v1S colytic flux through Hogl-dependent Pfk26 activation. This activation of Pfk26
Exploring the Impact of Osmoadaptation on Glycolysis 87
seems to outperform the Hogl-dependent increase in Gpdl concentration for low levels of activation. After stress is applied, these Hogl-regulating RCs vanish and R(t)Pkyruvate, R(t)Pkyruvate and R(t)VYkruvate (all three regarding production/degrav20 vtS v19 dation of Gpdl or GPDl mRNA) as well as R(t)Pkyruvate (glycerol transport) rise vI3 sharply. During the course of osmoadaptation, R(t)Pkyruvate slowly increases as v2I R(t)Pkyruvate. This is caused by the redirection of glycolytic flux to glycerol provI3 duction caused by high levels of Gpdl. This behavior is depicted in Fig.5. 3 2
1il >
2
£....
..s '" ~
13
t;l ~
0 -1
-2 -
-3
-2500
o
2500
5000
7500
10000
time (seconds) Fig. 6. RCs of pyruvate concentration for selected parameters of glycolysis, osmotic stress applied at t = O. Markers for each curve are inserted for distinction at arbitrary intervals. The parameters shown are the most relevant parameters of the reactions with the associated number in Fig. 1. kvlOKml and k v10 Km2 are the two different Km-values used in the activation of Pfk26. k v 4eq, the equilibrium constant of V4 was chosen here since it generally has a greater time-varying RC than k v 4vmax, the corresponding rate constant. Detailed kinetics of each reaction are given in the model in supplementary data.
The strongest R(t)~r;~~~!~s are shown in Fig. 6. Since alterations in these parameters change glycolytic flux even in the absence of HOG-signaling, the pyruvate concentration is always sensitive to these parameters. The sensitivity of the pyruvate concentration to glycolytic parameters generally decreases during osmoadaptation but rises again after adaptation is achieved. One exception is R(t)Pkyruvate, which vlOvmax increases glycolytic flux during osmoadaptation. This parameter indicates that the rate of VlO has the greatest impact on pyruvate concentration during osmoadaptation. The RCs for the two Km-values for the different states of Pfk26, R(tlkyruvate vlOKml and R( t )Pkyruvate , show a switch like behavior during osmoadaptation: Before the osvlOKm2 motic shock, R(t)Pkyruvate is greater, but as Pfk26 is activated by Hogl, R(t)Pkyruvate vlOKml vlOKm2 rises as the role of kvlOKml decreases. This switch in impact of the two parameters
88
C. Kuhn et al.
is not transient. The detection of this switch-like behavior exemplifies the advantage of timevarying response coefficients compared to standard response coefficients. The changes in response coefficients observed in this analysis also suggest that there need not be one exclusive rate limiting reaction in a metabolic pathway but that this property is distributed over all reactions and the specific values might heavily depend on the state of cell.
4. Conclusion We have revisited existing models of osmoadaptation in order to focus on the impact of individual parameters on glycerol production and on the balance between glycerol and pyruvate production. We incorporated new experimental evidence on Pfk26 and performed a systematic exploration of the model behavior. The refinement of the model resulted in the following predictions: Activation of Pfk26 by HoglPP leads to a significantly increased glycolytic flux during osmoadaptation and a decelerated osmoadaptation in case of Pfk26 knockout. Although the exact extent of the increase in glycolytic flux needs to be determined experimentally, the model shows that activation of Pfk26 by HoglPP has a crucial role in maintaining a stable pyruvate level or even increasing this level during osmoadapataion. This provides the cell with additional energy during adaptation and protects it from starvation. By increasing the glycolytic flux, Pfk26 also has substantial influence on the rate of osmoadaptation because it provides the reaction DHAP ---> G3P with an increased substrate concentration. The importance of this effect might vary under different conditions: If the initial glycolytic flux is high, maintenance of the pyruvate concentration might be more crucial than an increase in glycerol production due to increased glycolytic flux. If the glycolytic flux is initially low, an increase in glycerol production besides the Gpdl-mediated increase might be favorable. Using this rather simple mechanism, the yeast cell gains both a boost in adaptation speed and energy supply under stress conditions. The RCs for this system do not show one rate limiting reaction for glycolytic flux. Contrary to popular opinion, parameters of three reactions (V2' V3 and V4) have equally large absolute values for their RC on pyruvate concentration, which indicates that all three have a similar effect on glycolytic flux. Furthermore, they decrease during the simulation and get close to the RC of even a fourth reaction. There are many of these transient changes in RCs during osmoadaptation. This indicates that the extent to which a reaction influences the net flux through a pathway can vary greatly depending on the external conditions and state of the cell. It might therefore be more sensible to speak of a rate limiting reaction or a group of rate limiting reactions under certain conditions.
Exploring the Impact of Osmoadaptation on Glycolysis 89
Acknowledgments
Clemens Kuhn is funded by the International Research Training Group 'Genomics and Systems Biology of Molecular Networks', supported by the German Research Foundation (DFG). Elzbieta Petelenz is a fellow in the EC-funded Marie Curie EST project 'Systems Biology' (514169). Jorg Schaber is supported by the European Commission (CELLCOMPUT (043310)). Work in the Hohmann Lab is supported by Vetenskapsnidet and in both the Hohmann and the Klipp labs by the European Commission (QUASI (030710) and UNICELLSYS (201142)).
References [1] Bedri, A., Kretschmer, M., Schellenberger, W., and Hofmann, E., Kinetics of 6phosphofructo-2-kinase from Saccharomyces cerevisiae: inhibition of the enzyme by ATP, Biomedica Biochimica Acta, 48(7):403-411, 1989. [2] Canelas, A. B., van Gulik, W. M., and Heijnen, J. J., Determination of the cytosolic free nad/nadh ratio in Saccharomyces cerevisiae under steady-state and highly dynamic conditions, Biotechnol Bioeng, Jan 2008. [3] Cronwright, G. R, Rohwer, J. M., and Prior, B. A., Metabolic control analysis of glycerol synthesis in Saccharomyces cerevisiae, Appl Environ Microbiol, 68(9):44484456, Sep 2002. [4] Dihazi, H., Kessler, R, and Eschrich, K., Phosphorylation and inactivation of yeast 6-phosphofructo-2-kinase contribute to the regulation of glycolysis under hypotonic stress, Biochemistry, 40(48):14669-14678, Dec 200l. [5] Dihazi, H., Kessler, R., and Eschrich, K., High osmolarity glycerol (hog) pathwayinduced phosphorylation and activation of 6-phosphofructo-2-kinase are essential for glycerol accumulation and yeast cell proliferation under hyperosmotic stress, J Bioi Chem, 279(23):23961-23968, Jun 2004. [6] Gennemark, P., Nordlander, B., Hohmann, S., and Wedelin, D., A simple mathematical model of adaptation to high osmolarity in yeast, In Silico Bioi, 6(3):193-214, 2006. [7] Hohmann, S., Osmotic stress signaling and osmoadaptation in yeasts, Microbiol Mol Bioi Rev, 66(2):300-372, Jun 2002. [8] Hoops, S., Sahle, S., Gauges, R, Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P., and Kummer, U., Copasi-a complex pathway simulator, Bioinformatics, 22(24):3067-3074, Dec 2006. [9] Hynne, F., Dano, S., and Sorensen, P. G., Full-scale model of glycolysis in Saccharomyces cerevisiae, Biophys Chem, 94(1-2):121-163, Dec 200l. [10] Ingalls, B. P. and Sauro, H. M., Sensitivity analysis of stoichiometric networks: an extension of metabolic control analysis to non-steady state trajectories, J Theor Bioi, 222(1):23-36, May 2003. [11] Klipp, E., Nordlander, B., Kriiger, R., Gennemark, P., and Hohmann, S., Integrative model of the response of yeast to osmotic shock, Nat Biotechnol, 23(8):975-982, Aug 2005. [12) Kretschmer, M. and Fraenkel, D. G., Yeast 6-phosphofructo-2-kinase: sequence and mutant, Biochemistry, 30(44):10663-10672, Nov 1991. [13] Kiihn, C., Kiihn, A., Poustka, A. J., and Klipp, E., Modeling development: spikes of the sea urchin, Genome Inform, 18:75-84, 2007. [14] Mettetal, J. T., Muzzey, D., Gomez-Uribe, C., and van Oudenaarden, A., The
90
[15]
[16] [17] [18]
[19] [20]
[21]
[22]
[23]
C. K iihn et al. frequency dependence of osmo-adaptation in Saccharomyces cerevisiae, Science, 319(5862):482-484, Jan 2008. bIz, R., Larsson, K., Adler, L., and Gustafsson, L., Energy flux and osmoregulation of Saccharomyces cerevisiae grown in chemostats under NaCI stress, J Bacteriol, 175:2205-2213, Apr 1993. Paravicini, G. and Kretschmer, M., The yeast fbp26 gene codes for a fructose-2,6bisphosphatase, Biochemistry, 31(31):7126-7133, Aug 1992. Research, W., Mathematica, Version6.0, Wolfram Research, Champaign, IL, 2007. Rizzi, M., Baltes, M., Theobald, V., and Reuss, M., In vivo analysis of metabolic dynamics in Saccharomyces cerevisiae ii. mathematical model, Biotechnol Bioeng, 55:592-608, 1997. Schaber, J. and Klipp, E., Short-term volume and turgor regulation in yeast, Essays in Biochemisty, 2008. Teusink, B., Passarge, J., Reijenga, C. A., Esgalhado, E., van der Weijden, C. C., Schepper, M., Walsh, M. C., Bakker, B. M., van Dam, K., Westerhoff, H. V., and Snoep, J. L., Can yeast glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry, Eur J Biochem, 267(17):5313-5329, Sep 2000. Theobald, V., Mailinger, W., Baltes, M., Rizzi, M., and Reuss, M., In vivo analysis of metabolic dynamics in Saccharomyces cerevisiae: I. experimental observations, Biotechnol Bioeng, 55(2):305-316, july 1997. Widmann, C., Gibson, S., Jarpe, M. B., and Johnson, G. L., Mitogen-activated protein kinase: conservation of a three-kinase module from yeast to human, Physiol Rev, 79(1):143-180, Jan 1999. Zi, Z. and Klipp, E., SBML-PET: a systems biology markup language-based parameter estimation tool, Bioinformatics, 22(21):2704-2705, Nov 2006.
COMPARING FLUX BALANCE ANALYSIS TO NETWORK EXPANSION: PRODUCIBILITY, SUSTAINABILITY AND THE SCOPE OF COMPOUNDS KAI KRUSEl kai.kruse~student.hu-berlin.de
OLIVER EBENHOH2,3 ebenhoehOmpimp-golm.mpg.de
1 Theoretical
Biophysics, Insitute for Biology, Humboldt University, Berlin, Germany Max-Planck-Institute for Molecular Plant Physiology, Potsdam-Golm, Germany 3 Institute for Biochemistry and Biology, University of Potsdam, Germany
2
The producibility of metabolites from available resources is investigated systematically using flux balance analysis (FBA) and network expansion. Calculations are performed for the genome-scale metabolic networks of Escherichia coli and Methanosarcina barkeri. Strict biological interpretation of the results obtained with FBA leads to the concept of sustainability, which reduces the set of producible metabolites by assuming a growing and dividing cell. A systematic comparison showed that applying network expansion in many cases results in exactly the set of all sustainable metabolites. The purely heuristic approach of allowing for certain cofactors to facilitate reactions during the process of network expansion dramatically helps to improve agreement of the results from the two different approaches. In conclusion, we state that network expansion, due to its enormous advantages in computational speed, is a valuable alternative to determining producible metabolites with FBA.
Keywords: Flux balance analysis (FBA); network expansion; producibility; sustainability; scope; metabolism
1. Introduction
Considering the enormous speed at which sequencing projects are currently proceeding (at present, 769 fully sequenced genomes have been published and 2781 more are ongoing [11]), we have now access to the complete inventories of genes for a large number of organisms across all domains of life. In principle, from this information complete, genome-scale metabolic networks can be inferred by sequence comparison to genes or proteins which have been previously characterized. Genome-scale models are extremely useful for a wide variety of theoretical and computational analyses. Once the complete set of reactions is known, the powerful framework of flux balance analysis [10, 14J can be applied to predict optimal flux distributions maximizing the production of biomass or other, potentially exploitable, metabolites [8J. Further, it is possible to assess the effect of gene knock-outs and comparison of the computational predictions to experimentally measured fluxes can potentially point at erroneous or incomplete structures of the genome-scale network model [3J.
91
92
K. Kruse 8 O. Ebenhiih
Another strategy to analyze genome-scale networks is given by the method of network expansion [2, 7], which particularly aims at relating structural to functional features of large-scale metabolic networks. In this approach, networks of increasing size are constructed starting from an initial set of substrates (the seed) by stepwise adding all those reactions from the analyzed metabolic network which use as substrates only compounds provided by the seed or as products of reactions incorporated in earlier steps. The set of metabolites contained in the final network is called the scope of the seed and comprises all those metabolites which the network is capable of producing when only the seed compounds are initially available. It can, however, be argued that a scope does not realistically describe the true biosynthetic capacity of an organism, because the idealized situation that exclusively some external compounds and no further internal metabolite is present does never occur under normal circumstances. To account for this fact, we have introduced for several practical applications [1, 6J a modified version ofthe expansion process which takes into account that it is unrealistic to assume that some key metabolites, the so-called cofactors, have to be synthesized de novo from the available nutrients. This purely heuristic approach is based on common biological knowledge and the fact that some cofactor pairs are participating in a considerable number of biochemical reactions. For example, a highly frequent pattern is the transfer of a phosphate group from ATP to some acceptor molecule resulting in the formation of ADP. Similarly, NAD+ may accept electron pairs to yield the reduced from NADH, thus mediating redox reactions. In this work, we systematically investigate the producibility of metabolites from available resources and compare the results from the two approaches. FBA provides a strict mathematical framework and can be used to assess whether a given metabolic network is capable of carrying a flux such that a particular metabolite can be produced. In contrast, the method of network expansion, especially in its modified form to allow for cofactor functionality, relies on heuristics to assess which metabolites may be synthesized from a given combination of nutrients. By comparing corresponding results obtained by the two approaches, we can show that in many cases the resulting sets are coinciding, which has practical consequences for an efficient, large-scale functional analysis of metabolic networks.
2. Theory
A metabolic network is commonly described by the stoichiometric matrix N. A matrix with r rows and m columns describes a network in which r reactions connect m metabolites. An entry nij denotes the stoichiometric coefficient of metabolite i in reaction j, which is negative if the metabolite is consumed by the reaction, positive if it is produced and zero if there is no net production or consumption. The metabolic state of a cell can be described by a vector v E ]Rr containing the rates of all r biochemical reactions. These rates describe the velocities with which chemical conversions are performed and determine the temporal change of
Comparing FEA and Network Expansion
93
the concentrations by
dC
di = Nv,
(1)
where C E ]Rm is the vector of the concentrations. Under physiological conditions, the fluxes Vj underly certain limitations. Due to thermodynamic constraints, some reactions may only proceed in one direction, resulting in the constraint Vj ~ O. An upper bound for the reaction rates may result from limited amount of free enzyme. In the following, we will only consider the former type of constraint since the latter will not playa role for our principle considerations. For practical reasons, we will treat reversible reactions as two irreversible reactions proceeding in opposite directions. This is achieved by introducing an additional column to the matrix N in which the signs of all stoichiometric coefficients are reversed. This leads to an increased number of reactions which all obey the same sign constraint Vj ~ O. Intuitively, a metabolite k is producible from a combination of nutrient metabolites if there exists a flux distribution such that only the nutrient metabolites are consumed, the metabolite k is produced and all other metabolites are at least not consumed. In this consideration, it is assumed that "side-products", which are additionally produced, pose no problem to the organism and can be degraded or exported by other means. A mathematical description of producibility in the context of flux balance analysis has been given in [9]. If the available nutrients, or the seed, is denoted by U C {I ... m}, a metabolite k is producible if there exists a flux vector v = (Vj) with Vj ~ 0 such that
[NVh > 0 and [Nvl; 2: 0 for i tI- u.
(2)
For the components i E U, there is no restriction since these compounds may be imported from the environment. Condition (2) can be tested by phrasing it as a linear programming problem. Following the terminology introduced in [9], we call metabolites fulfilling this condition producible from the nutrients U. The entirety of all metabolites that are producible from U is denoted P(U). By defintion, a network may carry a stationary flux leading to an increase in concentration of the producible metabolites while only the nutrients are consumed. However, this interpretation holds only as long as it is assumed that the cell is in a stationary, non-growing state. If a growing and reproducing organism is considered, stricter conditions for the producibility of metabolites must be imposed. In particular, all metabolites not contained in the set P(U) are not producible and therefore their amount may not continuously increase. If we assume a persistent increase in cellular volume, the concentrations of such metabolites necessarily decrease and eventually reach zero and, as a consequence, are not available as substrates for other reactions. We take into account these considerations by repeating the calculation of all producible metabolites with the additional constraint that all those reactions are forbidden which use as substrate any metabolite that is not contained in P(U). More precisely, all those metabolites are identified for which flux vectors v = (Vj)
94
K. Kruse f3 O. Ebenh5h
exist with Vj ::::: 0 and VI = 0 if reaction l uses a substrate not contained in P(U). A reaction l fulfils this condition if the set {i rf. P(U) Inil < O} is non-empty. If this additional restriction results in a reduction of the set of producible compounds, the calculation is repeated with even stricter conditions. This process is iterated until the set of producible metabolites remains unchanged. The final set of metabolites is denoted by S(U) and a metabolite within this set is termed sustainable since it has the property that it can be produced from the nutrients U even if the cell is constantly growing. Sustainable metabolites are determined by repeatedly decreasing the set of producible metabolites until only those remain which can be produced from available nutrients without requiring the presence of any non-sustainable intermediates. In contrast, in the method of network expansion the scope of the seed U is determined by stepwise expanding a set of metabolites. Starting with the set U, all those reactions are identified that use exclusively substrates contained in the set and their products are included in the expanding set. Expansion stops if no further products are included and the final set is called the scope of the nutrients U, denoted ~(U). From the construction of the scope it is evident that every metabolite contained in the scope is also sustainable in the above defined sense. Therefore, ~(U)
c S(U) c P(U).
(3)
The concepts of producibility, sustainability and scope can be viewed as different definitions of which metabolites can be synthesized by a given network with increasingly stricter conditions. A)
.. --------- ..
-----------
<:A-t ~-:Y Z7B)
8)
_-------------_
'----
---------- -----
Fig. 1. Toy networks illustrating the differences of the presented concepts of producibility. A) A simple network producing biomass B from nutrient A in two consecutive reactions. Intermediates X and Yare essential for the production of biomass but neither is producible. This example demonstrates that a compound may be producible but not sustainable. B) A simple network producing biomass B from nutrient A in three consecutive reactions. All intermediates X, Y and Z and biomass B are sustainable. Since in an expansion starting from A, metabolite X is not available, the scope of A contains only A itself, ~(A) = {A}. This example demonstrates that a compound may be sustainable but not included in the scope of the nutrients.
The difference between producible and sustainable metabolites is characterized in the toy network depicted in Fig. lA. Here, clearly a steady state flux distribution exists such that metabolite B may be produced while only consuming nutrient A without a net consumption of X or Y. However, since the sum of X and Y is strictly balanced, it is not possible to produce either of these intermediates while consuming only nutrient A. Therefore, these metabolites are not producible. Imposing the constraint that no reactions may proceed which use one of those compounds as
Comparing FEA and Network Expansion
95
substrates no longer allows for a production of B. Therefore, B is producible but not sustainable. In the network depicted in Fig. 1B, metabolite B is sustainable from the nutrient A. The difference to the network from Fig. 1A is that from each consumed molecule A one excess molecule X may be produced and therefore the concentrations of all intermediates X, Y and Z may increase simultaneously. Consequently, all intermediates and the product B are sustainable. Fig. 1B also demonstrates why metabolites which are sustainable on U are not necessarily contained in the scope of U. Clearly, since intermediate X is required along the synthesis route to B, the expansion stops with A. In fact, the problem of the biological interpretation of a scope results from the fact that network expansion cannot account for such cyclic dependencies in which the presence of a metabolite is necessary for its own production. A key metabolite exhibiting this kind of dependency is ATP. In early steps during the synthesis of adenine nucleotides thermodynamically unfeasible reactions requiring the consumption of ATP are involved. Despite the completely different approaches underlying the definitions of sustainability 'and scopes, we have found that they are often identical.
3. Results We systematically compare sets of producible and sustainable metabolites with the corresponding scopes for the two genome-scale metabolic networks of Escherichia coli [13) and Methanosarcina barkeri [4). These well-characterized organisms have been fully sequenced and their genome-scale networks have been manually curated and are therefore considered as representative examples. The network of E. coli contains 932 reactions connecting 761 metabolites. For 143 metabolites exchange fluxes are defined, meaning that they may pass through the surface of the cell and are available as nutrients if abundant in the environment. If all these metabolites are assumed to be present, the set of sustainable metabolites amounts to 628. In contrast, the scope of the 143 external metabolites results in a set of only 312 compounds. If, however, it is assumed that ATP is available as a cofactor, the scope of the nutrients contains exactly all 628 sustainable metabolites. Similarly, for M. barkeri, a network with 619 reactions connecting 628 metabolites of which 70 are external, there are 475 metabolites which are sustainable on the set of all 70 nutrients. The scope of the nutrients comprises only 138 metabolites. In this case, the addition of the cofactor functionalities of ATP and NADH and NADPH yields a scope which is identical to the set of sustainable metabolites. Inspired by these findings, we perform a systematic comparison by first considering the idealized situtation in which exactly one metabolite is initially available. Further, we assume that water is largely abundant and therefore include water to every seed without explicitly mentioning it. In Fig. 2 the number of metabolites in the scope of a single metabolite (and water) is compared to the number of metabolites which are producible from this metabolite (and water) alone according
96
K. Kruse f3 O. Ebenhoh 80
30
70 60 II>
g. u U1
20
50 40
II> Co
30
U1
0
u
10
20
•",.
10 100
200
t': 300
400
producible metabolites
(a) E.coli
500
0
0
100
200
300
400
producible metabolites
(b) M.barkeri
Fig. 2. Comparison between scope size and numbers of producible metabolites. Each dot represents one metabolite. Dots on the straight line represent metabolites for which the scope size equals the number of producible metabolites.
to condition (2). In approximately 44% of all cases, the scope is identical to the set of producible metabolites. However, identity is only observed for small sets with the majority being those cases in which the scope is identical to the seed. In most cases the set of producible metabolites is considerably larger than the size of the corresponding scope. This is not surprising, considering that the criteria for obtaining producible metabolites are weaker than for metabolites in the scope. Interestingly, the size distributions of both sets are clearly structured. For the scopes, this property has been extensively investigated in [2, 7] and the results have been used to derive a hierarchical ordering of metabolism [5, 12]. Apparently, there exists a similar ordering of sets of producible metabolites. Fig. 3 shows the direct comparison of scope sizes and numbers of sustainable metabolites. In the E. coli network, these sets are identical in 97% of all cases and in M. barkeri identity is observed in almost 99%. Those metabolites for which the corresponding sets differ are labelled by the abbreviations used in [4, 13]. Remarkably, many metabolites in the E. coli network exhibiting differences in the sets of sustainable metabolites and those in the scope are related to important cofactors. In particular, many adenine nucleotide phosphates and nicotineamide dinucleotide phosphates belong to this class. In both networks, many sugar phosphates also show a considerable difference in the corresponding sets. Because cofactors apparently take on a role as key metabolites in both networks, a detailed investigation of their influence on scope size and contents is performed. We specifically consider the following four cofactor functionalities: 1) transfer of a phosphate group from ATP to an acceptor, yielding ADP, 2) simulatenous hydrolysis of two phosphate groups from ATP yielding AMP, 3) reduction of NAD+ to yield NADH, thereby oxidizing another compound, 4) the analogous process but involving NADP+ /NADPH. Apparently, the introduction of a cofactor functionality can only increase the scope. We have systematically compared the scopes resulting for all 16 combinations of cofactor functionalities with the sets of sustainable metabolites.
Comparing PEA and Network Expansion 140
97
40
120 30
100
..
80
~
60
c.
·nadp
40
.
g-
20
l;(
,.&~;P
·nad man1p
campP
.~.f'pn6P
10
.~I Rh
(a) E.coli
f::
.e4p 0
0
.i~~p
10 20 30 sustainable metabolites
40
(b) M. barkeri
Fig. 3. Comparison between scope size and number of sustainable metabolites. Metabolites are represented as dots. Metabolites for which the scope size is not identical to the number of sustainable metabolites (located below the diagonal) are labeled. For clarity, two metabolites were omitted in figure (a): acg5p (sustainable metabolites 279, scope 3) and glu5p (264, 3).
In Fig. 4, the results for the E. coli network for the four cofactor combinations ATP I ADP, ATP I ADP and ATP I AMP, NADH and NADPH, and all cofactors are shown. Interestingly, introduction of the redox cofactors NADH and NADPH lead to a stronger increase in scope size as the introduction of the phosphate transfer cofactors ATP I ADP and ATP I AMP. The latter case, in which both ATP related cofactor functionalities are introduced, is of particular importance. Here, the scopes of many central metabolites including NAD+, NADP+ and deoxyadeninephosphates are identical to the corresponding sets of sustainable metabolites. There exist, however, other metabolites whose scope is always considerably lower than the set of sustainable metabolites, which holds true for both investigated networks. A thorough investigation of the participating reactions preventing the expansion of the scope leads to the identification of metabolites, whose addition directly to the seed resulted in identity of scope and sustainable metabolites. In Table 1, the Table 1. Selection of metabolites that have to be added to the seed in order to obtain the same result for scope and sustainable metabolites. network E. coli
M. barkeri
both
addition to seed Proton (H+) ATP D-Ribulose 5-phosphate D-Ribulose 5-phosphate Proton (H+)
affected metabolites ps, 3dglnp, orot5p dnad, nadh, nadph e4p, s7p manlp, man6p, glp, f6p, g6p, e4p pran, 2cpr5p
98
K. Kruse
~
~
fj
O. Ebenhoh
140
140
120
120
100
100
80
~
~
...;., ,.
60
80
40
20
20
60
80
100
120
0
140
.
••
• •••
.'
60
40
40
nap
r.",
"<>.
~
0
20
(b)
ATP/ADP
120
120
100
100 1
.,
~.
"<>.
ll:
..:
0
.
120
140
120
140
:.
..,.f·
H:·
80 60
: ".
.'r
/:.
....
.. . ,:
20
1··""I ...
0
100
ATP/ADP, ATP/AMP
r
40
40 20
80
0
;(
.-.' .,.
60
60
..
140
,.
40
sustainable metabolites
140
80
p
#
sustainable metabolites
(a)
r
8 des
.~p
20
40
60
80
100
sustainable metabolites
(c)
NADH, NADPH
120
140
0
,-.""
I ...
0
20
40
60
80
100
sustainable metabolites
(d)
a.ll cofactors
Fig. 4. Comparison of scope size with different cofactor funcionalities with the numbers of sustainable metabolites for the E. coli network. In (b) metabolites have been labeled whose scope rose to the size of the sustainable metabolites by adding both cofactor functionalities of ATP.
most predominant examples are presented. To study whether the finding that the inclusion of cofactor functionality improves the agreement of scopes with sets of sustainable metabolites is of a general nature, we performed a Monte Carlo simulation. For this, we randomly generated 1000 seeds with sizes varying between 10 and 100. For both networks, the scopes for all possible combinations of cofactors as well as the corresponding sets of sustainable metabolites have been determined. In Fig. 5 the degree of agreement of the sets is plotted versus the seed size. Interestingly, the behaviour differs for both networks. Whereas in both cases the agreement increases with increasing seed size
Comparing FBA and Network Expansion O.g
0.8 ~
0.7
~
0.6
~
0.8
:i
0.7
'"
c ~ 0.5
~ 0.6
E 0.4 ~
~
O.S
~
0.4
o
o O.:S
is
99
.2
0.3
:e
0.2
~
1il
0.1
.t: 0.1
O~~cc~~~~~ D U ro 00 100 ~
~
~
~
~
average seed size
(a) E.coli
~
0.2
O~~~~~~~~ o 10 20 30 40 50 60 70 80 gO 100 average seed size
(b) M. barkeri
Fig. 5. Degree of identity of scopes and sustainable metabolites for both investigated networks as a function of seed size. Black line represents network expansion without cofactors, the green line with the two ATP-related cofactors and the red line considering all four cofactor functionalities.
when cofactors are included, this is not true for scopes without cofactors. In the case of E. coli, the best agreement is obtained by considering all cofactor functionalities simultaneously (for seed sizes larger than 10). In contrast, in M. barkeri inclusion of both ATP related cofactor functionalities for large seed sizes (> 40) yields the highest degree of identity.
4. Discussion
We have introduced several mathematical descriptions defining the producibility of metabolites from available nutrients. Simple producibility is given when a steady state flux through the metabolic network may exist such that the concentration of a metabolite increases while exclusively consuming the nutrients. By considering a cell under persistent growth, we arrive at the concept of sustainability, which defines metabolites whose concentrations may be increased even if all intermediates are simultaneously diluted. The method of network expansion provides the concept of a scope of nutrients, describing what a network may produce if exclusively the nutrients are present and all intermediates possess zero concentration. We have systematically compared sets of producible and sustainable metabolites with the scopes obtained from single initial compounds and found that the scope is often identical to the set of sustainable compounds. We could further show that including cofactor functionalities, which are derived from heuristic arguments, can significantly increase the number of identical cases. More importantly, Monte Carlo simulations for larger sets of nutrients showed a tendency towards greater accordance of scope and sustainability with an increasing number of nutrients. For some metabolites, the introduction of cofactor functionality was not sufficient to produce a scope identical to the set of sustainable metabolites. It is to be expected that this also holds true for combinations of seed compounds. In some
100
K. Kruse
(3
O. Ebenhoh
cases, the addition of protons to the nutrients was sufficient to enlarge the scope to the sustainable metabolites. Since protons in most cases do not influence the size of a scope, it seems reasonable to generally include them in the seed. This is in particular plausible since we always considered water to be abundant and in aquaeous solutions protons are always present. An interesting observation was made for some metabolites occuring in the pentose phosphate pathway. Erythrose-4-phosphate (E4P), for example, exhibits a very small scope but in both networks the corresponding sets of sustainable metabolites are significantly larger. This observation can be explained by considering the structure of the pentose phosphate cycle which contains many bimolecular reactions. A subset can easily be assembled allowing for a stationary flux producing, for example, xylulose-5-phosphate and glyceraldehyde-3-phosphate from two molecules of E4P. However, since E4P never appears as a single substrate, it is evident that the scope of E4P only contains E4P itself. This fact has practical consequences for a whole class of other organism-specific networks. Most photosynthetic organisms, such as plants or green algae, can fix CO 2 by means of the Calvin cycle which bears high similarities with the pentose phosphate cycle. To realistically assess the biosynthetic capabilities from nutrient combinations including CO 2 , also other compounds of the Calvin cycle, such as ribulose-l,5-bisphosphate should be added. A thorough investigation of genome-scale networks of photoautotrophic organisms is still outstanding. Although the concept of sustainability is mathematically more rigorous, it has the drawback that it is computationally very intensive. For some calculations of sets of sustainable metabolites, several hundred linear programming problems have to be solved. In contrast, the network expansion algorithm is extremely simple and fast and can easily be applied millions of times on a normal personal computer, rendering it suitable for large-scale applications for example to investigate thousands of nutrient combinations for hundreds of networks. Considering that the agreement of scopes with sets of sustainable metabolites is in most cases extremely accurate, we conclude that the enormous gain in computational speed justifies the inaccuracies that the network expansion method unavoidably displays due to the introduction of heuristic cofactor functionalities. References [1] Ebenhi:ih, 0., Handorf, T., and Kahn, D., Evolutionary changes of metabolic networks and their biosynthetic capacities. Syst Bioi (Stevenage), 153(5):354-358, Sep 2006. [2] Ebenhi:ih, 0., Handorf, T., and Heinrich, R., Structural analysis of expanding metabolic networks. Genome Inform, 15(1):35-45, 2004. [3] Edwards, J. S. and Palsson, B. 0., Metabolic flux balance analysis and the in silico analysis of Escherichia coli K-12 gene deletions. BMC Bioinformatics, 1:1, 2000. [4] Feist, A.M., Scholten, J.e.M., Palsson, B.O., Brockman, F.J., and Ideker, T., Modeling methanogenesis with a genome-scale metabolic reconstruction of methanosarcina barkeri. Molecular Systems Biology, 2:2006.0004, 2006. [5] Handorf, T., Ebenhi:ih, 0., Kahn, D., and Heinrich, R., Hierarchy of metabolic compounds based on their synthesising capacity. Syst Bioi (Stevenage), 153(5):359-363,
Comparing FBA and Network Expansion
101
Sep 2006. [6) Handorf, T., Christian, N., Ebenhoh, 0., and Kahn, D., An environmental perspective on metabolism. J Theor Bioi, 252:530-537, Nov 2007. [7) Handorf, T., Ebenhoh, 0., and Heinrich, R., Expanding metabolic networks: scopes of compounds, robustness, and evolution. J Mol Evol, 61(4):498-512, Oct 2005. [8) Ibarra, R.U., Edwards, J.S., and Palsson, B.O., Escherichia coli K-12 undergoes adaptive evolution to achieve in silico predicted optimal growth. Nature, 420(6912):186189, Nov 2002. [9) Imielinski, M., Belta, C., Rubin, H., and Halasz, A., Systematic analysis of conservation relations in Escherichia coli genome-scale metabolic network reveals novel growth media. Biophysical Journal, 90:2659-2672, 2006. [10) Kauffman, K.J., Prakash, P., and Edwards, J.S., Advances in flux balance analysis. Curr Opin Biotechnol, 14(5):491-496, Oct 2003. [I1J Liolios, K., Mavromatis, K., Tavernarakis, N., and Kyrpides, N.C. The genomes on line database (gold) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res, 36(Database issue):D475-D479, Jan 2008. [12) Matthiius, F., Salazar, C., and Ebenhoh, 0., Biosynthetic potentials of metabolites and their hierarchical organization. PLoB Comput Bioi, 4(4):elO00049, Apr 2008. [13) Reed, J.L., Vo, T.D., Schilling, C.H., and Palsson, B.O., An expanded genome-scale model of Escherichia coli K-12 (ijr904 gsm/gpr). Genome Biology, 4(9):R54.1-R54.12, 2003. [14J Schilling, C. H., Edwards, J. S., Letscher, D., and Palsson, B. 0., Combining pathway analysis with flux balance analysis for the comprehensive study of metabolic systems. Biotechnol Bioeng, 71(4):286-306, 2000.
SEMI-SUPERVISED GRAPH PARTITIONING WITH DECISION TREES TIMOTHY HANCOCK
HIROSHI MAMITSUKA
timhancock~kuicr.kyoto-u.ac.jp
mami~kuicr.kyoto-u.ac.jp
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan In this paper we investigate a new framework for graph partitioning using decision trees to search for sub-graphs within a graph adjacency matrix. Graph partitioning by a decision tree seeks to optimize a specified graph partitioning index such as ratio cut by recursively applying decision rules found within nodes of the graph. Key advantages of tree models for graph partitioning are they provide a predictive framework for evaluating the quality of the solution, determining the number of sub-graphs and assessing overall variable importance. We evaluate the performance of tree based graph partitioning on a benchmark dataset for multiclass classification of tumor diagnosis based on gene expression. Three graph cut indices will be compared, ratio cut, normalized cut and network modularity and assessed in terms of their classification accuracy, power to estimate the optimal number of sub-graphs and ability to extract known important variables within the dataset.
Keywords: graph partitioning; decision trees; multiclass classification
1. Introduction
The recent interest of computational biologists in graph partitioning stems from the idea of a highly organised community structure within biological networks, such as metabolism [12J. These communities manifest themselves as sub-graphs within the larger network. Graph partitioning describes the set of algorithms that seek to identify these sub-graphs. Common solutions to the graph partitioning problem are recursive k-way partitioning methods such as METIS [7J and approximate methods such as spectral approaches [4J. These methods however only output the optimal partition and offer no clues as to which features determine each sub-graph. Therefore after the optimal partition has been found the sub-graphs must then be analyzed to se~ if they possess a specific biological function. This second step often proves to be more time consuming than initially finding the optimal partition. If a graph partitioning algorithm could also provide a list of important variables that are related to specific sub-graphs then this feature would be of considerable use to computational biologists. In this paper we propose such an interpretable solution to graph partitioning through the construction of a decision tree. Graphs can be represented in many forms however the most common form is
102
Semi-Supervised Graph Partitioning with Decision Trees 103
with an adjacency matrix, S, which is a N x N symmetric matrix that~sllmmarizes the distances between the N nodes of the graph (Figure 1). In Figure 1 a partition on a graph has the action of dividing the adjacency matrix into four sub-matrices, SL, SR, So and S'{;, where SL, SR are the sub-graphs created by the partition and Graph
Adjacency Matrix
S=
Fig. 1.
0 1 1 0 0 I 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 1 1
0 0 0 I 1 0
Partition
SL
Sc
Sf:
SR
Diagrammatic representation graph partitioning.
So contain the edges that connect them. Graph partitioning defines the quality of
a specified partition through a graph cut index, such as ratio cut [4], normalized cut [lOJ and more recently network modularity [12J. However the search for the optimal partition using these indices is an NP hard problem because optimality requires a search through all possible permutations of the nodes of a graph. Tree based models present a framework for predicting a response, y, by a hierarchy of decision rules found within a predictor dataset, X [2]. These decision rules are simple binary inequalities such as all observations that satisfy x :::; 0.71 go into the left down the tree otherwise they follow the right path. Tree models provide a predictive framework to estimate the optimal size of the tree and are able to measure the importance of each variable to the construction of the tree. These powerful features have led to many extensions that enable them to be used in more situations other than just prediction such as a for clustering and for feature selection [5, 13]. In this work we investigate the potential of decision trees to solve the graph partitioning problem. The construction of a tree for graph partitioning requires a greedy search over all binary decision rules within an external (predictor) set of variables X that can partition the adjacency matrix. The restricts the search space of possible partitions allowing the problem to be solved within a realistic computational time frame. In this paper we investigate the feasibility of using decision trees to solve the graph partitioning problem by using graph cut criteria as the homogeneity measure required to construct a tree. We compare the performance of ratio cut, normalized cut and normalized modularity cut on a benchmark multi-classification dataset [8]. The indices are compared with respect to their classification accuracy, their ability to estimate the optimal number of sub-graphs and on their power as a feature selection technique. The benchmark dataset used for the comparison of the three indices is the Ramaswamyet al. (2001) microarray dataset [8J for multiclass classification. This is an ideal benchmark because it is a combination of multiple mircoarray datasets each on a different tissue type and contains experiments from both tumor and normal
104
T. Hancock €3 H. Mamitsuka
tissues. These two classification problems allow for analysis of this dataset to be performed at two different scales. Firstly, the large scale is to classify tumor vs normal irrespective of tissue type, and secondly the small scale is to classify tissue type irrespective of the tumor/normal classes. These two resolutions provide an ideal test bed to assess the robustness of the three graph cut indices. Furthermore, Ramaswamy et al. (2001) performed a feature selection routine, based on SVM, which ranked each gene according to its ability to classify each tissue type. The gene rankings computed by [8] provide a convenient benchmark for comparing the feature selection power of each graph cut index. 2. Methods
2.1. Preliminaries and Notations
For the purposes of decision tree construction it is only necessary to consider a binary partition of the form in Figure 1 where the sub-graphs for a given partition are defined by SL and SR. Let k denote either sub-graph, Lor R, and define O"k to be the sum of the edges within k, O"k = ~(i,j)Ek Sij and define nk to be the number of nodes within sub-matrix k. Additionally we also define O"c to be the sum edges between the sub-graphs and O"T to be the sum of all edges within S. 2.2. Graph Cut Indices
The graph partition indices under evaluation in this paper are the ratio cut, normalized cut and the normalized modularity. These indices have the following forms: Table 1.
Graph cut indices
Ratio Cut
RR(S) = min {
Normalized Cut
RN(S) = min {
L
C !J } kE(L,R) nk
L ~}
kE(L,R) 17k
. C ut Norma I·Ize d Mo d u Ianty
RM(s)-_max{
~
+ !JC
k _n (17k _ (!J +!J C ~ kE(L,R) nk!JT !JT
)2)}
The indices in Table 1 vary in increasing order of complexity starting with the ratio cut, which simply searches for the minimum number of edges between the sub-graphs normalized by their size in number of nodes nk. Ratio cut however pays no attention to the density of edges within each sub-graph, only the edges between them. Normalized cut considers the density of the each sub-graph by normalizing by the total number of edges within and between each sub-graph. These indices are obvious in the context of graph partitioning because both seek to minimize the edges between the sub-graphs.
Semi-Supervised Gmph Partitioning with Decision Trees
105
For modularity cut however, the sub-graphs are assumed to be communities within the entire network. A community is defined as a sub-graph that has a nonrandom structure. A random structure is defined to be where the probability of an edge between two nodes is independent of any specific sub-graph structure. The modularity of a sub-graph is defined to be how much the probability of each edge within a sub-graph differs from the probability of that edge existing by random chance. This definition of sub-graph structure appeals to biologists and it has been shown many biological networks such metabolism are organized by a hierarchy of modularity [9J.
2.3. Tree Based Graph Partitioning Graph partitioning by decision trees starts with the adjacency matrix, 8, as a single sub-graph in the root node of the tree. The tree is then built by recursively finding binary partitions on 8 given a set of predictor variables X. To do this for all terminal nodes (sub-graphs) of the current tree the next best graph partition is found using a greedy search over all possible decision rules. The tree is then grown at the node that has the optimal graph partition index. This process is shown diagrammatically in Figure 2. The first tree in Figure 2 results in the four sub-matrices where 8 1 and 8 2 are the sub-graphs and 8 12 are the edges between them. The larger sub-graphs are expected to be found first as it is logical that identifying the larger sub-graphs first is more likely to optimize the graph cut index. The second tree in Figure 2 is created by partitioning 8 1 to create four new sub-matrices where 8 3 and 84 are the subgraphs and 8 34 are the edges between them. Note that by recursively partitioning the sub-graphs we are not changing the network structure but reordering the rows and columns of the adjacency matrix according to a specified graph cut criterion such that the identified sub-graphs lie on the block diagonal of 8. First Cut
Fig. 2.
Second Cut
Diagrammatic representation of tree graph partitioning.
3. Data and Methodology The dataset under examination is the Ramaswamy et al. (2001) microarray dataset [8J for multiclass classification. This dataset is an agglomeration of microarrays spanning 16063 genes measured on 14 different tissue types which are summarized in Table 2. We perform the same data preprocessing steps as described in [8], however,
106
T. Hancock €3 H. Mamitsuka
for our analysis we further reduce the number of genes by taking the top 1000 ranked with the largest standard deviation. The work on this dataset by Ramaswamy et al. (2001) focused on identifying genes to classify only the tissue types from within the tumor observations, however for our purposes we must also consider the normal tissues. However it can be seen from Table 2 that not all tissue types are observed in the normal observations. The absence of class assignments within the dataset occurs when no normal observation is possible, as would be expected for classes such as leukemia. In the case where it is not possible to obtain a normal observation, microarray experiments on comparable tissue types have been added into the data, such as microarrays of blood from nonleukemia patients. Taking into account normal microarrays, it can be seen that the full dataset is extended to 18 classes of tissue type. Ramaswamy et al. (2001) [8] defines separate test and training datasets of the micro arrays within the tumor classes for tissue type classification. However as our intention is to consider both the tumor/normal and tissue type classification problems, we keep test set of Ramaswamy et al. (2001) for the tumor tissue types but randomly assign 45 normal observations to our test set and assign the other 45 observations to our training set. Our defined training/test partition is described in Table 2. Table 2. Bladder
BL Train
T",t
Breast BR
Summary of the Ramaswamy et al. Microarray Colorectal
Leukemia
CO
LE 24
Thmor Normal Tumor Normal
Tissue Type Central Nervous System CNS 16
0 4 0
Lymphoma
LY 16 0
Melanoma ML 8
Mesothelioma
Ovary
ME
OV 8 2 3 2
Germinal GERMINAL
Lung
0 2 0
Tissue Type Pancreas
Train T",t
Tumor Normal Tumor Normal
PA 8
Prostate PR
Renal RE 8
Uterus UT
Cerebellum CEREBELLUM
0
Blood BLOOD
Brain
BRAIN 0
LU
To construct the graph adjacency matrix we consider two major aspects of our problem. Firstly, we are considering a supervised problem and therefore would like the sub-graphs within the adjacency matrix to agree as much as possible with the known response classes. Secondly, we are analyzing the performance of graph partitioning with decision trees and would therefore also like the sub-graphs to be generated by a tree structure. Fortunately, these two issues can be addressed by using the random forest proximity matrix [1] as the graph adjacency matrix. A random forest is an ensemble of classification tree models where each split within each tree is evaluated from a separate random sample of variables and observations [1]. The trees within a random forest are generated independently and the ensemble classification is performed by a majority vote on the predicted classes of each observation made by each tree. It is well established that by creating a random forest the predictive performance will stabilize and improve when compared to a single decision tree. It has also been found that random forests are also suitable for
Semi-Supervised Graph Partitioning with Decision Trees
107
feature selection [1] and observation of response class structure through the random forest proximity matrix [11]. The random forest proximity matrix is a graph adjacency matrix where the microarray experiments are the nodes and the edges between them are the number of times any two experiments are placed in the same terminal node over all trees within the random forest. As a random forest proximity matrix is built from an ensemble of trees it provides an ideal network structure for evaluating the relative performance of the graph cut measures. In this paper separate random forests are created on the training sets for the tumor/normal and for tissue type classification. The random forests are built using the randomForest R package [3, 6] and consist of 500 decision trees where each split is evaluated on a random sample of 31 genes. A heat map of the random forest proximity matrix for both tumor/normal and tissue type classification reordered by the known classes is presented in Figure 3. In Figure 3 yellow represents high similarities between the observations within each class and red represents low similarities. It is immediately obvious within Figure 3 that there are two different resolutions within the dataset. Through closer observation of Figure 3 it is clear that tumor/normal random forest is more accurately classifying the tumor class compared to the normal class. For the tissue type classification we see that the larger tumor classes, CNS, LE, LY are easily classified but the smaller groups are not as easily found. TumOTfNormat~yMatrlx
TIssue Type Adjacency Matrix
Fig. 3. Random forest proximity matrices for training datasets for tumor/normal and tissue type classification.
To compare the performance of each graph cut index for decision tree partitioning of the adjacency matrices in Figure 3 we perform 10-fold cross-validation for tree sizes ranging from 1 to 25 partitions or 2 to 26 sub-graphs. To show the sensitivity of each graph cut index at each tree size, the graph cut indices are evaluated on the training set. Furthermore, to assess predictive power of each index the correct classification rates (CCR) of each tree classifying the relevant response for both the
108
T. Hancock 1'3 H. Mamitsuka
training and test sets are also presented. Over the course of each cross-validation we keep a count of which genes are used to construct the tree. This count is then used as a measure of variable importance. To assess which graph cut index is selecting the most informative genes, we compare our importance measure to the top 1000 "One VB All" OVA features for each tissue type identified by [8J.
4. Results and Discussion The performance results for the 10-fold cross-validation are shown in Figure 4. In Figure 4 the left graphs plot the graph cut indices, the middle graphs plot training set correct classification rate (CCR), and the right most graphs plot the test set CCR for each tree size over the course of lO-fold cross-validation. The top row of plots in Figure 4 are the results for the tumor/normal classification and the bottom row of plots are the results for the tissue type classification. It should be noted that for comparison each graph cut index has been scaled such that the maximum value is 1 and the error bars are ±1 standard deviation from the mean. Additionally, the best partition for ratio and normalized cut is when the index is minimized, however the best partition for modularity is when the index is maximized. For performance comparison purposes, the SVM classifier employed by Ramaswamy et al. (2001) [8] classified the tumor/normal classes at 92% accuracy and the tissue types at 78% accuracy. However it should be noted that the datasets in this paper are not exactly the same as in our dataset we have divided the normal microarray observations into test/training samples. From Figure 4 it is immediately obvious that for each index, as the tree is grown the correct classification rate (CCR) of the known classes on both the training and testing subsets increase. The increasing CCR indicates that each measure is finding the predictive structure within the adjacency matrix. It can also be observed that for tumor/normal classification, as the tree size increases each index converges to the same classification performance, however for the tissue type classification problem, modularity appears to perform slightly worse. The reduced performance of modularity cut for tissue type classification may be a result of a lack of sensitivity to smaller sub-graphs. The trend of the modularity index however is more reliable than that for ratio or normalized cut because it is observed in Figure 4 that after a tree size of 19 the modularity decreases. Interestingly, this decrease in modularity after 19 splits seems not affect the classification performance, suggesting that there are no more predictive sub-graphs remaining to be found. Therefore the decrease in modularity after 19 splits is indicating that the optimal number of sub-graphs has been reached and any further partitioning is not improving the result. The power to estimate the optimal tree size is not observed in either ratio or normalized cut indices. The top 10 important genes for each index for both classification problems are presented in Table 3. The important decision tree variables are sorted in decreasing order of importance with a decision tree rank of 10 indicating that a gene was
Semi-Supervised Graph Partitioning with Decision Trees Tumor/Narmal Graph Cut Index:
rrainiog Set CCR
Tumor/Normal TestSIrt CCR
TlssueType Graph Cui Index
TIssue TypeTraining Set CCR
Tissue Type TelitSei CCR
'"-----"""--
Fig" 4"
TumorJNormal
109
"--=~------
IO-fold cross-validation results for tumor/normal and tissue type classification models"
selected in the building of each decision tree in all 10 cross-validation training sets" For the Ramaswamy et al. (2001) OVA rank the lower value the more important a gene is to a tissue type, with a OVA rank of 1 for the most important and 1000 for the least. The published list can be found in the supplementary materials section of [8J. In Table 3 an examination of the selected genes for both experiments show a high degree of similarity in the lists for ratio and normalized cut however the list for modularity cut appears different. This is seen most clearly for tissue type classification where 5 of the 10 genes appear in the both the ratio and normalized cut list, but only 1 gene is found to be similar within the modularity cut list" This result is expected as modularity and both normalized and ratio cut differ considerably in their definition of the structure of a sub-graph. Comparing the decision tree ran kings with the OVA ranking (Table 3) it is seen that for tumor/normal classification the genes selected by all three indices are not well ranked genes in the OVA scheme and seem to span multiple tissue types" For tumor/normal classification the poor OVA rankings and lack of tissue type specificity are expected as the OVA rank is specific for separating tissue type classes not for separating tumor/normal classes" For tissue type classification the selected genes for ratio and normalized cut are found to be well ranked genes in the OVA rankings" In particUlar 3 genes selected by the decision trees, AB00678 Ls_at, RC-AAI76975_s_at and L20688 are found to be the top ranked genes for colorectal, prostate and leukemia tissue types respectively" However it is observed that modu-
110
T. Hancock e3 H. Mamitsuka Table 3.
Decision tree VIP ranking compared with OVA tissue type ranking. Thmor ,Norma] CI888iftcation
Graph Cut
Variable Name
Ratio Cut
M55998.s.at RC.AAI95626.4t MT6318...at RC.AA426011.at RC..AA609113....at U03057..at RC.AA456588..at RC.AA434245J..at X13839..at U48959..at
Dechlion Tree Rank
BL
7
CNS
CO
Tissue Type OVA Rank LY ME ML LU 29
152
ov
PA
73
85
PR
RE
70' 715
33
'" 749
407
UT 235
102
630 597
596
891
149
511
112
16
SOl 18 SOl
913 85 913
" "
273
33
715 229
RC.AA195626..a.t RC.AA609113..a.t
AFOOl548..rnal.,at
283
J03592..at RC..AA433930..at RC.AA055560.J'..at
560
740
235
27J
21
481 780
177
98
127
M12529....11.t X80822J..at
737 891
M55998.s.at
D79205..at
Modularity Cut
LE
18
129
M55998.s..at U48959..at HG3431.HT3616.s..a.t
Normalized Cut
BR 891
18
56
83' 29
152
3BO 168
689
73
235
85
940
981
M26708..s.at HG2788.HT2896..at RC..AAI95626..at HG3214.HT3391..at M17886..at X03342-1\1 L19527.,a.t
31' 222 18
Tissue Type Cl88llificatlon Graph Cut
Ratio Cut
Normalized Cut
Modularity Cut
Variable Name M62895..1l..at RC..AA338646.i..at L20688..at RC...AAI16975.s..at ABOO678J...s..at HG3214.HT3391..At XOO855..s..At AFFX.HSAC07.XOO35LM..At D00654..A1. JOO268....11..A1. AB006781..s..At RC..AA176975..s..At RC-AA338646.i,.at HG3214.HT3391...J:1.t AFFX.HSAC07.XOO351_M..At L20688..At X6269L.at RC..AA479727..i..at AFFX.HUMGAPDH.M33197..5..at T30851-Lat X99076-I"J1I1oI..At M27602.i..At
LllOJ5..$..at X04476..J1..J1.t RC..AA479727.J..At HG3214.HT3391..at AFFX.HSAC07.XOO351..5....a1 XOO35IJ..o.t 0792Q5..at X13839...at
Decision Tree Rank 8
BL
BR
CNS
CO
LE
Tisllue Type OVA Rank LU LV ME ML
618
195
17
ov
PA
PR
RE
UT
638
12
733 476
797
695
24 12
10
I 17
759
151
10 19 25
981 596
145 940 149
571
99 511
112
16
larity cut does not find genes that are well ranked in the OVA scheme. In particular modularity cut identifies 5 genes in the 10 top ranked genes that do not appear in any OVA ranking for any tissue type. The lack for power for feature selection of modularity cut is surprising given the comparable classification rates in Figure 4. Overall from Table 3 it appears that both ratio and normalized cut are clearly identifying important genes found to be specific to several tissue types where as modularity cut is not as useful for this purpose. 5. Conclusions
Overall this work has shown that using decision trees in combination with graph partition indices is an accurate and informative measure of identifying sub-graphs within an adjacency matrix. Our experiments show that either ratio cut or normalized cut appear to be more accurate and informative than the modularity cut.
Semi-Supervised Gmph Partitioning with Decision Trees
111
However it was observed that the modularity index gave more information on the optimal number of sub-graphs. Future work to assess the performance of each index for decision tree construction would have to consider more datasets with differing network structures. Furthermore the further exploration into the feature selection properties of each index is required, focusing on the effect of surrogate splits and position within the decision tree. 6. Acknowledgements
This work was in part supported by a Japan Society for the Promotion of Science (JSPS) fellowship and the Japan Science and Technology Agency - Institute for Bioinformatics Research and Development (JST-BIRD) project. References [1] Breiman, L., Random forests. Mach. Learn., 45:5-32, 2001. [2] Breiman, L., Friedman, J., Olshen, R., and Stone, C., Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA, 1984. [3] Breiman, L., Cutler, A., Liaw, A., and Wiener, M., randomforest - brei man and cutler's random forests for classification and regression, 2008. [4] Hagen, L. and Kahng, A. B., New spectral methods for ratio cut partitioning and clustering. Computer-Aided DeSign of Integmted Circuits and Systems, IEEE Transactions on, 11(9):1074-1085, 1992. [5] Hancock, T., Multivariate Consensus Trees: Tree-based clustering and profiling for mixed data types. PhD thesis, Mathematics and Statistics Department, James Cook University, 2006. [6] Ihaka, R and Gentleman, R., R: A language for data analysis and graphics. Journal of Computational and Gmphical Statistics, 5(3):299-314, 1996. [7] Karypis, G. and Kumar, V., Multilevel k-way hypergraph partitioning. VLSI DESIGN, 11(3):285-300, 2000. [8] Ramaswamy, S., Tamayo, P., Rifkin, R, Mukherjee, S., Yeang, C. H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., Poggio, T., Gerald, W., Loda, M., Lander, E. S., and Golub, T. R, Multiclass cancer diagnosis using tumor gene expression signatures. Proc Nat! Acad Sci USA, 98(26):15149-15154, December 2001. [9] Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N., and Barabasi, A. L., Hierarchical organization of modularity in metabolic networks. Science, 297(5586):15511555, August 2002. [10] Shi, J. and Malik, J., Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888-905, 2000. [11] Shi, J. and Horvath, S., Unsupervised learning with random forest predictors. Journal of Computational and Gmphical Statistics, 15(38):118-138, 2006. [12] Shiga, M., Takigawa, T., and Mamitsuka, H., A spectral clustering approach to optimally combining numerical vectors with a modular network. Proceedings of the 13th ACM SIGKDD, pages 647-656,2007. [13] Smyth, C., Coomans, D., Everingham, Y., and Hancock, T., Auto-associative multivariate regression trees for cluster analysis. Chemometrics and Intelligent Labomtory Systems, 80:120-129, 2005.
MEASURING CORRELATIONS IN METABOLOMIC NETWORKS WITH MUTUAL INFORMATION OLIVER EBENHOH 2
[email protected]
JORGE NUMATA!
[email protected]
ERNST-WALTER KNAPP!
[email protected]
Macromolecular Modeling Group, Freie Universitiit Berlin, Takustr. 6, Berlin, 14195 Germany 2 Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, Am Muhlenberg 1, Potsdam-Golm, 14476 Germany !
Non-linear correlations based on mutual information are evaluated to measure statistical dependencies among data points measured from metabolism in two dimensional space. While the Pearson correlation coefficient is only rigorously applicable to characterize strictly linear correlations with Gaussian noise, the mutual information coefficient is more generally valid. Here, we use recent distribution-free (non-parametric) mutual information estimators based on k-nearest neighbor distances. The mutual information algorithm of Kraskov et al. is found to yield estimates with low systematic and statistical error. The significance of the different methods is probed for artificial sets of tens to hundreds of data points, a size currently typical for metabolomic data. We analyze experimental data on metabolite concentrations from Arabidopsis thaliana by using these procedures. The mutual information was able to detect additional non-linear correlations undetectable for the Pearson coefficient. Keywords: statistical correlation; Pearson coefficient; non-linear correlation; mutual information; knearest neighbor entropy; metabolomics; Arabidopsis thaliana.
1.
Introduction: Linear and Non-linear Correlation Measures
1.1. Correlations of metabolite concentration data
Metabolomics is a crucial tool in systems biology, smce it allows insight into the phenotypic result of gene expression. Metabolites show coupled changes in concentration, both under the influence of genomic and stress perturbations, and as part of the intrinsic' variability of a biological network. The meaning of these correlations in terms of biochemical network topology and gene expression remains to be fully elucidated [1-3]. One important stumbling block is synthesized in the saying "correlation is not causation", which equally applies to non-linear correlations. Steuer et at. observed that large linear (Pearson) correlation coefficients often do not coincide with metabolite pairs that are neighbors in the biochemical network [4]. Here, we follow a more modest goal in providing suitable measures for statistical correlations among metabolite concentrations and to test their significance. The method is able to detect also non-linear correlations not accessible to the linear Pearson coefficient ,l'c.
112
Measuring Correlations in Metabolomic Networks
113
Statistical correlation measures based on mutual information are able to capture more features of the data than the linear Pearson correlation coefficient. At the same time, they demand larger data sets than the Pearson coefficient to be significant. Here, we test recent developments in non-parametric methods for entropy estimation [5-7] to provide a general, non-linear measure of statistical dependencies.
1.2. Advantages of mutual information as a measure of correlation Mutual information is a non-linear measure of statistical dependence based on information theory [8]. It has advantages over other methods, • since it requires no decomposition of the data into modes, so there is no need to assume additivity of the original variables, as is done in Principal Component (PCA) and Independent Component Analysis (lCA) [9]. • since it makes no assumptions about the functional form (Gaussian or non-Gaussian) of the statistical distribution that produced the data. Hence, it is a non-parametric method. A numerical implementation based on k-nearest neighbor distances [5-7, 10] is more attractive than other methods to estimate mutual information [11], • since it requires no binning to generate histograms. • since it consumes less computational resources and its parameters are easier to tune than for kernel density estimators. It is a common practice to normalize the data to zero mean and unit variance using a linear transformation, which has no effect on the Pearson correlation coefficient. Linear transformations are smooth and uniquely invertible maps, as are the more general homeomorphic (non-linear) transformations. Mutual information for pairs of variables is not altered by general homeomorphic transformations of the data [5, 12]. These properties are important because metabolomic data rarely yield absolute concentrations, but rather ratios of concentrations [2].
1.3. Entropy, mutual information and statistical (in)dependence We will employ the usual symbol for entropy from information theory (H) instead of the thermodynamic notation (S). All logarithms (In) refer to base e, so that entropy and mutual information are measured in nats. To convert to bits, divide by In(2). Weare interested in testing the correlation between two random variables Xi and Xi' which have marginal probability densities plx;) and pi~) and a joint probability density P(i,jJCXi. Xi)· In the current non-parametric approach, no particular functional form for probability densities is assumed. The corresponding differential entropies (for continuous variables) are:
114
J. Numata, O. Ebenhoh C3 E.- W. Knapp
The mutual information f(iJ) shared by Xi and xi is
- Ifp( .. )(x.,x.)ln [
1(I,) ..) -
I,}
I}
PU,j)(x;,xi ) ] ()
()
dx.dx.· I}
(2)
Pi Xi Pi Xi
Two variables Xi and xi are statistically independent if and only if the joint probability density equals the product of the marginal densities: P(iJ)(Xi, Xi) = plx;) x pix}), since in that case, the argument of the logarithm term in Eq. 2 is unity and the mutual information vanishes. The mutual information f(iJ) can also be written [8] as:
(3)
1( I,} ..) =H.+H -H( I,} . .). I } If the variables Xi and Xi are correlated, f(iJ) will take a positive value up to minCH;, H;). We employ a more intuitive non-linear correlation coefficient r[ [6, 13, 14] that assumes values in the interval (0,1) for correlated variables. This coefficient r[ is a measure of the generalized statistical dependence between two variables. For strict correlation of the variables Xi and Xi (e.g., Xi = Xi or Xi = -Xi)' r[ adopts the maximum value of +1; in absence of correlation rl vanishes. Albeit the exact value of f(iJ) cannot become negative, approximate evaluations can. Therefore, we propose here a modification of the coefficient r[ to allow also for negative values (-1, 0) that can quantify possible numerical errors in estimating mutual information
(4) Note that negative values of r[ should not be interpreted as anti-correlations, since adopts also positive values in that case. In contrast to mutual information the Pearson correlation coefficient quantifies exclusively linear correlations, and is actually given as a normalized covariance f(iJ)
(5) where
(6) Since the Pearson correlation coefficient is based on quadratic forms, it is relatively sensitive to outliers. Negative values of I'c for two variables denote anti-correlation (appearing as negative slope in a linear fit). In the numerical implementation, a value of I'c = is assigned to cases where one of the variances in the denominator of Eq. (5) vanishes. A non-vanishing I'c means that a linear fit can describe the correlation between Xi and Xj approximately. Similarly, a positive non-linear coefficient r[ means that the variables Xi and Xi are correlated, and a non-linear fit could describe this relationship.
°
Measuring Correlations in Metabolomic Networks
115
This is a very general statement and does not imply any particular functional form (such as a quadratic polynomial). 1.4. Numerical methods: k-nearest neighbor entropy and Kraskov mutual information
We employ two different methods to estimate mutual information I{iJ') and its correlation coefficient rI from Eq. 4. One method uses the k-nearest neighbor entropy [10] introduced by Hnizdo et al. [6, 7], which estimates H(I), H(j) and H(iJ) individually and then calculates the mutual information I{iJ) from Eq. 3, yielding the coefficient r/NN from Eq. 4. The second method estimates the mutual information I(i.j) in a more direct way using the Kraskov et al. algorithm [5], which is also based on a nearest-neighbor approach. The more direct estimate is advantageous, since it avoids accumulation of systematic biases inherent in the terms H(l), H(j) and H(iJ) when using Eq. 3 for I(iJ). For both algorithms, rI kNN and r[ Kras, the only adjustable parameter is the number of neighbors. We employ k = 6th nearest neighbor, which proved to be a good compromise between systematic and statistical errors (data not shown). 2.
Application to Constructed Data
2.1. Non-linear correlations are captured by mutual information
The non-linear correlation coefficient based on mutual information is able to detect additional correlations invisible to the linear Pearson coefficient. Cases AI-A 7 in Fig. I show comparable performance for both coefficients in linear cases. But, the non-linear nature of the correlation between the variables in cases BI-B6 causes !,C to vanish. Visually it is obvious that a relationship exists, and this is quantified by r/ras.
2
t'c r;Kras
1.0 1.0
/ 0.01 0.89
VV
, -
0.80 0.80
0.02 0.63
3
4
0.38 0.38
0.00 0.06
tI
.
.'" ."
0.00 0.67
~
•.,
0.00 0.79
.,,;
5
6
7
' , : -
.".
A
X
U
B
-0.38 0.38
0 .00 0.92
-0.80 0.80
0.00 0.81
o
-1.0 1.0
0.01 0.00
Figure I Comparison of the performance of the Pearson (linear) correlation coefficient /c and the non-linear measure rl Kra, (an implementation of rl) based on mutual information. Each of the 14 cases is an artificial example showing different functional relationships between the variables Xi and Xj' The artificial data sets are large: N,ize = 105 points, some of them with Gaussian noise. The first row (AI - A7) represents linear correlations, and for A4 a lack thereof. Except for the sign comparable performance is shown for rl Kras and /c when the correlation is linear. The second row (B I - B7) displays non-linear cases where /c fails to detect any correlation, while r/ras can quantify it. Case B7 is shown to be uncorrelated by both coefficients.
116
J. Numata, O. Ebenh(jh fj E.- W. Knapp
Note that anti-correlation (negative I'c) in B5-B7 is shown simply as statistical dependence in r/ras, which is strictly never negative in absence of numerical errors. In any case, the concept of anti-correlation is not applicable to relations with changing slope such as B4 or B6 of Fig. 1. Furthermore, anti-correlation loses its meaning even for linear relations in more than two variables [14]. 2.2. Significance of the coefficients given the limited sample size
Metabolomics and gene expression experiments currently yield tens to hundreds of data points. To probe the significance of the correlation coefficients among pairs of variables for such sample sizes, we numerically tested the artificial examples A2 (linear correlation), A4 (uncorrelated) and B4 (non-linear correlation undetectable by I'c). From Fig. 2, we observe that sample sizes N size that allows detecting correlations reliably need to be larger than Nsize > 40. One important lesson from Fig. 2 is that weak correlations corresponding to small correlation coefficients cannot be discriminated from background noise in agreement with findings of Selbig et al. [11]. There is a gray area in the region 0.550 < r[Kras < 0.665 for Nsize = 43, where a more thorough statistical analysis could be made [15]. In this work, we opt for the safe side by considering only large correlations.
Case
,-Pc
r
t:r~~l !
A2
l.--
I
r
kNN
:r'l
~....-
-1
10
..•.
A4
,,\:
10
h'.
'.'
)'~'. '.
I(
:~~HI
-1
10
"~
-
....
84
--40100 400
10
. ....,...~'$.. y~
~
~
....
-1
10
40 100 400
I( o
U]IIIWH 1
..{
-1 10
40100
N~I1'"
40100
~
~
~-
JIIII[I~I~
400
10
40100 400
NSil(!
•
.~~~~.'
:ri~~:
40 100 400
_:I~IHli
Ns,ze-
to
Kras
NSlZC
:' :." ~...
"''1;
40100 400
I
400
~
....-
'fJ 0 it-
i
-1
10
40100 400 N
~11C
NSilE'
Hff+r-"~ I
!!
1
'"
~-
-1 10
40100 400 N~11f>
Figure 2: For cases A2 (linear correlation), A4 (uncorrelated) and B4 (non-linear correlation), all three correlation coefficients ,fc, r/NN, r/'"" (vertical axis) were estimated using the sample size N,u, (horizontal axis). The error bars show empirical 95% confidence intervals (p = 0.05). They represent the observed, sometimes asymmetric variation around the mean for 2000 samples of N,izo data points each. Among the nonlinear methods, rI Km, shows less statistical and systematic errors than r/NN. Negative values of,fc denote anticorrelations. Negative values ofr/NN and r/'"' denote numerical errors in the estimation of mutual information.
Measuring Correlations in M etabolomic Networks
117
The method from Hnizdo et al. [6, 7] yielding r/,NN is based on a nearest neighbor entropy estimator. As suggested by Kraskov et al. [5], the systematic bias in individual entropy estimation of I-D and 2-D samples will not necessarily cancel out in Eq. 3. In our numerical experiment (Fig. 2, colunm r/NN), a negative systematic bias is evident for Nsize < 58 and a positive bias for N,ize > 58 with a larger spread of values and frequent occurrences of negative r/,NN values, which are traces of numerical errors. Nevertheless, the kNN method is still useful for very large sample sizes N,jze > 1000 and when the I-D entropy of each variable is of interest [10]. Computing r/ras to obtain the correlation is better suited than r/NN for our small sample sizes and in particular if we are interested in mutual information I(ij) and not in 1Kras D entropies. Thereby, rJ shows less systematic bias and lower variability (statistical bias) among the different computational methods. Negative values of I(ij) were only found for small to medium large correlation values rJ Kras < 0.665 or very small sample sizes Nsjze < 40, where presence of correlation is difficult to detect. A4: uncorrelated
false negatives:
true positives:
0.02%
99,8%
Figure 3: For the test cases A2, A4 and B4 (see Fig. I for morc dctails), the absolute value of thc Pearson coefficient l,.rci is plotted against mutual information rI Kms for 2 104 samples. Samples yielding points (+) outside the rectangle show significant correlations for both coefficients, while the points (x) from samples inside the rectangle do not. The rectangle marks the cutoff values l,.rci = 0.545 and r/ras = 0.665, which were chosen to minimize detection of false positives where no correlations are present. For linear correlation (A2) both coefficients provide similar information. But the Pearson coefficient ,.rc is not able to detect the non-linear correlations in B4 and reports values similar to the uncorrelated case A4. Negative values of r/"" denote numerical errors in the estimation of mutual information.
The cutoff values for I'c 0.545 and r/ras = 0.665 were chosen from Fig. 3 to minimize detection of false positives in the absence of correlation (case A4) for Nsize = 43. With these conditions, we obtain three false positives for 2 104 samples using r/ras (see Fig. 3, middle part) corresponding to 0.015% of all samples and a concomitant p value of p = 312 104 = 0.00015. In the following we will deal with data of sample size Nsize = 43 comparing 16290 pairs of metabolic variables. At the same time, we expect to detect 99.8% oflinearly highly correlated pairs, but only 28.5% of the non-linear ones (see Fig. 3). This is because the limited sample size of 43 data points limits reliable detection to large values of non-linear correlations. In an analog numerical simulation with 2 104 samples using the larger sample size Nsjze 700, non-linear correlations with
118
J. Numata, O. Ebenhoh
{3
E.- W. Knapp
0.75 could be completely separated from data obtained with absence of correlation Kras > = 0 (data not shown). Even for small Nsize = 43, applying both methods enriches detection of correlations in comparison to the usage of only I'c.
3.
Application to Metabolomic Data
3.1. Correlations among metabolite concentrations from Arabidopsis thaliana
The data set to be analyzed consists of a sample ofNsize = 43 Arabidopsis thaliana plants comprising four lines, where each line involves between ten to twelve biological replicates. The latter refers to plants of the same line (i.e. possessing the same DNA), which are kept under identical growth conditions. The four plant lines used are ecotypes Col-O and C24 and two of their crossings: Col-OxC24 and C24xCol-0. For each of these 43 plants, 181 standardized metabolite concentration ratios of the primary metabolism were measured. The data analyzed in the present study are the resulting (181xI80)/2 = 16290 pairs of metabolites taken from a data set analyzed before [16]. In the following, we look for correlations between metabolite pairs by grouping plant lines and biological replicates to form a sample whose size (Nsize = 43) is above the minimum reliable sample size, found to be Nsize = 40. However, a larger sample size would allow detecting differences in correlation among plant lines more reliably. The conservative limits of I'c > 0.545 and r[ Kras > 0.665 to detect correlations were chosen for the case of Nsize = 43. These limits are stringent and rather err on the side of producing false negatives to obtain trustworthy positives with p = 0.00015. More tolerant limits can be used for larger sample sizes or by employing p- and q-value analysis [15]. Table I: Correlations among the 181 metabolites were tested in a pairwise fashion, yielding 16290 pairs, from which 5 6% were found to be significantly correlated (indicated by bold digits) linear Pearson non-linear mutual comments coefficient information coefficient non-linear/only discovered by I'c <0.545 r/ra.< > 0.665 r/'as (Fig. 4 and 5) r/,a, > 0.665 I'c>0.545 IinearJlarge rKroo) (Fh~. 6) I'c>0.545 r/'as < 0.665 linear (small rKra,) (Fig. 7) I'c <0.545 uncorrelatedlun-detectable (Fig. 8) r/'as < 0.665
number of metabolite pairs (fraction) 91 (0.56%) 243 (1.49%) 586 (3.60%) 15370 (94.4%)
3.2. Metabolite pairs with high correlation
The following examples show significantly correlated pairs of metabolites. We present only examples for which the chemical signature allowed a structural identification of the metabolite, namely for 85 out of the 181 recorded metabolites [16]. The calculations of correlations among pairs of standardized metabolite concentrations (x- and y-axes) were done for the merged 43 data points from the four plant lines of Arabidopsis thaliana. However, we differentiate among the plant lines:
Measuring Correlations in Metabolomic Networks
119
pure ecotypes C24 (red cross, X) and Col-O (green circle, 0) as well as their crossings Col-OxC24 (blue square, 0) and C24xCol-O (black plus, +).
,
f'C= 0.46, 0K"'=0. 74
&
l' El0
!
0
§
'"
f'C= 0.45, ',K""=0.69
xl! x
)(
+
+0 o
f'C=·0.02, 0""=0.67
f'C= 0.16, 0"""=0.67 5 ...........•.......•- ...........- •......
5
~5
x
~
c
eg,
0
2
~
+
+
+
~)(' + +
~5
0
glutamine
5 .....•~-.........
0
-5 5' - - - - - - - - - '
phenylalanine
tyrosine
Figure 4: Examples with low Pearson coefficient (I'c < 0.545) but significant mutual information coefficient (rI Km, > 0.656). rPC= 0.35, ',""'=0.73
f'C= 0.22, ',K"'=0.67
.
x o
·5 . - - -......-........ ~5 0
cellobiose
1 I :.l
}
0
-5
~ [
-5'----...,.....----'
a
0-5
cellobiose'!!!:
·5 ·5
0 glycine
cellobiose
Figure 5: Aside from the presence of correlations only detectable by r/ differences among plant lines. Pearson was not significant, with I'c < 0.545.
m
f'C= 0.89, ,,""'=0.92
f'C= 0.89, ',""'=0.91
rPC= 0.47, ',""'=0.67
f'C= 0.53, ,,""'=0.71
,
> 0.656, these plots exhibit
f'C= 0.63, ',""'=0.70
f'C= 0.69, ',""'=0.68
5,············..··..· ..• .. •·················· ..· .... ········,
x
[] )(
o
.~
glucose 6-phosphate
0
~5~---~0~--~'
leUCine
fucose
.5-5':---~--:O----_.....J
galactinol
Figure 6: Examples where both correlation coefficients I'c and r/m , indicate significant correlation. (I'c> 0.545, rjKm, > 0.656). f'C= 0.66, ',""'= 0.60
51-~~!I ++
.5 \ . . . . . . . . - - - - 0 - - - " 5 succinic add
f'C= 0.67, ,,",",= 0.58
f'C= 0.64, ',"'"'= 0.56
jJ
+
:~~ I .
~+
~
c .~ 0
]
threonk add
rPC=0.59, ',K",,= 0.26
)(
glyceric acid
," 2,4 hydroxybutiric add
Figure 7: Examples for the case where only the Pearson coefficient was significant (?C > 0.545) but the nonlinear coefficient was not (rI Km, < 0.656).
120
J. Numata, O. Ebenhoh
0X
~
9-
0
§ '"
oS ·5
x +0 +
0 glucose 6-phosphate
E.- W. Knapp
"0 .~
."8 u
.~
I 01
J ·5
• b
b
T
0
citric acid
r""= 0.39, r,Kro'=0.34
t"c= 0.10, r,Kn;'=0.10
t"c=-0.06, fj"""=-0.41
r""= 0.17, r,Krn'=0.25
]
{3
1!
.~
rn
o Q)) 00
i
w
x
+
.2 0
'"
?;fxt1l +
0
cellobiose
xylitol
Figure 8: The above examples show likely uncorrelated pairs of metabolites, where the limited number of data points does not allow a clearer classification.
Figs. 4 and 5 present correlations which were only detectable as significant by the mutual information coefficient r/ras, but invisible to the Pearson correlation coefficient !,C. In Fig. 4, the reason is the presence of outliers. The examples in Fig. 5, in addition to correlation, also present differences among plant lines, which cluster in different concentration regimes. Three of the plots in Fig. 5 involve cellobiose, which in another study [16] using a larger data set was found to be the largest contributor to phenotypic variations. The metabolic data analyzed in the present study are a subset of these data. In particular the metabolic data with cellobiose are in an experimentally trustworthy concentration regime, where correlations are likely not caused by experimental error. Fig. 6 shows examples where both correlation coefficients r[ Kras and !,C adopt values that indicate significant correlation. The first plot corresponds to large correlation found in a variety of studies [1]. The metabolites glucose 6-phosphate and fructose 6-phosphate are directly connected in the biochemical network by the enzyme EC 5.3.1.9. [17]. In the second plot, both metabolites are hydrophobic amino acids. But the chemical nature of the metabolites is seemingly unrelated in the third and fourth plot. In Fig. 7, we illustrate metabolite pairs where only the Pearson correlation coefficient, !,C, points to significant correlation. Most of such cases yield an intermediate value for r/ras, in the "gray area" that does not allow clear discrimination. The last plot in Fig. 6 is a rare example where r[ Kras is particularly small. Lastly, Fig. 8 shows either uncorrelated cases, or cases where the coefficients were not able to detect correlation reliably. The second plot shows two chemically related metabolites, which however show no correlation. The third plot shows a separate cluster for plant line Col-O, but no correlation. In the last plot some correlation seems to appear, but the correlation coefficients are too small to be significant. 4.
Conclusion
There are two major advantages in using the mutual information coefficient. The first one is the discovery of additional correlations invisible to the Pearson coefficient, frequently because of the presence of outliers (see Fig. 4). The second advantage is the detection of correlation even if plant lines cluster in different concentration ranges. Although a cluster analysis would be able to detect these differences in concentration regimes, the present
Measuring Correlations in Metabolomic Networks
121
method allows concurrent detection of correlation. For example, cellobiose displays a consistently lower concentration range when compared to galactinol for plant line Col-O, but not for the other three plant lines. Simultaneously, the two metabolites were found to be correlated by the mutual information coefficient, but not if the Pearson coefficient is used. (Fig. 5). In this work, the emphasis was on discovering few but highly significant correlations, with a small risk of false classifications even for small sample sizes of Nsize = 43. However, it should be noted that larger sample sizes of a few hundred data points would allow to detect also smaller correlations.
Acknowledgments This work was supported by the International Research Training Group "Genomics and Systems Biology of Molecular Networks" (GRK1360 of the DFG). We would like to thank Dr. Matthias Steinfath and Dr. Jan Lisec for useful discussions and for sharing their experimental data [16].
References [1] [2] [3] [4] [5] [6] [7]
[8] [9] [10] [11] [12]
Steuer, R., On the analysis and interpretation of correlations in metabolomic data. Briefings in Bioinformatics. 7(2): 151-158,2006. Camacho, D., A.dJ. Fuente, and P. Mendes, The origin of correlations in metabolomics data. Metabolomics. 1(1): 53-63,2005. MUller-Linow, M., W. Weckwerth, and M.-T. Hutt, Consistency analysis of metabolic correlation networks. BMC Systems Biology. 1(44),2007. Steuer, R., et al., Observing and interpreting correlations in metabolomic networks. Bioinformatics. 19(8): 1019-1026,2003. Kraskov, A., H. StOgbauer, and P. Grassberger, Estimating mutual information. Phys. Rev. E. 69: 066138, 2004. Hnizdo, V., et al., Nearest neighbor estimates of entropy. American J of Math and Manag Sciences. 23: 301-321,2003. Hnizdo, V., et aI., Nearest-Neighbor Nonparametric Method for Estimating the Configurational Entropy of Complex Molecules. J Comput Chem. 28(3): 655-668, 2007. Cover, T.M. and J.A. Thomas, Elements ofInformation Theory. 2nd E ed. Wiley Series in Telecommunications, ed. D.L. Schilling. 2006. Steinfath, M., et aI., Metabolite profile analysis: from raw data to regression and classification. Physiologia Plantarum. 132: 150-161, 2008. Numata, 1., M. Wan, and E.W. Knapp, Conformational Entropy of Biomolecules: Beyond the Quasi-Harmonic Approximation. Genome Informatics. 18: 192,2007. Steuer, R., et al., The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics. 18 Suppl. 2: S231-S240, 2002. Matsuda, H., Physical nature of higher-order mutual information: Intrinsic correlations and frustration. Phys Rev E. 3: 3096-3102, 2000.
122
J. Numata, O. Ebenhoh fj E.- W. Knapp
[13] Dionisioa, A., R. Menezes, and D.A. Mendes, Mutual information: a measure of dependency for nonlinear time series. Physica A: Statistical Mechanics and its Applications. 344(1-2): 326-329,2004. [14] Lange, O.F. and H. Grubmiiller, Generalized Correlation for Biomolecular Dynamics. Proteins: Structure, Function, and Bioinformatics. 62: lO53-lO61, 2006. [15] Storey, J.D., The Positive False Discovery Rate: A Bayesian Interpretation and the q-Value. The Annals of Statistics. 31(6): 2013-2035, 2003. [16] Lisee, J., et aI., Identification of metabolic and biomass QTL in Arabidopsis thaliana in a parallel analysis ofRIL and IL populations. The Plant Journal. 53: 960-972, 2008. [17] Mueller, L.A., P. Zhang, and S.Y. Rhee, AraCyc: A Biochemical Pathway Database for Arabidopsis. Plant Physiology. 132: 453-460, 2003.
OPTIMALITY CRITERIA FOR THE PREDICTION OF METABOLIC FLUXES IN YEAST MUTANTS EVAN S. SNITKIN 1 [email protected]
DANIEL SEGRE 1•2 [email protected]
IGraduate Program in Bioinjormatics, Boston University, 44 Cummington St., Boston, Massachusetts, 02215, USA 2Departments of Biology and Biomedical Engineering, Boston University, 24 Cummington St., Boston, Massachusetts, 02215, USA Constraint-based models of cellular metabolism, such as flux balance analysis (FBA), use convex analysis and optimization to study metabolic networks at a genome scale. The availability of reaction lists for numerous organisms, along with a variety of network analysis and optimization tools, is making these approaches increasingly popular for metabolic engineering and biomedical applications, as well as for addressing fundamental biological questions. It is therefore very important to assess the predictive capacity of these models and to understand how to interpret them in a biologically relevant manner. Typically, model assessment is limited to gauging the ability to predict phenotypes, such as viability under different environmental and genetic conditions. These types of assessments, for the most part, focus only on the growth phenotype of the cells, but ignore the underlying flux predictions. While this may be sufficient for certain types of study, the question of whether flux balance models can reliably predict intracellular and transport fluxes is crucial for more detailed analysis, and remains largely unanswered. Here we compare FBA model predictions of yeast metabolic fluxes to a previously published set of experimentally determined fluxes for \3 different single gene deletion mutants across a variety of possible objective functions. We find that the specific optimization criteria used to determine fluxes have a significant impact on the accuracy of the predicted fluxes. Interestingly, while different optimization methods provide very different levels of agreement relative to experimental fluxes, they tend to provide similar predictions with respect to the effect of the perturbation on growth. This demonstrates that assessment of models at the level of flux predictions is a critical step in assessing the biological validity of different models and optimization criteria.
Keywords: flux balance analysis; gene deletion; optimality criteria; flux measurements
1.
Introduction
A century of detailed biochemical studies, in conjunction with the genomic revolution, has culminated in the release of metabolic reconstructions for a number of model organisms. These metabolic reconstructions comprise the stoichiometries of all known enzymatic reacti<;>ns in a given organism. In addition to enabling the study of metabolic networks in diverse organisms [19], these reconstructions have yielded the ability to create genome-scale predictive models by using the steady state framework of flux balance analysis [12]. Flux balance models have been released for a number of bacterial organisms such as E. coli [7]and H pylori [14], and more recently also for the eukaryotes yeast [9] and human [5]. With the ability to generate models largely from sequence data, it should be expected that the pace of model development will only increase in the coming months and years.
123
124
E. S. Snitkin
fj
D. Segre
Along with the increase in model availability has come a widening of the spectrum of reported applications of flux balance models. Recent work has demonstrated the use of flux balance models to address cutting edge research questions ranging from understanding the dynamics of microbial communities [17] to predicting perturbations required to fulfill complex metabolic engineering objectives [2]. These various applications of flux balance models often require different levels of predictive abilities from the models. For instance, for some applications, being able to accurately capture the range of possible metabolic behaviors of an organism is sufficient [3], while for others the ability to predict the precise metabolic state resulting from specific perturbations is required [2]. Given that different model applications may require different levels of predictive proficiency, it is important to be able to evaluate the appropriateness of models for addressing different research questions. A common method fqr evaluating models is by quantifying their abilities to predict the effects of environmental and genetic perturbations on growth rate. The attractiveness of this approach for model evaluation largely stems from the availability of high-throughput growth phenotype data for many organisms, in addition to the ease with which the effects of environmental and genetic perturbations on growth can be determined using these models. While such assessments evaluate model behavior in response to diverse perturbations, the assessments are typically limited to growth phenotype. An open question is how a model's ability to predict the growth phenotypes under a variety of conditions translates into its ability to predict the fluxes underlying the growth predictions. Here, we utilized a compendium of experimentally determined fluxes for yeast single gene deletion mutants [1] to gain insight into the ability of yeast flux balance models to predict central carbon metabolic fluxes in response to perturbations. In addition to assessing the relationship between predictions of growth phenotypes and predictions of the underlying fluxes, we also compared the ability of different objective functions to predict the metabolic response to genetic perturbations. Through this analysis we hoped not only to assess the predictive abilities of flux balance models at the level of flux predictions, but also to understand what drives the metabolic response to genetic perturbations. Our results support previous studies which suggested that the metabolic response to genetic perturbations is best described as a minimal rerouting of fluxes around the perturbation. Despite the clear superiority of an objective function implementing minimal flux rerouting to predict mutant fluxes, all tested objective functions correctly predicted the growth phenotype for all 13 mutants considered. This suggests that correct predictions of growth phenotype do not necessarily imply an accurate prediction of the underlying fluxes.
2.
Methods
2.1. Experimentaljlux data
All experimentally measured fluxes and uptake/secretion rates were taken from the supplementary material of the 2005 manuscript by Blank et at. [1]. Among the 38 single gene deletion mutants for which fluxes were measured, we focused on 13 for which the deleted gene did not have any duplicates. The reason for this is that gene duplicates are
Optimality Criteria for the Prediction of Metabolic Fluxes
125
implemented in a trivial manner in flux balance models, unless regulation is explicitly taken into account. In a typical flux balance calculation duplicate genes completely back one another up under all conditions. 2.2. Flux Balance Analysis Flux balance analysis is a linear constraint based modeling approach which has been described in detail elsewhere [6]. Briefly, flux balance analysis consists of two critical steps; (1) the imposition of linear constraints on fluxes, stemming from the assumption of steady state, and (2) an optimization step by which a particular set of fluxes fulfilling the given constraints is selected. These linear constraints limit the feasible flux solutions to those which result in no net production or consumption of any metabolite. These steady state constraints can be described by the nullspace of the m x n stoichiometric matrix S. The columns of S represent the n reactions, and its rows the m different metabolites. An entry Sij represents the stoichiometric coefficient of metabolite i in reaction). In addition to the steady state constraints, additional linear constraints are imposed to set upper and lower bounds on individual fluxes Caj ~ Vj ~ hj). These constraints can be applied to fix maintenance requirements, restrict reversibility of reactions and set limits on nutrient uptake rates. The previously released iLL672 yeast metabolic reconstruction was used for all analyses [13]. Constraints on uptake rates were imposed to mimic the minimal glucose conditions under which the utilized set of experimentally determined fluxes were determined. Gene deletions were implemented in the model by setting the flux to zero for all reactions requiring the protein product of the deleted gene. 2.3. Objective functions to predict mutant jluxes While the imposition of the linear constraints mentioned above restricts the space of possible metabolic behaviors, there are still potentially an infinite number of flux states which can fulfill the given constraints. To select a particular flux state, which can in turn be compared to the experimentally measured fluxes, one typically maximizes or minimizes a linear combination of fluxes, based on a biologically relevant criterion. Here we evaluated the flux predictions made using several different criteria. A summary of the different objective functions and the motivation for testing them can be found in Table 1.
3.
Results
3.1. Experimentally determinedjluxes To evaluate the relative abilities of different objective functions to accurately predict the metabolic flux response to genetic perturbations, we utilized the aforementioned compendium of experimentally determined fluxes for S. cerevisiae single gene deletion mutants [1]. The mutants analyzed by Blank et al. were selected on the basis that the deleted genes encoded enzymes which catalyzed reactions that were active under minimal glucose conditions, but were not essential to growth. In other words, these genes encoded enzymes in flexible reactions, such that by observing how the metabolic network responds to their deletion, insight could be gained into the metabolic basis for
126
E. S. Snitkin f3 D. Segre
the robustness to gene deletions that has been previously observed in yeast metabolism [I, 4]. Despite the fact that the set of mutants analyzed by Blank et al. targeted genes in various central carbon metabolic processes, the nature of the metabolic flux responses were largely similar. Specifically, it was observed that for most mutants, the metabolic response was a local rerouting of flux around the perturbed reaction, with the relative flux through other pathways remaining similar to the wildtype. The exceptions to this rule were for mutants in reactions critical to redox metabolism, where more distant rerouting was observed. An important caveat to the observed similarity in the flux distributions of the different mutants is that the absolute flux of carbon varied greatly. This aspect of the deletion mutant response is demonstrated in Fig. 1, where the glucose uptake and biomass production for the 13 mutants analyzed in the current study are shown. It can be seen that although the efficiency with which carbon is utilized is largely similar across different mutants, the growth rates vary greatly. 1.1 eLSCl eMAEl
1.0:
ewr eCTPl
0.9; II) II)
Q)
eSFCl
S
0.8i-
~
0.7i
eGLYl
u::: 'c,
eGCV2
PCKl eOACl
e
SDH1
0 (5
'iii
>.
.c
0.6~
a.
eFUMl
0.5: ePDAl
0.4; eRPEl
~-----
--6-----8'------to--- . . . . -t~
J .............
14
......... L _......
16
-18
Glucose Uptake Rate (mmol/g/h) Fig. I. Experimentally determined glucose uptake rates and fitness for strains analyzed in current study. Glucose uptake rates were plotted against the physiological fitness for the 13 mutants analyzed in the current study, along with the wildtype. Each point represents an individual strain, which is labeled with the gene which was deleted, or with WT if no gene was deleted. Physiological fitness was computed by normalizing a strains growth rate by that of the wildtype. The wide range of glucose uptake rates indicates variation in the absolute metabolic flux carried in the different mutants. On the other hand, the strong correlation between glucose uptake rate and physiological fitness suggests that the glucose is largely being used in a similar manner across the different mutants.
Optimality Criteria for the Prediction of Metabolic Fluxes
127
3.2. Objective functions used to predict mutant fluxes
Our assessment of the ability of yeast flux balance models to predict fluxes in single gene deletion mutants included the evaluation of a set of 9 different objective functions (See Table 1). These 9 objective functions can be dissected into four categories: growth maximization, minimization of metabolic adjustment, experimentally motivated and alternate maximization criteria. Table I. Objective functions used to detennine mutant fluxes.
Optimization Method
Primary Optimization Function
A secondary optimization was performed to minimize the sum of the absolute values of the fluxes A secondary optimization was performed to minimize the distance from an experimentally constrained WTsolution
KO
FBA MIN AV
max
Vgrowlh
FBA_WT_MIN_DIST
max
Vgrowlh
KO
m
MOMA_LP
Additional Notes
min ~)v{O -v~
LP refers to the use of linear programming to minimize the Manhattan distance
I
i=I QP refers to the use of quadratic programming to minimize Euclidean distance
m
MOMA_QP
mm ~::CViKO
_V;WT)2
;=1
I
m
MOMA_LP_ WT_ CONSTR
mm
IViKO _
WT - EXP
Vi
I
i=1
m
MOMA_QP_WT_CONSTR
MOMA_ LP_OLC_UP_NORM
mm
I
(V{O -V~ _EXP)2
i=I
min
m
v KO
VWT
i=I
VGLC
VGLC
L:I ~o --1rTi m
MOMA_LP_BM_SINK
The experimentally constrained WT solution was computed minimizing the sum of fluxes, given the experimental constraints [13].
min
II
V;KO _V;WT
I
During the optimization sink reactions were created for each biomass component
;=1
FBA_MAX_ETOH
max
KO VEIOH
For both primary and secondary optimizations biomass was fixed to the experimental value determined for theJli ven mutant
Abbreviations: WT = Wildtype, KO = Knock Out LP = Linear Programming, QP = Quadratic Programming, BM = Biomass, GLC = Glucose, EXP = Experimental
3.2.1. Growth maximization
This set consisted of two objective functions, which both select flux solutions which maximize biomass production. The two objective functions differ in their secondary objective functions, which are used to select among the set of alternative flux solutions which all result in optimal biomass production. The first, FBA_MIN_A V, performs a
128
E. S. Snitkin €3 D. Segre.
secondary optimization which finds the flux distribution which produces the optimal biomass and has the minimal sum of the absolute values of fluxes through all reactions. The hypothesis underlying this approach is that yeast will attempt to achieve maximal growth at a minimal expense in terms of enzyme usage [10, 15]. The second objective function, FBA_WT_MIN_DIST, performs a secondary optimization which finds the set of fluxes which produces the optimal biomass and has the minimal Manhattan distance from an experimentally constrained wildtype solution. The motivation for this secondary objective was the aforementioned observation that the distribution of flux in deletion mutants is overall very similar to the wildtype. 3.2.2. Minimization of metabolic adjustment
This set consisted of four objective functions all of which minimize the distance from a wildtype flux solution, given the additional constraint of the gene deletion [16]. These objectives differ in the distance metric used and the wildtype flux solution to which the distance was minimized. The distance metrics were Manhattan (MOMA_ LP and MOMA_ LP_ WT_ CONSTR) and Euclidean (MOMA_ QP and MOMA_ QP_ WT_ CONSTR) distances, both of which have been used in previous applications of the minimization of metabolic adjustment criteria [13, 16]. The wildtype flux distributions differed in that one uses experimental flux data to constrain the solution space (MOMA_LP _WT_ CONSTR and MOMA_QP _WT_CONSTR), and the other does not (MOMA_ LP and MOMA_QP). 3.2.3. Experimentally motivated
Both of the experimentally motivated objective functions are derivatives of minimization of metabolic adjustment, but with additions which were motivated by some of the observations made by Blank et al. [1], and others [8], in the analysis of fluxes in genetic mutants. MOMA_GLC_NORM used an experimentally constrained wildtype solution as above, but minimized the distance between fluxes normalized by the glucose uptake rate (See Table I). The motivation for MOMA_GLC_NORM was the observed variation in the absolute flux among the different deletion mutants. The second objective is MOMA_BM_SINK, which minimized the Manhattan distance from an experimentally constrained wildtype solution as above, but included sink reactions for all biomass components. The motivation for MOMA_BM_SINK was to alleviate constraints on maintaining wildtype growth, when minimizing distance to the wildtype flux solution. 3.2.4. Alternate maximization criteria
The only objective function in this category maximized ethanol production in the mutant, given that biomass production was fixed to the experimentally observed value. The FBA_MAX_ETOH objective was motivated by the well known phenomenon whereby yeast preferentially ferments glucose, although it can be more efficiently broken down through oxidative phosphorylation [II]. Some have theorized that this aspect of yeast metabolism is a result of a selective advantage in maximizing ethanol production, so as to create a poor environment for potential competitors [18].
Optimality Criteria for the Prediction of Metabolic Fluxes
129
3.3. Correlations between experimental and predicted fluxes
Initial evaluation of the different objective functions was done by computing the Spearman Rank correlation between predicted fluxes and 36 experimental flux measurements. These 36 fluxes, which consist of fluxes through central carbon metabolism along with uptake/secretion rates, were selected for correlation analysis because they represent a set of linearly independent variables in the genome scale yeast model used. The results of the correlation analysis are shown in Fig. 2 for four optimization methods, which were found to be representative of the nine evaluated. For all 13 mutants tested, the objective functions which computed minimal distance from an experimentally determined wildtype solution achieved the best correlations. The performance of this set of methods was largely unaffected by the choice of distance metric (Manhattan or Euclidean), the addition of sinks for biomass components or by computing distances based on fluxes normalized by glucose uptake rates. On the other hand, the nature of the wildtype reference from which the distance was minimized was found to be very important. Specifically, inferior performance was observed across all mutants when using the method which minimizes the distance from a wildtype solution predicted by assuming maximal biomass production. 1.00
• •
0.95
• •
•
•
• • • I
0.90
0::: .lI::: c: C\'l 0::: ~
III
• • I III
III
0.85
0.80
E C\'l ~
en
0.75
0.70
0.65
CTP1
FUM1 GCV2 GLY1
LSCl
MAE1 OACl PCK1
PDAl
Mutants Fig. 2. Spearman eorrelations of predicted fluxes with experimentally determined fluxes. Spearman rank correlation R values were computed between experimentally determined fluxes and the fluxes predicted by each of the 9 objective functions for the 13 different gene deletion mutants. Here, the R values for 4 objective functions are shown, as these 4 were found to be representative of all 9. Specifically, MOMA_LP performed the same as MOMA_QP, while MOMA_LP_WT_CONSTR performed the same as MOMA_QP_WT_CONSTR, MOMA_OLC_NORM, and MOMA_BM_SINK. For virtually all mutants the strongest correlation was achieved using an objective which minimized the distance from an experimentally
130
E. S. Snitkin & D. Segre
constrained wildtype flux solution (black circles). The reference flux solution was critical, as minimizing the distance from a wildtype solution computed with the assumption of optimal growth resulted in a decreased correlation in all mutants (gray triangles). The objective maximizing production of ethanol (gray diamonds), produced fluxes which were least correlated with the experimental measurements. Notably, despite the respirofermentative behavior of yeast in aerobic glucose conditions, maximization of ethanol did a worse job of describing the flux response than maximization of growth (black squares) for a1\ 13 mutants. ACETATE SECRETION ANAPLEROTIC REACTIONS BIOMASS CITRATE CYCLE ETC, COMPLEX II ETC. COMPLEX IV ETHANOL SECRETION GLUCOSE UPTAKE GLYCEROL SECRETION GLYCOLYSIS PENTOSE PHOSPHATE CYCLE SUCCINATE SECRETION
Fig. 3. Normalized difference of fluxes predicted by MOMA_ LP_ WT_ CONSTR from experimental values. Differences were computed between the experimenta1\y determined and model predicted fluxes. Before taking the difference between fluxes, all fluxes were normalized by the glucose uptake rate for the given mutant. In order to make differences comparable for fluxes of different magnitudes, flux differences were then normalized by the range of a given flux across all experimental measurements. Fina1\y, flux differences for reactions in the same metabolic pathway were averaged together to allow for easier interpretation of incorrect flux predictions. Displaying this data in a heatmap, where black represents maximal difference and white minimal difference, reveals that the largcst differences between experimental and model predicted fluxes are for the pdal, zwfl and rpel mutants. This fits with correlation analysis, as these mutants had three of the lowest Spearman R values for the MOMA_ LP_ WT_ CONSTR objective. Looking at the heatmap to identifY the processes with the largest differences for these three mutants provides insight into the cause of the low correlations. For pdal, the large difference in succinate secretion is a result of the model failing to predict that the TCA cycle is used to maintain NADHINAD balance in the absence of the pyruvate dehydrogenase reaction. For rpel, the model did not capture rerouting present in many pathways. Most of these reroutings stemmed from differential use of the pentose phosphate pathway resulting from the gene deletion. Fina1\y, for zwfJ, there is a large increase in the flux through malic enzyme to compensate for the inability to produce NADPH through the pentose phosphate pathway. The increased flux through malic enzyme is associated with an increase in flux through the TCA cycle and the respiratory chain, which is not predicted by the model. In general, these three gene deletions a1\ result in reroutings to maintain redox balance, and the full scope of these reroutings are missed by the model predictions.
Optimality Criteria for the Prediction of Metabolic Fluxes
131
While the objective function which minimizes distance from an experimentally constrained wildtype solution was best for all mutants, there is variability in its relative performance across mutants. To explore this variability in more detail, we examined predicted fluxes for MOMA_LP_WT_CONTR, and assessed how well the fluxes though different metabolic pathways were predicted for different mutants. We hoped that the results of this analysis, which are displayed in a heatmap in Fig. 3, would provide insight into the sources of the decreased performance in certain mutants. The most erroneous flux predictions for most pathways are largely restricted to three mutants: rpel, pdal, and zwfl. The pdal and zwfl mutants are in reactions which utilize redox cofactors, and as described by Blank et at. such mutants tend to enact more distant rerouting to maintain redox balance. Therefore, it fits with intuition that using an objective function which minimizes distance from the wildtype would struggle in capturing more distant flux changes. A detailed examination of the predicted fluxes for these two mutants shows that while adjustments are predicted which resolve the redox imbalances caused by the given mutation, they are not the same adjustments found experimentally. For instance, for the pdal mutant, the NADINADH imbalance caused by the mutation is predicted to be resolved using the NADH dependant acetaldehyde dehydrogenase, but it seems that instead yeast increases respiratory activity to achieve redox balance. For the zwfl mutant, the model fails to predict the huge increase through the TCA cycle and malic enzyme, which occurs in yeast to counteract the deficiency in NADPH resulting from the lack of an intact pentose phosphate pathway. These examples indicate that the flux rerouting in yeast metabolism which takes place in order to maintain redox balance does not represent a minimal adjustment, or at least not minimal with respect to the distance metrics evaluated here. 3.4. Prediction of absolute flUX changes While the correlations computed above quantify how well the different objective functions predict the nature of flux reroutings in the various deletion mutants, they do not capture how well the different objectives predict the absolute flux through the system. As discussed above, while the 13 different mutants analyzed here largely have the same relative flux through different pathways as observed in the wildtype, the absolute flux varies greatly. To evaluate how well the different objective functions capture different mutations' effects on absolute flux, we compared predicted biomass production in each mutant to the corresponding experimentally measured values. The results of this comparison are displayed in Fig. 4 for the MOMA_LP _WT_CONSTR objective function. Fig. 4 indicates that despite the strong correlation between predicted and observed fluxes for all deletion mutants, there is little success in predicting the relative effects of the same mutations on the growth rate. The same trend observed in Fig. 4 was seen for all objective functions. Specifically, across all objective functions no mutant was predicted to have less than 90% of the wildtype growth, whereas experimental measurements found that 9 of the 13 mutants in fact had less than 90% of the wildtype growth rate.
132
E. S. Snitkin
fj
D. Segre .PCKl .GLY1. SFC1 . . GCV2 • •MAEl .OACl CTPl WT
1.00 .FUMl
.SDHl
0.99
!:lell
0.98
.LSCl
.5
u:
~ '5l a:
0.97
Qj
"8 ::a:
0.96 .RPEl
0.95
.PDAl
Experimental Fitness Fig. 4. Comparison of model predicted and experimentally determined growth rates for different strains. Experimentally determined fitness was plotted against fitness predicted using the MOMA_ LP_ WT_ CONSTR objective function for the 13 gene deletion and wildtype strains. Fitness was defined as the ratio between the growth rate of a given strain and the growth rate of the wildtype. While the experimental fitness values have a wide range across the 13 mutants, the model predicts that no mutant has a growth rate less than 95% that of the wildtype.
4.
Discussion
We evaluated the proficiency with which yeast flux balance models can predict the flux response to a variety of gene deletion mutations. Specifically we assessed the flux predictions made by nine different objective functions, in response to 13 different single gene deletions. Comparison of flux predictions to complementary experimentally measured fluxes revealed that for all mutants the best performing objective functions were those which minimized the distance of mutant fluxes from an experimentally constrained wildtype solution. Importantly, while the 9 objective functions showed major differences in the accuracy of their predicted fluxes, all objectives correctly predicted that the 13 mutants would be able to produce biomass. This clearly demonstrates that the ability to correctly predict growth phenotypes does not necessarily translate into the ability to correctly characterize the underlying response at the level of reaction fluxes. The fact that for all mutants the flux response was best described by objectives which implemented minimal flux rerouting, supports previous analyses of the metabolic response to gene deletions. Although the minimal rerouting objectives were consistently the best, predictions for all mutants were not equally good. Specifically, it was found that for mutants in reactions involving redox cofactors, a minimal adjustment was not
Optimality Criteria for the Prediction of Metabolic Fluxes
133
sufficient to completely describe the flux response. We hypothesize that the reason for this is that there are a number of degrees of freedom in redox balancing, and the minimal rerouting criteria by itself is not sufficient to accurately predict the observed response. Likely, criteria which cannot easily be captured by flux balance models, such as enzyme affinity for redox substrates and kinetic rate constants, are crucial in determining how redox balance is achieved. In addition to issues with redox mutants, all objective functions failed to predict the absolute flux for different mutants. Specifically, despite accurate predictions of how fluxes were rerouted in the mutants, the model predictions did not capture the reduction in the overall flux observed in the experiments .. The inability of any objective function to capture this aspect of the mutant response leaves the mechanism responsible for this observation unidentified. Again, it is likely that features of the metabolic response which cannot be captured by flux balance models are important here. Specifically, the relative efficiency of alternative pathways may limit the overall flux in mutants. Alternatively, regulatory responses to imbalances resulting from the gene deletions may cause an overall reduction in metabolic activity. Despite some of the shortcomings in the abilities of flux balance models to predict mutant flux responses, overall they largely capture the salient features of the response to the different gene deletions. Importantly, the selection of objective function proved critical to the accuracy of the predicted fluxes, despite little effect on the prediction of mutant growth. Acknowledgements
The authors would like to thank Bill Riehl and Hsuan-Chao Chiu for critical reading of the manuscript. The authors would also like to acknowledge support from the NASA Astrobiology Institute, the US Department of Energy, and Boston University. References
[1]
[2]
[3]
[4] [5]
[6]
Blank, L.M., Kuepfer, L. and Sauer, U., Large-scale I3C-flux analysis reveals mechanistic principles of metabolic network robustness to null mutations in yeast, Genome Bioi, 6(6):R49, 2005. Burgard, A.P., Pharkya, P. and Maranas, C.D., Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization, Biotechnoi Bioeng, 84(6):647-57, 2003. Burgard, A.P., Nikolaev, E.V., Schilling, C.H., et ai., Flux coupling analysis of genome-scale metabolic network reconstructions, Genome Res, 14(2):301-12, 2004. Deutscher, D., Meilijson, I., Kupiec, M., et ai., Multiple knockout analysis of genetic robustness in the yeast metabolic network, Nat Genet, 38(9):993-8, 2006. Duarte, N.C., Becker, S.A., Jamshidi, N., et ai., Global reconstruction of the human metabolic network based on genomic and bibliomic data, Proc Natl Acad Sci USA, 104( 6): 1777-82, 2007. Edwards, 1.S., Ibarra, R.U. and Palsson, B.O., In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data, Nat Biotechnoi, 19(2):125-30,2001.
134
E. S. Snitkin & D. Segre
[7]
Feist, A.M., Henry, C.S., Reed, J.L., et al., A genome-scale metabolic reconstruction for Escherichia coli K-12 MG 1655 that accounts for 1260 ORFs and thermodynamic information, Mol Syst Bioi, 3:121, 2007. Fischer, E. and Sauer, U., Large-scale in vivo flux analysis shows rigidity and suboptimal performance of Bacillus subtilis metabolism, Nat Genet, 37(6):636-40, 2005. Forster, J., Famili, I., Fu, P., et al., Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network, Genome Res, 13(2):244-53,2003. Holzhutter, H.G., The principle of flux minimization and its application to estimate stationary fluxes in metabolic networks, Eur J Biochem, 271(14):2905-22, 2004. Johnston, M. and Kim, J.H., Glucose as a hormone: receptor-mediated glucose sensing in the yeast Saccharomyces cerevisiae, Biochem Soc Trans, 33(Pt 1):24752,2005. Kauffman, K.J., Prakash, P. and Edwards, J.S., Advances in flux balance analysis, Curr Opin Biotechnol, 14(5):491-6,2003. Kuepfer, L., Sauer, U. and Blank, L.M., Metabolic functions of duplicate genes in Saccharomyces cerevisiae, Genome Res, 15(10):1421-30,2005. Schilling, C.H., Covert, M.W., Famili, I., et al., Genome-scale metabolic model of Helicobacter pylori 26695, J Bacteriol, 184(16):4582-93,2002. Schuetz, R., Kuepfer, L. and Sauer, U., Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli, Mol Syst Bioi, 3:119,2007. Segre, D., Vitkup, D. and Church, G.M., Analysis of optimality in natural and perturbed metabolic networks, Proc Natl A cad Sci USA, 99(23):15112-7, 2002. Stolyar, S., Van Dien, S., Hillesland, K.L., et al., Metabolic modeling of a mutualistic microbial community, Mol Syst Bioi, 3:92, 2007. Thomson, J.M., Gaucher, E.A., Burgan, M.F., et ai., Resurrecting ancestral alcohol dehydrogenases from yeast, Nat Genet, 37(6):630-5, 2005. Vitkup, D., Kharchenko, P. and Wagner, A., Influence of metabolic network structure and function on enzyme evolution, Genome Bioi, 7(5):R39, 2006.
[8]
[9] [10] [11]
[12] [13] [14] [15]
[16] [17] [18] [19]
BIOSYNTHETIC POTENTIALS FROM SPECIES-SPECIFIC METABOLIC NETWORKS GEORG BASLERl,z
ZORAN NIKOLOSKI l ,2
basler~pimp-golm.mpg.de
nikoloski~pimp-golm.mpg.de
OLIVER EBENHOHl,2 ebenhoehmmpimp-golm.mpg.de
THOMAS HANDORF3 handorf~pimp-golm.mpg.de
1 Institute
for Biochemistry and Biology, University of Potsdam, 14476 Potsdam, Germany 2 Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam, Germany 3 Theoretical Biophysics, Humboldt- University Berlin, 10115 Berlin, Germany Studies of genome-scale metabolic networks allow for qualitative and quantitative descriptions of an organism's capability to convert nutrients into products. The set of synthesizable products strongly depends on the provided nutrients as well as on the structure of the metabolic network. Here, we apply the method of network expansion and the concept of scopes, describing the synthesizing capacities of an organism when certain nutrients are provided. We analyze the biosynthetic properties of four species: Arabidopsis thaliana, Saccharomyces cerevisiae, Buchnera aphidicola, and Escherichia coli. Matthaus et al. [12J have recently developed a method to identify clusters of scopes, reflecting specific biological functions and exhibiting a hierarchical arrangement, using the network comprising all reactions in KEGG. We extend this method by considering random sets of nutrients on well-curated networks of the investigated species from Bioeye. We identify structural properties of the networks that allow to differentiate their biosynthetic capabilities. Furthermore, we evaluate the quality of the clustering of scopes applied to the species-specific networks. Our study provides a novel assessment of the biosynthetic properties of different species.
Keywords: biosynthetic capabilities; clustering; scope; species-specific
1. Introduction
Recently, there has been tremendous interest in the comparison of metabolic network structures in order to quantitatively and qualitatively explain the organizational structure and identify possible intrinsic network design principles. While the research in this field historically concentrated on kinetic modelling of small parts of metabolism, e.g., the glycolytic pathway [15J, the emergence of biochemical databases, such as: KEGG [10], Brenda [11], and BioCyc [16], has prompted the interest for analyses of large-scale metabolic networks. As kinetic data corresponding to genome-wide, species-specific metabolic networks are often difficult to obtain or precisely determine, novel, topology-based methods have been introduced in the last decade to allow a functional anal-
135
136
C. Basler et al.
ysis of such networks. In particular, such networks have been investigated by graph-theoretic approaches [1, 18, 20], steady-state analysis, e.g., elementary flux modes [17] or the related concept of extreme pathways [14], flux balance methods [5, 19], or, recently, by characterizing their synthesizing capacities using the concept of scopes [7]. The concept of a scope provides an effective method for determining which products a network can synthesize when it is provided with a given set of nutrient metabolites. In [8], it was shown that the synthesizing capacities of the nutrient metabolites, i.e., their scopes, form a complex hierarchy in the species-independent network defined by the KEGG database. This hierarchy is mainly determined by the chemical composition of the metabolites-those with a larger number of chemical elements or chemical groups (and, therefore, with a larger scope) are placed on top of metabolites with a simpler composition. In a recent paper [12], this complex hierarchy was condensed into a terse hierarchy of descriptive consensus scopes resulting from a clustering of scopes originating from all nutrient metabolites, taken individually. These consensus scopes represent sets of highly similar scopes, and could be assigned to characteristic combinations of chemical elements and a few chemical groups. As it is computationally impossible to calculate the synthesizing capacities of all nutrient combinations, the consensus scopes are useful to efficiently describe the biosynthetic potential of a given metabolic network. Here, we investigate at which meaningful threshold values the formerly observed hierarchies and corresponding consensus scopes can also be found in species-specific networks. Our analysis comprises the metabolic networks offour model species: Arabidopsis thaliana, Saccharomyces cerevisiae, Buchnera aphidicola, and Escherichia coli, as defined in the BioCyc database. These species have been chosen as representatives of different domains of life and contrasting living environments. In particular, Arabidopsis thaliana (abbr. Arabidopsis, taxon 3702) is a eukaryotic multicellular CO 2 fixating plant, while Buchnera aphidicola (abbr. Buchnera, taxon 107806) is a highly specialized, intracellular parasite in aphids. Escherichia coli (abbr. E. coli, taxon 83333) is a well-studied bacteria that can grow in a variety of environments, and Saccharomyces cerevisiae (abbr. Yeast, taxon 4932) is a unicellular eukaryote and fungus that has been extensively used as a model organism. Furthermore, we perform extensive analyses focused on the effect of different parameters on the outcome of the clustering approach. Finally, as the concept of scope strongly depends on the network structure, we discuss the influence of properties, characteristic for the investigated species-specific networks, on the scopes. Organization and contributions: The methods employed in this study are presented in Section 2: The employed network representations and the scope algorithm are outlined in Subsections 2.1 and 2.2. In Subsections 2.3 - 2.5, the three main methods used in evaluating the influence of different parameters on the scope hierarchies, namely: the scope size distribution, (dis)similarity indices, and weighted modularity of a given clustering, are presented. The results from our analysis ap-
Biosynthetic Potentials from Species-Specific Metabolic Networks
137
pear in Section 3, while discussion about the effect of the network properties on the investigated approach for determining a representative scope hierarchy is given in Section 4. 2. Methods
In this section, we describe the methods for testing the sensitivity of the approach proposed by Matthaus et al. [12J in order to investigate the biosynthetic potential of specific species. In Subsection 2.1, we detail the retrieval and representation of networks used in this study. The main method-calculation ofthe scope-is formally presented in Subsection 2.2. The size distributions of scopes on the investigated networks are discussed in Subsection 2.3, and the approach for determining the relationship between the parameters and methods for clustering is discussed in Subsections 2.4 and 2.5. 2.1. Species-specific networks
A metabolic network is typically represented by a directed bipartite graph G (V, E). The node set V of G can be partitioned into two subsets: Vr , containing reaction nodes, and Vm , comprised of metabolite nodes, such that Vr U Vm = V. The edges in E are directed either from a node u E Vm to a node v E Vr , in which case the metabolite u is called a substrate of the reaction v, or from a node v E Vr to a node u E Vm , when u is called a product of the reaction v. In the following, we refer to substrates as predecessors (abbr. pred), and products as successors (abbr. succ). Such representation of a metabolic network can be retrieved from a publically available database of biochemical reactions. Here, the metabolic networks of the four investigated species were obtained from the BioCyc database [16]. Similarly to the network retrieval procedure specified in Matthaus et al. [12], the reactions were checked for consistency, and, consequently, those showing erroneous stoichiometry were removed. In addition, generic reactions and metabolites integrating sets of related metabolites were removed from the network, as proposed in [6]. The curation process was applied to the BioCyc database release from December 5, 2007, and resulted in networks of the following sizes: 1329 compounds and 1404 reactions (Arabidopsis) , 1158 compounds and 1256 reactions (E. coli), 620 compounds and 594 reactions (Yeast), 356 compounds and 336 reactions (Buchnera). The BioCyc database also provides information on the reversibility of biochemical reactions. Every enzymatic reaction (with a given direction), in principle, may also proceed in the reverse direction. However, the direction in which a reaction actually proceeds strongly depends on the metabolite concentrations, and may therefore vary for different physiological conditions. Thus, for analyzing the structure of a metabolic network from a given species, all reactions may be considered as being operable in both directions. Here, as a result, all reactions are assumed to be reversible. Hence, the network is represented by a bipartite graph G = (V, E), where the successors and predecessors of a reaction are exchange ably considered as
138
G. Basler et al.
reactants or products.
2.2. Biosynthetic potential of metabolites via scope Given a metabolic network G of an investigated species, the biosynthetic potential for a given set of metabolites, acting as substrates, can be described in terms of their scope, i.e., the metabolites that can be synthesized in the network by the substrates. The scope concept is related to reach ability in the metabolic network G: A reaction node v E Vr is reachable if all of its substrates are reachable. Given a subset S of metabolite nodes, called a seed, a node u E Vm is reachable either if u E S or if u is a product of a reachable reaction. With these clarifications, we can present a precise mathematical formulation for the scope of a given seed [3J: Definition 2.1. Given a metabolic network G = (V, E) and a set S ~ Vm , the scope of the seed S, denoted by R( S), is the set of all metabolite nodes reachable from S. For a given metabolic network G = (V, E) and a set S ~ Vm , the scope R(S) can be determined in polynomial time of the order O(IEI . IV!), as can be established by analyzing the following algorithm: Algorithm 1: Scope for a set of seed metabolites S in a metabolic network G Input: Metabolic network G = (Vm U Vr , E), set of seed metabolites S ~ Vm Output: Scope R(S) 1 mark all nodes in Vr .as unreachable and unvisited 2 R(S) = S 3 repeat 4 if there is a reachable unvisited node r E Vr then 5 mark r as visited 6 R(S) = R(S) U pred(r) U succ(r) 7 end 8 foreach node rEv,. do 9 if pred(r) ~ R(S) or succ(r) ~ R(S) then 10 mark r as reachable 11 end 12 end 13 until no reachable unvisited nodes in Vr
I
I
In our analysis, the seed, S, is chosen uniformly at random from the set of metabolite nodes in a given network G. Algorithm 1 is then applied to each of f = 3000 sets S of a specified cardinality c. In the following, we describe how one can determine the distribution and clustering of scopes for a given cardinality, c, of
Biosynthetic Potentials from Species-Specific Metabolic Networks
139
the seed.
2.3. Distribution of scope sizes
Ex
Given a species X with a metabolic network represented by G x, let be the set of all scopes for f randomly chosen sets S, such that c = lSI. The scope size distribution for gives the probability, Px(s), that a scope, randomly chosen from is of size s. The effect of the parameter c on the distribution P( s) can be investigated by plotting the curves Px(s) for different values of c. To investigate the (possible) difference in the scope size distribution for several species, the sizes of the scopes are normalized by the number of metabolites in the corresponding network for each species. The scope size distributions of the investigated species are analyzed in Subsection 3.1.
Ex
Ex,
2.4. Clustering of scopes Existing studies of biosynthetic potential [8, 12] have identified that a large number of metabolites do have scopes similar in size and metabolite composition. Here, we investigate this idea by hierarchical clustering for a set of scopes generated from a seed with cardinality c and a given metabolic network of a species X. Hierarchical clustering is based on a given distance (dissimilarity) matrix for the elements of Similar to [12], we employ the reversed Jaccard index as a distance measure for a pair of scopes, R(Si) and R(Sj), ISil = ISjl = c, 1 :S i,j :S J. The computation is in the order of O(/f1 2 ) for J scopes. For completeness, we give the definition of Jaccard distance, JR(Si)R(Sj):
Ex
Ex.
JR(Si)R(Sj)
IR(Si) n R(Sj)1 = 1 - IR(Si) U R(Sj)1
We investigate the effect of a nearest neighbor group-average clustering algorithm [9]. Nearest neighbor clustering is a bottom up clustering method where iteratively clusters with increasing distance are joined, starting with clusters composed of single elements (scopes). Group-averaging refers to the method of defining the distance between two clusters as the average over all distances between pairs of the corresponding cluster elements. The output of a hierarchical clustering algorithm is a tree, which can be cut at a given distance between the clusters, to retrieve the clusters of scopes. The clusters obtained from a cut at distance T contain all scopes whose mutual distance is not greater than T. The results of the clustering of scopes are presented in Subsection
3.2. 2.5. Evaluation of parameter values To evaluate the influence of the size of the seed, c, and the distance, T, at which the clustering tree is cut on the quality of the obtained clusters, we use weighted
140
C. Basler et al.
modularity [2]-a generalization of the graph cluster quality measure proposed by Newman and Girvan [13]. To apply graph cluster quality measures, one first has to build a graph from a given matrix of dissimilarity indices. Here, we construct a graph from the dissimilarity matrix by creating a node for each scope, with the distances between the scopes as weighted edges: let I be the dissimilarity matrix used in the hierarchical clustering. The weighted adjacency matrix A of the graph H is given by 1 - IR(Si)R(Sj) , over all pairs R(Si) and R(Sj) in 2: The edges of graph H are then weighted by the similarity of the scopes Si and Sj. Let C = {C1 , ... , Cp } be the set of scope clusters obtained by cutting the clustering tree at distance T. Given a graph H, with node set given by the f scopes and weighted edges as defined above, the modularity of C measures the quality of the clustering, or how separated nodes (scopes) from different clusters are from each other. It is defined as:
x.
Q
__ 1 . c,r -
~
2m.L.....t
(A .. _d(i)d(j)) b" 2m 'J
tJ'
t,)=1
where m = 2: ij Aij is the weighted number of edges in H, Aij is the element of the adjacency matrix in row i and column j, d( i) is the weighted degree of scope i in H, d(j) is the weighted degree of scope j in H, and bij = 1, if i and j are in the same cluster of C, and 0, otherwise. With regard to this definition, the modularity measure assesses the closeness of the scopes placed in the same cluster (according to the employed clustering algorithm) and their "distance" from the scopes placed in the other clusters with respect to the weighted adjacency matrix (i.e., the similarity matrix). We investigate the behavior of the cluster quality for different sizes of the seed and different values for the parameter T at which the clustering tree is cut to obtain the set of clusters C (see Subsection 3.3). 3. Results
Here, we analyze and compare the scope size distributions, cluster agglomeration, and weighted modularities of scope clusters, obtained from the networks of the four investigated species. The scope size distributions and cluster agglomeration reveal characteristic features of the networks, while the weighted modularities determined for different values of cut-off and seed size allow to systematically and quantitatively assess the relative influence of these parameters on the clustering. 3.1. Scope size distributions
Analyses of the scope concept have already identified that metabolites exhibit different biosynthetic potentials, i. e. the number of reachable metabolites strongly
Biosynthetic Potentials from Species-Specific Metabolic Networks
141
depends on the composition of the seed [3J . Therefore, we use the size of the scope to quantitatively characterize the biosynthetic potential of the seed metabolites in a given metabolic network. To this end, we empirically determine the size distributions of scopes resulting from the four investigated species (see Fig. 1). In order to enable comparability, the scope sizes were normalized by the size of the network , and the counts of scopes were turned into a probability distribution (see Subsection 2.3 for details).
Arabldopsis thaliana scope size distributions
~
_
E. coli scope size distributions
~
Seed size 4
_ Seed size 14 D-- Seed size 24
15
ci
Seed size 4
_
Seedsize14
ci
g ~
I;
I
~
J' ~
8
ci
.l~\
i'l ci
0
ci
ci
8
is
ci
ci
0.0
0.0
0. 1
0.2
Scope size (normalized)
0.3
OA
0.5
Scope size (normalized)
(a)
(b)
Saccharomyces cerevisiae scope size distributions
~
_ _
Buchnera aphidicola scope size distributions
~
S eedsize4 Seedsize14
0 - - Seed size 24
15
_ _
Seedsize4 Seedsize14
0---- Seed size 24
15
ci
i J'
_
0--- Seed size 24
15
ci
~
I;
~
!
~
8
ci
i'l ci
" ci
ci
is
is
ci
~.1n
c~ ~
ci
0.0
0.1
0.2
0.3
OA
Scope size (normalized)
(c)
0.5
0.6
0.0
0. 1
0.2
0.3
OA
0.5
0.6
0.7
Scope size (normalized)
(d)
Fig. 1. Scope size distributions of (a) Arabidopsis, (b) E. coli, (c) Yeast and (d) Buchnera, normalized by the number of metabolites in the corresponding network. The distributions are shown for seed sizes 4 (red), 14 (blue), and 24 (yellow). The highest frequencies for seed size 4 are excluded for clarity: P.4 rnbidopsis(4) = 0.38, Pi:. co li (4) = 0.35, P~east (4) = 0.39, and P~uchnera (4) = 0. 38.
We observe that with small seeds of four metabolites, the scope size distributions of all investigated networks share a high peak for very small scope sizes, indicating that a large number of seeds exhibit a very low biosynthetic potential. The remaining large isolated peaks in the networks of Arabidopsis (Fig. 1a) and E . coli (Fig. 1b) correspond to characteristic scopes reachable from a relatively large number of
142
G. Basler et al.
different seeds. These characteristic scopes correspond to large subnetworks with a high degree of mutually reachable metabolites, which we refer to as scope communities: If the seed contains metabolites from within such a scope community, then there is a high probability of reaching all the metabolites within the community. In addition, a scope community is self-contained in the sense that metabolites outside of the community can only be reached if the seed contains certain metabolites also outside of the community. Note that although one characteristic peak may correspond to several such scope communities with a similar scope size, this is not observed in the networks of Arabidopsis and E. coli. Instead, the subsequent clustering reveals that scopes pertaining to the same characteristic peak are agglomerated into one cluster at a merging distance not greater than 0.2. Furthermore, the relatively large sizes of the communities (apx. 35%, 46%, and 60% of the network size in Arabidopsis, see Fig. la, and apx. 38% and 45% in E. coli, see Fig. lb) suggest that the smaller scope communities form subsets of the larger ones and, thus, exhibit a hierarchical arrangement, as identified by Matthaus et al. [12J. By increasing the seed size, the probability of reaching any particular metabolite increases, and, therefore, one obtains larger scopes. In particular, we observe that for all networks the fraction of small scopes decreases, while the overall scope sizes increase. For the more complex networks of Arabidopsis and E. coli, we observe that the center of the large peaks shifts towards the larger scope size. This demonstrates that seeds containing metabolites from within a scope community now frequently contain additional metabolites from outside of the community, which account for a small increase of the scope size. Moreover, seeds containing no metabolites from within a scope community remain to have a small scope, regardless of the increased seed size. Consequently, scope communities in the more complex networks represent an outstanding feature that is robust with respect to the seed size. In contrast to these findings, an increase of the seed size in the smaller networks of Yeast (Fig. lc) and Buchnera (Fig. ld) results in more evenly distributed scope sizes. This observation suggests that scope communities do not exist or are less pronounced compared to the cases of Arabidopsis and E. coli. For these two species, there are many scopes containing a distinct fraction of metabolites in the network. Finally, while the scope size distributions of Arabidopsis and E. coli are easily distinguishable by the frequency, relative scope size and number of scope communities, this is not the case for Yeast and Buchnera.
3.2. Cluster agglomeration The dissimilarity matrix serves as the basis for the clustering described in Subsection 2.4. During the clustering process, scopes are agglomerated into clusters, starting with the most similar. At a merging distance of 0, every scope forms an individual cluster, so that the number of clusters equals the number of scopes f, i.e., 3000. The number of clusters monotonically decreases with an increasing merging distance,
Biosynthetic Potentials from Species-Specific Metabolic Networks
143
until, at a distance of 1, all scopes form a single cluster. The number of clusters obtained at a certain merging distance provides information on the overall mutual similarities between scopes. In the case of many highly similar scopes, a small number of clusters will be obtained for a small merging distance, while the opposite holds for the case of many dissimilar scopes. For instance, if at a distance of 0.5 the number of clusters is half the number of scopes, then more than half of the scopes have a mutual distance of at most 0.5; therefore, more than half of the scopes share at least two thirds of their metabolites with another scope (cf. Subsection 2.4).
Arabldopala thallan. cluster agglomeration 0
-
~
\
g
.;
\\
g
, ..............,
~
.. ,~ ....
.;
o.
~
.;
!
- . . . . . . . . . . >.:>. . . ~,
.
0.8
0.'
0.8
~
.;
'.\
:l
.........::-.;.~~.~ 0.2
\-~--~~,
~
~
~ 0.0
Seedslz64 ---' Seed size 14 Seed siZ624
\':
~
0
'-..'"
'.
~
!
Seed w&4 Seed aze 14 Seed IJiiz624
\._--_.............
~ G
E. coli cluster agglomeration
~
.... , ...
1.0
0.0
08 Merging distance
(b)
Saccharomyces carevi.'ae cluater agglomeration ~
-
--_.
\~"
j
"
\,
G
ti
'\. ~
...
.......,.
N
$eedslze4 Seedsize14
Seed size 24
~
1
........., ...........
~--.
f
',~....~,~~....~
..... '..........
.;
Buchne,. aphldlcol. cluater agglomeraHon ~
Seed Size 4 Seed size 14
Seed aze 24
'.. "'"
\\ -...,
....................>.~.~::~.~
~
(a)
G
-
.......-..•.......... .....
........
Merging distance
.;
'~~~.....-
......••
G
.; ~
0 N
0
'.
~
~ 0.0
0.2
0.8
0.'
Merging distance
(C)
OB
1.0
0.0
02
0.'
10
Merging distance
(d)
Fig. 2. Frequency of observed clusters over the merging distance for (a) Arabidopsis, (b) E. coli, (c) Yeast and (d) Buchnera. While steps appear in the frequencies for seed size of 4 (solid line) as a consequence of numerical effects of the Jaccard distance, the shapes appear continuous for seed sizes of 14 (dashed line) and 24 (dotted line). Furthermore, the overall mutual distances of scopes decrease when increasing the seed size, resulting in a smaller fraction of clusters at a particular merging distance.
As shown in Fig. 2, the mutual similarities of scopes exhibit significant differences when using varying seed sizes. As a trend, the number of clusters obtained at
144
G. Basler et al.
a certain merging distance is reduced with the increase of the seed size, demonstrating that more similar sc.opes result from a larger seed size. This conforms to the intuition, as larger seeds result in larger scopes with a higher probability of sharing common metabolites. While the agglomeration curves from seed sizes 14 and 24 appear continuous, steps appear in the curves from seed size 4. For the latter curves, a large number of scopes is agglomerated into clusters at certain distances. For Ambidopsis (Fig. 2a) and E. coli (Fig. 2b), there are large steps of more than 160 scopes at characteristic distances of 2/3 and 3/4, and steps of more than 530 scopes at a distance of 6/7. In Yeast (Fig. 2c) and Buchnem (Fig. 2d), there are steps of more than 300 scopes at distances of 2/3 and 6/7. These are numerical effects of the Jaccard distance which provides a discrete number of possible dissimilarity values, decreasing with smaller cardinalities of the compared entities. When using a small seed size, the fraction of small scopes is very large (cf. Subsection 3.1). Consequently, for a large number of scopes there is a small number of possible distances to consider. For instance, at a distance of 2/3, all scopes of size four with two metabolites in common are merged, and all scopes of size six with three metabolites in common, and so on. With many small scopes, these characteristic distances occur more frequently, leading to the observed steps. For the clustering of Ambidopsis and E. coli with seed sizes of 14 and 24, a significant fraction of scopes is agglomerated with a merging distance of less than 0.1. This indicates that there are many scopes with a high mutual similarity. In contrast, this does not hold for Yeast and Buchnem, where the range of similarities between scopes is more uniformly distributed and, thus, results in cluster agglomerations at higher distances. Again, there are significant differences between the calculated scopes of A mbidopsis and E. coli on one hand, and Yeast and Buchnem, on the other hand.
3.3. Influence of cut-off and seed size Due to the observed large impact of the employed seed size and cut-off on the calculated scopes and the resulting clustering, we aim at evaluating the influence of these parameters on the quality of clustering. Particularly, we are interested in those parameter values that allow to obtain clusters of highest weighted modularity. Moreover, thorough investigation of the parameter space may provide insights in the presented approach of scope clustering. We determine scopes from random seeds as described in Subsection 2.2 for seed sizes 2 ::::: c ::::: 25. For each set of scopes resulting from a given network and seed size, we perform the clustering of scopes as described in Subsection 2.4. Finally, we cut the obtained cluster trees at cut-off distances 0.05 ::::: T ::::: 1 with step-size of 0.05, and determine the weighted modularities of the resulting sets of clusters, as defined in Subsection 2.5. In Fig. 3, the resulting matrices of weighted modularities for different parameter
Biosynthetic Potentials from Species-Specific Metabolic Networks Color Key
Influence of cut-off and seed size on cluster quality for Arabidopsis thaliana
]g "
145
Influence of cut-off and seed size on cluster quality for E.coli
cg
8~ ~
0
0.'
0
0.20.304
Value
0.1
0.2
0.3
Value 0.05 0 .•
0.' 0.15 0.2 0.25 0.3 0.35
05 0.55
0,15
0.2 0.25 0.3 0.35 0.45 0.5 0.55 0.' 065 0.' 0.75 0.6
?S U
0.' 0,75 0.' 0.55 0.9 0.95
NMV~W~~~~~~~~~~~~~gN~~~~
.
0.9 0.95
""' •• ' ••
g"~~~~.~
9.'
"
Value
93
,
~"~~~~
(b) Color Key
Influence of cut-off and seed size on cluster quality for Saccharomyces cerevisiae
Influence of cut-off and seed size on cluster quality for Buchnera aphidicola
c§
8g
ug
0
••
Seed size
(a)
~§
0
0'
0'
{t20.3
OA
Value
0'<)5 0'
0 .• 0.15 0.2 0.25 0.3 0.35 0.4
0.15
0.2 0.25 0.3 0' 0.45 0,55 0.' 0.85 0.' 0.75 0.'
.
~
05 0.55 0.6 0.65 0.'
,
0,95
NM~mw~~rn~~~~~~~~~~gN~re~W
l
0.8 0.85 0.' 0.95
0'
N~~mw~~m~~~~~~~~~~g~~~~~
Seed size
(C)
U
0,85
Seed size
Color Key
?'5
Seed size
(d)
Fig. 3. Heatmaps of the weighted modularities using different seed sizes and cut-offs, for (a) Ambidopsis, (b) E. coli, (c) Yeast and (d) Buchnem. Histograms of the obtained values are shown in the top-left corners. The best cluster qualities are obtained using a seedsize of 2 and cut-off 0.7 for E. coli, and seed size 2 and cut-off 0.95 for the other species.
values are shown as heatmaps with corresponding value histograms. The lowest weighted modularities for all species are slightly below 0 and correspond to a cutoff distance of r = 1. This supports the intuition that a low value for the modularity should be obtained from an apparently poor clustering. The highest values differ for all species: for Arabidopsis (Fig. 3a) Qc=2,r=O.95 ~ 0.43, for E. coli (Fig. 3b) Qc=2,r=O.7 ~ 0.31, for Yeast (Fig. 3c) Qc=2,r=O.95 >=:::; 0.41, and for Buchnera (Fig. 3d) Qc=2,r=O.95 ~ 0.43. However, these maxima correspond to identical parameters of c = 2 and r = 0.95 in Arabidopsis, Yeast and Buchnera, while the modularity obtained from the same parameters in E. coli is Qc=2,r=O.95 ~ 0.18.
146
G. Basler et al.
The evaluation of parameters indicate that the best clustering is achieved for a small seed size of c = 2 and a very high cut-off of T = 0.95 for all species except E. coli, for which T = 0.7 results in the highest cluster quality. The preference for small seeds demonstrates that small sets of metabolites can be well classified into distinct groups according to their biosynthetic potential using the concept of scopes. On the other hand, our analysis suggests that scopes from more complex seed compositions are harder to classify. Furthermore, the selection of a cut-off value T = 0.95 indicates , that a small number of large clusters, containing scopes up to a very high distance, is preferred. Hence, the arrangement of scopes from small seeds into few very coarse groups results in the highest separation of clusters.
4. Discussion
Characterizing the biosynthetic potential by only employing the structure. of metabolic networks offers a means for comparing and contrasting different species. Here, we investigated to what extent the approach proposed in [12] could be extended to determining scope clusterings and metabolite hierarchies in speciesspecific networks. To this end, we performed a comprehensive sensitivity analysis of the approach, which depends on the size and composition of random seeds and the cut-off distance for extracting clusters of scopes. The analysis furthermore includes the effect of the size and composition of random seeds on the scope size distributions in the four investigated species. The findings related to the scope size distributions conform to the existing results on species-specific networks [4] as well as the network comprising all reactions from KEGG [12], i.e., alarge number of seeds exhibit a small biosynthetic potential. Accordingly, we observe characteristic scope sizes corresponding to scope communities for Arabidopsis and E. coli, which indicates the existence of consensus scopes and supports their hierarchical arrangement. This argument can be further strengthened by our findings regarding the scope size distributions of Arabidopsis and E. coli: With an increase of seed sizes, the overall scope sizes increase, while preserving the scope community structure. The results from the agglomerative clustering performed on the scopes of 3000 randomly chosen seeds of different sizes suggest a plateau for the fraction of clusters at a merging distance around 0.2, i.e., no significant number of scopes is agglomerated at distances close to 0.2. This is typically pronounced for seeds of larger sizes from the networks of Arabidopsis and E. coli. We point out that the phenomenon of plateau was already observed elsewhere [12] and was used as a principle for choosing a threshold in the extraction of scope clusters and the resulting metabolite hierarchies. However, our analysis warrants caution when extending these observations to the networks of Yeast and Buchnera: While Arabidopsis and E. coli are organisms with complex metabolic networks, the opposite holds for Buchnera. Although Yeast is a generalist model organism with complex metabolic functions, its scope size dis-
Biosynthetic Potentials from Species-Specific Metabolic Networks 147
tribution does not exhibit characteristic peaks and, therefore, does not contain any distinct scope communities. Likewise, there is no plateau observed in the cluster agglomeration of Yeast. The observed differences in scope sizes and clustering between Arabidopsis and E. coli on one hand, and Yeast and Buchnera on the other hand may be due to either differing qualities in the curation of the networks or a possible realistic difference in the biosynthetic potential of these species. To further assess the quality of scope clusterings, we applied a generalization of the modularity measure. While for certain values of parameters (i.e., cut-off distance and seed size) we obtained relatively high modularities for the respective scope clustering, the observed values have significantly different implications: The highest value for the modularity in the investigated species was obtained at cut-off distances of 0.95 and 0.7, corresponding to a small number of clusters comprising scopes with a wide range of similarities. Moreover, for most cut-off distances, the highest modularity is reached for small seed sizes (c = 2), suggesting that the cluster agglomeration may be highly dependent on the discretization capacity of the employed Jaccard distance. We point out that the same empirical analysis was performed and comparable results were obtained using the Manhattan distance as a (dis)similarity measure. Therefore, we can conclude that the method for extracting scope clusters and metabolite hierarchies may be most appropriate to large scope sizes, most likely resulting from large seed sizes and complex networks, for which both, the plateau principle and the observed scope communities, are clearly pronounced. To conclude, we identified features based on the concept of scopes, which allow for a structural comparison of different species, and indicate the existence of consensus scopes and metabolite hierarchies in Arabidopsis and E. coli. In addition, our sensitivity analysis revealed a strong influence of the evaluated parameter values in the quality of clustering. Future research may aim at characterizing the scope communities via their metabolite compositions and hierarchical organization, and extending the analysis to additional organisms. References
[1] Barabasi, A. L. and Albert, R., Emergence of scaling in random networks. Science, 286:509-512, 1999. [2] Brandes, V., Delling, D., Gaertler, M., Gorke, R., Hoefer, M., Nikoloski, Z. and Wagner, D., On modularity clustering. IEEE Trans. Knowl. Data Eng., 20(2}:172-1&8, 2008. [3] Ebenhoh, 0., Handorf, T., and Heinrich, R., Structural analysis of expanding metabolic networks. Genome Informatics, 15:35-45, 2004. [4] Ebenhoh, 0., Handorf, T., and Heinrich, R., A cross species comparison of metabolic network functions. Genome Informatics, 16(1}:203-213, 2005. [5] Edwards, J.S. and Palsson, RO., The escherichia coli mg1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proc Natl Acad Sci USA, 97:5528-5533, 2000. [6] Fefst~ A. M., Henry, C. S., Reed, J. L., Krummenacker, M., Joyce, A. R., Karp, P. D., Broadbelt, L. J., Hatzimanikatis, V., and Palsson, B. 0., A genome-scale metabolic
148
[7] [8]
[9] [10]
[11]
[12] [13] [14]
[15]
[16J
[17] [18] [19J [20]
G. Basler et al. reconstruction for escherichia coli k-12 mg1655 that accounts for 1260 orfs and thermodynamic information. Mol Syst Bioi., 3(121), 2007. Handorf, T., Ebenhoh, 0., and Heinrich, R, Expanding metabolic networks: Scopes of compounds, robustness, and evolution. J. Mol. Evol., 61:498-512, 2005. Handorf, T., Ebenhoh, 0., Kahn, D., and Heinrich, R, Hierarchy of metabolic compounds based on their synthesizing capacity. fEE Proc. Systems Biology, 153(5):359363,2006. Hastie, T., Tibshirani, R, and Friedman, J., The elements of statistical learning: Data mining, inference and prediction. Springer, New York, 200l. Kanehisa, M., Goto, S., Hattori, M., Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., and Hirakawa, M., From genomics to chemical genomics: new developments in kegg. Nucleic Acids Res., 34:D354-357, 2006. Karp, P.D., Ouzounis, C.A., Moore-Kochlacs, C., Goldovsky, L., Kaipa, P., Ahren, D., Tsoka, S., Darzentas, N., Kunin, V., and Lopez-Bigas, N., Expansion of the biocyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Research, 19:6083-6089, 2005. Matthiius, F., Salazar, C., and Ebenhoh, 0., Biosynthetic potentials of metabolites and their hierarchical organization. PLoS Comput Bioi, 4(4):e1000049, Apr 2008. Newman, M. E. J. and Girvan, M., Finding and evaluating community structure in networks. Physical Review E, 69(026113), 2004. Price, N.D., Reed, J.L., Papin, J.A., Wiback, S.J., and Palsson, B.O., Network-based analysis of metabolic regulation in the human red blood cell. Journal of Theoretical Biology, 225:185-194, 2003. Rapoport, T. A., Heinrich, R, and Rapoport, S. M., The regulatory principles of glycolysis in erythrocytes in vivo and in vitro. a minimal comprehensive model describing steady states, quasi-steady states and time-dependent processes. Biochem J, 154(2):449-469, Feb 1976. Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., and Schomburg, D., Brenda, the enzyme database: updates and major new developments. Nucleic Acids Research, 32:D431-D433, 2004. Schuster, S. and Hilgetag, C., On elementary flux modes in biochemical reaction systems at steady state. J. Bioi. Syst., 2:165-182, 1994. Strogatz, S. H., Exploring complex networks. Nature, 410:268-276, 200l. Varma, A. and Palsson, B.O., Metabolic flux balancing:basic concepts, scientific and practical use. Bio/Technology, 12:994-998, 1994. Wagner, A. and Fell, D. A., The small world inside large metabolic networks. Proc. R. Soc. Lond. B, 268:1803-1810, 2001.
GENERALIZED REACTION PATTERNS FOR PREDICTION OF UNKNOWN ENZYMATIC REACTIONS YUGOSHIMIW [email protected]
MASAHIRO HATTORI [email protected]
SUSUMUGOTO [email protected]
MINORU KANEHISA [email protected]
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan Prediction of unknown enzymatic reactions is useful for understanding biological processes such as reactions to external substances like endocrine disrupters. To create an accurate prediction, we need to define a similarity measure in the reaction. We have developed the KEGG RPAIR database which is a collection of chemical structure transformation patterns, called RDM patterns, for substrateproduct pairs of enzymatic reactions. In this study, we compared RDM patterns with Ee numbers which are the well-known hierarchical classification scheme for enzymes. Additionally, we performed hierarchical clustering of RDM patterns using the information stating whether each subsubclass ofEe has a particular RDM pattern or not. To represent the variation ofRDM patterns in a cluster, we generalized RDM patterns in the same cluster using the hierarchy of KEGG Atomtypes, which are the components of RDM patterns. Using this generalized pattern, we can predict which cluster includes a given RDM pattern even if the reaction of the pattern has not been assigned any Ee numbers. Thus we will be able to define the similarity between enzymatic reactions by using this cluster information.
Keywords: Ee number; KEGG RPAIR; classification of enzymes; enzymatic reaction
1.
Introduction
Recently, a large amount of biochemical information as well as genomic information and chemical information has become available [5, 6J. For example, in the KEGG LIGAND database, much information about biochemical small molecules, biochemical reactions, enzymes, glycans, and drugs are available [1, 12J. Here enzymes are proteins that catalyze the biochemical reactions; however there are lots of enzymes whose function have yet to be unveiled. This causes missing enzymes in metabolic pathways and many unknown reactions should be characterized. Thus, the computational prediction of unknown enzymatic reactions may be useful for understanding the biological processes such as xenobiotics biodegradation: reactions to external substances like endocrine disrupter [2, 7, 8]. To improve the accuracy of prediction, we need to better systematize the reaction mechanisms of known enzymatic activities and to define an appropriate measure of similarity among the enzymatic reactions for further analysis. To achieve these objectives, we performed comprehensive analyses using the Be classification and KEGG RP AIR database.
149
150
Y. Shimizu et al.
The EC (Enzyme Commission) number is a well-known classification scheme for enzymes [9, 11]. In EC classification, enzymes are hierarchically classified by types of catalyzed reactions and their substrates and products. Each EC number consists of the letters "EC" followed by four numbers separated by periods (e.g. EC 1.1.1.1). The first, second, and third numbers are called class, subclass, and sub-subclass respectively. The fourth number represents the substrate specificity. The EC numbers have been utilized for many computational applications such as classification or prediction of enzymatic reactions. However, there are also some problems in EC classification. The EC numbers are classified manually, based on published experimental data, by the IUPAC-IUBMB Joint Commission on Biochemical Nomenclature. This requirement of published articles leaves many reactions unclassified. Additionally, the structural transformation between single compounds pair is unclear since EC represents the relationships between multiple substrates and multiple products. In order to avoid these problems, we have developed the KEGG RP AIR database that is a collection of chemical structure transformation patterns, called RDM patterns, for every substrate-product pairs of enzymatic reactions [4]. In this study, we compared the RDM patterns with EC numbers and performed hierarchical clustering of the RDM patterns using the information whether each subsubclass of EC has the RDM pattern or not. To represent the variation of the RDM patterns in a cluster, we introduced the generalized RDM patterns in the same cluster using the hierarchy ofKEGG Atomtypes, which are the components of the RDM patterns. 2.
Materials and Methods
2.1. KEGG LIGAND database
KEGG LIGAND is a composite database which contains various databases about biochemical compounds. In this study we have used ENZYME, REACTION, and RPAIR from the KEGG LIGAND database (as of 2008/05/13). ENZYME (4976 entries) is a database of EC numbers and contains names of enzymes, catalyzed reactions, genes, as well as other types of information. REACTION (7567 entries) is a database of all biochemical reactions that are included in ENZYME or appear on KEGG metabolic pathways. RP AIR (8706 entries) is a database of chemical structure transformation patterns, called RDM patterns, for every substrate-product pair (reactant pair) in REACTION. 2.2. RDM pattern 2.2.1.
KEGG RPAIR database
Each entry in RP AIR contains the alignment of atoms between the substrate-product pairs and the structural transformation pattern called RDM pattern. In general, one enzymatic reaction contains multiple substrates and multiple products, which result in
Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions
151
multiple pairs. Here each pair of chemical compounds should be distinguished by its biochemical role under the reaction, and in the RP AIR database five types of such roles have been available with the annotated labels, "main", "cofac", "leave", "ligase" and "trans", which are exemplified in Figure 1. In this study, to reduce the noise of poorly characterized pairs we used only the main type which corresponds to a major component of pairs in each reaction.
main
main
~lt + BJ) {:} ~.k + JOB I ~I
leave
I
j
II
AB+C{:}A+BC I I trans
Fig. I. Examples of substrate-product pairs and their assigned types. In the left example, both the pair (AB, AH) and pair (AB, BOH) are classified as the main type and the other pair (H 20, BOH) is defined as the leave type. In the right example, there are also two main types and the trans type is assigned the last one. In any cases, the hydrogen atoms are not considered.
Table I. Definition ofKEGG Atomtypes. (extracted from http://www.genome.jpikegg/reactionIKCF.html) Atom
Atom class
Description
CI
alkane
C2
alkene
01
single bond
C
0
2.2.2.
Atomtype Cia Clb Clc Cld Clx Cly Clz C2a C2b C2c C2x C2y Ola Olb Olc Old 02a 02b 02c 02x
Description R-CH3 R-CH2-R R-CH(-R)-R R-C(-R)2-R ring-CH2-ring ring-CH( -R)-ring ring-C( -Rh-ring R=CH2 R=CH-R R=Ci-Rfl ring-CH=ring ring-C(-R)=ring or ring-C(=R)-ring R-OH N-OH P-OH S-OH R-O-R P-O-R P-O-P ring-O-ring
KEGG Atomtype
In KEGG RP AIR, all atoms are represented by KEGG Atomtypes, which have been hierarchically defined by the physicochemical environment of atoms. Mostly, atomtypes are represented as three letter codes as shown in Table 1. The first letter indicates the atomic species, the second indicates information about the atomic bonds, and the third
152
Y. Shimizu et al.
indicates the information of the substituted groups. In particular, the second level of hierarchy in KEGG Atomtypes is called the atom class. For example, "c" is the carbon atom itself, the atom class "Cl" represents the carbon atom observed in alkanes and the atomtype "CIa" represents the carbon atom which connects to another carbon atom and three hydrogen atoms. There are 68 atomtypes in RP AIR database and a portion of them is shown in Table 1. 2.2.3.
RDMpattern
An RDM pattern is defined as a set of KEGG Atomtype changes at the reaction center (R), the difference region (D), and the matched region (M) for each reactant pair (Fig. 2). R atoms are boundary atoms between the matched regions and the unmatched regions. D atoms are next to the reaction center (R atoms) in the unmatched regions. M atoms are adjacent to the R atoms in the matched regions. In most cases R, D, and M atoms are all single pairs and the RDM pattern is represented as "R\-R2 :D]-Dz:M]-M2" (Fig. 2). Multiple pairs in D or M atoms can be considered and are represented by concatenating all atomtypes using "+", and multiple pairs in R are represented by multiple RDM patterns in which R atoms are a single pair. The asterisk "*,, in the RDM patterns indicates that there is no atom or it is only a hydrogen atom. The structural transformation between single compounds pair is now clear since each entry of the RP AIR database is a binary pair. Also the RDM pattern represents the transformational pattern around the reaction center. Hence it can be assumed that the RDM patterns may basically reflect the reaction mechanism at the site where each enzyme catalyzes. RDM patterns are generated first computationally by the chemical structure comparison program SIMCOMP, followed by manual curation [3]. There were 2401 kinds of the RDM patterns in RPAIR.
i
Ii
Nib /
"
Nla
" Cia
/ cSa,
./ II
05a
RDM pattern
~
Nla-Nlb:*-C5a:Clc-Clc
Fig. 2. Examples of a substrate-product pair and its RDM pattern. The red colored atoms (Nla and Nib in the boundary of the dashed line) are R, the blue and the yellow atoms are D (C5a connected to N I b) and M (C I b connected to R atoms), respectively. The rest of the matched region is depicted by green color.
Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions
153
2.3. EC-RDM dot matrix All EC numbers and corresponding RDM patterns were extracted from the databases. Then, the EC-RDM dot matrix was created to overview the relationship between the EC classification and the RDM patterns. The row of the matrix corresponds to subsubclasses of EC numbers, and the column of the matrix corresponds to the RDM patterns. The characteristic relationship between EC sub-subclasses and RDM patterns in the matrix is shown in the Result section.
2.4. Hierarchical clustering After obtaining the EC-RDM dot matrix, we performed a hierarchical clustering of the RDM patterns, using the information whether each EC sub-subclass has a particular RDM pattern or not. The distance (D) between two RDM patterns (RDMJ and RDM2 ) can be formulated as follows: (1) where V(RDMJ) and V(RDM2 ) are respective bit vectors of the RDM patterns RDMl and RDM2, and each element of a vector corresponds to the existence (1) or nonexistence (0) of each sub-subclass of Ee. Tc indicates the Tanimoto coefficient which is defined as follows: 7'
(
lC X
Y ,
)
=
The number of bits where x; = 1 andy; = 1 The number of bits where x; = lor y; = 1
(2)
where {Xi} and {Yi} are bit vectors [10]. We used the average linkage method for the hierarchical clustering.
2.5. Generalization of RDM patterns Using the cluster information obtained in the above section, we constructed the generalized patterns of the RDM patterns to represent the variation of the RDM patterns in each cluster. We implemented an algorithm that compares character strings of the RDM patterns in the same cluster to generate their generalized pattern. In this generalization process, the hierarchy ofKEGG Atomtypes (atom species, atom class, and atomtype) is used. The detailed procedure of generalizing two RDM patterns, RDMJ and RDM2, is described as follows: Step 1: All possible representations ofRDM\ are generated and stored in {RDMJ}. Step 2: The following procedures (2-2) are performed for each RDMIi of the set {RDM J}.
154
Y. Shimizu et al.
Step 2-2: RDMIi is separated into R li, D 1i, and M li . RDM2 is also separated into R2, D2, and M2. Then, RJi and R2, DJj and D2, and MJi and M2 are compared respectively. When multiple atoms are incorporated into each D or M representation, they are compared at the corresponding position of atoms. That is, when comparing DJi (= D1 1i+D 21i) with D2 (= D I2+D22), the comparison is done between D\i and DI2 and between D2Jj and D22. Step 3: The most matched case is selected and the generalized pattern is generated. The priority of the matching the atom representations when comparing KEGG Atomtypes in Step 2-2 is shown in Table 2. Generalized patterns are made via following conditions. The example of generalization is also shown in Table 2 and Fig. 3. i) The parts which have complete match in Step 2-2 are output directly. ii) The parts which have match at the atom class level or atom species level in Step 2-2 are substituted by the atom class or atom species respectively. iii) The parts which have no match in Step 2-2 are substituted by both components separated by comma and in parentheses. Table 2. Definition of the priority in the atomtype comparison and examples of generalization between atom types. Priority I 2 3 4
Example of generalization
Description
Original atomtypes
Generalized pattern
Plb and Plb 02c and02b Olc and 03b Plb andClb
Plb 01 0 jPlb,Clb)
Complete match Matching at the atom class level Matching at the atom species level No match in comparison
)2c
:*..1'
Ib:rl~
02b:*·C Ib: PIb~
TT
\
·Ol:"-(P lb.C Ib):PIb-P'lb Fig. 3. An example of generalization.
155
Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions
3.
Results
3.1. Relationship between Ee sub-subclasses and RDM patterns There were 3116 EC numbers (195 EC sub-subclasses) which correspond to at least one RDM pattern (1571 main types). Fig. 4 shows the EC-RDM dot matrix. Some RDM patterns correspond to many sub-subclasses of EC numbers. For examples, the RDM pattern "Olc-02c:*-Plb:Plb-Plb" in the box A in Fig. 4 corresponds to 25 sub-subclasses of EC numbers and "Sla-S2a:*-C5a:Clb-Clb" in the box B corresponds to 14 sub-subclasses of EC numbers. These patterns are found in reactions such as the hydrolysis of A TP and the formation of a thioester bond respectively. These reactions are most significant and can be observed extensively in biochemical reactions since they are frequently used as the energy source of other reactions.
A B
6.5.1 r:--~--------------:-:----=-~----:----~-t:+--I+--' 6.1.1 -~r 5.3.1
4.3.3 4.1.2 ~, 3.6.4
........:--
.
~ 3.4.19 '+- 3.1.5 o 2.8.2 en 2],4
2.4.1 o~ 1.21.4
:1(
-" '=.:
'.
•
. •.
~--'-- .....- - '
. . : . . . . •••
..... _ _ .... _ . _ ... _. _ _•
.~.-.. . ... -
......... __
-
",,-
:' •• ..
c_-·~ ~;.L.: _;r\.;~~.. ~. ~:.~ ',-''-':;~~ '~JD .---, ..
~-o--..:.-_-'-_ _ _'----'--_-=.;cc_ _ _~~ •.------'-----'----+t----t!c-j-,
-§
1.17.4;=f---"" .•• '('11417 !>
~
~.12~ :::J en 1.9.3 1.7.99
'; ~;.
;;,..
-. .
• -~-
....
1.6.1 1.4.2 1.2.4 1.1.1
:
..
-<:>
'" <:>
~ <:>
I.)
h
h
,,'-'
u
<:>
<:>
<:>
u
-rid
"x
<:> <:>
" "
h
iG
;,;
~
U
.-
'.
~.
~ U
~ U
e <:>" G" <:>"
'-' ~
-1->.
U
U
<:>
u
<:>
<:>
u
~~
".- Jj
x
".
_.
.- .
-
~:;r
uE
~
'JJF
!L .n b .n ~ :~ tM ,,- ~ ~ ~ U~ (j; z z Z 0 ~0 ~0 0cG .il0 g ~.u U '-' U U U U ~ J; £ ,$ J!l uJ!l u~ u6l z !! u; <:> u u u z z z 0 a a a .B -I'(d
RDM pattern
"
z
.,
" " ""
'"'
(I)
0
.
<0
0
N
N
Fig. 4. The EC-RDM dot matrix. The RDM patterns corresponding to at least 2 sub-subclasses ofEC are shown because of simplicity.
Some EC sub-subclasses found in the boxes C, D, E and F correspond to many RDM patterns. For example, EC 1.1.1 (box F), EC 4.2.1 (box C) and EC 2.5.1 (box D) correspond to 98, 83 and 72 RDM patterns respectively. The number of EC numbers within a certain EC sub-subclass is different depending on the sub-subclass. Therefore we compared the number of the RDM patterns within each sub-subclass of EC with that of enzymes included in each sub-subclass of EC (Fig. 5). As seen in the Fig. 5, the
156
Y. Shimizu et al.
variation of the RDM patterns in sub-subclass of EC almost depends on the variation of the 4th number ofEC except EC 2.7.11 and EC 3.6.3 which have only one RDM pattern. lOG
M 90 '5
~
]
80
~
~ 70
~ ..c:
icl
'E
.~
50
+ 2.3.1 50
I" ~
i :;; "'" '5 .aE
40 30
* 2.4.1
20
~ z 10
··--~·-T~·····
G
n
2.7.11 5D
10{]
.......................................................- . - - , - - -.....• 15Cl
Number of enzymes included
Z5D
ill each sub-subclass of EC
Fig, 5. Relationship between the numbers of enzymes included in each sub-subclass of EC and the number of the RDM patterns within each sub-subclass ofEC.
3.2. Hierarchical clustering and generalization of RDM patterns We performed the hierarchical clustering of the RDM patterns by using the information of existence or nonexistence of sub-subclasses of EC numbers. A part of the reSUlting cluster is shown in Fig. 6. It is obvious that the RDM patterns belonging to the same cluster consist of similar character strings and that the diversity of KEGG Atomtypes in the RDM patterns are considerably low. On the other hand, KEGG Atomtypes involved with other clusters are much different from each other, and the difference of such atomtype representations becomes larger and larger at the higher-level of the hierarchy in the clustering tree. 01 a-02a:*-C 1y:C 1b-C 1b 2.4,1 2.4.2 3,2,1 01 a-02a:*-C 1y:C 1y-C 1Y 2,4,1 2.4.2 3,2,1 01 a-02a:*-C 1y:CBy-CBy 2.4,1 3.2,1 01a-02a:*-C1y:C1z-Clz 2.4,1 3,2,1 01 a-02a:*-C1y:C1c-C1 c 2.4,1 3,2,1 01a-02a:*-C1y:C1d-C1d 2.4,13,2,1 01 a-02a:*-C1z:C1 c-C1 c 2.4,883,2,1 01 a-02a:*-C1z:C1 b-C1 b 2.4.1 2.4.88 01 a·02a:*-C1z:C1y-C1y 2.4,1 2.4,883,2,1 Fig. 6, A part of the clustering tree of the RDM pattern, Only RDM patterns corresponding to at least 2 subsubclasses of EC are shown because of simplicity, The RDM patterns are shown at the leaves of the clustering tree with their corresponding EC sub-subclasses, The full image of the resulting tree is available at the following URL, http://web.kuicr.kyoto-u.ac.jp/supp/shimizulibsb2008/
Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions
157
Next, we performed the generalization of the RDM patterns. The generalization starts at the lowest level of the cluster tree, that is the leaf of the tree, then grows up to higher level of hierarchy, and end at the highest level of cluster, that is the root of the tree. The generalization process is exemplified in Fig. 7. In particular case of this figure, the generalized pattern ofRDM at the highest level is Ol-02a:*-CI:C-C, which can contain all of the RDM patterns within the whole clusters. We applied this generalization to all clusters. The generalized pattern of RDM which correspond to at least 2 sub-subclasses of Ee had comparatively simple forms, however that of RDM which correspond to only I sub-subclass of EC tended to have somewhat complicated forms (e.g. (N,C,Olb,)(N,C,07a):*-(C,O):(Nl,C,06)) even if the distance between clusters was equal to O. This is because the clusters tend to have many RDM patterns and the diversity within a cluster itself becomes larger. Ola-02a:*-Cly:Clb-Clb~ 01a-02a:*-Cl :C1-Cl Ola-02a:*-Cly:Cly-Cly -.-J
r
y °la-02a:*-ClY:C-C
irL
Ola-02a:*-Cly:C8y-C8y ~
JJ
~.. __________--'
01a-02a'''-Cl 'C-C Ola-02a:*-Cly:Clz-Clz Ola-02a:*-Cly:Clc-Clc ~ y. 01,-01'0,' "'Y'C1d"ld~-~ ~"a-02a, -(1,,(1-(1 Ola-02a:*-Clz:Clb-Clb Ola-02a:*-Clz:Cly-Cly Ola-02a:*-Clz:Clc-Clc
[~
J-]
~
01a:02a:*-C~C-C
-_
Fig. 7. An example of the result ofthe RDM generalization. The clustering tree of the RDM pattern is the same as that of Figure 6.
4.
Discussion
In this study, we have systematized the reaction mechanisms based on the EC classification by hierarchically clustering the RDM patterns. We could successfully represent the variation of reactions by using the generalized pattern of RDM. For example, a generalized pattern "Nlb-Nla:C2-*:Clb-Clb" can represent five possible patterns, since the atom class "C2" contains the following five atomtypes: C2a, C2b, C2c, C2x, and C2y. Because a generalized pattern is generated by the patterns in the same cluster in which combinations of corresponding EC numbers are similar, variations in generalized patterns indicate the possible reaction patterns in some EC numbers. Using this generalization we will be able to calculate which cluster includes a given RDM pattern even if the relevant reaction has never been assigned to any EC numbers. Then we will be able to define the similarity measure between known enzymatic reactions (which are found in the database) and unknown enzymatic reactions and consequently we may improve the accuracy of the prediction of unknown enzymatic reactions. For example, we have developed the e-zyme system, which can automatically assign the EC number to a given compound pair by using the RDM patterns. Incorporating the
158
Y. Shimizu et al.
generalization and the similarities of the RDM patterns in the EC assignment process of the e-zyme, we will be able to improve its accuracy rate. Acknowledgments We would like to thank J.B. Brown for the proofreading of our manuscript. This work was supported in part by a grant-in-aid for scientific research on the priority area "Comprehensive Genomics" from the Ministry of Education, Culture, Sports, Science and Technology of Japan, and by the Institute for Bioinformatics Research and Development of the Japan Science and Technology Agency. The computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University. References [1]
[2]
[3]
[4]
[5]
[6] [7]
[8] [9]
[10] [11] [12]
Goto, S., Okuno, Y., Hattori, M., Nishioka, T. and Kanehisa, M., LIGAND: database of chemical compounds and reactions in biological pathways, Nucleic Acids Res., 30(1): 402-404, 2002 Hendry L.B., Roach L.W. and Mahesh V.B., Multidimensional screening and design of pharmaceuticals by using endocrine pharmacophores, Steroids, 64(9): 570-575, 1999. Hattori, M., Okuno, Y., Goto, S., and Kanehisa, M., Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways, 1. Am. Chern. Soc., 125(39):11853-11865, 2003. Kotera, M., Okuno, Y., Hattori, M., Goto, S., and Kanehisa, M., Computational assignment ofthe EC numbers for genomic-scale analysis of enzymatic reactions, 1. Am. Chern. Soc., 126(50):16487-16498,2004. Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., and Hirakawa, M., From genomics to chemical genomics: new developments in KEGG, Nucleic Acids Res., 34:D354-357, 2006. Kanehisa, M. and Bork, P. Bioinformatics in the post-sequence era, Nature Genetics, 33:305-310,2003. Oh, M., Yamada, T., Hattori, M., Goto, S., and Kanehisa, M., Systematic analysis of enzyme-catalyzed reaction patterns and prediction of microbial biodegradation pathways,1. Chern. In! Model., 47(4):1702-1712, 2007. Phillips K.P. and Foster W.G., Key Developments in Endocrine Disrupter Research and Human Health, 1. Toxicol. Environ. Health B Crit. Rev., 11(3-4):322-344,2008. Webb, E.C. and International Union of Biochemistry and Molecular Biology. Nomenclature Committee, Enzyme Nomenclature, Academic Press, San Diego, California, 1992. Willett P., Barnard 1. M., and Downs G. M. Chemical Similarity Searching, 1. Chern. In! Comput. Sci., 38: 983-996, 1998. http://www.chem.qmul.ac.ukliubmb/enzyme/ http://www.genome.jp/kegg/ligand.html
OPTIMAL METABOLIC REGULATION USING A CONSTRAINT-BASED MODEL WILLIAM 1. RIEHL 1
DANIEL SEGRE 1,2
[email protected]
[email protected]
IGraduate Program in Bioinformatics, Boston University, 44 Cummington St., Boston, Massachusetts, 02215, USA 2Departments of Biology and Biomedical Engineering, Boston University, 24 Cummington St., Boston, Massachusetts, 02215, USA Regulation of metabolic enzymes plays a crucial role in the maintenance of metabolic homeostasis, and in the capacity of living systems to undergo physiological adaptation under multiple environmental conditions. Metabolic regulation is achieved through a complex interplay of transcriptional and post-transcriptional mechanisms, some of which have been experimentally characterized for specific pathways and organisms. Many of the details, however, including the values of most kinetic parameters, have proven difficult to elucidate. Hence, understanding the principles that underlie metabolic regulation strategies constitutes an ongoing challenge. In the context of genome-scale steady state models of metabolic networks, it has been shown that evolution may drive metabolic networks towards reaching computationally predictable optimal states, such as maximal growth capacity. Here we develop a new computational approach based on the hypothesis that the regulatory systems operating on metabolic networks have evolved towards an optimal architecture as well. Specifically, we hypothesize that the topology of metabolic regulation networks has been selected for optimally maintaining the system balanced around one or more steady states. Based on these hypotheses, we use methods related to flux balance analysis to construct a model of metabolic regulation based primarily on a metabolic network's topology, bypassing the requirement for the details of all kinetic parameters. This model predicts an optimal regulatory network of metabolic interactions that can resolve perturbations to a given steady state in a metabolic system. We explore the ability of the model to predict optimal regulatory responses in both a simple toy network and in a fragment of the well-described glycolysis pathway.
Keywords: metabolic regulation; flux balance analysis; enzyme kinetics; metabolism; optimality; logistic map; chaos
1.
Introduction
Genome-scale stoichiometric models of cellular metabolism, such as flux balance analysis (FBA), can provide predictions of mass flow through metabolic networks in a population of cells under steady state conditions [3, 12]. While some recent stoichiometric models and FBA predictions include Boolean regulatory expressions [2, 14], understanding how to best formulate a joint modeling framework for metabolism and its regulatory control constitutes an important ongoing challenge. Regulatory networks allow cells to undergo physiological changes in response to dynamically changing environmental conditions, or possibly to cope with externally imposed genetic modifications (e.g. gene knockouts), In addition, even at unperturbed steady states, regulatory networks may have an important role in ensuring homeostatic stability against stochastic noise [8]. In particular, one might expect that if a metabolic network is
159
160
W. J. Riehl e3 D. Segre
perturbed slightly out of steady state, the regulatory system should help quickly restore homeostasis [10, 11]. Although much is known about transcriptional and posttranscriptional regulation in several central metabolic pathways, a lot of regulatory interactions, and most parameters, still remain to be discovered. One may therefore ask whether it is possible to develop a mathematical framework to study metabolic regulation even in the absence of detailed parameters and of specific experimental knowledge of all interactions. Along these lines, previous work has analyzed the evolution of metabolic networks and their regulation, especially as it pertains to optimality of construction and use of resources [4]. If metabolism has evolved over time to be optimal for several different functions - e.g. generation of biomass, reliable transduction of energy, production of signals and antibiotics - then it is likely that the regulatory mechanisms that control these processes have also evolved to be optimal for the maintenance of different metabolic states. In this work we approach the question of how metabolism is regulated by proposing a model of optimal metabolic regulation that predicts the regulatory capacity of metabolites in a network by using the reactions involved in the system.
2. A Constraint-based Model of Regulation In genome-scale stoichiometric models of metabolism, the metabolic network is represented as a stoichiometric matrix, S, where each element Sij represents the number of moles (or molecules) of metabolite i produced (positive sign) or consumed (negative sign) in reaction j. If the metabolic network is at steady state, then the system can be described by the following set of equations:
i =1,2, ... ,M
(1)
where X; is the concentration of metabolite i, Vj is the flux through reaction j, N is the total number of reactions, and M is the total number of metabolites. The flux Vj represents the rate at which reaction j proceeds at steady state. The approach of flux balance analysis (FBA) takes advantage of the fact that the steady state approximation transforms these nonlinear differential equations in the concentrations into linear algebraic equations in the fluxes. The space of feasible flux states identified by these linear constraints is further restricted by linear inequalities that define nutrient availability, followed by a Linear Programming (LP) search for a set of fluxes that is optimal for a given linear objective function (for a more detailed introduction to FBA, see [5]). Deviating from the traditional formulation of a flux balance model, we ask now how to describe the regulatory response to a metabolic perturbation that takes the system out of its original steady state. In particular, we ask how the regulatory system optimally restores homeostasis. Equation (1) describes the steady state fluxes in an FBA model. If the fluxes are not in steady state, the equations need to take into account the time dependent accumulation (or depletion) of metabolite pools:
Optimal Metabolic Regulation Using a Constraint-Based Model
dX.' = L s. v. dt . I)
J
=E.(t)
161
(2)
'
}
where Ei(t) represents the excess production (or consumption) of metabolite i at time t. If a control mechanism exists to return the system to a steady state, it would act to change the fluxes so that after some time /).t, the excess production will become zero. In terms of our variables, this would mean that a flux correction ~v, which is a function of the excess production E, should modify the flux distribution v, yielding a regulated set of fluxes Vr: (3) Our goal will be to find how regulation should cause such a change ~v to occur. We base our model of regulation on the hypothesis that the system's control mechanism can detect and respond to this overproduction through allosteric and kinetic effects. If the perturbations to the fluxes are small, it is possible that the rapid control conferred by metabolic regulation (allosteric effects, feedback inhibition, cofactor activation, etc.) will restore homeostasis in an optimal manner [7]. In the simplest formulation of our model, we assume that the overproduction of certain metabolites can directly regulate the fluxes that produce them. Specifically, we implement a feedback mechanism limited to non-competitive inhibition without transcriptional effects. To explain how this may work, we start with an analysis of the simple uni-directional Michaelis-Menten equation:
(4) where v is the reaction rate (analogous to the fluxes from FBA), Vmax is the maximum reaction rate, X is the concentration of substrate, and Km is the Michaelis constant. If an inhibitor molecule with concentration I, and constant of inhibition K[ is included in the system, affecting this enzymatic reaction in a non-competitive way, the MichaelisMenten Equation (4) is rewritten as
(5)
Thus, the change in flux
~v
caused by the presence of the inhibitor I can be written as:
(6)
162
W. J. Riehl
fj
D. Segre
If the concentration of inhibitor present is considerably less than its activity, then we can further reduce (6) to:
(7) In our regulation model we are going to assume that any metabolite j could potentially act as a regulator for any flux i, based on a kinetic law similar to the inhibitor effect described in Equation (7). In other words, we will assume that the flux correction is proportional to the amount of regulating metabolite. Note that because we assume that the metabolites can act as general regulators, they can have both an activator effect (positive value) as well as an inhibitory effect (negative value). Hence, we extend Equation (7) to define the regulatory change as the cumulative effect of all metabolites acting on all fluxes based on a matrix A of weights, to be determined. M
~Vi = Vi LAijXj
i = 1,2, ... ,N
(8)
j=l
Aij is the element (i, j) of the N x M matrix A, representing the regulatory effect that metabolite j has on flux i. This matrix element is related to the classical definition of the constant of inhibition by the following equation:
1 A ij = - K! ,ij..
(9)
The question we are going to focus on is whether, given specific perturbed flux states, we can infer a regulatory matrix A leading to a ~v that brings the network back to a steady state (Equation 3). One problem we need to face is that we do not necessarily know the concentrations of each metabolite J0. However, we do assume the knowledge of the perturbed non-steady state fluxes (i.e. the excess production E). From Equation (2) we know that AX; = E;M. If we simplify our model by taking a fixed M set to unity, and by assuming (see also Discussion) that the regulatory response depends on the concentration change (AX;) rather than on the total concentration (Xi), we can rewrite Equation (9) as: M
~Vi =ViLAijEj
i
= 1,2, ... ,N
(10)
j=l
or, in matrix form, as: ~v=AEv
(11)
Inserting Equation (11) into Equation (3) we obtain an explicit expression for the newly regulated flux state Vr as a function of the perturbation and the regulatory matrix:
Optimal Metabolic Regulation Using a Constraint-Based Model
163
(12) or, in a component-wise form,
i = 1,2, ... ,N
(13)
The matrix of putative regulatory interactions, A, therefore describes the regulatory effect that all metabolites in a network could have on all reactions. However, it is unlikely that all metabolites can regulate all reactions. Rather, we may expect that evolutionary adaptation may have shaped the regulatory network to use only a small set of optimally useful regulatory interactions. We use LP to predict these interactions, in the form of the matrix A, subject to different objectives and constraints. We define two different possible linear constraints for the mode of optimal regulation: the perturbed system could either be regulated to reach any closely available steady state, or to more specifically go back and restore the original unperturbed steady state. Each of these strategies could have a biological relevance, depending on whether general stability or a specific set of fluxes is functionally advantageous. We also define three possible linear programming objectives for identifying the optimal A matrix. First, one can minimize the number of regulatory interactions (i.e.: nonzero elements of A). This would concentrate the regulatory power of metabolites in the network to just a few key players. However, it may be best to minimize the overall regulatory effort, even if this implies using a non-minimal number of regulatory arrows. We search these types of optima in two different ways: either we minimize the sum of the absolute values of the elements of A, or the sum of squares of the values of A. Both of these may reduce the regulatory ability of any single metabolite, distributing the interactions among different elements in the system.
3. Model Results 3.1 Toy linear metabolic pathway We initially applied this algorithm to very simple metabolic networks to be able to explore exhaustively all possible modes of regulation. The simple model we used was one of a linear metabolic pathway of two metabolites and three uni-directional reactions (Figure 1). Any steady state in this pathway will be such that all fluxes must have the same value: in this case the initial steady state each flux has a value of 1 mmollgDWh (i.e., Vi = 1, for all i).
164
W. J. Riehl & D. Segre
B
A Vl
V3
VI
XI [
X2
1 0
V2
-1 1
V3
~1]
Figure I: Linear metabolic pathway for testing. (a) Map of model. (b) Stoichiometric matrix of model.
We model perturbations as changes to one or more values of the fluxes, moving the system out of steady state. An optimal regulatory response would entail the minimal amount of feedback necessary to restore the fluxes through the system to steady state. We performed multiple perturbations to the linear system and gathered the resulting regulatory changes that occurred. 3.2. Single flux perturbations The set of perturbations explored involved a perturbation to a single flux in the system. Each flux was individually increased and decreased by 10%, and A was calculated using all combinations of the above methods. We found that the resulting regulatory structures had two overall approaches to control (Figure 2). If we calculate a A to resolve both perturbations at once with the goal of reaching any steady state, the control exerted acts on all fluxes in the system except for the one perturbed, adjusting flow through the system to match the perturbation. (Figure 2A, top row) However, when we calculated A to return to the original steady state, there were two different effects. First, we found that only one perturbation at a time could be resolved - because of the linear dependence of the two perturbed vectors, there is no single A that can resolve both an increase and a decrease to the same initial steady state. Second, we found that regardless of the size of the perturbation, the same regulatory structure is encountered. In either case, the metabolite that is being overproduced (or overconsumed) is predicted to exert control over the fluxes that produce (consume) it. With regard to the different optimization objectives used, we find that when either the first or last flux (VI or V3) was perturbed, the method used to optimize the regulatory network was irrelevant: in all cases, the same regulatory network was predicted. However, when the second flux was perturbed, the different objectives yielded different, although related, regulatory networks (Figure 2B). This suggests that there could be multiple optimal regulatory schemes that would have the same regulatory effect on the network through different mechanisms. These could be a feedback system (as described by the minimization of the number of regulatory interactions), a feedforward system (in the minimization of the sum of absolute values of A) or a combination of both (in the minimization of the sum of squares of A). Any of these mechanisms has the potential to restore homeostasis and, depending on the metabolic pathway, may be most appropriate under different experimental conditions.
Optimal Metabolic Regulation Using a Constraint-Based Model
A Any s. s. Original s. s.
Perturb flux •
Perturb flux
VI
,sf:··~. . ~.~~ ~
~..; ..~
B Minimize number of interactions
~®
~
165
V3
~.~. . . . .~'~:.~. ~
Perturb flux v2 Minimize sum of values of A
Minimize sum of squares of A
Any s. s.
~~.~..............~..."''''"\ ~...~.~········""·;·0·~ ~ (~.?:""". '.:'. ". ~.'.~.) ~
Originals. s.
~@ )f''';'~ ~ ~ ~
Figure 2: Regulatory structures. The network described in Figure I was perturbed by increasing the value of a single flux. Bold lines indicate perturbed fluxes. Separate A.s were calculated using each of the optimization schemes and objectives described in the text. (A) When fluxes VI or V, are perturbed, each optimization scheme yields the same regulatory structures (dashed edges). These structures differ based on the regulatory objective, whether they seek any steady state (top) or to return to the original steady state (bottom). (B) When flux V2 is perturbed, each combination of objective and optimization scheme yields a different regulatory structure.
3.3. Single perturbation robustness Next, we expand on the method of restoring a perturbation to a given steady state by studying the robustness of such a mechanism. Given a A constructed around restoring a single perturbed flux to a given steady state, how well does the same mechanism perform against different perturbations to the same flux? We approach this question by perturbing only V3 (from the model used previously, Figure 1). A single A was calculated, then applied to several different perturbations ofv3, with perturbed values ranging from -0.75 to 3.25 (with a value of 1 being the target steady state). We find that when applying one regulatory system to multiple perturbations, the relationship between the perturbed flux and the resulting regulated flux varies quadratically (Figure 3). Robustness is also observed for any A: we found a range of perturbed fluxes such that when regulation with a single A is applied to them, they approach the steady state solution. However, if a perturbation is out of that range, the application of regulation will move the system even further from steady state. This range appears to be calculable based on the target steady state of the system and the perturbation used to calculate A.
166
W. J. Riehl
fj
D. Segre
(])
:oJ
iii
>
x ~ 0.5
* "3 0) (])
a:
o o
2
3
Perturbed Flux Value Figure 3: Single perturbation robustness. The A that drives this was based on a perturbation of flux V3 of the linear pathway described in Figure I A. The steady state value of regulation is I (dotted line) and the perturbation on which A was calculated is 2. This flux was then perturbed at values between -0.75 and 3.25 (xaxis) and values after regulation were calculated (y-axis). Note that both "perturbations" of I and 2 are regulated to the same steady state value. Perturbation values between 0 and 3 all approach the steady state value of I after regulation. Perturbation values outside ofthis range move further from the steady state solution.
3.4. Single flux perturbation trajectories
As noted above, after applying a regulatory scheme to a perturbation different than the perturbation for which A was optimized, A can still regulate the perturbation by either bringing it closer or moving it further from steady state. This leads to the hypothesis that after several iterations of applying the same A to the adjusted fluxes, the perturbation may either be fully dampened, or diverge. For certain ranges of perturbation (as described above), this is indeed the case. In Figure 4, we plot the trajectory of several iterations of this regulation, which can be thought of as representing a dynamical process of metabolic regulation. This can be described by the recurrence relation:
(14) where n indicates the time iteration step. For example, Vo is the original perturbed flux, while vndenotes the nth application of the regulation described in Equation (13).
Optimal Metabolic Regulation Using a Constraint-Based Model
B
A
167
C 1.5 0
~
'0
~
'0
-5
~ 0.5
'S
>
> -10
~
~
I
OJ
/
'S
\
~
\
OJ
~
>
0.5
-15 '--_-.J.._ _-'""
0.5 V perturbed
-15-10 -5
0
v perturbed
5
o
0.5
1.5
v perturbed
Figure 4: Single perturbation trajectory. I\. matrices were calculated to restore steady state to perturbations to the flux V3 in the network in Figure I. Plots A and 8 use a I\. calculated to return a perturbation of 2 to I while plot C uses a I\. that returns a perturbation of 113 to I. In each plot, the dotted line is the diagonal, the dashed line is the parabola described in Figure 3, and the solid line is the trajectory after several iterations of Equation (14). (A) Convergent regulation. A perturbed flux value of 0.1 will return to steady state after several regulatory steps. (8) Divergent regulation. A perturbed flux value of3.5 will approach - 00. (C) Chaotic regulation. For some values of I\. and initial perturbations (here, the initial perturbation is 0.4), any regulation performed may behave chaotically, never converging on a steady state or diverging toward infinity.
This dynamical regulation process behaves similar to a logistic map [13], displaying regimes of convergence, divergence or apparent chaotic trajectories, depending on the values of the parameters A and v. With regard to metabolic regulation this finding potentially implies that chaotic or divergent behavior might be easily encountered by regulatory networks, unless specific ranges of parameters are avoided. This may pose constraints on possible regulatory networks optimized through evolutionary adaptation. 4. Glycolysis An obvious question is whether our method can be used to predict the topology and dynamics of regulation in real-world networks. As a simple example, we chose a simplified (condensed) version of the glycolytic pathway, previously used for similar testing of computational approaches (Figure 5) [15J. Similarly to what done for the simple linear pathways (Figure 2), we approach this network by perturbing each flux individually and predicting the optimal network to restore homeostasis.
A
c
Figure 5: Perturbations in a simplified model of glycolysis. Solid lines represent metabolic reactions, and dashed lines represent predicted optimal metabolic regulation. Reactions represented as bold lines are the ones being perturbed. G = glucose, F = fructose-6-phosphate, 8 = fructose-I,6-bisphosphate, P = phophoenolpyruvate, Y = pyruvate, L = lactate, T = adenosine triphosphate, D = adenosine diphosphate.
168
W. J. Riehl €3 D. Segre
In all cases, only one regulatory metabolite was necessary for optimal regulation that restores a given steady state. For each of the reactions involving an energy-carrier, ADP was predicted to act as the main regulatory molecule (Figures 5B, 5C, and data not shown). Lactate also acted as a negative feedback regulator on its own production (Figure 5D), and glucose acted as a negative regulator on the influx of glucose (Figure 5A). 5. Discussion
In this work we developed new algorithms and methods for predicting optimal metabolic regulation based on the topology and stoichiometry of a metabolic network. Thus far, we have applied these algorithms to small pathways that are linear in nature in order to understand how accurate and robust the predictions are. Initially we found that while a single regulatory scheme can be robust for some perturbed values (Figures 3 and 4), it quickly becomes clear that a single regulatory approach predicted by this method is incapable of effectively regulating all perturbations. For example, a regulatory scheme focused on regulating perturbations to a single flux will have little or no effect on other fluxes. We also observed that multiple applications of a single regulatory system can produce unexpected, apparently chaotic results (Figure 4C). While some of these results may be unrealistic consequences of the mathematical approximations used, they may also capture some fundamental properties of biological regulation systems evolved to respond to multiple perturbations. Recent work has shown, for example, that some metabolic states are more stable than others, and that perturbations occurring on top of unstable states can lead to cell death [9]. It is worth emphasizing that each of these predicted optimized regulatory mechanisms represents just that: the optimal amount of regulation necessary to respond to a given perturbation. In all cases explored (perturbations to a single flux in the network), the optimal controlling metabolite turns out to be either a reactant or product in the perturbed reaction. However, it remains a point of interest that for many perturbations in glycolysis, the controlling metabolite predicted most often was ADP. This is interesting because both ADP and ATP are known to be strong regulators (either activators or inhibitors) of glycolysis. This may point to the utility of this method as both a quantitative (degree of regulation necessary) and a qualitative (type of metabolite functioning as a regulator) prediction generator. The current model involves simplifying hypotheses and approximations, some of which may be unjustified from the biochemical point of view. These include the assumption that the regulatory response is based on concentration changes, rather than absolute concentration values; the fact that we do not include flux relaxation induced by plain kinetic effects; the use of arbitrary values for flux perturbations; the implementation of a dynamical process based on discrete time points; and the limitation to noncompetitive inhibition as the only form of feedback. In ongoing work, we are addressing each of these assumptions to determine their impact on our results, and possible
Optimal Metabolic Regulation Using a Constraint-Based Model
169
strategies for more realistic implementations. We plan to expand on this work and use it to explore more complex systems. At first, we will use this method to understand how it predicts regulation of different and multiple perturbations to a system. We expect that when two or more fluxes are perturbed, the regulatory network will quickly become complex and intricate. Next, we plan to explore the regulation of networks with complex topologies that include branching and cyclical pathways. Eventually we intend to apply this predictive method to whole-genome models of flux balance, such as the Escherichia coli model produced by Feist et al. [6] or the Saccharomyces cerevisiae model produced by Blank, et al. [1]. Acknowledgements
The authors wish to thank Hsuan-Chao Chiu, Niels Klitgord, and Evan Snitkin for meaningful discussion and critical reading of the manuscript. Linear Programming calculations were performed using the software Xpress, kindly provided by Dash Optimization under free academic license. This work was partially supported by the NASA Astrobiology Institute, the US Department of Energy and the US National Institutes of Health (NIGMS). References
[1]
[2] [3]
[4]
[5] [6]
[7] [8]
Blank, L.M., Kuepfer, L. and Sauer, U., Large-scale 13C-flux analysis reveals mechanistic principles of metabolic network robustness to null mutations in yeast, Genome Bioi, 6(6):R49, 2005. Covert, M.W., Schilling, C.H. and Palsson, B., Regulation of gene expression in flux balance models of metabolism, J Theor Bioi, 213(1):73-88, 2001. Covert, M.W. and Pals son, B.O., Constraints-based models: regulation of gene expression reduces the steady-state solution space, J Theor Bioi, 221(3):309-25, 2003. EbenhOh, O. and Heinrich, R., Stoichiometric design of metabolic networks: multifunctionality, clusters, optimization, weak and strong robustness, Bull Math Bioi, 65(2):323-57,2003. Edwards, J.S. and Palsson, B.O., Metabolic flux balance analysis and the in silico analysis of Escherichia coli K-12 gene deletions, BMC Bioinjormatics, 1(1,2000. Feist, A.M., Henry, C.S., Reed, J.L., Krummenacker, M., Joyce, A.R., Karp, P.D., Broadbelt, L.J., Hatzimanikatis, V. and Palsson, B.O., A genome-scale metabolic reconstruction for Escherichia coli K-12 MG 1655 that accounts for 1260 ORFs and thermodynamic information, Mol Syst Bioi, 3(121, 2007. Fell, D., Understanding the Control of Metabolism, Portland Press Ltd., 1997. Goyal, S. and Wingreen, N.S., Growth-induced instability in metabolic networks, Phys Rev Lett, 98(13):138105,2007.
170
W. J. Riehl €3 D. Segre
[9J
Grimbs, S., Selbig, J., Bulik, S., Holzhutter, H.G. and Steuer, R., The stability and robustness of metabolic states: identifying stabilizing sites in metabolic networks, Mol Syst BioI, 3(146, 2007. Hatzimanikatis, V., Floudas, C.A. and Bailey, lE., Optimization of regulatory architectures in metabolic reaction networks, Biotechnology and Bioengineering, 52(4):485-500, 1996. Heinrich, R. and Rapoport, T.A., A linear steady-state treatment of enzymatic chains. General properties, control and effector strength, Eur J Biochem, 42(1):8995, 1974. Kauffman, KJ., Prakash, P. and Edwards, lS., Advances in flux balance analysis, Curr Opin Biotechnol, 14(5):491-6,2003. May, R.M., Simple mathematical models with very complicated dynamics, Nature, 261(5560):459-67, 1976. Shlomi, T., Eisenberg, Y., Sharan, R. and Ruppin, E., A genome-scale computational study of the interplay between transcriptional regulation and metabolism, Mol Syst BioI, 3: 10 I, 2007. Vance, W., Arkin, A. and Ross, l, Determination of causal connectivities of species in reaction networks, Proc Natl Acad Sci USA, 99(9):5816-21, 2002.
[10]
[l1J
[12J
[13J [14J
[15]
COMPARATIVE DETERMINATION OF BIOMASS COMPOSITION IN DIFFERENTIALLY ACTIVE METABOLIC STATES HSUAN-CHAO cmu! [email protected] ! 2
DANIEL SEGREY [email protected]
Graduate Program in Bioinformatics, Boston University, Boston, MA, 02215, USA Departments of Biology and Biomedical Engineering, Boston University, Boston, MA, 02215, USA
Flux Balance Analysis (FBA) has been successfully applied to facilitate the understanding of cellular metabolism in model organisms. Standard formulations of FBA can be applied to large systems, but the accuracy of predictions may vary significantly depending on environmental conditions, genetic perturbations, or complex unknown regulatory constraints. Here we present an FBA-based approach to infer the biomass compositions that best describe multiple physiological states of a cell. Specifically, we seek to use experimental data (such as flux measurements, or mRNA expression levels) to infer best matching stoichiometrically balanced fluxes and metabolite sinks. Our algorithm is designed to provide predictions based on the comparative analysis of two metabolic states (e.g. wild-type and knockout, or two different time points), so as to be independent from possible arbitrary scaling factors. We test our algorithm using experimental data for metabolic fluxes in wild type and gene deletion strains of E. coli. In addition to demonstrating the capacity of our approach to correctly identifY known exchange fluxes and biomass compositions, we analyze E. coli central carbon metabolism to show the changes of metabolic objectives and potential compensation for reducing power due to single enzyme gene deletion in pentose phosphate pathway.
Keywords: flux balance analysis; systems biology; data integration; metabolic objectives
1.
Introduction
An important goal of systems biology is to reconstruct and simulate biological networks to facilitate the understanding of complex cellular metabolism. Constraint based approaches have been applied to characterize the cellular flux distribution and predict metabolic phenotypes for cells grown in different conditions. One of the most prominent constraint based approaches, Flux Balance Analysis (FBA), relies on a steady state approximation and optimization algorithms to predict metabolic fluxes at cellular level [15]. The steady state approximation translates into a set of constraints on the fluxes, namely that the net sum of all fluxes producing or consuming each metabolite has to be zero. FBA determines these steady state fluxes by searching the space of feasible solutions, a polyhedral space defined by multiple constraints, for a choice of fluxes that minimizes/maximizes an objective function associated with a biological task. For instance, for a unicellular organism, one may ask what is the solution that maximizes an appropriately defined growth (or biomass production) flux, reflecting selection for fastgrowth during evolution [15]. In addition to maximizing growth, van Gulik and Heijnen suggested maximization of ATP yield, based on the assumption that evolution drives
171
172
H.-C. Chiu €3 D. Segre
maximal energy efficiency [14]. Bonarius et al. suggested minimization of overall intracellular flux, reflecting the hypothesis that organisms are evolved to maximize enzymatic efficiency [1]. Several works have proposed methods to identify objective functions from experimental data. Knorr et al. proposed a Bayesian-based probability ranking method to evaluate multiple objective functions [7]. Schuetz et al. have measured fluxes and evaluated different objectives with a Euclidean metric approach [11]. Among all the objectives studied by Schuetz et at., nonlinear maximization of the A TP yield best described unlimited growth on glucose in oxygen or nitrate respiring batch cultures while linear maximization of the overall A TP or biomass yields achieved the highest accuracy under nutrient limited continuous cultures [11]. Although FBA optimal growth seems to work well in several cases, it has been shown to be sometimes insufficient for predicting perturbed metabolic states, such as the one found in gene deletion knockout strains. A better way to determine mutant fluxes is to use Minimization Of Metabolic Adjustment (MOMA) [12], which assumes that the mutants would stay as close to wild type flux distribution as possible. One lesson learned from MOMA is that metabolic networks perturbed from a simple average behavior may be better described by objective functions different than standard growth rate maximization. One can imagine, in general, that a living system may switch its objective when facing a physiological change. For example, the diauxic shift in yeast, which is the switching from anaerobic growth to aerobic respiration upon depletion of glucose, is known to be correlated with widespread changes in the expression of genes involved in carbon metabolism, protein synthesis, and carbohydrate storage [3, 6]. Understanding the physiology of such a natural progress is still an open challenge. Lacking knowledge of objectives for perturbed cells and changes of objectives under different metabolic states limits the capacity to correctly describe metabolic networks using FBA methods. An alternative way to study metabolism is to infer metabolic flux objectives from available data. Comparative analyses of biomass compositions in different physiological states, either between wild type and mutants or throughout naturally occurring physiological transitions, could provide insight helpful towards understanding the design of metabolic networks. Previously Burgard and colleagues proposed ObjFind and BOSS to identify putative objective functions from flux measurements. Specifically, these methods identify the coefficients of importance responsible for flux distributions in E. coli and yeast [2, 4]. Uygun et al. proposed a multilayer optimization framework to discover the major fluxes of metabolic objective that account for the flux distribution in a mammalian cell [13]. However, these methods rely on flux measurements and cannot take advantage of other high throughput data. Here we present an FBA-based approach to infer the biomass compositions that best describe multiple physiological states of a cell. Our method is designed to incorporate high throughput data for comparatively determining metabolic objectives in two physiological states. As a first step, we analyze here flux data from E. coli central carbon metabolism pathways [5] to demonstrate our method for predicting metabolic objectives.
Comparative Determination of Biomass Composition
2.
173
Method
2.1. Flux Balance Analysis FBA describes the cellular level reaction rates (fluxes) under a steady state approximation, thereby imposing linear mass balance constraints. All the nutrients taken from the extracellular environment would be consumed to produce biomass or other byproducts and taken out from the system without intracellular metabolite accumulation. The steady state equation responsible for mass balance can be written as follows: dxldl = Sv = 0 (1) where x is the vector of metabolites, v is the vector of reaction fluxes and S is the stoichiometric matrix of the network. S is an m by n matrix where m is the number of metabolites and n is the number of reactions. The value Sij in S is the stoichiometric coefficient for metabolite i in reaction j. Additional constraints such as lower and upper bound for specific enzymatic reactions or nutrient uptake rates may also be imposed as LBj'S.v;S.UBj, for reaction Vj' FBA determines a specific flux prediction by maximizing/minimizing a linear objective function associated with a biological task. A typical FBA objective used in microbial systems is the maximization of biomass production [15] based on the assumption that unicellular organisms have been selected to reach maximum growth performance during evolution. Biomass production is approximated by a growth flux Vgrowlh, which is defined as follows: (2) where c is the vector of biomass coefficients, whose component Ci indicates the proportion of metabolite Xi required for the formation of a unit of biomass. The linear programming statement for maximizing growth in FBA could be formulated as: max
Vgrowth
s.t. Sv = 0
(3)
LB j ~ Vj ~ UBj 2.2. Objective inference We extend the conventional FBA formulation to concurrently infer metabolic objectives in two different metabolic states of a system. Here we limit our search to maximization of biomass production as an objective function, but we allow the biomass composition to assume in principle any vector of coefficients. For instance, the two states could be the wild type and a given mutant. The goal is then to infer the corresponding c l and c2 vectors of biomass coefficients best representing the metabolic objectives for the two corresponding physiological states.
174
H.-C. Chiu f3 D. Segre
To reverse engineer the objectives, we implement a linear optimization procedure to identify the FBA objectives maximally compatible with given vectors El and E2 encoding reference experimental data: min
L IV~1 - V~21 E E j
EJ"EJ~O
S.t. S' ·v'
j
=0,
j
whereS'
=[~~], v' =[:~
]
(4)
LBJ ~v~ ~UBJ
Llvj I~vmin where v is the vector of fluxes to be determined, is a zero-containing matrix with the same dimensions of S. In this optimization problem the overall flux activity (the sum of the absolute values of all fluxes) is imposed to be above a threshold Vmin (e.g. 25% of the flux activity obtained with regular FBA). Biomass production reactions for the first and second state are disabled from the stoichiometric matrix and a sink reaction for each biomass component is added. Each single biomass component originally flowing into biomass is exported separately and the inferred fluxes will correspond to the biomass coefficients for the corresponding metabolic state. Our optimization method tries to optimize biomass coefficients simultaneously for two metabolic states, hence allowing us to take advantage of the fact that certain data could provide only relative changes between reaction activities in the two states. Here we limit the optimization to intracellular fluxes. To test our objective function inference approach and demonstrate its performance, we apply our method to experimental flux measurements in E. coli central carbon metabolism pathways, taken from the paper published by Ishii et al. [5]. In their flux measurements, wild type strain of E. coli K-12 and 24 single gene deletion mutants of glycolysis and pentose phosphate pathway were grown in glucose-limited chemostat cultures. The mutant cells were grown at fixed dilution rate of 0.2 hours-I, and wild-type cells cultured at the same specific growth rate were used as a reference sample. They also cultured wild type cells in different dilution rates (0.1, 0.4, 0.5, and 0.7 hours-I) for comparison. In this work, we apply an E. coli central carbon metabolism FBA model [9] to study these data. The biomass production reaction in this model is a sink for the linear combination of several metabolites that are precursors of amino acids, nuc1eotides or lipids: 0.205 g6p + 0.361 e4p + l.496 3pg + l.787 oaa + l.079 akg + 2.833 pyr + 0.898 r5p + 0.519 pep + 0.129 g3p + 0.071 f6p + 18.225 nadph + 3.748 accoa + 3.547 nad + 55.703 atp + 55.703 h20
7 18.225 nadp + 3.748 coa + 3.547 nadh + 55.703 adp + 55.703 pi + 4l.025 h
(5)
Comparative Determination of Biomass Composition
175
The fact that we are dealing with a small model and there are a lot of sink reactions for the metabolites results in many alternative optima for the optimization in Eq. (4). Therefore we further use Minimization Of Metabolic Adjustment (MOMA) [12] to find the most probable steady state solution for Vi by exploring the solution space we get from optimizing Eq. (4). Coefficients for biomass precursors listed above are inferred after the primary and secondary optimization process.
3.
Results
Performing gene deletions is a commonly used approach to study how an organism responds to perturbations. FBA and MaMA have been used for generating predictions of these metabolic responses. MaMA, in particular, has been shown to be more accurate for predicting mutant fluxes than FBA. However, there are cases in which neither FBA nor MaMA objectives seem to capture well enough the true metabolic state (Fig. 1). Hence this is a good test case for our algorithm, in search for biomass composition coefficients that would be compatible with experimental data. (b)
(a) 15
.
15
0
o
0
10
'2
'2
.Q
~
5
,
"0
~
0.
0
1.L
-5
«III
:;--
.. . •
0
5
Co
0
:E 0 :E
-5
«
~-
-10 -15
~ "0 ~
0
o
-5
v ( experiment) i
5
10
. ,.
0
00
o
0 0
-10 -15
o
-5 Vi
5
(experiment)
Fig. I. Intracellular flux determination for mutant strain .1.zwf. Units for both axes are millimoles per gram dry weight per hour (mmollgDWIh). (a) FHA flux predictions for the mutant do not correlate well with experimental measurements. (b) MOMA predictions are expected to better correlate with experimental fluxes. However, in this case, even MOMA predicted mutant fluxes are not satisfactory enough for inferring biomass coefficients.
We applied our method to flux measurements in E. coli central carbon metabolism pathways [5] to infer the metabolic objectives in wild type and mutant strains. The reference (Ref) strain we use is the average of the four replicates available experimentally. Fig. 2 shows the correlation of predicted and experimental exchange rates (which were not part of the input of the above inference algorithm). Our predictions for glucose uptake rates agree with all wild type and mutant measurements studied here. In general, predicted oxygen uptake rates match well with experiments except for those of the OR03 wild type strain (See Table 1). The less accurate predictions for oxygen or CO2 production rates may be caused by reactions consuming or producing these compounds that are not in central carbon metabolism pathways. For instance,
176
H.-C. Chiu & D. Segre
Ubiquinone-8 biosynthesis requires oxygen. Therefore, under-predicted oxygen uptake rates will result in a corresponding under-prediction of Ubiquinone-8 related reactions, such as NADH dehydrogenase or succinate dehydrogenase. In addition, predictions may also be affected by inaccurate flux measurements. For example, three CO 2-associated reactions have large standard deviations (larger than 0.5*mean, see Table S5B in [5]) in the wild type replicates, possibly due to experimental difficulties or resulted from the fitting procedure for flux corrections to achieve isotopomeric steady state. (b)
(a) x
40
~
'0 Q) U
20
'C ~
c. ~-
10 0
L, ./ GR03
:0-
~
x x
~x
b.
0
glc °2
x
-10 -10
10
20
vj(experiment)
GR03~++
0.15
GR04
30
30
t
GR04
0.1
'6 ~ c.
>-
&wf
0.05
CO2
40
+ 0
~
+ 0.05
ethanol
0.1
I
0.15
vj(experiment)
Fig. 2. Predicted uptake and secretion rates in wild type and three mutant strains ~zwf, ~pgl and ~gnd (mutants for pentose phosphate pathway single gene deletion). Unit for flux is millimoles per gram dry weight per hour (rnmol/gDW/h). Negative values refer to uptake rates. (a) Glucose uptake rates in all cultures are predicted quite well. Some oxygen uptake rates are under-predicted in wild type strains under high dilution rates. The wild type strain with largest dilution rate (GR04) has a large deviation for CO 2 predictions. (b) All significant ethanol production rates are correctly predicted.
3.1.
Conserved biomass coefficients across different glucose supply rates
For the wild type strains grown in different dilution rates, we implement our algorithm relative to the reference strain (Ref) mentioned above. In Fig. 3, the predicted production rates of the ten biomass precursors defined in the FBA model are plotted against the corresponding biomass coefficients. A linear correlation is observed across all dilution rates, ranging from an almost glucose-starved state to a nearly unlimited glucose supply. In addition, the slope of the line determined by the aligned data points roughly reflects the growth rate of each wild type strain. For instance, the slope observed in Fig. 3d is roughly 1.8 times the one in Fig. 3b, which matches the fold change of growth rates (0.7 vs. 0.4 h· l ) between these two experiments. These results suggest that E. coli grown in glucose supply cultures apply robust metabolic objectives for biomass precursors, in agreement with the FBA optimal growth assumption for central carbon metabolism, regardless of glucose supply rate.
Comparative Determination of Biomass Composition
3.2.
177
Influence of single gene deletion in pentose phosphate pathway
The pentose phosphate pathway is responsible for generating NADPH and nucleotides. A perturbation in the pentose phosphate pathway could change the levels of NADPH and nucleotides and may result in less efficient growth. To study the changes of metabolic objectives caused by the deletion of pentose phosphate pathway genes, we applied our algorithm to three single gene deletion mutants ilzwf (glucose 6-phosphate-ldehydrogenase), ilpgl (6-phosphogluconolactonase) and ilgnd (glucose 6-phosphate dehydrogenase), relative to the Ref state. Our goal was to see whether considerable changes of biomass coefficients or fluxes rerouting could be detected. (a)
(b) GROl
GR02
3
3
'E 2.5 Q)
'E 2.5 Q)
'(3
'(3
if: Q)
if:
8.,
2
.,.,8
2
'"cu
cu
E 1.5 0 :0
E 1.5 0 :0
'0
'0
Q)
Q)
t5
t5
'6
'6
i!?
i!?
0.5
[L
0
[L
0
2 Expected biomass coefficient
GR04 3
'E 2.5 Q)
'E 2.5 Q)
'(3
'(3
if: Q)
8
2
2
'"'"cu
'"'" cu
E 1.5 0 :0
E 1.5 0 :0 '0
'0
'6
'6
Q)
~
i!?
3
(d) 3
[L
PYR
2 Expected biomass coefficient
0
3
GR03
8
.. . . 0 '
(c)
if: Q)
0.5
t5
i!?
0.5 0
[L
.. 0
0.5
..
PYR '
0
2 Expected biomass coefficient
3
0
' 2 Expected biomass coefficient
3
Fig. 3. Predicted production rates for biomass precursors in wild type strain under different dilution rates. Units for both axes are millimoles per gram dry weight per hour (mmol/gDW/h). The y coordinate for each data point represents the predicted flux production rate for the corresponding biomass component, and x coordinate is the biomass coefficient taken from the FBA model (see Eq. (5)). E. coli is cultured at dilution rate ofO.lh-'(a), O.4h1 I (b), 0.5 h- ' (c), 0.7h- (d) respectively [5]. The slope of the data line roughly reflects the growth rate for each experiment. Pyruvate coefficients in GR02 and in GR03 are erroneously predicted to be zero. This might be related to the less accurate prediction of CO2 production rate, since several reactions consuming or producing pyruvate generate CO2 •
178
H.-C. Chiu & D. Segre
rg6p
D
--4
6pgl
zwf
r5p
NADPH
NADPH
---. pgl
6pgc
I \ x5p
Pentose Phosphate Pathway
Glycolysis
Fig. 4. Map of the initial reactions in the pentose phosphate pathway. zwf and gnd are responsible for NADPH production to generate reducing power for growth.
Fig. 4 illustrates the reactions being knocked out from the pentose phosphate pathway in our computational study. Detailed predictions for biomass components and several key fluxes are shown in Table 1. Note that all mutants were grown in chemostat cultures at the same dilution rate (O.2h- l ) as the wild type strain (column Ref in Table 1). Hence they can all be considered to grow at the same rate. The predicted production rates for the different biomass precursors can therefore be directly compared between different mutants, and relative to the corresponding biomass composition coefficients used in FBA calculations, appropriately normalized. Our results show that most biomass coefficients for biomass precursors change proportionally to the coefficients themselves across different strains. However, individual deviations from this trend can be seen. Fig. 5 shows the predicted production rates for biomass precursors in wild type and mutants. Amino acids and nucleotide precursors (e4p bar to pep bar) tend to be over-produced and under-produced in ~pgl and in ~gnd respectively, compared to the Ref strain. Meanwhile, the measured dry weight for ~pgl and ~gnd show the same trend as the production rates for these biomass precursors. One explanation for the deviations is that these mutants may not grow at exactly the same rate due to possible experimental error, since the dry weight matches the under/over production trend. Another interpretation would be that these mutants reprogram their fluxes differently in response to gene deletion. However, more investigation would be required to draw a clear conclusion. 0.8
•
0.7 0.6
0.3
o o •
v
O.S 0.4
C=::J FBA model Ref (0.2h-1) (0.2h-1 ) ~pgl (0.2h-1) ~gnd (0.2h-1) ~zwf
1.2
•
•
IQJ
0.8
v
•v
0.6
0
0.2
0.4 0.2
0.1
O'--~-'
0
g6p e4p 3pg oaa akg pyr rSp pep g3p f6p
dry weight
Fig. 5. Left panel is the predicted production rates for biomass precursors (rnmol/gDW/h) in wild type and mutants under the same dilution rate (O.2h·'). Right panel is the measured dry weight (gIL) for these strains.
Comparative Determination of Biomass Composition
179
NADPH serves as the electron donor in reductive biosynthesis. Gene deletions in the pentose phosphate pathway perturb NADPH levels and further cause oxidative damage to the mutants [8]. One possible response for these mutants is to reroute their fluxes and generate NADPH from NADP in other pathways. Our results suggest that these mutants may use another strategy for replenishing NADPH level. As shown in Table 1, all three mutants are predicted to have higher PntAB transhydrogenase activity, suggesting that mutants may replenish NADPH level by converting NADH into NADPH. This prediction supports the previous suggestion that PntAB transhydrogenase plays an important role for generating NADPH in E. coli [10]. The predicted PntAB flux ratio for ~zwflRef (1.55) agrees with previously reported ~zwfJwild-type mRNA ratio (about 1.7) [10]. In contrast to the predicted robust metabolic requirements for biomass precursors, the cofactor requirements vary. NAD and NADPH requirements show considerable increase per unit of biomass production in ~zwf and ~pgl. It is not clear how to biologically interpret the increase of redox requirements in these mutants. One possible explanation is the mutants result in redox imbalance and their regulatory networks react consequently, causing unusual ways to direct the network operation for central carbon metabolism. In all cases analyzed here, ATP coefficients are predicted to be zero. This is because ATP synthase is present in the FBA model, and we have no information about the proton and phosphate uptake rate, Hence, the reverse ATP synthase flux is indistinguishable from the sink flux of ATP in biomass. Therefore the amount of ATP synthase reaction actually contains the ATP biomass production (but in the opposite direction). When we block (set to zero) ATP synthase in the model, the ATP biomass coefficient results equal to the absolute value of the ATP synthase flux listed in Table 1. However, this is not enough to draw a conclusion at this point on the actual A TP biomass coefficient for each strain. This issue could be examined in detail in the future, with more experimental information. 4.
Discussion
We proposed an FBA-based approach to infer the biomass compositions that best describe multiple physiological states of a cell. Our results show that E. coli maintains robust biomass coefficients for biomass precursors in central carbon metabolism pathways under glucose supply medium, ranging from an almost glucose-starved state to a nearly unlimited glucose supply. This result suggests that E. coli operates its central carbon metabolism pathways with the same biomass objective, in agreement with optimal growth criteria under glucose supply medium. One should keep in mind that this result might be partially biased by the fact that experimental inference of fluxes requires fitting to a stoichiometric model that usually involves a biomass production flux as well. Our predictions for mutants indicate that there is an increase usage for the PntAB transhydrogenase flux, suggesting another potential strategy for the mutants to
180
H.-C. Chiu
fj
D. Segre
compensate the less efficient NADPH production caused by single gene deletion in the pentose phosphate pathway. Some of our flux predictions cannot be fully understood, partly due to the use of an incomplete model (as opposed to a genome-scale one) and partly due to potential experimental errors in the flux measurements. For instance, if we had information about Ubiquinone-8 associated fluxes, we could correct the missing information in the current model and improve the accuracy of oxygen uptake rate prediction. On the other hand, it would be difficult to apply a genome-scale E. coli FBA model in our study, since the experimental data is limited to central carbon metabolism pathways. At present, large scale flux measurements are still unavailable, due to experimental difficulties. One way to overcome this limitation would be to take advantage of other types of high throughput data. Our method is designed to incorporate not only flux measurements, but also other high throughput data as the reference vector E in Eq. (4), such as mRNA expression or protein levels for two distinct physiological states. In ongoing work, we are applying our method to time series data such as gene expression along the cell cycle, to provide insights into the physiology of cellular growth. This will allow us to learn more about how living organisms organize their biomass requirements and manage energy or redox balance during their life cycle. The method should provide insights into how a cell allocates its metabolic resources in a timedependent and condition-specific manner, and can be extended to integrate multiple data sources with FBA models, to shed new light on the system-level organization of metabolic networks.
Comparative Determination of Biomass Composition Biomass comeonent Biomass Precursors
'FBA model
Ref (0.2h·')
t:.zwf (0.2h-')
Biomass coefficients Llgnd Llpgl (0.2h-') (0.2h-')
'GR01 (0.1h·')
'GR02 (O.4h·')
'GR03 (0.5h-')
181
'GR04 (0.7h-')
g6p
0.043
0.043
0.044
0.050
0.038
0.040
0.043
0.042
0_044
e4p
0.076
0.089
0.089
0.101
0.076
0.080
0.086
0.085
0.088
3pg
0.314
0.299
0.296
0.343
0.257
0.272
0.288
0.292
0.297
oaa
0.375
$0.398
0.394
0.454
0.345
0.364
0.384
0.387
0.396
akg
0.226
0.243
0.242
0.279
0.210
0.222
0.235
0.236
0.241
pyr
0.594
0.615
0.610
0.702
0.000
0.562
0.000
0.000
0.613
r5p
0.188
0.198
0.197
0.225
0.169
0.182
0.190
0.191
0.196
pep
0.109
$0.109
0.108
0.124
0.094
0.098
0.104
0.106
0.107
g3p
0.027
0.033
0.032
0.037
0.029
0.030
0.033
0.032
0.033
f6e
0.015
0.021
0.022
0.024
0.018
0.018
0.020
0.021
0.021
Cofactors 'atp
11.684
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
nad
0.744
23.631
40.600
63.507
25.496
0.000
23.516
0.000
0.000
nadph
3.823
31.252
45.218
61.827
34.733
12.992
28.807
12.322
21.787
accoa
0.786
0.825
0.817
0.941
0.000
0.752
0.000
0.000
0.000
'GR04
'GROl
'GR02
'GR03
0.000
0.000
0.000
0.000
0.000
'57.991
'32.026
9.538
25.244
9.618
15.141
Seecificreac
Ref
NADPH->NADH
0.000
0.000
11e91 0.000
'NADH->NADPH
27.713
'42.818
hzwf
11gnd
eNADH->NAD #ADP->ATP (ATP synthase)
9.279
14.506
21.057
9.053
2.526
8.043
e1.146
~.406
-5.906
-7.114
-6.839
-8.288
-5.030
-5.695
-6.455
-9.590
EX_ac
0.000
0.000
0.000
1.245
0.000
1.389
1.597
1.371
EX_akg
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
EX_co2
8.251
9.747
9.482
9.456
7.512
6.128
6.311
12.380
EX_etoh
0.000
0.013
0.000
0.000
0.000
0.000
0.058
0.044
EX_for
0.000
0.000
0.000
0.532
0.000
0.594
0.600
0.000
EX_fum
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
EX-9lc
-2.934
-3.178
-3.361
-2.922
-2.676
-2.525
-2.653
-3.813
EX_h20
4.252
8.711
15.300
2.103
-2.172
2.939
-3.947
-6.669
EX_h
9.924
6.903
0.955
12.474
15.098
8.902
16.164
25.450
EXJac_D
0.000
0.000
0.000
0.000
0.000
0.000
EX_o2
-5.792
-8.765
-11.867
-6.006
-2.250
-4.785
0.000 e_1.408
"-2.786
EX_p'l
-0.792
-0.788
-0.904
-0.681
-0.722
-0.763
-0.769
-0.787
EX succ
0.000
0.000
0.000
5.2E-6
0.000
0.000
0.000
0.000
0.000
Table I. Predicted biomass production and important fluxes (mmol/gDW/h). Negative values refer to uptake fluxes. 'Normalized to the same scale with Ref column for comparison. £PntAB transhydrogenase activities increase in all three mutants. 'The ATP biomass coefficient would be the absolute value of ATP synthase fluxes if we block ATP synthase reaction from the model. sOne flux pair (Ref and ~zwf) fails to predict the correct value for oaa and pep (results not shown). This is due to erroneous prediction for a single reaction, ppe (phosphoenolpyruvate carboxylase), which converts pep and co2 into oaa. The deviation of ppe fluxes in two predictions of Ref biomass (Ref vs. ~zwf and Ref vs. ~gnd) matches the deviation of co2 production rates. In addition, the flux measurement for ppe has large standard deviation [5). ©The NADH dehydrogenase flux seems to be under-predicted in GR03 and GR04 due to the unprecise oxygen uptake prediction.
182
H.-C. Chiu
fj
D. Segre
Acknowledgements
The authors would like to thank Evan Snitkin, Niels Klitgord and William Riehl for discussion and critical reading of the manuscript. Linear Programming calculations were performed using the software Xpress, kindly provided by Dash Optimization under free academic license. This work is supported by research grants from the US National Institute of Health (5012846-00) and the US Department of Energy (DE-FG0207ER64388 and DE-FG02-07ER64483). References
[1] Bonarius, H.PJ., Hatzimanikatis, V., Meesters, K.P.H., et al., Metabolic flux analysis of hybridoma cells in different culture media using mass balances, Biotechnol Bioeng, 50(3):299-318, 1996. [2] Burgard, A.P. and Maranas, C.D., Optimization-based framework for inferring and testing hypothesized metabolic objective functions, Biotechnol Bioeng, 82(6):670-7, 2003. [3] DeRisi, J.L., Iyer, V.R. and Brown, P.O., Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278(5338):680-6, 1997. [4] Gianchandani, E.P., Oberhardt, M.A., Burgard, A.P., et al., Predicting biological system objectives de novo from internal state measurements, BMC Bioinjormatics, 9(43,2008. [5] Ishii, N., Nakahigashi, K., Baba, T., et al., Multiple high-throughput analyses monitor the response of E. coli to perturbations, Science, 316(5824):593-7, 2007. [6] Johnston, M. and Carlson, M., The Molecular Biology of the Yeast Saccharomyces: Gene Expression, 1992. [7] Knorr, A.L., Jain, R. and Srivastava, R., Bayesian-based selection of metabolic objective functions, Bioinjormatics, 23(3):351-7, 2007. [8] Minard, K.1. and McAlister-Henn, L., Antioxidant function of cytosolic sources of NADPH in yeast, Free Radic Bioi Med, 31(6):832-43,2001. [9] Palsson, B.D., Systems Biology: Properties oj Reconstructed Networks, Cambridge University Press, 2006. [10] Sauer, U., Canonaco, F., Heri, S., et al., The soluble and membrane-bound transhydrogenases UdhA and PntAB have divergent functions in NADPH metabolism of Escherichia coli, J Bioi Chern, 279(8):6613-9, 2004. [11] Schuetz, R., Kuepfer, L. and Sauer, U., Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli, Mol Syst Bioi, 3(119, 2007. [12] Segre, D., Vitkup, D. and Church, G.M., Analysis of optimality in natural and perturbed metabolic networks, Proc Natl Acad Sci USA, 99(23): 15112-7,2002. [13] Uygun, K., Matthew, H.W. and Huang, Y., Investigation of metabolic objectives in cultured hepatocytes, Biotechnol Bioeng, 97(3):622-37, 2007. [14] van Gulik, W.M. and Heijnen, J.J., A metabolic network stoichiometry analysis of microbial growth and product formation, Biotechnol Bioeng, 48(6):681-698, 1995. [15] Varma, A., Boesch, B.W. and Pals son, B.D., Stoichiometric interpretation of Escherichia coli glucose catabolism under various oxygenation rates, Appl Environ Microbial, 59(8):2465-73, 1993.
SUFFIX TECHNIQUES AS A RAPID METHOD FOR RNA
SUBSTRUCTURE SEARCH RAPHAEL A. BAUERl,2,> raphael.bauer~charite.de
KRISTIAN ROTHER3,4,> krother~genesilico.pl
JANUSZ M. BUJNICKI3,4
ROBERT PREISSNERI
iamb~genesilico.pl
robert.preissner~charite.de
1 Institute
of Molecular Biology and Bioinformatics, Structural Bioinformatics Group, Charite Universitiitsmedizin (Medical University), Arnimallee 22, 14195 Berlin, Germany 2 Graduate School: Genomics and Systems Biology of Molecular Networks, Monbijoustr. 2, 10117 Berlin, Germany 3 International Institute of Molecular and Cell Biol09Y in Warsaw, ul. Ks. Trojdena 4, 02-109 Warsaw, Poland 4 Laboratory of Bioinformatics, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, ul. Umultowska 89, 61-614 Poznan, Poland The RNA Ontology Consortium recently proposed a two-letter representation of the RNA backbone conformation. In this study, we compare the suite notation to a custom string representation that utilizes '7 - () pseudotorsion angles. Both representations were used to assess similarity and self-similarity in several RNA structure datasets. For the detection of similarities between two RNA structures we are utilizing suffix techniques that allow for the detection of substructure similarity within some degree of inexactness. The suite representation as well as the pseudotorsion representation was tested on four diverse RNA datasets. The possibility to detect structural similarities on these datasets allowed to recover many homologous structural elements that have implications for further understanding of the RNA apparatus in Systems Biology. The software as well as the utilized datasets are freely available from http://suiterna.sourceforge.net.
Keywords: RNA; structural search; suffix array; suite encoding
1. Introduction
String-based approaches to RNA structure analysis are widely used as long as secondary structures are concerned. But, there have been few attempts to express 3D features in a string notation. Recently, the RNA Ontology Consortium [11] proposed a string representation for the conformation of RNA backbones. This allows the use of classical string matching methodology to compare structural features in turn. In this manuscript, we explore how suffix techniques can be used to find similar regions in RNA backbone strings. >Both authors contributed equally to the paper.
183
184
R. A. Bauer et al.
RNA secondary structures are most commonly expressed in the dot-bracket grammar, which contains all nested Watson-Crick and wobble base pairs. This string notation is easy to handle, and therefore has been widely used to describe local motifs [10], for computational approaches comparing RNA sequences by tree grammars [16], and for aligning two or more sequences [4]. To distinguish subtle structural motifs, like the sarcin-ricin motif, RNAse P, pseudoknots, and tertiary interactions, this notation is not enough. These features depend on specific base pairing and stacking interactions, and a specific arrangement of the RNA backbone. The RNA Ontology Consortium has bundled efforts to describe RNA structures. It poses a platform where structural Bioinformaticians can exchange ideas and discuss formal nomenclature. Systematic approaches to describe RNA tertiary structure have been started from many sides: A typology of base pairs as the basic unit of which RNA is built was defined [19]. This allowed to identify interchangeable pairs of base-base interactions (known as the isostericity principle) [12]. Stacking is conceived as a major stabilizing force, and two complementary typologies have been introduced [13]. To describe larger local structural units, circular topologies, residues interconnected by backbone, base-pair or stacking interactions, have been introduced. Assembly of these building blocks has been successfully used in constructing tertiary structures, given that the topology is known or well-predicted [15]. Jane Richardson et al. created a string representation of the RNA backbone [17], where the backbone conformation of ribose-to-ribose 'suite' units can be represented by two letters. To analyze the RNA backbone, the most significant feature are torsion angles. For each base, there are six of them, one for each bond from one phosphodiester unit to the next. These torsion angles show a characteristic distribution. More distinct clusters of the torsions can be found if RNA 'suites' - units from one ribose to another - instead of the traditional phosphate-phosphate units are considered [14]. Each suite consists of seven torsion angles, including both C4'-C3' bonds. The torsion angles were clustered, each cluster being defined as a hyperellipsoid in the 7D space formed by the seven torsions of one suite. In total 46 distinct conformations of the backbone were identified. For each cluster, a two-character code was assigned. The first character corresponds to the first three torsion angles, and the second to the other four. Thus, it is possible to write an entire RNA 3D structure as a ID string representing the backbone. The main disadvantage of the suite representation is that its scope is limited to well-defined backbones. For a high quality dataset, it covers 90-95% of the residues in RNA structures. The other residues are disregarded either because any of the backbone torsions are outside well-defined boundaries, or because the suite is not close enough to any of the hyperellipsoids in 7D space. Most of the unassigned residues are in flexible regions having a high temperature factor, or they simply belong to clusters that are too sparsely populated to form a separate cluster. An alternative description of the RNA backbone is based on pseudotorsion angles. For this, the RNA structure is reduced to C4' and P atoms similar to the Co: trace of proteins. Between these atoms, two pseudotorsions f/ and () are defined.
Suffix Techniques as a Rapid Method for RNA Substructure Search
185
Even though it is more coarse-grained, the TJ - (J angles encode important features such as the sugar pucker to a satisfying degree. The Amigos program can be used to calculate pseudotorsions [6]. The P and C4' atoms are frequently used to construct initial backbone trace in x-ray crystallography. Recently, it was reported that using P-C1' pseudo torsions improves the assignment of the backbone and ribose to electron density maps (K. Keating, personal communication), but it was not explored how these pseudotorsions map to other structural features. It is very tempting to utilize these backbone representations to compare local structures of RNA to each other. There are only few instruments available to compare RNA structures. Most of them are based on secondary structures, and they use the dot-bracket grammar. Among them, RNAforester [16], Vienna [9] and ARTS [5] are the most common. Recently a webserver (SARSA) was released [3] that uses a custom vector quantification to cluster the RNA bases into 23 distinct conformers that are translated into a string representation. SARSA is subsequently applying traditional string alignments to find similar motifs. SARSA is especially useful when applied to multiple alignments of RNA structures, however a search against a database of RNA structures is not supported. The RNAFRABASE web site (http://rnafrabase.ibch.poznan.pl/) contains a big number of loop fragments from RNA structures, but it is very limited in both the kind of fragments contained, and possible search methodology. To our knowledge there exists no method that allows fast queries for similar RNA substructures against a database. Therefore, we decided to use string representations of the RNA backbone in order to take advantage of existing algorithmic solutions for the efficient string search. Alternatively we are calculating a pseudotorsion representation of TJ - (J angles. To cope with the problem of thousands of motifs and thousands of RNA structures available we are using a suffix technique [7] that holds all information in an index and can be crawled almost linearly. The main objectives of this work are in brief: (1) Verification of the applicability of the RNA Ontology Consortium suite code, by examining the suites of differently structured RNA. (2) Presentation of a suffix method to compare RNAs to each other and giving an overview which structures and substructures are similar. (3) Discussion of possible alternatives (regarding the structure - string coding, used search algorithms) and applications.
2. Methods We constructed suffix arrays from strings consisting of the RNA Ontology Consortium suite codes for four different datasets: motifs from the SCOR database, all tRNA structures, a high-resolution dataset, and the representative RNADB05 set. Each of them was then queried for matching subsequences in the suffix array to detect structural similarities. As an alternative approach, strings representing TJ -
186
R. A. Bauer et al.
() angles of the RNA backbone were constructed and processed in the same way. 2.1. Datasets used
2.1.1. SCOR dataset First, we wanted to know, whether known RNA motifs annotated in SCOR can be recovered by the suite representation. SCOR is a database containing 15,945 structural, functional and tertiary interaction motifs that have been annotated manually [18J. A hierarchical classification inspired by the SCOP database [IJ has been established, but the database lacks updates after 2004. Therefore, a reliable automatic recognition of motifs could be useful. Currently, no such procedure is available with the circular motif library of the MC-Sym program probably coming closest [15J. For this analysis, all 4,501 structural and 100 tertiary interaction motifs from SCOR (version 2.0.4) data were used. Functional motifs annotate entire RNAs, and were excluded. The according fragments of PDB structures had lengths between 2-11 suites for structural, and 4-60 suites for tertiary interaction motifs. This set was termed "SCOR". Functional motifs are annotating entire RNAs, and are considered in the later datasets.
2.1.2. tRNA dataset Second, we were interested in proofing that a set of structurally highly conserved RN As can be recognized by the suite representation as a positive control. For this, the tRNA as one of the most conserved molecules in life was chosen. Although tRNA sequences started diverging even before the genetic code itself was fixed and their structures are highly modified by post-transcriptional additions, all of them need to have a highly conserved tertiary structure in order to work in the translation machinery. Thus, it is not surprising, that all example tRNAs from the PDB look the same from afar - and we were convinced that they should have very similar backbone conformation when represented as suites. To examine whether this hypothesis holds, all tRNA structures from the NDB database [2J were retrieved. The resulting tRN A set consists of 102 tRN A structures from all kingdoms of life and is termed "TRN A" .
2.1.3. RNADB05 and HIRES sets Third, we wanted to check for similarities among RNAs of different origin. This was done for two sets of RNA structures. One was the dataset used by Richardson et al. (termed RNADB05) [17J. The RNADB05 set is a manually refined representative set of 173 RNA structures from both X-Ray and NMR experiments. The second set (HIRES) consists of 74 high-resolution X-Ray structures. They were filtered from the PDB by applying resolution::; 2.5 A and r-value ::; 0.25 constraints. Structures with identical sequences, and sequences with less than four bases were discarded.
Suffix Techniques as a Rapid Method for RNA Substructure Search
187
2.2. Calculation of RNA backbone string representation For each structure in each of these datasets, a string using the suite representation, and another one based on the pseudotorsions was calculated. The calculation is also applied to structures that are queried against one of these datasets. The method to calculate suites from a structure was re-implemented according to the description in [17]. The seven torsion angles were calculated according to Figure 1 in 5' to 3' direction. They were then assigned to one or none out of the 46 suite clusters. First they are grouped according to their 8, 8 - 1, and 'Y angles to limit the number of clusters to be considered. Second, the 7D distances to the 7D hyperellipsoids for each cluster were calculated. If the suite was inside a hyperellipsoid, its name was assigned to the suite. The extents of these hyperellipsoids varies depending on the cluster. Especially, some of the clusters were partially overlapping; in these cases the closest hyperellipsoid center was used.
suite code
dihedrals
1b23
Fig. 1. Definition of RNA suites. A suite stretches from one ribose unit to the next, involving seven dihedral angles along the RNA backbone. Note that the 8 angle is used by two adjacent suites. In the suite encoding, the first three dihedral angles are represented by a number, the next four by a letter. The example is taken from the tRNA structure with PDB-code Ib23.
Even though it is recommended by Richardson et al. not to calculate suites for residues with a high B-factor and with clashes, we decided to include them anyway. This was done for two reasons: First, to have a continuous string representation for all RNA structures. This is particularly important considering that 5-15% of the residues are unassignable to suites, and thus in average only short fragments of structure would remain for calculation at all. Second, we wanted to assess the number of errors that occur in a real-life dataset. There were four kinds of errors: Missing atoms in the residue (resulting in a '--' suite code), a single torsion angle outside boundaries defined in [17] (S0called triaged residue, resulting in a 'tt' suite code), an outlier suite which is not
188
R. A. Bauer et al.
close to any cluster (resulting in a '00' suite code), and a close outlier inside a 4D hyperellipsoid but outside in 7D space (resulting in a '!!' suite code). The second possibility to translate a 3D structure of an RNA into a sequence of characters is implemented by calculating the 'f/ - () pseudotorsion angles from the backbone atoms of the same residues as the suites. For 'f/, these were the C4'i-Pi+ 1C4'i+1-Pi+2 dihedral, and for () the Pi-C4'i-Pi+1-C4'i+1 dihedral angles. Each of these angles was divided into 36 ten-degree bins, and for each bin, an alphanumeric character was assigned. Thus, a single 'f/ - () tuple - conceptually corresponding to the RNA suite - was represented by two characters as well. Only in the case when either of the atoms defining the dihedral was missing, an '--' code was assigned in place of the 'f/ - () tuple.
2.3. Suffix tree and array implementation Our studies where performed using a suffix array. While even simple implementations of suffix trees fulfill the property to search for a given substring in O(m) with m being the length of the input string we used the slightly slower suffix array implementation because of a better memory footprint. An algorithmic introduction to suffix trees and suffix arrays is given in [8]. The implementation we used as suffix array can search in O(mlogn) with m being the length of the search string, and n the number of strings in the index. This performance is fast enough considering the absolute amount of structures to index - even for all RNA structures in the PDB (currently 1500). A suffix array works in principle in the following manner: To index a string s with length m in the suffix array each substring from 0 - m is put into an array. This array is then sorted alphabetically. After the sorted array is established a substring of s can be retrieved by using binary search over the index that fulfills the O(mlogn) property. A conceptual disadvantage of suffix techniques is that a substring search can only be performed in an exact manner. To overcome this disadvantage we are using the notion of n-grams to perform an inexact search and to get a scoring of one input structure against a whole database. This similarity score (SCORE) is generated by searching all consecutive substrings of length n (n-grams) of the input string against the database.
SCORE =
number_of_matches-found number -of _matches_expected
(1 )
This allows us to generate a ranking of the best matching entries in the database as well as a nice way to generate an all-against-all ranking of entities in one database. One drawback of this scoring scheme is that ubiquitous repeating substrings (like 'la1a1a1a') are found in nearly every entity in the database and therefore add a huge bias to the calculation. To avoid that, a search of substrings with repeating entities is excluded.
Suffix Techniques as a Rapid Method for RNA Substructure Search
189
Apart from the theoretical runtimes given by O(x) the practical runtimes for the n-gram search with the current Suffix Array implementation is below 5 seconds for an all against all search of the RNADB05 set (257 entries) on a commodity pc (dual core 2.2 GHz, 3 GB RAM). 3. Results
In this analysis, we systematically looked for similar backbone conformations, and then checked whether they occur in RN As that are somehow annotated in a similar way. We calculated the suite strings and 'f/ - () binning strings for for 4,950 structures in all datasets. In Table 1, the distribution of suite codes is shown. Table l. Ratio of suite codes, as they occur in the four datasets examined here. The table is filled with number of suites of a particular kind, divided by the total number of suites (including outliers) for the corresponding dataset.
!! Ob 1[ 1b Ie 19 1t 2[ 2g 20 3a 3d 4b 4n 5d 5n 5q 6g 6p 7d 7r 9a tt
TRNA
SCOR
RNADB05
HIRES
0.0221 0.0005 0.0007 0.0077 0.0110 0.0045 0.0226 0.0046 0.0011 0.0056 0.0003 0.0045 0.0026 0.0023 0.0003 0.0009 0.0012 0.0004 0.0001 0.0003 0.0042 0.0012 0.0019 0.1120
0.0167 0.0015 0.0012 0.0065 0.0165 0.0063 0.0217 0.0019 0.0057 0.0007 0.0007 0.0084 0.0056 0.0028 0.0013 0.0029 0.0023 0.0006 0.0032 0.0052 0.0046 0.0023 0.0052 0.0352
0.0100 0.0094 0.0020 0.0078 0.0202 0.0049 0.0127 0.0025 0.0048 0.0015 0.0009 0.0038 0.0027 0.0045 0.0019 0.0019 0.0010 0.0005 0.0033 0.0044 0.0027 0.0017 0.0042 0.0543
0.0094 0.0343 0.0011 0.0057 0.0244 0.0068 Om05
0.0031 0.0026 0.0017 0.0020 0.0020 0.0011 0.0048 0.0017 0.0017 0.0009 0.0003 0.0028 0.0043 0.0011 0.0000 0.0051 0.0709
&a Oa 1L 1a 1c 1£ 1m 1z 2a 2h 2u 3b 4a 4d 4p 5j 5p 6d 6j 7a 7p 8d 00
TRNA
SCOR
RNADB05
HIRES
0.0188
0.0252 0.0047 0.0252 0.5760 0.0426 0.0058 0.0177 0.0029 0.0110 0.0010 0.0005 0.0022 0.0026 0.0012 0.0021 0.0020 0.0008 0.0020 0.0008 0.0076 0.0029 0.0026 0.0766
0.0170 0.0041 0.0269 0.5943 0.0477 0.0044 0.0111 0.0023 0.0109 0.0019 0.0009 0.0022 0.0020 0.0017 0.0019 0.0016 0.0011 0.0030 0.0008 0.0043 0.0028 0.0020 0.0723
0.0119 0.0034 0.0201 0.6015 0.0471 0.0023 0.0071 0.0011 0.0122 0.0011 0.0020 0.0014 0.0020 0.0017 0.0014 0.0014 0.0009 0.0045 0.0006 0.0014 0.0034 0.0000 0.0590
0.0007 0.0422 0.4504 0.0769 0.0098 0.0314 0.0001 0.0117 0.0017 0.0016 0.0004 0.0003 0.0042 0.0017 0.0007 0.0007 0.0057 0.0003 0.0078 0.0020 0.0005 0.1179
As expected, the helical stem suite variants (la, 1m, 1L, &a) are predominant. In the two representative datasets, the la suites account for up to 60% of all suites, its three satellite clusters contain together another 5%. In SCOR these numbers are very close to that, indicating that the 1a backbone conformation is apt to form many of the motifs annotated there (verified by visual inspection of the primary suite strings). In TRNA the number of la is lower (45%). This is a common feature of the tRNA fold, as this observation is the same for all tRNA suite strings. In turn,
190
R. A. Bauer et al.
some of the other suites are more highly represented. In particular, 1L, 1c, 1m, 2g, 4d, 6d, and It seem to play an important structural role in tRNA. The total number of all four kinds of invalid suites ('tt', 'oo','!!', and '--') are 25.25% in the tRNA set, 12.00% in SCOR, and 14.60%/17.36% in the RNADB05 and HIRES datasets, respectively. At first, the latter seems surprising, because one would expect less errors in high resolution structures. The percentage is mainly caused by 3.4% residues with missing atoms. The remaining 13.9% are caused by 'triaged' dihedral angles, and by outlier suites for which no suitable cluster could be found. An interpretation of this is that these are unusual backbone conformations which are only visible at a better resolution - in low-resolution structures they probably get smoothed out by the refinement process. In SCOR, the number of invalid suites is much lower. It is clearly biased by the manual selection of motifs, which by definition must occur in well-defined regions. In the tRNA set, the high error rate was examined in more detail. It appears that the three loop regions contain many conformations that do not fit in any cluster (resulting in '00' or 'tt' suites in a row for some structures). This can be a result of strong constraints in the structure during the refinement or by interaction with other molecules. In the high resolution tRNA entry with PDB id 1ehz, the rate of triaged and outlier suites is lower than in the RNADB05 and HIRES sets and the clusters of outliers do not occur here. It is unclear whether modified bases contribute to the problem, but in the examined high-resolution structures this was no problem either. This observation indicates that the lower resolution RNA structures are to be treated with caution.
3.1. Analysis of SCOR motifs The 4,601 motifs from SCOR were divided into a 20% training set and a 80% test set. The training motifs were stored in the suffix tree, and the test motifs searched in it by all their subsequences of 12 characters. One should assume that e.g. loops of a given type should have similar backbone conformations. Therefore we wanted to know which motifs can be identified this way, and whether they are distinct from other motifs. It was counted how many motifs from the test set could be correctly identified based on matchings of their suite strings. In Figure 2, the sensitivity and specificity of this analysis is given for each motif class separately. It turns out, that the predictability of the SCOR motifs is low. While the specificity is above 0.6 for almost all classes examined, and at 1.0 for many of them, the sensitivity covers almost the entire range from zero to one. The reason is a high number of false negatives in each class. To find out where these come from, the suite strings of several classes were inspected in more detail: The '180 degree turn' class consists of 24 motifs. 17 of them are just two suites (three residues) long, all having the suite string '4b6p'. The remaining 7 contain five suites, which are small variations of 'la3a1g9a1a'. These two groups fully correspond
Suffix Techniques as a Rapid Method for RNA Substructure Search
191
Recognition of SCaR motifs by substring matching
r----,
.. •
1.0 r-.~.~';'.~.T • ....,...... ~r:.,."':.:-~:"'::-:-.-::'-~"
0.8
.•
~
it
1,1,.
~
0.6
0.2
o~
U
M S~nsitillity
M
0.8
1.0
[1 - TP/(TP+fN)]
Fig. 2. SCOR motifs recognized by substring matching. The entire set of SCOR motifs was divided into a 20% training set and an 80% test set. The number of correctly matched sUbstrings of length 12 (or the entire motif, if it was shorter), the number of matches from different SCOR motifs, and the total number of motif pairs compared were used to calculate the sensitivity and specificity of the search.
to two homologous positions in different structures of the 23S rRNA (1874-1876 for the first, and 1789-1794 for the second). A similar effect can be observed for many other motifs like '3 non-We base pair', 'About 90 Degree '!Urn With All Bases Simply Stacked', and 'Multiple Twist'. In other cases, like the 'Ustk stack swap' motif, even more variations can be found. On the positive side, it has to be noted that the homologous motifs can be recognized well from as few as 2-4 suites, and their structures are conserved. As stated above, the manual selection of motifs probably facilitates this. There were no examples found, where two non-homologous motifs belonging to the same class can be identified on the bases of their suites alone. One of the reasons for this observation is that the rules upon which SeOR motifs have been annotated, are based on singular decisions made by experts. It appears, that the base pairing/secondary structure scheme that is specific for a particular motif class, does not impose a constraint on the backbone strong enough to allow a prediction. On the other hand, this implies that in the RNA backbone, an independent set of frequently occurring conformations could exist that has not been described.
3.2. Similarities among tRNA Next, a set of 102 tRNA structures with a well-defined backbone structures was examined. Because all tRNA structures have a highly conserved tertiary structure, one would expect this to be represented in the suite strings as well. In the TRNA dataset, several suites are over-represented compared to the RNADB05 and HIRES sets (partiCUlarly '6d', '2g', '7d', '1£', 'lc' and '11'). These
192
R. A. Bauer et al.
can be found in corresponding positions of most tRN As. We have locally aligned a couple of D-Ioops from tRNA structures with the corresponding suite strings in Figure 3. While each backbone follows the loop along the same path, there are several small differences in the suite codes. These include local variants, often replacing one suite by one close in the 7D dihedral space (e.g. the 'la'-'lL' and 'lm'-'l[' exchanges). The structures are also occasionally interrupted by outlier suites. These outliers are visible, but hardly distinguishable in the visualization. They do not alter the direction of the backbone and by no means disrupt the loop structure. Rather, it seems that many of them are results of improper refinement or low structure quality, as high-resolution structures such as PDB-code lehz and PDB-code 1b23 are less affected by this. One important conjecture of this is, that the suite codes are a very detailed description of tRNA backbone structure. It is apparently not suitable to describe a well-defined structure such as the D-Ioop in a general and unambiguous way. For the same loop trace, many combinations of suites are possible.
48
49 51
50
Fig. 3. The backbone of the dihydrouridine loops from the tRNA structures with PDB-codes: Ib23, lefwC, 19ts, lqf6, and lqrs superimposed by their backbone atoms. The labels indicate the residue numbers. The suite codes of the dihydrouridine loops are described in the table on the right. Outlier suites are underlined valid, but singleton suite codes at a given position are highlighted in bold case.
Another observation is that up to half of the D-Ioop suites are of the 'la' type, which was described by [17] as the conformer forming' A-form helices'. The D-Ioop contains a noncanonical base pair between residues 54 and 58, and two adjacent GC base pairs (53-61 and 52-62). But apart from that, many of the bases are involved in tertiary stacking (57, 58) and base pairing (59, 60) interactions. In total, the D-Ioop stem is more than a simple helix, showing that the abundant 1a suite can accommodate different structural roles. Although it was not attempted to align all structures explicitly, this seems feasible from these observations, and can be expected to result in a consensus alignment
Suffix Techniques as a Rapid Method for RNA Substructure Search
193
of suites. A more detailed analysis could be used to identify individual conformations of tRNA at a high level of detail. An all-against-all search of subsequences of all tRNA suite strings was performed using the suffix array, and the n-gram algorithm, as described in section 2.3. In Table 2, the numbers of hits found for different word lengths are given. Table 2. Results of the all-against-all search in the TRNA, RNADB05, and HIRES datasets using the n-gram approach. The column "total hits" indicates how many exactly matching n-grams were found for the given word length. "score" gives the average score for these hits. The score is calculated by the sum of the inverse frequencies from Table 1 for the matching n-gram.
n-gram length
TRNA number hits
score
RNADB05 number hits
score
HIRES number hits
score
4 6 8 10 12 14 16 18 20
6824 6732 6386 5381 3812 2817 1990 1542 1306
5.4 13.0 19.1 24.5 38.1 60.9 96.0 140.3 175.4
27978 22543 17674 13657 10436 6504 3554 2376 1443
6.1 10.6 14.9 16.6 20.4 30.7 45.2 62.0 86.7
10543 10111 8917 6497 4823 3321 2683 2001 1283
3.7 7.1 10.9 16.5 20.8 34.3 46.8 59.2 86.2
The tRN A dataset is different enough among itself, that in average only 69 other structures contain a sufficient number of matching n-grams. But, for structures found, the number of words within one hit is high. With increasing word length, the number of hit structures decreases continuously. This is expected as it gets increasingly difficult to find a longer word in the set of suite strings, because each of the occasional variations will disrupt the search for a local match. The number of words found within a structure drops correspondingly at first, but starts to rise again at a word length of 16 (data not shown). This observation can be explained by the fact that these hits are only occurring in a few but highly similar tRNA structures, where little or no variation occurs. We therefore conclude that a word size of 12 or 14 is optimal to find similarities within the set with as little background noise as possible, and at the same time not restricting the search to almost-identical structures. The outcome of the all-against-all search has been visualized in Figure 4 (TRN A depicted left). There, the normalized number of word hits for a given pair of structures is plotted. This indicates that an overall level of similarity exists between most pairs of tRNAs. The bright spots result from a group of few highly similar tRNA structures (the ones still remaining with word size 20). The dark regions (the lines at 31, and several ones between 56-68) are structures with very low similarity. The structures in this region (among others, PDB-codes: 1y14, 2ow8, 2v46, 3tra) were examined more closely. It turned out that these contain a much higher proportion
194
R. A. Bauer et al.
(up to 40%) of outlier and erroneous suite codes. Three of the examples here are structures of tRNAs bound to ribosomes, having resolutions of 3.7 A and higher. The fourth (PDB-code: 3tra) is alone, but it also has been determined at an inferior resolution. This dearly shows that the suite nomenclature is of very limited use for non-high-resolution structures.
Fig. 4. Scores of the all-against-all search in the a) TRNA (left), b) RNADB05 (middle), and c) HIRES (right) datasets. On each axis, the structures used are sorted according to their PDBcode. The color indicates the score found for a particular structure-structure-pair. The scaling was chosen such as that dark areas correspond to repeating 'la' matches. The higher the score, the more uncommon suites a particular hit contains. The results shown here are for n-grams of length 12.
3.3. Similarities in the representative RNA sets To assess whether these observations are meaningful, we compared both the 107 high-resolution structures and the 254 structures from the RNADB05 set. The number of hits found is described in Table 2. The according similarity maps are depicted in Figure 4. At first, it is observed that some of the suite strings in the datasets were too short to match anything (empty rows/columns and an interrupted diagonal in the heat map). Also, both the HIRES and RNADB05 datasets contained a number of sequences with trivial structures, consisting of 'la'-repeats and not much more. The scoring also depends on the length of the query string and therefore the matrices must not necessarily be symmetric. In Figure 4, it is clearly visible that the overall number of structures in RNADB05 and HIRES with detected similarities drops more sharply compared to the TRNA set. In the same way, the total number of hits changes. Even though the RNADB05 set is larger, only few hit structures remain there at word size 20 (also see Table 2). One reason for that is that the average size of both reference datasets is smaller, as they contain many hairpin loops and other short RNA. In both reference sets, the number of A-form helical stems (repeating regions consisting of 'la' suites) is higher, and they are practically excluded from the eval-
Suffix Techniques as a Rapid Method for RNA Substructure Search
195
uation by the scoring function. This leaves only a fraction of hits in the reference compared to the tRNA set. In tRNA not only a higher number of hits exists, but they are also less random because they consist of less frequently occurring suites. This shows that the similarity among tRNAs is non-random, which can be taken as a proof of concept for the method. One structure in the RNADB05 set - rr0082H09, the 23S subunit of the ribosome - was matched by almost any other from this database. The structural variety in this single structure easily matches that of the remaining dataset taken together, and any motif found somewhere else is probably found there as well (see the white vertical line in Figure 4 at dataset RNADB05). Interestingly, when searching for a set of local RNA structures other than helical stems with either of the methods, we find non-homologous hits. This works for: a) an internal loop of the SRP and the ribosomal SSU, b) a biotin-binding pseudoknot and the tRNA, and c) a tRNA and the E-Ioop from 5S-RNA. 4. Discussion
Geometrically, the suite representation does not cover variations that could occur in the bond lengths and flat angles of the RNA backbone. While bond lengths have a very narrow distribution throughout all structure files, bond angles show significant variation. This means that there is a degree of freedom that makes it impossible to rebuild RNA structures from a string, even if the suite nomenclature would determine the dihedrals with perfect precision. There are two obvious possibilities to resolve this: (1) Encode the flat angles in a similar way as the suites. (2) Encode base-base interactions in the string in order to constrain the structure, and use a 3D modeling procedure subsequently. We believe that the second method is more promising, because it would include those interactions that shape the function of RNA instead of restricting the structure of RN A to the backbone alone. Such a reconstruction of structures from a descriptive grammar (not string-based) was demonstrated already in [15]. Another implication of this approach would be, that if an RNA has in some region no further constraints, it may be structurally flexible. Therefore, the second approach would indirectly encode the flexibility. Having a rapid method for string-based motif recognition has a number of potential applications. First, it could be used to systematically find frequently occurring backbone motifs in RNA structures - as it has been demonstrated here. Further, it can be used to sample big numbers of backbone conformations in order to generate native-like RNA backbones which could be modeled subsequently. Finally, it allows on-the-fly evaluation of RNA models which are generated during manual structure modeling or automatic refinement. The combination of this technique with more
196
R. A. Bauer et at.
elaborate string representations would impose further improvement. We therefore think it is possible to accurately re-model the structure of RNA from a string representation by including additional structural features like base pairs, base stacking, or even tertiary interactions with energy minimization instead of extensive probing of the local conformational space. The T/ - () binning approach was shown to produce too many different local conformations for an effective substring matching. One could argue that by decreasing the number of bins, the matching could be improved. But, it has been shown earlier, that the pseudotorsion angles contain specific regions that are characteristic for some structural motifs [6]. Decreasing the bin size would ignore these and therefore be hopelessly inaccurate. Therefore, either explicit clusters in the pseudotorsion space would have to be defined or string matching techniques allowing for more inexact matches than the current suffix array would be necessary. We emphasize, that a more fuzzy search method could improve the usefulness of the suite codes as well. In particular, this could eliminate the adversary effects of the occasionally occurring erroneous or undefined suites. Practically, this could be implemented as a classical similarity matrix between the suite codes, and for the beginning, its values could simply be based on a normalized 7D distance between the 46 suite clusters. Given the performance of the suffix array the analysis presented here could easily be extended to the entire NDB [2]. Identifying structures that should be expected to be similar (e.g. based on their function) is more challenging, if one does not want to rely on sequence similarity alone. 5. Conclusions
This work presented the first approach that uses an indexing technique to scan the structural space of RNA. The indexing was implemented using suite codes and an T/ - () binning approach and tested on four distinct datasets. It could be shown that this approach can be used to rapidly identify similar substructures. This has applications not only for querying the RNA space but also for the modeling of RNAs by rapidly predicting possible conformations and in turn on-the-fiy evaluation of proposed RNA models regarding structural and functional similarities. All datasets as well as the sourcecode is freely available from http: / / sui terna. sourceforge. net. We hope this will be useful for the community and are looking forward to receiving feedback. Acknowledgements
This effort is supported by DFG SFB-449, Deutsche Krebshilfe, DFG (Deutsche Forschungsgemeinschaft) International Research Training Group (IRTG) on "Genomics and Systems Biology of Molecular Networks" (GRK1360) and the 6th MarieCurie EU Research Training Network "DNA Enzymes", grant no. MRTNCT-2005019566. Without the use of free and/or open source software this effort would not
Suffix Techniques as a Rapid Method for RNA Substructure Search
197
have been possible. References [1] Andreeva, A., Howorth, D., Chandonia, J. M., Brenner, S.E., Hubbard, T.J., Chothia, C., Murzin, A.G., Data growth and its impact on the SCOP database: new developments. Nucleic Acids Research, 36(Database issue):419-425, January 2008. [2] Berman, H. M., Westbrook, J., Feng, Z., lype, L., Schneider, B., and Zardecki, C., The Nucleic Acid Database. Acta Crystallographica Section D, 58(6 Part 1):889-898, Jun 2002. [3] Chang, Y.F., Huang, Y.L., and Lu, C.L., SARSA: a web tool for structural alignment of RNA using a structural alphabet. Nucleic Acids Research, 36(Web Server Issue): 1924, May 2008. [4] Dowell, R. D. and Eddy, S. R., Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 7:400+, September 2006. [5] Dror, 0., Nussinov, R., and Wolfson, H. J., The ARTS web server for aligning RNA tertiary structures. Nucleic Acids Research, 34(Web Server issue), July 2006. [6] Duarte, C. M. and Pyle, A. M., Stepping through an RNA structure: a novel approach to conformational analysis. Journal of Molecular Biology, 284(5):1465-1478, December 1998. [7] Giegerich, R. and Kurtz, S., From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction. Algorithmica, 19(3):331-353, November 1997. [8] Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, January 1997. [9J Hofacker, I. L., Vienna RNA secondary structure server. Nucleic Acids Research, 31(13):3429-3431, July 2003. [10J Hofacker, I. L., Bernhart, S. H., and Stadler, P. F., Alignment of RNA base pairing probability matrices. Bioinformatics, 20(14):2222-2227, September 2004. [l1J Leontis, N.B., Altman, RB., Berman, H.M., Brenner, S.E., Brown, J.W., Engelke, D.R, Harvey, S.C., Holbrook, S.R, Jossinet, F., Lewis, S.E., Major, F., Mathews, D.H., Richardson, J.S., Williamson, J.R, and Westhof, E., The RNA Ontology Consortium: an open invitation to the RNA community. RNA, 12(4):533-541, April 2006. [12] Lescoute, A., Leontis, N. B., Massire, C., and Westhof, E., Recurrent structural RNA motifs, Isostericity Matrices and sequence alignments. Nucleic Acids Research, 33(8):2395-2409, 2005. [13] Lescoute, A. and Westhof, E., The interaction networks of structured RNAs. Nucleic Acids Research, 34(22):6587-6604, December 2006. [14] Murray, L. J. W., Richardson, J. S., Iii, A. W. B., and Richardson, D. C., RNA backbone rotamers finding your way in seven dimensions. Biochemical Society Transactions, pages 485-487, 2005. [15] Parisien, M. and Major, F., The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature, 452(7183):51-55, 2008. [16] Reeder, J., Hochsmann, M., Rehmsmeier, M., Voss, B., and Giegerich, R., Beyond Mfold: Recent advances in RNA bioinformatics. J Biotechnol, March 2006. [17] Richardson, J.S., Schneider, B., Murray, L.W., Kapral, G.J., Immormino, RM., Headd, J.J., Richardson, D.C., Ham, D., Hershkovits, E., Williams, L.D., Keating, K.S., Pyle, A.M., Micallef, D., Westbrook, J., Berman, H.M., RNA backbone: consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium contribution). RNA, 14(3):465-481, March 2008.
198
R. A. Bauer et al.
[18J Tamura, M., Hendrix, D. K., Klosterman, P. S., Schimmelman, N. R., Brenner, S. E., and Holbrook, S. R., SCOR: Structural Classification of RNA, version 2.0. Nucleic Acids Res, 32(Database issue), January 2004.
[19J Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H., and Westhof, E., Tools for the automatic identification and classification of RNA base pairs. Nucl. Acids Res., 31(13):3450-3460, July 2003.
THE RELATIONSHIP BETWEEN FINE SCALE DNA STRUCTURE, GC CONTENT, AND FUNCTIONAL ELEMENTS IN 1% OF THE HUMAN GENOME ELLIOTT H. MARGULIES 2 [email protected]
STEPHEN C. J. PARKER] [email protected]
THOMAS D. TULLIUS]' 3 [email protected]
] Graduate Program in Bioinjormatics, Boston University, Boston MA 02215, US.A. National Human Genome Research Institute, National Institutes 0/ Health, Bethesda MD 20892, US.A. 3 Department o/Chemistry, Boston University, Boston MA 02215, US.A. 2
GC content has been shown to be an important aspect of human genomic function. Extending beyond the scope of GC content alone, there is a class of regions in the genome that have especially high GC content and are enriched for the CG dinucleotide-called CpG islands. CpG islands have been linked to biologica\1y functional genomic elements. DNA structure also contributes to biological function. Recent studies found that some DNA structural properties are correlated with CpG island functionality [5, 14]. Here, we use hydroxyl radical cleavage patterns as a measure of DNA structure, to explore the relationship between GC content and fine-scale DNA structure. We show that there is a positive correlation between GC content and the solvent-accessible structural properties of a DNA sequence, and that the strength of this correlation decreases as genomic resolution increases. We demonstrate that regions of the genome that have highly solvent-accessible DNA structure tend to overlap functional genomic elements. Our results suggest that fine-scale DNA structural properties that are encoded in the genome are important for biological function, and that the highly solvent-accessible nature of high GC content regions and some CpG islands may account for some of their functional properties.
Keywords: DNA structure; GC content; CpG islands; hydroxyl radical cleavage; functional element;
human genome
1.
Introduction
GC content-the fraction of G or C nucleotides within a given window-is variable across the human genome [17, 36]. This observed heterogeneity in sequence composition has been implicated as a marker for some functional genomic regions. One example of this is CpG islands, which are regions of the genome characterized by high GC content and enrichment of the CG dinucleotide [11]. CpG islands have been linked to many regulatory processes [7, 18,24,33,37-39]. Beyond the primary order of nucleotides in a genome that is used to define GC content and CpG islands, the local structural profile of DNA has been implicated in a number of biological processes. Recent studies suggest that DNA structure is important for some of the same processes as CpG islands: namely DNA-protein interactions [20], promoter function [1, 29], epigenetically controlled gene regulation [4, 23, 32, 34, 40],
199
200
S. C. J. Parker, E. H. Margulies &J T. D. Tullius
and DNase I hypersensitivity [14]. However, the precise relationship between GC content, fine-scale DNA structure, and genome function remains unclear. A critical first step in assessing this relationship is the ability to predict the local DNA structural profile for genomic sequences. Hydroxyl radical cleavage patterns of DNA have been used to study structural properties for a wide variety of sequences [13, 19, 30]. The cleavage pattern of naked DNA is a reflection of an important structural parameter, the solvent-accessible surface area of the DNA backbone [2]. The cleavage pattern thus provides a high-resolution quantitative measure of the shape of the DNA backbone and how it varies with respect to its sequence. We have recently shown that using a database of experimentally-determined hydroxyl radical cleavage patterns, the cleavage pattern of any DNA sequence can be predicted with a high degree of accuracy [13]. Although GC content has recently been implicated in defining hydroxyl radical cleavage patterns of DNA [35], this analysis was conducted at a relatively low genomic resolution of 333 base pairs. Single-nucleotide, genome-scale DNA structure predictions are feasible [13], which makes exploring the relationship between GC content and finescale DNA structure possible. Since different DNA sequences can have similar local structural properties [10, l3], directly correlating GC content with DNA structure is an important experiment. Results from the ENCODE Pilot Project provide a rich resource for functional annotations in 1% of the human genome [3]. These developments facilitate the investigation of the relationship between GC content, DNA structure, and functional elements in this 1% of the human genome. Here, we compare GC content to DNA structure (measured as hydroxyl radical cleavage patterns) at various genomic resolutions, with an emphasis on fine-scale DNA structure. We then measure the occurrence of significantly over-represented DNA structural motifs with known functional annotations. Our results show that GC content only weakly influences fine-scale DNA structure, and that local structural properties may be important in conferring biological functionality to genomic regions like CpG islands. 2.
Materials and Methods
2.1. DNA sequence andfunctional annotation data sources The DNA sequence for NCBI build 36 (March 2006), hg18 version of the ENCODE regions within the human genome was downloaded from the UCSC genome browser (http://genome.ucsc.edulENCODEJ) [21,22]. We used the following functional annotations for comparisons with DNA sequence and structural features. All the annotations are available through the UCSC genome browser (see above), unless otherwise noted. For all analyses, the hg18 version of each annotation track was used.
Fine Scale DNA 8t.r?lrt?i.rp
r;c
Content. and Functional Elements
201
•
DNase I hypersensitive sites (DHSs) represent regions of open chromatin architecture where protein-DNA interactions occur. We used a Union set ofDHSs derived from the human GM06990 cell line, as described in [3, 14].
•
Formaldehyde Assisted Isolation of Regulatory Elements (FA IRE) is an alternative method used to locate regions of open chromatin. FAlRE sites are enriched for regulatory elements [12].
•
Promoters were defined as the region 2.5 kilobases upstream from gene start sites. We used the GENCODE [16] gene track to define genes.
•
Ancestral Repeats (ARs) are mobile elements that inserted before the common ancestor of most mammals. They are thought to be neutrally evolving and are therefore typically used to represent nonfunctional regions of the human genome [9, 15,28,31,41]. We used the AR regions defined in [3].
•
CpG islands are regions of the human genome with high GC content and higherthan-expected CG dinucleotide density. We used the CpG islands track from the UCSC genome browser, which was constructed using the CpG island definition described in [11].
•
Evolutionarily constrained regions are areas of the human genome that are under purifying selection against nucleotide changes. We used the 'moderate track' which is a summary of regions identified by multiple sequence alignment and constraint detection algorithms-described in [3, 25] for this analysis.
•
Transcription start sites used here are described in [3, 8].
•
As a control, we constructed a 'random annotation' by randomly selecting 500 base pair intervals within the ENCODE regions. We repeated this process 1000 times to create the random annotation track used here. Since this annotation set was derived randomly, there should be no association with any given set of functional elements.
2.2.
Local DNA structure prediction and GC content analysis
We used predicted hydroxyl radical cleavage patterns as a measure of local DNA structure. Hydroxyl radical cleavage patterns were predicted using the Sliding Tetramer Window algorithm described in [13] for all the ENCODE regions. After the cleavage intensity at each base was predicted, we averaged the cleavage values within a window for all possible windows within the ENCODE regions. For GC content analysis we calculated the fraction of G or C bases within all possible windows of various sizes within the ENCODE regions. To calculate CpG density we counted the observed number of CG dinucleotides within the same windows.
202
S. C. J. Parker. E. H. MarGulies €3 T. D. Tullius
2.3. Annotation proximity and overlap statistics To calculate the proximity of various windows to functional annotations we computed the distance, in base pairs, from the closest base in a given window to the closest base from the nearest element in the specified annotation. To calculate the observed overlap statistics between different annotations, for example-comparing the regions in annotation X to the regions in annotation Y, we first computed the fraction of regions in annotation X that overlap any region from annotation Y. We then constructed a null distribution of the fraction of expected overlaps by using the block bootstrap method described in [3]. We calculated the mean and standard deviation from the null distribution to assess the statistical significance of the observed overlap. This allowed us to determine if the regions in annotation X overlap the regions in annotation Y significantly more or less than random expectation. 3.
Results
3.1. Correlation between GC content and local DNA structure Given the data reported in [35] that shows a high correlation between GC content and mean hydroxyl radical cleavage patterns at a window size of 333 base pairs, we first sought to reproduce and supplement these results. We computed the Pearson correlation between GC content and mean hydroxyl radical cleavage for windows of size N, where N = {2, 3, 4, 5, 10, 20, 50, 100, 333, 500, 1000, 10000}, in the ENCODE regions. We observe a positive correlation between the size of a window and the strength of the correlation between GC content and hydroxyl radical cleavage (Figure IA). That is, while large windows have a high correlation between GC content and mean hydroxyl radical cleavage, small windows-which are a reflection of the fine-scale structure of DNA-donot. To determine if the above result is unique to the DNA in the ENCODE regions we randomized all of the ENCODE sequences. We used a first order Markov model trained on the real ENCODE sequences to preserve all dinucleotide frequencies. The random sequences follow the same correlation trend as the real ENCODE sequences (data not shown), which suggests that the observed correlations are in inherent property of DNA and not an artifact of the ENCODE sequences chosen for this analysis. We next focused on the relationship between CpG density and mean hydroxyl radical cleavage over windows of size N (Figure IB). For equivalent values of N, the strength of the correlation between DNA structure and CpG density is less than for GC content (compare Figure IB to Figure IA).
Pine Scale DNA Structure, GC Content, and Punctional Elements
203
A 1
,---- .
i
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 N
M
V
LI'>
0 N
8
0 LI'>
0
8
M M M
0 0 LI'>
0 0
0 0 0
8
8
Window size (bases)
B 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 N
M
v
LI'>
8
0 N
0 LI'>
0
8
M M M
0 0
III
0 0 0 .-<
0 0 0
8
Window size (bases)
Fig. I. Pearson correlation coefficient (r) between OC content or CpO island density and hydroxyl radical cleavage at various genomic scales. A. Correlation between OC content and predicted hydroxyl radical cleavage. B. Correlation between CpO density and predicted hydroxyl radical cleavage. Note that all r-values are positive.
The above results demonstrate that the correlation between GC content or CpG density and DNA structure is variable depending on genomic scale. Importantly, when fine-scale DNA structure is considered the correlation with GC content or CpG density is low. Therefore nucleotide composition is not a good predictor of fine-scale DNA structural properties. To further demonstrate this point we focus the rest of this manuscript on an in-depth analysis of N = 10. We specifically select this scale because it represents about one turn of the DNA double helix and is the approximate size of a
204
S. C. J. Parker, E. H. Margulies f3 T. D. Tullius
transcription factor binding site [27], which should allow for biologically relevant interpretations of the results. Looking at the entire distribution of GC content and mean hydroxyl radical cleavage for a window of size 10 (Figure 2) clearly demonstrates that GC content is not a good predictor of fine-scale DNA structure. For example, it is possible to have windows with 0% GC content and higher mean hydroxyl radical cleavage than some windows with 100% GC content (Figure 2A). Examples of cleavage profiles for windows with different GC content are shown in Figure 2B-C.
3.2. High hydroxyl radical cleavage regions overlap with functional elements We focused in on the highest and lowest mean cleavage intensity 10 base windows. To do this, we calculated Z-scores for all windows using the observed mean cleavage intensity over the window and the mean and standard deviation for all windows in the random sequence distribution we constructed (as described in section 3.1). We used a Zscore threshold of IZI = 3.09, which is equivalent to a p-value of p = 0.001, to extract windows with the highest and lowest mean cleavage values. This process resulted in 43,096 high cleavage windows and 306,089 low cleavage windows. Overlapping windows for each set were merged so that disjoint and contiguous genomic regions are present in the two resulting annotation sets. This merging process resulted in 14,914 high cleavage regions and 57,307 low cleavage regions. To determine if the resulting high and low hydroxyl radical cleavage regions occur near biologically active areas of the genome we measured their proximity to annotated transcription start sites (Figure 3). We observe that the low cleavage regions cluster near transcription start sites, and the high cleavage regions do so to a more pronounced extent, suggesting that these particular DNA structural features may be associated with some aspect of gene regulation. To further examine the possibility that low or high cleavage regions are associated with biological function, we employed a more rigorous statistical test. We used the block bootstrap method [3] to measure the statistical confidence associated with how often low or high cleavage regions overlap a number of functional annotations (see section 2.3 for an overview of this method and section 2.1 for an explanation of each annotation). Figure 4 shows the results of this analysis. The observed overlap of low and high cleavage regions with a random annotation (see methods) is the same as random expectation. Low and high cleavage regions overlap ancestral repeats significantly less than random. The fraction of high cleavage regions that overlap with promoters and F AlRE sites is statistically significant (p < 0.05). This result suggests that high cleavage regions may have an association with functional regions for the genome. The observation that 72% of CpG islands overlap high cleavage regions is highly statistically significant (p < 10.27 ).
Fine Scale DNA Structure, GC Content, and Functional Elements
205
A Q)
Ol
~
til
Q)
U
~
0.5
'6 ~
~
Til 81 Bi BaS ::8:8'8 :::': '::: "8: : i 11 i
""':', :'8: ! iii i 1~ ~ ~ 8.1: i 11 ~ ~ 1
e
0
~ ..c: c:
III
:!: -0.5
1
j
~
o
10
20
30 40
50
60
70 80
90 100
GC percent in 10 base windows
B 90% GC 1.5
0.5
o
..................................... "'..........
.•....•. ",,,."' . . . ..
~
-0.5 --GCGCGTGCGC - ... - cCCCCCCCCA
2
3
4
5 6 Nucleotide
7
8
10
9
c 10% GC 1.5 -+-TTATATGTAT - .... AAAAAAAAAC
0.5
o
j
,,'
.,.
-o~: +_._"_",._"_._."-_"_._'_","_"_._-,""_"_._"_","_-_._',"_""_._"-","_"_8_","_""-.-'---r-~ 2
3
4
5 6 Nucleotide
7
8
9
10
Fig. 2. Hydroxyl radical cleavage in 10 base windows binned by GC content. A. The correlation between GC percent and hydroxyl radical cleavage is not perfect. Windows with high GC content can have low cleavage, and windows with low GC content can have high cleavage. Example hydroxyl radical cleavage profiles for 10 base windows with high (B) and low (C) GC content.
206
S. C. J. Parker, E. H. Margulies & T. D. Tullius DAD
0.35
low cleavage regions
0.30
iii high cleavage regions
0.25 0.20 0.15 0.10 0.05 0.00 0 0 0
lfl
0 0 0 0 ,.....
0 0 0
lfl
,.....
0 0 0 0 N
0 0 0 lfl
N
0 0 0 0
fV)
0 0 0 lfl fV)
0 0 0 0
0 0 0 0
0 0 0
lfl
"" ""
lfl
0 0 0 0 lfl
A
Distance to nearest transcription start site (bases)
Fig. 3. Proximity analysis of high and low cleavage regions relative to annotated transcription start sitcs. High and low cleavage regions tend to occur near annotated transcription start sites, and this effect is more pronounced with high cleavage regions.
1.0
DLOW (L) cleavage regions
~0.9
High (H) cleavage regions
~0.8
80.7 til gO.6
+
·~0.5
...
'0°.4
•
gO.3
•
+
~0.2 u. 0.1
0.0 L Random Ancestral Promoters CpG repeats islands
H
DHS
FAIRE
Fig. 4. Low and high cleavage region overlaps with functional annotations. Black points represent the mean of a null distribution estimated using the block-bootstrap method (see section 2.3 for a summary) and error bars represent +/- one standard deviation. DHS =DNase I Hypersensitive Sites; FAIRE = Formaldehyde Assisted Isolation of Regulatory Elements sites.
Fine Scale DNA Structure, GC Content, and Functional Elements
207
3.3. CpG islands overlapping high cleavage regions are more likely to he functional The CpG islands used in this analysis all meet a common criteria that was developed using primary DNA sequence-based metrics. We observe that most, but not all, CpG islands overlap high cleavage windows (Figure 4), suggesting that DNA structural features can be used to partition CpG islands into different groups. Given that high cleavage windows have a statistically significant association with promoters and F AlRE sites (Figure 4), the CpG island set that overlaps these windows may have enhanced functional tendencies compared to their non-high cleavage region overlapping counterparts. To specifically test the above hypothesis, we first partitioned all CpG islands within ENCODE into two groups: 1) CpG islands that do not overlap high cleavage regions, and 2) CpG islands that overlap high cleavage regions (Figure 5A). We then performed a statistical overlap analysis with these two groups relative to other annotations and compared the overlaps between groups (Figure 5B). All CpG islands have a statistically significant association with promoters, DNase I hypersensitive sites, and evolutionarily constrained regions. However, the group 2 CpG islands overlapped significantly more of each annotation compared to the group 1 CpG islands (compare open bars to grey bars in Figure 5B). These results suggest CpG islands that overlap high cleavage regions are more likely to be functional. 4.
Discussion
We have performed a general assessment of the relationship between GC content, DNA structure (as measured by hydroxyl radical cleavage patterns), and genome function. Our results demonstrate that the correlation between GC content and DNA structure varies depending on the scale of the comparison. At low resolution scales the two variables are correlated, but the strength of this correlation decreases as resolution increases. When a biologically meaningful scale is considered-for example 10 bases represent one tum of the DNA double helix and is the approximate size of a transcription factor binding site [27]-GC content is not a strong predictor of local DNA structure. Even at scales greater than 10 bases, CpG density does not predict local overall structure well. We found more low hydroxyl radical cleavage regions in the ENCODE regions compared to high regions. However, the high cleavage regions seem are more significantly associated with functional genomic elements like promoters, FAIRE sites, and CpG islands. The finding that despite equality based upon a common primary DNA sequencebased definition, not all CpG islands are the same with respect to fine-scale DNA structure, is particularly interesting. We previously reported a common hydroxyl radical cleavage pattern found among DNase I hypersensitive sites (DHSs) that occurs more often in DHSs overlapping CpG islands compared to DHSs that do not overlap CpG islands [14]. The results presented here suggest that fine-scale local DNA structural
208
S. C. J. Parker, E. H. Margulies & T. D. Tullius
motifs may be associated with differentiating CpO islands that have greater functional potential. The set of CpO islands overlapping high cleavage regions occur within evolutionarily constrained regions of the human genome significantly more often than do the set of CpO islands that do not overlap high cleavage regions (Figure SB). It is interesting to speculate that local DNA structural features that distinguish the former set of CpO islands can act as a substrate for natural selection. The above result, along with recent literature perspectives [6, 26], suggests this may be a possibility. Collectively, the results reported here illustrate the importance of considering local DNA structure when investigating the relationship between genomic sequence and the biological functionality encoded therein. A CpG island & high-cleavage region overlaps no highcleavage overlap 28%
highcleavage overlap 72%
all
B al
1.0
epG
islands (504)
DePG - high
~ 0.9 8 0.8
cleavage regions
~ 0.7 c:
.!l! 0.6 f/j
(; 0.5 IS 0.4 'l5 0.3 5 0.2 ~ 0.1 u: 0.0
+-1---'--.,-
Promoters
DHS
Evolutionarily constrained
Fig. 5. epG islands with high-cleavage windows are more likely to be functional. A. 72% of annotated epG islands overlap at least one high-cleavage window. B. epG islands containing high-cleavage windows overlap significantly more promoters, DHSs, and evolutionarily eonstrained regions compared to epG islands that do not contain high-cleavage windows (* p < 0.05 for grey bar compared to open bar in the same .c~teg~ry; Fisher cxact test). Black points and error bars are as described in Figure 4. DHS = DNase I hypersenSItive SItes.
Fine Scale DNA Structure, GC Content, and Functional Elements
209
Acknowledgments
We would like to thank Eric Bishop for providing code to calculate the proximity of regions to known transcription start sites and for critical evaluation of the manuscript. We would like to thank Gayle McEwen for providing code to calculate region overlap statistics using the block-bootstrap method. This work was funded by a grant from the National Human Genome Research Institute of the National Institutes of Health (R01 HG003541) to TDT. EHM was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. SCJP was supported by a Ford Foundation Dissertation Fellowship. References
[1]
Abee1, T., Saeys, Y., Bonnet, E., et al., Generic eukaryotic core promoter prediction using structural features of DNA, Genome Res., 18(2):310-323,2008. [2] Balasubramanian, B., Pogoze1ski, W.K. and Tullius, T.D., DNA strand breaking by the hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the DNA backbone, Proc Natl Acad Sci USA, 95(17):9738-43, 1998. [3] Birney, E., Stamatoyannopoulos, lA, Dutta, A, et aI., Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, Nature, 447(7146):799-816, 2007. [4] Bock, C., Paulsen, M., Tierling, S., et al., CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure, PLoS genetics, 2(3):e26-e26, 2006. [5] Bock, c., Walter, J., Paulsen, M., et aI., CpG Island Mapping by Epigenome Prediction, PLoS Computational Biology, 3(6):e1 1O-e1 10, 2007. [6] Cooper, G.M. and Brown, C.D., Qualifying the relationship between sequence conservation and molecular function, Genome Res., 18(2):201-205,2008. [7] Davuluri, R.V., Grosse, 1., and Zhang, M.Q., Computational identification of promoters and first exons in the human genome, Nat Genet, 29(4):412-417, 2001. [8] Denoeud, F., Kapranov, P., Ucla, c., et aI., Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions, Genome Res., 17(6):746-759,2007. [9] Ellegren, H., Smith, N.G.C., and Webster, M.T., Mutation rate variation in the mammalian genome, Current Opinion in Genetics & Development, 13(6):562-568, 2003. [10] Gardiner, EJ., Hunter, C.A, Lu, X.J., et aI., A structural similarity analysis of double-helical DNA, J Mol Bioi, 343(4):879-89, 2004. [11] Gardiner-Garden, M. and Frommer, M., CpG Islands in vertebrate genomes, Journal of Molecular Biology, 196(2):261-282, 1987. [12] Giresi, P.G., Kim, l, McDaniell, R.M., et al., FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin, Genome research, 17(6):877-85,2007.
210
S. C. J. Parker, E. H. Margulies f3 T. D. Tullius
[13] Greenbaum, lA., Pang, B., and Tullius, T.D., Construction of a genome-scale structural map at single-nucleotide resolution, Genome research, 17(6):947-53, 2007. [14] Greenbaum, lA., Parker, S.C.J., and Tullius, T.D., Detection of DNA structural motifs in functional genomic elements, Genome research, 17(6):940-6,2007. [15] Hardison, R.C., Roskin, K.M., Yang, S., et al., Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution, Genome Res., 13(1): 13-26,2003. [16] Harrow, l, Denoeud, F., Frankish, A., et al., GENCODE: producing a reference annotation for ENCODE, Genome Biology, 7(Suppll):S4-S4, 2006. [17] International Human Genome Sequencing, C., Initial sequencing and analysis of the human genome, Nature, 409(6822):860-921, 200l. [18] Ioshikhes, I.P. and Zhang, M.Q., Large-scale human promoter mapping using CpG islands, Nat Genet, 26(1):61-63, 2000. [19] Jain, S.S. and Tullius, T.D., Footprinting protein-DNA complexes using the hydroxyl radical, Nat. Protocols, 3(6): 1092-1100,2008. [20] Joshi, R., Passner, J.M., Rohs, R., et al., Functional specificity of a Hox protein mediated by the recognition of minor groove structure, Cell, 131(3):530-43,2007. [21] Karolchik, D., Baertsch, R, Diekhans, M., et al., The UCSC genome browser database, Nucl. Acids Res., 31(1):51-54, 2003. [22] Kent, W.J., Sugnet, C.W., Furey, T.S., et al., The human genome browser at UCSC, Genome Res., 12(6):996-1006,2002. [23] Kogan, S.B., Kato, M., Kiyama, R, et al., Sequence structure of human nucleosome DNA, lournal of Biomolecular Structure & Dynamics, 24(1):43-8, 2006. [24] Kudla, G., Lipinski, L., Caffin, F., et al., High guanine and cytosine content increases mRNA levels in mammalian cells, PLoS Biology, 4(6):e180 EP --e180 EP -,2006. [25] Margulies, E.H., Cooper, G.M., Asimenos, G., et aI., Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome, Genome Res., 17(6):760-774,2007. [26] Margulies, E.H. and Birney, E., Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes, Nat Rev Genet, 9(4):303-313, 2008. [27] Maston, G.A., Evans, S.K., and Green, M.R, Transcriptional regulatory elements in the human genome, Annu Rev Genomics Hum Genet, 2006. [28] Mouse Genome Sequencing, C., Initial sequencing and comparative analysis of the mouse genome, Nature, 420(6915):520-562, 2002. [29] Pedersen, A.G., Baldi, P., Chauvin, Y., et al., DNA structure in human RNA polymerase II promoters, 1 Mol Bioi, 281(4):663-73,1998. [30] Price, M.A. and Tullius, T.D., Using hydroxyl radical to probe DNA structure, Methods in Enzymology, 212(194-219, 1992. [31] Rat Genome Sequencing Project, C., Genome sequence of the Brown Norway rat yields insights into mammalian evolution, Nature, 428(6982):493-521, 2004.
Fine Scale DNA Structure, GC Content, and Functional Elements
211
[32] Salih, F., Salih, B., Kogan, S., et al., Epigenetic Nucleosomes: Alu Sequences and CG as Nuc1eosome Positioning Element, Journal of Biomolecular Structure & Dynamics, 26(1):9-16, 2008. [33] Sandelin, A., Carninci, P., Lenhard, B., et al., Mammalian RNA polymerase II core promoters: insights from genome-wide studies, Nat Rev Genet, 8(6):424-436, 2007. [34] Segal, E., Fondufe-Mittendorf, Y., Chen, L., et al., A genomic code for nucleosome positioning, Nature, 442(7104):772-778, 2006. [35] Thomas, D.1., Rosenbloom, K.R., Clawson, H., et aI., The ENCODE Project at DC Santa Cruz, Nuc!. Acids Res., 35(suppl_1):D663-667-D663-667, 2007. [36] Venter, J.C., Adams, M.D., Myers, E.W., et al., The Sequence of the Human Genome, Science, 291(5507):1304-1351, 200l. [37] Vinogradov, A.E., Isochores and tissue-specificity, Nuc!. Acids Res., 31(17):52125220,2003. [38] Vinogradov, A.E., Dualism of gene GC content and CpG pattern in regard to expression in the human genome: magnitude versus breadth, Trends in Genetics, 21(12):639-643,2005. [39] Vinogradov, A.E., Noncoding DNA, isochores and gene expression: nucleosome formation potential, Nucl. Acids Res., 33(2):559-563, 2005. [40] Wanapirak, c., Kato, M., Onishi, Y., et al., Evolutionary conservation and functional synergism of curved DNA at the mouse epsilon- and other globin-gene promoters, J Mol Evol, 56(6):649-57, 2003. [41] Yang, S., Smit, A.F., Schwartz, S., et aI., Patterns of Insertions and Their Covariation With Substitutions in the Rat, Mouse, and Human Genomes, Genome Res., 14(4):517-527,2004.
A NOVEL STRATEGY TO SEARCH CONSERVED TRANSCRIPTION FACTOR BINDING SITES AMONG COEXPRESSING GENES IN HUMAN YOSUKE HAT ANAKA [email protected]
MASAO NAGASAKI [email protected]
TAKESHI OBA YASHI KAZUYUKI NUMATA [email protected] [email protected]
RUI YAMAGUCHI [email protected] ANDRE FunTA [email protected]
SEIYAIMOTO TEPPEI SHIMAMURA YOSHINORI TAMADA [email protected] [email protected] [email protected] KENGO KINOSHITA [email protected]
SATORU MIYANO KENTANAKAI [email protected] [email protected]
Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1, Shirokanedai, Minato-ku, Tokyo 108-8639, Japan We report various transcription factor binding sites (TFBSs) conserved among co-expressed genes in human promoter region using expression and genomic data. Assuming similar promoter structure induces similar transcriptional regulation, hence induces similar expression profile, we compared the promoter structure similarities between co-expressed genes. Comprehensive TF binding site predictions for all human genes were conducted for 19,777 promoter regions around the transcription start site (TSS) given from DBTSS and promoter similarity search were conducted among coexpressing genes data provided from newly developed COXPRESdb. Combination of Position Weight Matrix (PWM) motif prediction and bootstrap method, 7,313 genes have at least one statistically significant conserved TFBS. We also applied basket method analysis for seeking combinatorial activities of those conserved TFBSs.
Keywords: co-expressed genes; position weight matrix; promoter structure similarity; conserved TFBS
1.
Introduction
In the last decade, massive number of gene expression data from DNA microarray experiments, and various organisms' complete genomic data became publicly available. And yet the spatiotemporal regulatory mechanisms is still unclear, it is widely accepted that the gene expression activities heavily depend on recognition of specific promoter sequences by transcriptional regulatory proteins in higher eukaryotes. The transcription regulatory sites, and thus cis-regulatory regions, can be identified using the high-throughput methods such as ChIP-chip experiment [1-3). However, there are around 2,000 estimated transcription factors encoded in the human genome [4-5], and many are likely to be expressed and to combinatorial regulate target genes in various conditions, makes experimental identification of cis-regulatory regions difficult. Therefore, further computational identification of TFBSs based on signatures of their presence in the genomic sequence [6-9] is still an attractive alternative.
212
Search Conserved Transcription Factor Binding Sites
213
In this paper, we combined the genomic data and expression data for analysis to obtain substantial insights for gene regulatory mechanism by combination of conventional methods and newly developed database. To start with computational, all the genes from an organism clustered based on their expression patterns. Then, examination of the promoter region of genes in the same expression pattern group and look for common sequence motifs, namely transcriptional regulatory sites (transcription factor binding sites) that cause these genes active or inactive. For the prediction of TFBSs, we first applied 505 vertebrate Position Weight Matrices (PWMs), matrix of score values that give weighted match to promoter sequences: upstream 1000 bp and down stream 200 bp of transcription start sites (TSSs) for 19,777 human genes. Following that, we gathered genes co-expressed in various conditions and cell-cultures to seek for common TFBSs among the coexpressing genes. The novelty of our approach is that, taking into account that the limited structural flexibility of transcription regulating machines, we focused on common motifs in promoter region with similar distance from transcription start sites.
2.
Methods and Results
2.1. Coexpressing gene sets The coexpressing gene sets for each human gene were downloaded from a coexpressing gene database COXPRESdb ver.7 (http://coxpresdb.hgc.jp). In COXPRESdb, the coexpression data is calculated from the 4,401 Affymetrix GeneChip data (123 experiments) from NCB! GEO. Following the RMA normalization applied to each experiment, genes were normalized by expression level in each micro arrays experiment. Then, all experiments were combined into one gene expression table, and the weighted Pearson Correlation Coefficients (PCCs) were calculated between genes to give correlation rank. The recently developed COXPRESdb is quite unique from other coexpressing gene databases because it introduces a parameter, "Mutual Rank (MR)" deduced from correlation rank values. In essence, the correlation rank calculated from PCCs are asymmetric, meaning the rank of gene B from gene A is not the same as the rank of gene A from gene B. Thus, rather than taking the rank based upon Pearson Correlation Coefficient between genes, to give the best combination of coexpressing gene sets, the geometric averaged rank between the two directional ranks, Mutual Rank (MR) is introduced.
MR(AB) = ~MR(A ~ B) x MR(B ~ A)
(1)
We retrieved the coexpressing gene lists arranged in descending order of MR for each gene.
214
Y. Hatanaka et al.
2.2. Promoter sequence For the promoter sequences, we retrieved 1,000 bp upstream and 200 downstream of the TSSs assembled from UCSC hg1S, for 19,777 human genes. The location data of TSSs were retrieved from DBTSS v.6.0 (http://dbtss.hgc.jp), which bases on unique collection of experimentally determined 5'-end sequences of full-length cDNAs.
2.3. Transcription factor finding site (TFBS) prediction We collected 505 vertebrate position weight matrices (PWMs), equivalent to 313 transcription factors, registered at TRANSFAC v.S.3. For prediction the TFBSs, we used MATCH [11], which uses Mann-Whitney U- test using random gene list as a reference, to map all locations of predicted TFBSs on human genome assembled above. The basic concept of prediction is shown in Fig. I. Since this mapping algorithm is highly error-prone, mainly false positive hits because the known binding sites are short and sometimes degenerated. Thus, we adopted integrated value (2: 0.9S) of "matrices similarity score" and "core similarity score" to minimize false negatives. We then divided the genome into 50 bp regions and counted each region for the presence or absence of each PWM. We chose this region size because PWMs tend to produce large numbers of possible TFBSs in the genome; 50 bp regions are small enough to prevent most regions from containing most motifs. Also, experimental data from the TRANSCompel (v.lO.3) [12] shows over 99% of the distance between the experimentally determined transcription factor binding sites « 100bp). Therefore, the range of 50 bp is compatible with the size of known cis-regulatory regions and small enough to avoid inclusion of too many predicted TFBSs. Following mapping of predicted TFBSs, we minded this matrix of genomic regions and TFBSs' frequencies contained in the region as "TFBS location matrix" as shown in Fig.2.
5'
SCAN
"'V IflflflVU
GENOME
O,g Fig, 1. An image of score matrix mapping algorithm of MATCH. Those regions with matched score 2: 0.98 were selected.
(·999, ·9501
(·949, ·9001
(·99, ·501
(-49, 01
(+l01, +150)
(+151,
0
0
0
1
0
0
Tf1
1
0
0
0
0
0
Tf2
0
1
1
0
0
0
TF3
0
0
0
1
0
0
...
0
0
+200)
TF190
Fig. 2. TFBS location matrix. The element in the matrix is the frequency of mapped TFBSs.
Search Conserved Transcription Factor Binding Sites
215
2.4. Bootstrap method Using "TFBS location table" for each gene, we calculated the intensity of conservation of the predicted TFBSs with combination of coexpression data. Due to the comparatively strict condition in the TFBS prediction, some "TFBS location matrix" became zero matrix, thus we excluded such genes from the following process, leaving 9,330 genes. We selected top 20 genes, or N (~ 20) genes if missed from above reason, from COXPRESdb in the ascending order of "Mutual Rank" as highly co-expressed genes' group, because the expression similarity rapidly decreases after the top 20 genes [13]. Let TFg,i,j denote the ith row and jth column of the "TFBS location matrix" for gth gene, whereas TF:~~,c represents the corresponding element of "TFBS location matrix" for
N genes. Then, transform TFg,i,j to the arithmetic average x;~:,~ according to
xseed
(fTF:.0" J+ TFg';,J
= ---"c_=I:...-_ _ _ _ _ __
(2)
N +1
g,l,j
where x;~:,~ is interpreted as the "intensity of conservation". In order to evaluate the significance of the conservation, we applied a testing procedure, which exploits a technique of the Bootstrap method. The testing procedure is described as follows: 1.
For randomly selected N genes, calculate arithmetic average x;~~;om repeatedly according to
(L...~ TF.r~ndomJ + TF "" I,j
xrandom
g,i,j,k
=
g,l,j
---"c_=I'--_ _ _ _ _ __
N +1
(3)
where k=l, ... ,10,000 2.
random Arrange X g,l,j, " "k xrandom g,l,j,I
3.
.
III
I descent order, and pace
< xrandom < L -
g,l,j,2
-
x seed g i j" "
as
~ Xrandom ~ x seed ~ g,l,j,Z
Compute an integrated p-valuePg,i,j by
g,l,j
L < xra~dom g,l,j,lOOOO
(4)
216
Y. Hatanaka et al.
P .. =1g,/,j
Z .. +1 g,/,j (10,000 + 1)
where the following conditions were applied for
(5)
Pg,i,j
elf TFg,i,j = 0, then Pg,i,j ~ 1 eIfPg,/,j . . >0.05,thenPg,/,j . . ~1 elf
Pg,j
= 1, then the jth row was excluded from the matrix.
The p-value matrix for the gth gene is denoted by Pg shown as Fig. 3.
***
(+101, +150)
(+151, +200)
0.0021
1
TFl
1
0.0015
TF4
1
1
TFlS
***
*** *** *** ***
***
***
1
***
1
1
*** TF65
1
***
1
1
TF123
(-999,
(-949,
-950)
-900)
1
IB
1 1
1 I±I
*** 1 0.012
0.0015
I±I
1
Fig. 3. P-value matrix for each transcription factor (TF). Elements of the matrix> 0.05 are converted to I.
7,313 genes out of 9,330 human genes had at least one conserved TFBS. And the frequency of those significantly conserved TFBSs for 200 transcription factors among 7,313 genes are shown in histogram (Fig. 4).
Fig. 4. The frequency histogram of conserved TFBSs for 200 transcription factors. The frequency ranges from I to over 2,500.
Search Conserved Transcription Factor Binding Sites
217
2.5. Heatmap Following the bootstrap method testing, p-value matrices were depicted as 2-colored heatmap. AlP FINAL
I
I
HBA2 FINAL AhR
ATF
I I
alpha-CP1
I
Churchill
Who
VDR
AP-2rep
DEAF1
I I I I
I
STATi
I
I I
AR PU
SREBP-1 HFH4
Sp1
HNF4 STAT5A
I STAT3
I
I
I
Churchill AP-2gamma HNF4
I
STAT5A
FOX01
cap
FOX04
SMAD
Fig. 5< Heatmap of the conserved transcription factor binding site for (a) AlP and (b) HBA2. Red represents the conservation with statistical higher significance.
2.6. Association rule data mining Determining significantly conserved TFBS may help the transcription factor partners with co-acting biological roles for less well-studied transcription factors combination. Therefore, we applied the "association rule" used in market basket analysis. This method is to determine which items are frequently purchased together by using a database of transactions in which each tuple is a list of items purchased together in one customer's transaction. The mining seeks to discover rules such as "beer=>snacks," meaning "People who buy beer also often buy snacks." Association rules can be formally described as follows:
= {iJ' i 2 , ••• ,in} is a set of literals called items
•
I
•
D is a set of transactions. Each transaction T is a set of items such that T c I A transaction T contains X, a set of items in I, if X c T An association rule is an implication of X => Y, where X c I, Y c I and
• •
XnY 0
218
Y. Hatanaka et al.
•
C is the confidence of a rule X
•
S is the support of rule X
~ Y in transaction set D if c% of transactions in D that contain X also contain Y. It is also known as the conditional probability of Y given X, or P(Y \ X)
X and Y.
~
Y in set D if s% of transactions in D contain both
It is also known as the joint probability of both
X and Y, or
P(X\Y) Since the transcription factor like STAT] has more than 2,000 conserved TFBSs among the transactions (genes), this may inhibit the substantial finding of meaningful rule for genes with rather less conserved TFBSs. Therefore, we limited the search of rule among genes with less than 10 TFBSs (Fig. 6).
Fig. 6. Histogram of conserved TFBS frequency for each transcription factor.
By selecting genes with the maximum number of TFBS by 9, 39 transcription factors were left with 102 genes corresponding to the found TFBSs. For those TFBSs, mining of frequent TFBSs, association rules were calculated using the Apriori algorithm. The Apriori algorithm employs level-wise search for frequent itemsets (TFBS pairs). The result is shown in Table 1. Table 1. The rule of basket analysis: (maxlen, support) = (9, 0.01).
Rule No.
X
==>
Y
support
confidenc e
lift
1 2 3 4 5
E2F-4:DP-l pRb:E2F-l.DP-l HFH8 NF.kappaB Zic2
==> ==> ==> ==> ==>
pRb:E2F-l.DP-l E2F-4:DP-l Freac.7 c.Rel Zicl
0.0373 0.0373 0.0233 0.0187 0.0187
1.000 1.000 0.8333 0.8333 1.000
26.75 26.75 25.47 21.40 53.50
6
Zicl
::::}
Zic2
0.0187
1.000
53.50
Search Conserved Transcription Factor Binding Sites
3.
219
Discussion
In this paper, we conducted TFBS prediction using PWMs and bootstrap method to search conserved TFBSs. The assumption that location and combination of TFBSs are restricted due to the limited structural flexibility of transcriptional regulating machines, the co-expressed genes may have common structure in the promoter region. As a result of the bootstrap testing, number of statistically significant conserved TFBSs were found, and implication of functionality among them were confirmed. For example, in Fig. 4(a), AlP, aryl hydrocarbon receptor interacting protein shows significantly conserved TFBS of AhR, aryl hydrocarbon receptor, in close range of TSS. The heatmap also captures the HNF4-alpha as binding transcription factor which also being validated by ChIP-chip experiment [14J. As for HBA2, in Fig.4 (b), alpha2-globin, heatmap shows experimentally validated SPI binding site on ChIP-chip experiment [15J. These results imply the conserved TFBSs are indeed functional, due to the restricted structural flexibility of transcription regulating machines. We further sought for significantly co-occurring conserved TFBSs using basket analysis method with expectation that such combinatorial phenomenon implies a potential Cis-regulatory regions. According to the rule No.1 and No.2, the pRb:E2F-l:DP-1 and E2F-4:DP-I have tendency to co-occur in the conserved region. The transcription factor complex pRb:E2F-I:DP-I, is known to associate withpRB altering the binding site specificity of E2F-lIDP-l complexes [16]. Therefore, the pRB may act as switching device for gene regulation. The fact that E2F-4:DP-I, a similar complex of E2F-I:DP-I subunit, has tendency to co-occur, it is yet to experimentally validated, but pRE may have similar function as regulatory switching device. As for rule No.4, NF.kappaB with c.Rel, it is known that they make complex and bind with DNA for transcriptional activation of various genes [17 -18J. As for Zic I and Zic2, it is known that they bind and trans-activate the apolipoprotein E gene promoter [19]. As for rule No.3, despite there has no previous report of interaction between HFH8 and Freac.7, they may be the good candidates for potential interacting partners for the further experiment. For the future work, while we restricted the "conserved TFBSs" to be located exactly in the same region among co-expressing genes in this study, in order to adopt more flexibility for the transcription regulating machines, including the redundancy by slightly shifted TFBSs may reveal better candidate for common functional TFBSs among coexpressing genes. And searching the co-occurring TFBSs among the coexpressing genes using basket method may reveal novel candidate for the cis-regulatory elements. In conclusion, our strategy to search the conserved TFBSs among coexpressing genes revealed the fact the there are, in deed, a significant number of conserved TFBSs. And with the analysis of co-occurrence, it is likely that such co-occurring conserved TFBSs may act as cis-regulatory element in human genome. References [IJ Kim, J., Bhinge, AA., Morgan, Xc., Iyer VR., Mapping DNA-protein interactions in large genomes by sequence tag analysis of genomic enrichment, Nat. Methods, 2(1); 47-53,2005.
220
Y. H atanaka et al.
[2] Lee, TI., Rinaldi, NJ., Robert, F., Odom, DT., Bar-Joseph, Z., Gerber, GK., Hannett, NM., Harbison, CT., Thompson, CM., Simon, I., Zeitlinger, J., Jennings, EG., Murray, HL., Gordon, DB., Ren, B., Wyrick, JJ., Tagne, JR, Volkert, TL., Fraenkel, E., Gifford, DK., Young, RA., Transcriptional regulatory networks in Saccharomyces cerevisiae, Science, 298(5594):799-804, 2005. [3] Carroll, JS., Meyer, CA, Song, J., Li, W., Geistlinger, TR., Eeckhoute, J., Brodsky, AS., Keeton, EK., Fertuck, Ke., Hall, GF., Wang, Q., Bekiranov, S., Sementchenko, V., Fox, EA., Silver, PA, Gingeras, TR., Liu, XS., Brown, M., Genome-wide analysis of estrogen receptor binding sites, Nat. Genet., 38(11): 1289-1297, 2006. [4] Tupler, R., Perini, G., Green, MR., Expressing the human genome, Nature, 409(6822):832-833,200l. [5] Messina, DN., Glasscock, J., Gish, W., Lovett, M., An ORFeome-based analysis of human transcription factor genes and the construction of a micro array to interrogate their expression, Genome Res., 14 (lOB) :2041-2047,2004. [6] Wingender, E., Chen, x., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I. and Schacherer, F., TRANSFAC: an integrated system for gene expression regulation, Nucleic Acids Res., 28, 316-319, 2000. [7] Siggia, ED., Computational methods for transcriptional regulation, Curro Opin. Genet. Dev., 15(2):214-221,2005. [8] Tavazoie, S., Hughes, JD., Campbell, MJ., Cho, RJ., Church, GM., Systematic determination of genetic network architecture, Nat. Genet., 22(3):281-285, 1999. [9] Birnbaum, K., Benfey, PN., Shasha, DE., cis element/transcription factor analysis (cisITF): a method for discovering transcription factor/cis element relationships., Genome Res., 11(9):1567-1573,2001. [10] Bussemaker, HJ., Li, H., Siggia, ED., Regulatory element detection using correlation with expression, Nat. Genet., 27(2):167-171, 200l. [11] Kel, A, Gossling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, 0., Wingender, E., MATCH: A tool for searching transcription factor binding sites in DNA sequences, Nucleic Acids Res., 31 :3576-3579, 2003. [12] Matys, V., et aI., TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes, Nucleic Acids Res., 34(Database issue):DI08-110, 2006. [13] Takeshi, 0., Shinpei, Hayashi., Masayuki, Shibaoka., Motoshi, S., Hiroyuki, Ohta., Kengo, Kinoshita., COXPRESdb: a database of co-expressed gene networks in mammals, Nucleic Acids Res., Jan;36(Database issue):D77-82, 2008. [14] Odom, D.T., Zizlsperger, N., Gordon, D.B., Bell, G.W., Rinaldi, N.J., Murray, H.L., Volkert, T.L., Schreiber, J., Rolfe, P.A., Gifford, D.K., Fraenkel, E., Bell, G.I., Young, R.A, Control of pancreas and liver gene expression by HNF transcription factors, Science, 303: 1378-1381,2004. [15] TRANSFAC_Team, New ChIP-on-chip data. TRANSFAC Reports., Re1121:0002, 2008. [16] Tao, Y., Kassatly, R.F., Cress, W.D., Horowitz J. M., Subunit composition determines E2F DNA-binding site specificity, Mol. Cell. Bio!., 17:6994-7007, 1997. [17] Sun, S.C., Elwood, J., Beraud, e., Greene, W.e., Human T-cell leukemia virus type I Tax activation of NF-kappaBlRel involves phosphorylation and degradation of IkappaBalpha and RelA (p65)-mediated induction of the c-rel gene, Mol. Cell. Biol., 14:7377-7384,1994.
Search Conserved Transcription Factor Binding Sites
221
[18] Hansen, S. K., Nerlov, C., Zabel, u., Verde, P., Johnsen, M., Baeuerle, P., Blasi, F., A novel complex between the p65 subunit ofNF-kappaB and c-Rel binds to a DNA element involved in the phorbol ester induction of the human urokinase gene, EMBOJ., 11:205-213, 1992. [19] Salero, E., Perez-Sen, R., Aruga, J., Gimenez, C., Zafra, F., Transcription factors Zicl and Zic2 bind and transactivate the apolipoprotein E gene promoter, 1. BioI. Chern., 276:1881-1888, 2001.
MODELING IL-2 GENE EXPRESSION IN HUMAN REGULATORY T CELLS MANUELA BEN ARyl manuela.benaryOcms.hu-berlin.de
HANNA BENDFELDT2 bendfeldtOdrfz.de HANSPETER HERZEL l
RIA BAUMGRASS2 baumgrassOdrfz.de
h.herzelObiologie. hu-berlin. de 1 Institute
for Theoretical Biology, Humboldt University of Berlin, Invalidenstr. 43, 10115 Berlin, Germany 2 German Rheumatism Research Centre, Chariteplatz 1, 10117 Berlin, Germany Interleukin-2 (IL-2) is one of the first cytokines to be expressed by T helper cells (Th cells) after antigen-specific stimulation. In contrast, regulatory T cells (T reg cells) do not express IL-2, although they are activated via the same pathways. In regulatory T cells the additional transcription factor FoxP3 is expressed. Using intracellular measurement of the transcription factors NFAT and FoxP3 as well as the cytokine IL-2 on single cell level we revealed a small fraction of IL-2 expressing T reg cells. Furthermore, these data enabled us to develop initial mathematical models describing gene expression of IL-2 in individual cells. The models are adapted to data from human regulatory T cells. Based on statistical tests of available flow cytometric data it seems reasonable that not only the amount of the transcription factors NFAT and FoxP3 is important but also their concentration ratio. We discuss specific problems of modeling gene expression on single cell level taking IL-2 expression as an example.
Keywords: mathematical model; interleukin-2; regulatory T cells
1. Introduction
T helper cells transcribe the cytokine interleukin 2 (IL-2) in response to an antigenspecific stimulus. IL-2 is an immediate early gene and a key molecule in adaptive immunity [12]. This cytokine affects T cell proliferation and differentiation [11] and therefore leads to an expansion of activated T cells. In particular, growth and maintenance of Treg cells are crucially dependent on IL-2 [5, 16]. T reg cells suppress other immune cells, thus protect the body from autoimmunity. T reg cells are considered not to be able to express IL-2 themselves [10, 18].
The transcriptional regulation of interleukin 2 The antigen-specific activation of T helper cells lead to the activation of signaling networks, which will activate different transcription factors. The transcription factors NFAT (nuclear factor of activated T cells) and AP-1 (activator protein 1), as well as NFKB (nuclear factor-kappa B) are essential for the expression of IL-2 [9].
222
Modeling 1L-2 Gene Expression in Human Regulatory T Cells
223
These transcription factors bind at multiple sites in the minimal promoter region upstream of the transcription start site [12]. The minimal promoter region stretches about 300 bp [12] and includes at least four NFAT binding sites [9] and four AP-1 binding sites which are close to the NFAT binding sites [12] (see Fig. 1). Regulatory T (T reg ) cells are a subpopulation of T helper cells. T reg cells are characterized by the expression of the master transcription factor FoxP3 (forkhead box P3). FoxP3 can bind to the DNA-binding sites of AP-1, and might thereby repress the expression of IL-2. Potential suppressive mechanisms of FoxP3 in Treg cells are currently discussed [4, 19]. Other transcription factors, such as NFr.B, also bind to the minimal enhancer but for simplicity they are not discussed in the models.
-
V
NFAT
OAP-1 and FoxP3
ur· -300
•
-200
-100
minimal promoter
II
TSS
Fig. 1. The interleukin-2 promoter. The minimal promoter stretches from -300 bp to +40 bp and contains a minimal set of transcription factor binding sites regnlating the IL-2 gene (adopted from [9]).
The gene expression of interleukin 2 shows bimodal behavior in regulatory T cells Although Treg cells are not considered to express IL-2 so far, our flow cytometric data show that a subpopulation of T reg cells clearly expresses IL-2. This bimodal behavior (see Fig. 2) leads to several interesting questions. What mechanism is responsible for induction or repression ofIL-2 in the IL-2 expressing and nonexpressing Treg cell subpopulations, respectively? Are the concentrations of the transcription factors NFAT and FoxP3 important for IL-2 expression? Is there a biological function of the IL-2 expressing Treg cell subpopulation? 2. Methods
In four experiments T helper cells from four different donors are stimulated for 5 hours using phorbol12-myristate 13-acetate (PMA) and ionomycin. The cells were costained for NFAT, FoxP3 and IL-2 and fluorescence intensities (FI) were measured by flow cytometry [11]. The flow cytometric data are normalized using the highest fluorescence intensity. They are clustered using k-means (k=2) for FoxP3 expressing (T reg cells) and nonexpressing cells (Th cells), as well as for IL-2 expressing and nonexpressing cells. The normalized data of transcription factors and IL-2 in Treg
224
M. Benary et al.
cells are compared using Pearson's and Kendall's correlation coefficient. Distribution fitting and simulation of cell populations are performed via Matlab programs.
3. Results
Linear correlations cannot sufficiently explain IL-2 expression in regulatory T cells Using Pearson's correlation coefficient we found a positive correlation between the amount of the transcription factors NFAT and FoxP3 and the amount of expressed IL-2 per cell in all Treg cells as well as in Treg cells expressing IL-2 in all four data sets. The mean of Pearson's correlation coefficient between NFAT and IL-2 is higher in Treg cells (mean = 0.49, std = 0.11) than in Treg expressing IL-2 (mean = 0.40, std = 0.22). The mean of Pearson's correlation coefficient between FoxP3 and IL-2 is also positive, but higher in Treg cells expressing IL-2 (mean = 0.40, std = 0.07) than in all Treg cells (mean = 0.28, std = 0.12). This indicates a positive linear correlation between the concentration of the transcription factors NFAT and FoxP3 and the concentration of expressed IL-2. To verify these results we used Kendall's correlation coefficient. NFAT and IL-2 are positively correlated, but with smaller values in Treg cells (mean = 0.29, std = 0.10) as well as in Treg cells expressing IL-2 (mean = 0.29, std = 0.08). The correlation between FoxP3 and IL-2 fluctuates around zero, so this does not allow any specific conclusions. Both correlation coefficients indicate a positive correlation between the amount of NFAT and the amount of IL-2. The correlation coefficients are less conclusive in the case of FoxP3 and IL-2. Furthermore, the values of the coefficients are low and they can only explain a small fraction of the data. This leads us to the conclusion 75% - 90% 1
1 10 %-25%1
10'
10'
2
10
3 10
Fluorescence Intensity of IL-2
(a)
(b)
Fig. 2. (a) Scatter plot of FoxP3 and IL-2 expression in human T helper cells. The black dots indicate cells, which have a high concentration of FoxP3. FoxP3 is a marker for regulatory T cells. The x-axis shows the fluorescence intensity of IL-2 expression. The number of regulatory T cells, which express IL-2 varies between 10% and 25% (upper right). (b) Bimodal IL-2 expression in regulatory T cells. Two peaks can be distinguished, corresponding to the number of cells in the two subpopulations. The high peak corresponds to the T reg cells not expressing IL-2, whereas the small one corresponds to T reg cells expressing IL-2.
Modeling lL-2 Gene Expression in Human Regulatory T Cells
225
that linear correlations cannot sufficiently explain IL-2 expression. Moreover, the correlation coefficients give no clear answer to the question why a subpopulation of Treg cells expresses IL-2.
Simple decision models reproduce bimodal behavior We introduce three simple models, which cover certain aspects of the regulation of IL-2 gene expression. We explain first the general assumptions of modeling, then we introduce the specific models, and test whether the models can capture the properties of the data.
General assumptions The IL-2 gene is assumed to have two states according to Biggar and Crabtree [3]. If the gene is in the OFF state (G=O) only a basal expression (with rate rb) can occur. If the gene is in the ON state (G=I) an additional induced expression (with rate ri) is assumed. The IL-2 protein can be degraded with rate d. This leads to the following differential equation for the time-course of IL-2 expression. dIL2
- - = rb
dt
+ r"
G - d· IL2
•
(1)
In the experiments the cells are fully stimulated with PMA and ionomycin for 5 hours, so the gene expression of IL-2 is regarded to be in steady state. IL2
_ rb
st.st. -
+ dri . G
(2)
As we are looking at a population of cells, cell-to-cell variability is introduced by taking the expression rates rb and ri from normal distributions. The parameters of these distributions are shown in Table 1. The independent ratios It and 7 describe the positions of the peaks.
Random Model For the random switch we assume that IL-2 expression is not regulated by a defined mechanism, but that the 1L-2 gene in a cell is in state ON with a 15% chance. This model does not give mechanistic explanations, but rather is used as default for comparisons with the other models.
G
= {I, rand ~ 0.15 0,
else
(3)
226
M. Benary et al.
Cooperation model We assume that IL-2 expression is only regulated by NFAT and that all NFAT binding sites in the minimal promoter have to be occupied to allow IL-2 gene expression [12]. This can be modeled as cooperative binding using a Hill-function dependent on NFAT, assuming a similar mechanism as in multiple phosphorylation events [13].
rand ~ JP(G) else n JP(G - 1) _ NFAT - NFATn+Kn
G = {I, 0,
(4) (5)
The values of the Hill-function lie between zero and one and can thus be seen as the probability of the IL-2 gene to be in state ON. The concentration of NFAT in all cells is estimated with a lognormal distribution. Using this distribution in the Hill function yields a probability distribution for the IL-2 gene to be in state ON. The dissociation constant K of NFAT binding to DNA and the concentration of NFAT are governing the number of cells in each subpopulation.
Inhibition model In this model we assume that IL-2 expression is coordinated regulated by NFAT and FoxP3. NFAT acts as an enhancer of IL-2 gene expression, whereas FoxP3 is assumed to be an repressor. For simplicity, we assume that FoxP3 inhibits IL-2 expression directly and that this transcription factor binds independently of NFAT to the DNA. Four complexes can form, and gene expression of IL-2 can occur if only NFAT and not FoxP3 is bound to the DNA.
G= {I, 0, JP(G)
=
rand~JP(G)
(6)
else
KF· NFAT FoxP3· NFAT + KN . FoxP3 + KN . KF
.
+ K F . NFAT'
( 7)
The concentrations of NFAT and FoxP3 as well as their dissociation constants K N, K F are determining the number of cells in the subpopulations. The experimental data of NFAT and FoxP3 can be estimated via lognormal distributions with parameters NFAT/i' NFAT" and FoxP3/i' FoxP3" , respectively. The values of the other parameters have been assigned to represent the data and can be found in Table 1.
Modeling IL-2 Gene Expression in Human Regulatory T Cells
227
Table 1. Parameters for the models. In the last column we indicate the relevant models ("random" = R, "cooperation"d = C and "inhibition" = I). parameter N rilJ.
r'
'0-
rb" rbo-
d
K n
KN KF
description number of cells mean of induced expression rate std of induced expression rate mean of basal expression rate std of basal expression rate degradation rate
value 10000 0.075 0.075 0.0075 0.005 0.00001
dissociation constant Hill coefficient
4
dissociation constant of NFAT to DNA dissociation constant of FoxP3 to DNA
6.7 6.7
used in
R, R, R, R, R, R,
C, C, C, C, C, C,
I I I I I I
C C
-#1 Random •• #2 Cooperation -#3 Inhibition
10° IL-2 (a.u.)
Fig. 3. IL-2 distributions in the steady state simulated by three different models. The "random model" (black) has about 12 % cells expressing IL-2, the "cooperation model" (grey dashed) has about 22 % cells expressing IL-2 and the "inhibition model" (grey solid) has about 11 % cells expressing IL-2. These numbers are consistent with the data from the donors. All models have the same parameter values for the transcription rates.
The distributions of NFAT and Fox-P3 favor the model of inhibition All models lead to a good fit of the bimodal behavior of 1L-2 expression (see Fig. 3). The "cooperation model" and the "inhibition model" are effected by the amount of NFAT and FoxP3 compared to the "random model". The "cooperation" and "inhibition" models can be distinguished by the effect of the transcription factors. The estimated lognormal distributions of the transcription factors are described
228
M. Benary et al.
by the parameters J-t and (J. Changing these parameters for NFAT and FoxP3, we observed a change in the number of cells expressing IL-2 (see Table 2). Based on the values of NFATO" one can distinguish the "cooperation model" and the "inhibition model" . Table 2. Change in number of Treg cells which express lL-2 after changing parameters of NFAT and FoxP3 distributions.
changed parameter NFATl'i
i
NFATu
lL-2 expressing cells #2 Cooperation #3 Inhibition
i 1
FoxP31'i FoxP3 u i
i i
1
Using data from four different donors we can exploit individual variations. Each individual has different absolute levels of transcription factors in the T cell populations. These individual differences are possibly due to regulatory processes and the immunological history of an individual. Our data favor the "inhibition model" (dashed line in Fig. 4), because the number of cells increases when NFATO" increases. However, the number of IL-2 expressing T reg cells should also decrease with increasing FoxP30". This is true for small values of FoxP30" but cannot explain all the data .
.. .... ..-
+
...
...-. ....
,'.
• NFAT" - - -linear regression of NFATa
+ FoxP3" +
t
t
..
.-• '
•"
,': •
0.4
+
+ 0.6 0.8 1 1.2 estimation of paramater a of lognonmal distribution
1.5
Fig. 4. Estimated values for the parameters NFATu and FoxP3 u . The number of IL2 expressing Treg cells increases, when NFAT u increases (black * and dashed line) as would be expected from the "inhibition model". The number of IL-2 expressing T reg cells also decrease with increasing FoxP3 u (grey +), but only in a certain range.
Modeling 1L-2 Gene Expression in Human Regulatory T Cells
229
4. Discussion
Verification via overexpression experiments is needed Because the effect of FoxP3 in the "inhibition model" is not conclusive, overexpression experiments are needed to verify the effect of FoxP3 on the number of cells in the subpopulations. With the "cooperation model" and the "inhibition model" we can provide some hints how many Treg cells will express IL-2 for a given distribution of NFAT and FoxP3 in the cells.
Regulation by other transcription factors So far these models only take into account the regulation via NFAT and FoxP3. However, other transcription factors also playa role. First, AP-I binding to NFAT stabilize NFAT-DNA binding. Second AP-I and FoxP3 compete for NFAT binding [19]. Third NFKB [8], Oct [6] and other factors are important for IL-2 expression,too. The three models shown here, focus only on the regulation via the minimal promoter region of the IL-2 gene. In addition, enhancer regions in distal promoter areas [17] and epigenetic modifications [1, 2] playa crucial role to open up the minimal promoter. Furthermore, the models do not include any regulation of the transcription factors. This might be very important as the gene expression of FoxP3 is enhanced by NFAT [15]. The gene expression of NFATc1, which is a member of the NFAT family, is also induced by an autoregulatory loop [14].
Time course data In our current models IL-2 expression is regarded to be in steady state. This only captures the cell-to-cell variability in a restricted way. The stochastic effects of transcriptional and translational bursts are not taken into account and might yield insights into the behavior of the cells in the course of time. Simulating the time course can also incorporate the dynamic changes in transcription factor concentrations, for example the increase of NFATc1 induced by its positive autoregulation. Measurements of time-dependent concentrations of several involved transcription factors will allow a refinement of our current initial models. Acknowledgements This work was funded by the DFG and BMBF. References [1] Adachi, S. and Rothenberg, E. V., Cell-type-specific epigenetic marking of the IL2 gene at a distal cis-regulatory region in competent, nontranscribing T-cells, Nucleic Acids Res, 33(10):3200-10, 2005.
230
M. Benary et al.
[2] Attema, J. L., Reeves, R., Murray, V., Levichkin, 1., Temple, M. D., Tremethick, D. J. and Shannon, M. F., The human IL-2 gene promoter can assemble a positioned nucleosome that becomes remodeled upon T cell activation, J Immunol, 169(5):246676,2002. [3] Biggar, S. R. and Crabtree, G. R., Cell signaling can direct either binary or graded transcriptional responses, EMBO J, 20(12):3167-76, 200l. [4] Chatila, T. A., Li, N., Garcia-Lloret, M., Kim, H. J. and Nel, A. E., T-cell effector pathways in allergic diseases: transcriptional mechanisms and therapeutic targets, J Allergy Clin Immunol, 121(4):812-23; quiz 824-5, 2008. [5] de la Rosa, M., Rutz, S., Dorninger, H. and Scheffold, A., Interleukin-2 is essential for CD4+CD25+ regulatory T cell function, Eur J Immunol, 34(9):2480-8, 2004. [6] Kamps, M. P., Corcoran, L., LeBowitz, J. H. and Baltimore, D., The promoter of the human interleukin-2 gene contains two octamer-binding sites and is partially activated by the expression of Oct-2, Mol Cell Bioi, 10(10):5464-72, 1990. [7] Kang, S. M., Beverly, B., Tran, A. C., Brorson, K., Schwartz, R. H. and Lenardo, M. J., Transactivation by AP-1 is a molecular target of T cell clonal anergy, Science, 257(5073):1134-8, 1992. [8] Kang, S. M., Tran, A. C., Grilli, M. and Lenardo, M. J., NF-kappa B subunit regulation in nontransformed CD4+ T lymphocytes, Science, 256(5062):1452-6, 1992. [9] Kim, H. P., Imbert, J. and Leonard, W. J., Both integrated and differential regulation of components of the IL-2/IL-2 receptor system, Cytokine Growth Factor Rev, 17(5):349-66,2006. [10] Papiernik, M., de Moraes, M. L., Pontoux, C., Vasseur, F. and Penit, C., Regulatory CD4 T cells: expression of IL-2R alpha chain, resistance to clonal deletion and IL-2 dependency, Int Immunol, 10(4):371-8, 1998. [11] Podtschaske, M., Benary, U., Zwinger, S., Hofer, T., Radbruch, A. and Baumgrass, R., Digital NFATc2 activation per cell transforms graded T cell receptor activation into an all-or-none IL-2 expression, PLoS ONE, 2(9):e935, 2007. [12] Rothenberg, E. V. and Ward, S. B., A dynamic assembly of diverse transcription factors integrates activation and cell-type information for interleukin 2 gene regulation, Proc Natl Acad Sci USA, 93(18):9358-65, 1996. [13] Salazar, C. and Hofer, T., Allosteric regulation of the transcription factor NFAT1 by multiple phosphorylation sites: a mathematical analysis, J Mol Bioi, 327(1):31-45, 2003. [14] Serfiing, E., Chuvpilo, S., Liu, J., Hofer, T. and Palmetshofer, A., NFATc1 autoregulation: a crucial step for cell-fate determination, Trends Immunol, 27(10):461-9, 2006. [15] Tone, Y., Furuuchi, K., Kojima, Y., Tykocinski, M. L., Greene, M. 1. and Tone, M., Smad3 and NFAT cooperate to induce Foxp3 expression through its enhancer Nat Immunol, 9(2):194-202, 2008. [16] Waldmann, T. A., The biology of interleukin-2 and interleukin-15: implications for cancer therapy and vaccine design, Nat Rev Immunol, 6(8):595-601, 2006. [17] Ward, S. B., Hernandez-Hoyos, G., Chen, F., Waterman, M., Reeves, R. and Rothenberg, E. V., Chromatin remodeling of the interleukin-2 gene: distinct alterations in the proximal versus distal enhancer regions, Nucleic Acids Res, 26(12):2923-34, 1998. [18] Wolf, M., Schimpl, A. and Hunig, T., Control of T cell hyperactivation in IL-2deficient mice by CD4(+)CD25(-) and CD4(+)CD25(+) T cells: evidence for two distinct regulatory mechanisms, Eur J Immunol, 31(6):1637-45, 200l. [19] Wu, Y., Borde, M., Heissmeyer, V., Feuerer, M., Lapan, A. D., Stroud, J. C., Bates, D. L., Guo, L., Han, A., Ziegler, S. F., Mathis, D., Benoist, C., Chen, L. and Rao, A., FOXP3 controls regulatory T cell function through cooperation with NFAT, Cell, 126(2):375-87, 2006.
TOXICITY VERSUS POTENCY: ELUCIDATION OF TOXICITY PROPERTIES DISCRIMINATING BETWEEN TOXINS, DRUGS, AND NATURAL COMPOUNDS SWANTJE STRUCK l ULRlKE SCHMIDT l " BJOERN GRUENING l [email protected] [email protected] [email protected] INES S. JAEGER l.2 [email protected]
JULIA HOSSBACW [email protected]
ROBERT PREISSNERlt robert. [email protected]
Bioinformatics Group, Institute 0/ Molecular Biology and Bioin/ormatics, Charite-University Medicine, Arnimallee 22, 14195 Berlin, Germany 2 Department o/Cardiology and Angiology, Charite-University Medicine, Schumannstr. 20/21,10117 Berlin, Germany
I Structural
Within our everyday life we are confronted with a variety of toxic substances. A number of these compounds are already used as lead structures for the development of new drugs, but the amount of toxic substances is still a rich resource of new bioactive compounds. During the identification and development of new potential drugs, risk estimation of health hazards is an essential and topical subject in pharmaceutical industry. To face this challenge, an extensive investigation of known toxic compounds is going to be helpful to estimate the toxicity of potential drugs. 'Toxicity properties" found during those investigations will also function as a guideline for the toxicological classification of other unknown substances. We have compiled a dataset of approximately 50,000 toxic compounds from literature and web sources. All compounds were classified according to their toxicity. During this study the collection of toxic compounds was investigated extensively regarding their chemical, functional, and structural properties and compaired with a dataset of drugs and natural compounds. We were able to identify differences in properties within the toxic compounds as well as in comparison to drugs and natural compounds. These properties include molecular weight, hydrogen bond donors and acceptors, and functional groups which can be regarded as "toxicity properties", i.e. attributes defining toxicity. Keywords: toxins; drugs; natural compounds; toxicity
1.
Introduction
Toxins are hazardous substances causing illness or damage to an exposed organism when inhaled, swallowed or absorbed through the skin. The famous physician Paracelsus (1493-1541) already mentioned: "Dosis sola venenum facit" ("Only the dose makes the poison") which is still a central concept of toxicology. This dose-dependency implicates that even water may evoke toxic effects when given in high amounts, and, on the contrary, small doses of a powerful toxin may lead to healing. Toxins constitute a very diverse group of substances, ranging from enzymes up to small chemical compounds affecting just as many different targets [1]. It is of great scientific interest to classify toxins and to compare their toxic effects in order to identify new toxins and to understand the biological mechanisms they are involved in. There are • Equally contributing authors t Corresponding author
231
232
S. Struck et al.
different measurements to estimate toxicity: LD50 and LC50 (lethal dose or concentration at which 50% of a population dies) are widely established, but also TGI (total growth inhibition), NOEL (no observable effects limit), or LOEL (lowest observable effects level) are used. In recent publications it was considered to perform QSAR (Quantitative StructureActivity Relationship) or QSPR (Quantitative Structure-Property Relationship) analyses to predict the toxicity based on chemical and physical properties without further experimental investigations [2,3]. The motivation for the study at hand was the collection of a dataset of structures with corresponding toxicity information which formed the basis for a training set for QSAR toxicity predictions. This dataset also enabled a detailed investigation of the correlations between chemical, functional, and structural properties of toxic compounds. We found that highly toxic compounds possess a higher molecular weight and more hydrogen bond donors and acceptors as compared to less toxic compounds, drugs, or natural compounds. Furthermore, an increased occurrence of certain functional groups and structural properties (e.g. chiral centers) was detected in highly toxic compounds. These "toxicity properties" form a very promising basis for the prediction of the toxicity of unknown compounds.
2.
2.1.
Data and Methods
Data
As no comprehensive collection of toxic compounds is publicly available, no sufficient information regarding the structural and physicochemical mutuality of toxic compounds is obtainable. Therefore, we collected more than 50,000 toxic compounds from literature and different web resources and stored them in a database. This provides a unique collection of data which enables an extensive investigation of their structural and physicochemical attributes. The dataset was enlarged adding natural compounds and drugs from our databases [4,5]. Toxins often show similar modes of action to drugs which makes them ideal lead structures for the development of new drugs in pharmaceutical research. On the other hand, drugs as well as toxins are natural compounds or derivatives thereof. Since these substance classes are related, it seems appropriate to consider all of them in our investigations regarding similarities and differences. Toxic compounds To investigate the relation between a molecule's structure and its toxic impact, we compiled a database containing about 50,000 small molecule compounds, their structures and experimentally measured values of toxicity. About 44,000 structures were taken from the Developmental Therapeutics Program (DTP) of the National Cancer Institute (NCI). Each compound was tested on 60 different cancer cell lines. Values for growth inhibition (GI50), total growth inhibition (TGI) and
Toxicity versus Potency
233
lethal concentration (LC50) were collected. Structures and toxicity information are freely available on the DTP website [6]. Toxicity information for about 4,500 molecules was extracted from the NLM [7] whereas corresponding molecular files were taken from PubChem [8]. Furthermore, about 1,200 structures were taken from the literature [9] and the corresponding toxicity values were extracted from the text. The toxic compounds were investigated regarding their chemical and physicochemical properties. To elucidate the correlation of these properties, the compounds were subdivided into three groups according to their toxicity (-log (LC50». Compounds with -log (LC50) values of 3 until 6 were combined in a slightly toxic group. The group of medium toxicity comprised compounds with -log (LC50) values of6 until 9. The third group contained the highly toxic compounds integrating compounds which feature -log (LC50) values above 9. For a more detailed investigation the group of medium toxic compounds was subdivided into compounds with -log (LC50) value intervals between 6,7,8, and 9. Drugs For the comparative analyses we used the structures of about 2,500 drugs. These data were extracted from the free database SuperDrug containing WHO-classified drugs [5,10]. Entire plants, extracts, mixtures, colloids, and biopolymers are not included in this dataset. Natural compounds A second reference group is composed of natural compounds. About 47,000 structures were taken from the free database SuperNatural containing natural compounds, derivates, and analogues [4].
This complete dataset enables the investigation of chemical, functional and structural properties and sheds light on the complex topic toxicity. Subsequently, the attributes of natural compounds and drugs will be discussed in relation to those of toxic compounds.
2.2.
Methods
Calculation of chemical properties The calculations of chemical properties, e.g. molecular weight, number of hydrogen bond donors and acceptors were performed with functions from OpenBabel 2.1, an open source chemical toolbox [11, 12]. To compute the properties for the structures the software MyChem was used which is an implementation of the OpenBabel2.1library for MySQL [13].
234
S. Struck et at.
Analysis offunctional and structural properties For the analysis of functional and structural properties SMARTS patterns encoding functional and structural elements were defined [14]. The distributions of these patterns were analyzed between the different groups: the three groups of toxic compounds, the drugs, and the natural compounds.
3.
Results and Discussion
3.1. Chemical properties Toxicity In this study the toxicity of the compounds is defined as -log (LC50), the medium deathly concentration for exposed animals or cells. The maximum values were reached in the log (LC50) class "4-5" (Figure 1), indicating that most of the compounds fell within the lower toxic range. As the distribution of 50,000 compounds is plotted, the numbers of those assembled in the classes 6-9 and higher still comprises of thousands of compounds.
Toxicity 90 80 70 ~ 60 (Jl
-a 50 c
::>
ii
40
E
a U 30 20 10 0 <4
4-5
5-6
6-7
7-8
8-9
>9
-log (Le50) Fig. I. Distribution of the compounds according to their toxicity. The ratio of compounds is plotted against the -log (LC50) values as a measurement of toxicity.
Molecular weight Figure 2 depicts the distribution of the molecular weight of natural compounds, drugs and toxic compounds. It is noteworthy that the drugs have the lowest weight followed by the group of slightly, medium, and highly toxic compounds whereas natural compounds represent intermediate weights.
Toxicity versus Potency 235 40
Molecular Weight
,.
35 30
--.k-- TC 3-6
:;? 25 £...
'"c:
"0
-.--TC6-9 _ _ TC>9
20
:::l
0
n. 15
---(j)---
0
--zlr--NC
E ()
Drugs
10 5
.,
0 <100
200
300
400
500
600
700
800
900
1000
>1000
Weight [g/mol]
Fig. 2. Distribution of the molecular weight of toxic compounds (TC), natural compounds (NC) and drugs. Thc toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 highly toxic).
Molecular Weight 40 35 30
~
25
'" §
20
TC6-7
"0
o a.
--111-- TC7-S
-t-TC8-9
E 15
8
10
a <100
200
300
400
500
600
700
800
900
1000
>1000
Weight [g/molJ
Fig. 3. Detailed distribution of the molecular weight regarding the group of medium toxicity (-log (LCSO): 6-9). (TC toxic compounds)
Figure 3 shows a detailed distribution of the medium toxic compounds regarding their molecular weight. This diagram reflects the same trend as shown in Figure 2. The slightly toxic compounds are characterized by a lower molecular weight compared to the more toxic compounds. These findings support the tendency that toxic compounds have a higher molecular weight than non-toxic compounds. In summary, the investigated groups of compounds differ according to their molecular weight forming a clear sequence: drugs, slightly toxic compounds, natural
236
S. Struck et al.
compounds, medium toxic compounds, and highly toxic compounds. Thus, a clear correlation between the toxicity and the molecular weight can be found. As drugs are designed as small molecules which can enter cells easily, these compounds are comparatively small. Within the highly toxic compounds, toxins like valinomycin (Streptomyces fulvissimus) or halichondrin (Axinella sp.) can be found. These large compounds function by binding to receptors or forming pores in membranes and are, therefore, very effective resulting in a high toxicity. Hydrogen bond donors and acceptors Hydrogen atoms attached to a relatively electronegative atom gain a positive partial charge which makes them very reactive. Thus, they act as hydrogen bond donors in the formation of a hydrogen bond to electronegative atoms such as fluorine, oxygen, or nitrogen which serve as hydrogen bond acceptors. Hydrogen bond donors and acceptors are ideal components of toxic compounds due to their high reactivity. Therefore, toxic compounds are even active at very low concentrations by interacting with biological macromolecules such as enzymes or cellular receptors.
H-Bond Acceptors 40,-----------------------------------------------. 35 30
~ "' '0
_TC3-6 25
--+--TC6-9 _ _ TC>9
§ 20 o c.
.. -e··· Drugs
g 15
--,.·-NC
()
10
o
1
2
3
4
5
6
7
8
9
10 11
12 13 14 15 16 17 18 19 20 >20
Acceptors [nJ
Fig. 4. Distribution of the amounts of hydrogen bond acceptors of toxic compounds (TC), natural compounds (NC) and drugs. The toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 = highly toxic).
Toxicity versus Potency
237
H-Bond Donors 40r------------------------1-----~
40
35 30 ~
e.-
TC3-6
25
_.+_.- TC 6-9
'"
"0
§ 20
--B-TC>9
a c. E 15
o
a
o
- ·e-·· Drugs
1 2 3 4
5
6 7 8 9 1011
.....--NC
10
o
1
2
3
4
5
6
7
8
9
10 11
12 13 14 15 16 17 18 19 20 >20
Donors [n]
Fig. 5. Distribution of the amounts of hydrogen bond donors of toxic compounds (TC), natural compounds (NC) and drugs. The toxie compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 = highly toxic). The small diagram shows a detailed distribution of the amounts of hydrogen bond donors regarding the group of medium toxicity (-log (LC50): 6-9).
To analyze this supposition, the amount of hydrogen bond donors and acceptors was compared between toxic compounds, natural compounds, and drugs (Figures 4 and 5). It was found that the group of natural compounds, slightly and medium toxic compounds, and drugs have very similar amounts of hydrogen bond acceptors as well as donors, ranging between three and six hydrogen bond acceptors and between zero and two hydrogen bond donors. The lowest number of hydrogen bond acceptors was found within drugs, as they are chemically designed to fulfill the Lipinski's rule of five [15]. According to this rule, they are supposed to comprise not more than 10 hydrogen bond acceptors in order to have adequate ADME properties [16]. In contrary to this, the group of highly toxic compounds shows both, more hydrogen bond donors and acceptors. It is obvious that within the groups of slightly, medium, and highly toxic compounds the amount of hydrogen bond acceptors and donors rises. This was confirmed by a more detailed investigation of the medium toxic compounds which show the same trend regarding the hydrogen bond acceptors (data not shown) and donors (Figure 5 small graph). Comparing the molecular weight and the hydrogen bond acceptors the same sequence of compound groups can be found: the drugs feature the least amount of hydrogen bond acceptors followed by the slightly toxic compounds, natural compounds, and the medium toxic compounds concluding with the highly toxic compounds as the group with the highest amount of hydrogen bond acceptors. The same order occurs regarding the hydrogen bond donors, except that the natural compounds show the least amount of hydrogen bond donors and the drugs follow the slightly toxic compounds. Thus, the assumption was confirmed, that the more toxic a compound the more hydrogen bond donors and acceptors can be found in the structure.
238
3.2.
S. Struck et al.
Functional properties
The distribution of functional groups in toxic compounds, drugs and natural compounds was analyzed and is depicted exemplarily in Figure 6. It can clearly be seen, that the occurences of functional groups rises with increasing toxicity whereas the natural compounds and the drugs exhibit frequencies among those of the toxic compounds. The highly toxic compounds differ significantly in the amounts of alcohol and sugar groups compared to the other compounds. The more hydroxyl groups can be found in a molecule, the more hydrogen bond donors are available and the higher is the reactivity. Sugar molecules have many chiral centers and therefore, are characterized by a high stereo selectivity. Regarding the huge amount of different sugar molecules there is a vast number of possible combinations resulting in a high specificity according to the binding affinity to their targets. Alcohol or phenol as an aromatic alcohol are characterized by their reactivity and corrosiveness resulting in a high toxicity. These properties are explained by the denaturing effect of phenol on membrane proteins forming pores which may lead to cell death. Acetal includes a hydroxyl group which, as mentioned above, makes molecules more reactive. Acetals are stable with respect to hydrolysis by bases. This is an important property for toxic compounds since the more protected they are from hydrolysis the better they can perform their effects. In summary, an order can be defined, starting with the sligthly toxic compounds with the least amounts of the depicted functional groups followed by the natural compounds, the drugs, and the medium toxic compounds concluding with the highly toxic compounds which possess the highest frequencies of the mentioned functional groups. Functional properties Alcohol
TC3-6 Acetal/Acetal-like
TC6-9 .. TC >9
!ljiNC
Alenol
Drugs Sugar
o
10
20
30
40
50
60
70
80
90
100
compounds [%] Fig. 6. Distribution of the occurrences of functional groups of toxic compounds (TC), natural compounds (NC) and drugs. The toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 = highly toxic).
Toxicity versus Potency
3.3.
239
Structural properties
Structural properties were also investigated as toxicity indicators. The most distinct ones are represented in Figure 7. The analyses of the structural characteristics in the three groups of toxicity show results analogous to the analyses of the functional properties: the more toxic a compound the more distinctive the property. Since chiral centers can be found in high amounts in sugar molecules their distributions correlate with those of the sugar group having the same origin: the high specificity and selectivity they provide ensure a very efficient and specific mode of action of toxic compounds. Conjugated double bonds contribute to the stability of a molecule so that a high amount hamper degradation and enable the toxin to perform its effects. Earlier studies revealed that the center of aromatic rings act as hydrogen bond acceptors [17] which is expected to playa significant role in molecular associations. This ensures a very specific and selective mode of action which explains the increasing amount of ring systems with increasing toxicity. Structural properties Ring system
TC3-6 TC6-9 Conjugated double bond
III
TC>9
III
NC
mDrugs Chiral center
o
10
20
30
40
50
60
70
80
90
100
compounds [%]
Fig. 7. Distribution of the occurrences structural properties of toxic compounds (TC), natural compounds (NC) and drugs. The toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = low, 6-9 = medium, >9 = high).
3.4.
Case study
Amatoxins are cyclic non-ribosomal oligopeptides found in several members of the Amanita genus of mushrooms, one being the Death cap (Amanita phalloides). The most deadly of all the amatoxins is the a-amanitin with an oral LD50 of approximately 0.1 mglkg. It is an inhibitor of the RNA polymerase II blocking the transcription of DNA and RNA [18]. This leads to a total failure of the protein synthesis causing severe effects on liver and kidney [19]. Death usually occurs around a week from ingestion [20]. A map of the purine and pyrimidin pathway which can be found in the Kyoto Encyclopedia of
240
S. Struck et al.
Genes and Genomes (KEGG) [21] is shown in Figure 8. It displays in detail the function of the RNA polymerase II and the effects its inhibition by a-amanitin would cause. 5 '·Acetylphosphoadenosine 0 (mitochondria) 5L Bell2oylpho.phoadeno.ine 0 (mi1OchorulIia)
Fig. 8. Excerpt of the purine pathway extracted from KEGG. The enzyme colored in red with the number "2.7.7.6" depicts the RNA polymerase II.
With a molecular weight of 918.97 g/mol, 13 hydrogen bond donors, and 15 hydrogen bond acceptors the chemical "toxicity properties" of a-amanitin are consistent with our findings of the highly toxic compounds. A lot of ring systems, conjugated double bonds, and chiral centers also fit in our results of the structural "toxicity properties" of the highly toxic compounds.
4.
Conclusion and Future Perspectives
In this work we were able to elucidate a continuous trend in structural, chemical, and functional properties within the different groups of toxic. The analysis of hydrogen bond donors and acceptors as well as certain functional groups and structural features revealed a positive correlation between occurrence and toxicity whereas the amounts of drugs and natural compounds have similar values compared to the slightly toxic compounds. Toxic compounds function in a variety of ways and subgroups, like the highly toxic ones, react with their target in a completely different manner than drugs. While drugs are usually small compounds, able to enter the cell and to affect targets within the cells, a lot of toxic compounds function by forming pores in membranes (e.g. alpha toxin from Staphylococcus aureus), by permanent activation of for example sodium channels (aconitin) or by interaction with neurotransmitter receptors (strychnin). With the help of
Toxicity versus Potency
241
such mechanisms these toxic compounds are able to affect critical pathways which often cannot be circumvented. Therefore, these molecules are very effective. The data presented here provide valuable insight into the phenomenon of toxicity by elucidating "toxicity properties", characteristics of toxic compounds. Thus, the properties analyzed here will function as additional criteria to predict toxicities with the help of QSAR. Additional toxicity relevant properties, as presented here, will be helpful to improve such analysis. Further efforts will be made in the prediction of potential targets of unknown compounds. Acknowledgements
This work was supported by the International Research Training Group Boston-KyotoBerlin, funded by the German Research Foundation (DFG). References
[1] Watson, P., Spooner RA., Toxin entry and trafficking in mammalian cells, Adv Drug Deliv Rev, 58: 1581-1596,2006. [2] Hong, H., Xie, Q., Ge, W., Qian, F., Fang, H., Shi, L., Su, Z., Perkins and R, Tong, W., Mold(2), Molecular Descriptors from 2D Structures for Chemoinformatics and Toxicoinformatics, J Chern Inf Model, 2008. [3] Hughes, L.D., Palmer, D.S., Nigsch, F. and Mitchell, J.B., Why are some properties more difficult to predict than others? A study of QSPR models of solubility, melting point, and Log P, J Chern Inf Model, 48: 220-232, 2008. [4] Dunkel, M., Fullbeck, M., Neumann, S. and Preissner, R., SuperNatural: a searchable database of available natural compounds, Nucleic Acids Res, 34: D678683,2006. [5] Goede, A., Dunkel, M., Mester, N., Frommel, C. and Preissner R.,. SuperDrug: a conformational drug database, Bioinforrnatics, 21: 1751-1753,2005. [6] http://dtp.nci.nih.gov/ [7] http://chem.sis.nlm.nih.gov/chemidplus [8] http://pubchem.ncbi.nlm.nih.gov/ [9] Teuscher, E. and Lindequist, U., Biogene Gifte, Gustav Fischer Verlag, Germany, 1994 [10] Gunther, S., Kuhn, M., Dunkel, M., Campillos, M., Senger, c., Petsalaki, E., Ahmed, J., Urdiales, E.G., Gewiess, A., Jensen, L.1. et al., SuperTarget and Matador: resources for exploring drug-target relationships, Nucleic Acids Res, 36: D919-922, 2008. [11] Guha, R., Howard, M.T., Hutchison, G.R, Murray-Rust, P., Rzepa, H., Steinbeck, c., Wegner, J. and Willighagen, E.L., The Blue Obelisk-interoperability in chemical informatics, J Chern Inf Model. 46: 991-998,2006. [12] http://openbabel.sourceforge.netl [13] http://mychem.sourceforge.netl. [14] http://www.daylight.comldayhtmVdoc!theory/theory.smarts.html
242
S. Struck et al.
[15] Lipinski, CA., Lombardo, F., Dominy, B.W. and Feeney, PJ., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings, Adv Drug Deliv Rev, 46: 3-26, 2001. [16] van de Waterbeemd, H. and Gifford, E., ADMET in silico modelling: towards prediction paradise?, Nat Rev Drug Discov, 2: 192-204,2003. [17] Levitt, M. and Perutz, M.F., Aromatic rings act as hydrogen bond acceptors, J Mol Bioi, 201: 751-754,1988. [18] Lindell, T.1. et aI., Specific Inhibition of Nuclear RNA Polymerase II by agrAmanitin, Science, 170: 447-449, 1970. [19] Wieland, T., Poisonous Principles of Mushrooms of the Genus Amanita: Fourcarbon amines acting on the central nervous system and cell-destroying cyclic peptides are produced, Science, 159: 946-952, 1968. [20] Mas, A., Mushrooms, amatoxins and the liver, Journal of Hepatology, 42: 166-169, 2005. [21] http://www.genome.jp/kegg/.
COMPARATIVE VEGF RECEPTOR TYROSINE KINASE MODELING FOR THE DEVELOPMENT OF HIGHLY SPECIFIC INHIBITORS OF TUMOR ANGIOGENESIS ULRIKE SCHMIDT! [email protected]
JESSICA AHMED! [email protected] MICHAEL HOEPFNER2 [email protected]
ELKE MICHALSKY! [email protected]
ROBERT PREISSNER! [email protected]
Structural Bioinformatics Group, Institute for Molecular Biology and Bioinformatics, Charite (CBF), Arnimallee 22, 14195 Berlin, Germany, http://bioinformatics.charite.de 2 Molecular Tumor Therapy and Tumor Angiogenesis Group, Institute of Physiology, Charite (CBF), Arnimallee 22, 14195 Berlin, Germany !
The Vascular Endothelial Growth Factor receptors (VEGF-Rs) playa significant role in tumor development and tumor angiogenesis and are therefore interesting targets in cancer therapy. Targeting the VEGF-R is of special importance as the feed of the tumor has to be reduced. In general, this can be carried out by inhibiting the tyrosine kinase function of the VEGF-R. Nevertheless, there arise some problems with the specificity of known kinase inhibitors: they bind to the ATP-binding site and inhibit a number of kinases, moreover the so far most specific inhibitors act at least on these three major types of VEGF-Rs: Fit-I, Flk-I/KDR, Flt-4. The goal is a selective VEGF-R-2 (FlkIIKDR) inhibitor, because this receptor triggers rather unspecific signals from VEGF-A, -C, -D and E. Here, we describe a protocol starting from an established inhibitor (Vatalanib) with 2D-/3Dsearching and property filtering of the in silico screening hits and the "negative docking approach". With this approach we were able to identifY a compound, which shows a fourfold higher reduction of the proliferation rate of endothelial cells compared to the reduction effect of the lead structure.
Keywords: VEGF; cancer; tumor angiogenesis; homology modeling; in silica screening; docking
1.
Introduction
Angiogenesis, the fonnation of new blood vessels, nonnally occurs moderately in adults, e.g. during wound healing and during the menstrual cycle. The process of angiogenesis is regulated by activators and inhibitors [1]. Tumor angiogenesis is the fonnation of networks of blood vessels supplying the tumor with oxygen and nutrients. Tumor cells induce this process by releasing signaling proteins to the surrounding nonnal tissue. The most important signaling proteins, which are also released by most of the cancer cells, are the vascular endothelial growth factors (VEGFs). The VEGF family consists of the following secreted glyco-proteins: VEGF-A, VEGF-B, VEGF-C, VEGF-D, VEGF-E and the placental growth factors (PIGF-l and -2)
243
244
U. Schmidt et al.
[2-4]. The VEGFs bind to VEGF receptor (VEGF-R) proteins on the endothelial cell surface with different binding affinities for each of the VEGF-Rs. Expression of VEGF-Rs varies in specific endothelial cell layers. The VEGF-R-2 is located on almost all endothelial cells; however, the VEGF-R-I and -3 are alternatively located on endothelial cells in distinct vascular layers [5]. Since angiogenesis was found to be necessary for tumor growth [6], the inhibition of pathological angiogenesis is a main goal in cancer therapy. Particularly, the VEGFNEGF-R pathway plays a significant role in the development of angiogenesis and therefore represents a point of interference for therapy in oncology [5]. Different strategies to inhibit tumor angiogenesis exist: It is possible to interfere with angiogenesis from the extracellular as well as from the intracellular site. In the extracellular region, for example, antibodies and soluble receptors can avoid binding of the VEGF to the binding site of the receptor [6]. Moreover, VEGF antagonists block the ligand binding site of the VEGF-R on the extracellular site. Another way is the inhibition of the VEGF-R in the intracellular region by blocking the ATP-binding site of the tyrosine kinase [7]. However, there arise some problems concerning the specificity of known tyrosine kinase inhibitors: they bind into the ATP-binding site and inhibit a number ofkinases. So far the most specific inhibitors act on the VEGF-Rs. The goal would be to find a selective inhibitor for the VEGF-R-2 (KDR) , because it is expressed on almost all endothelial cells and the majority of the effects in angiogenesis, including cell proliferation, micro-vascular permeability [8], invasion, migration, and survival [9, IOJ, are mediated by VEGF-R-2. To find new compounds by using structure-based drug design, structural information about the target is needed. But today, no complete crystal structures of the VEGF-Rs are available. Here, we describe a protocol to find novel potential VEGF-R inhibitors starting from an established inhibitor (Vatalanib, see Figure 1) [11]. A known inhibitor was used as lead structure for an in silica two- and three-dimensional searching in an "Inhouse" database to identify novel potential VEGF-R tyrosine kinase inhibitors. Moreover, the structures of the ATP-binding site of three VEGF-Rs were modeled, starting from an incomplete crystal structure of the VEGF-R-2. These homology models were then used for comparative docking as qualitative evaluation of the in silica screening results.
2.
Methods
The in silica searching protocol consists of several steps, which are described in this section. In Figure I the procedure is schematically depicted.
2.1. Compound database To search for new potential VEGF-Rs inhibitors we used our Inhouse database which contains about four million compounds and more than 140 million conformers, which
Comparative VEGF Receptor Tyrosine Kinase Modeling
245
were pre-calculated by using the MedChemExplorer of Accelrys [12, 13]. Around 95% of the compounds stored in the Inhouse database are commercially available for experimental validation. In silleD screening
Lead structure
l
(Vatalanlb)
L""""""'~""'''''''''''~I
, ==::. sequence
Preliminary alignment
-.e
~.m~atepmlm")
.
r.v;,c,v"--;,vm.;m,,mx.tW.l m;m.,.M.1
1 Fig. 1. Scheme of the in silico and in vitro screening protocol.
2.2. Two-dimensional searching To search for similar structures in our Inhouse database we pursued 2D-searching. The screening is based on the chemical similarity between two molecules according to the similar property principle of Johnson and Maggiora [14].
246
U. Schmidt et al.
A structural fingerprint [15], a binary string encoding for the chemical characteristics of a compound, was calculated for the lead structure as well as for the database compounds. To screen the database, the fingerprint of the lead structure was compared to the fingerprints of the database entries by using the Tanimoto coefficient [16]. The Tanimoto coefficient is defined as:
Na describes the number of bits, which were set 1 in the fingerprint of compound a, Nb stands for the number of bits, which were set to I for compound b and Nab is the number of bits, which have compound a and compound b set to 1 in common. A molecule with a similarity greater than 85% (2: 0.85) to an active compound is assumed to be biologically active itself [17]. Therefore, only compounds with a similarity greater than 85% to the lead structure were considered. 2.3. Three-dimensional searching
A 3D-similarity search was applied to identify potential scaffold hoppers. For this purpose, the lead structure was compared to the conformers of drug-like compounds stored in our database. A plane representing the moment of inertia was put into all structures. For a comparison of two structures, the long and short sides of the planes were superimposed, which resulted in four different superimposition possibilities. The superimpositions were evaluated by using a scoring function, which includes the number of superimposed atoms and the Root Mean Square Deviation (RMSD). This scoring function is defined as: score = (percentage of superimposed atoms) . e,RMSD 2.4. Homology modeling
For homology modeling of the three VEGF-Rs several steps were necessary and were performed with the aid of the Swiss-PDBViewer [18]. A crystal structure of the VEGF-R-2 (PDB-code: 1YWN) was obtained from the Protein Data Bank (PDB). This structure is not complete; two gaps are located in and near the ATP-binding site. The ATP-binding pocket was completed by using the SuperLooper web server [19]. Loops were extracted from the LIP database [20] and inserted into the structure via the web service. Furthermore, the completed model of the VEGF-R-2 was used as template structure for the VEGF-R-l and VEGF-R-3. Finally, the models were subjected to an energy minimization using the respective function of the Swiss-PDB Viewer.
Comparative VEGP Receptor Tyrosine Kinase Modeling
247
2.5. Property filtering To estimate the drug-likeness of the 2D/3D-searching results the compounds were filtered according to their molecular properties by using the "Lipinski rule of five". There are four empirical rules, which say, that an orally available drug has: • not more than 5 hydrogen bond donors • not more than 10 hydrogen bond acceptors • a molecular weight below 500 glmol and • a 10gP (water/n-octanol partition) < 5. If a compound breaks more than one rule, it does not promise to become a drug [21]. Therefore, only compounds with no or at most one violation of the Lipinski rules were considered. The properties were calculated with the Accord for Excel Add-On [22].
2.6. Docking To evaluate the remaining drug-like candidates, they where docked into the ATP-binding site of the modeled VEGF-Rs by using the docking program Glide from Schrodinger [23). The Glide scoring function (Glide SP score) was used to rank the docking results. The docking scores and the visual inspection of the docked ligand-protein complexes were used as qualitative evaluation of the candidates and resulted in a ranking of those compounds. The best molecules were used for further in vitro screening.
2.7. In vitro screening A kinase assay was used to test the drug candidates for their inhibitory effect on VEGFRs. The potential of inhibition is expressed by the IC50 value (the concentration where kinase activity is reduced to 50%). Cytotoxicity was measured using a LDH-assay. The ability of cell proliferation inhibition was tested on different cell lines (endothelial cell line EA-HY 926) for each of the potential angiogenesis inhibitors.
3.
Results and Discussion
3.1. Sequence alignment and homology modeling The sequence alignment of the VEGF-Rs, as shown in Figure 2, is the basis of homology modeling. In a second step the non-identical amino acids of the template structure were exchanged according to the VEGF-R sequences. Only gaps in the ATP-binding pocket were filled in. VEGFR-3 VEGFR-2 VEGFR-l
827 IIp:ILIIYDlo,SINE 816 809
VEGFR-3 VEGFR-2 VEGFR-1
877 AVrCML],EGATIilIS 866 859
VEGFR-3 VEGFR-2 VEGFR-1
248
U. Schmidt et al. VEGFR-3 VEGFR-2 VEGFR-1
1024 1015 1009
VEGFR-3 VEGFR-2 VEGFR-l
1074 1065 1059
VEGFR-3 VEGFR-2
1124 1115
VEGFR-3 VEGFR-2 VEGFR-1
1174 QGRGI,QE 1165 QANAQQD 1159 QANVQQD
Fig. 2. Sequenee alignment of the three VEGF-Rs after the homology modeling. Amino acid differenees in thc ATP-binding site arc highlighted in black; other differences in grey.
Figure 3 shows a superimposition of the ATP-binding sites of all three homology modeled VEGF-Rs. Different amino acid residues in the ATP-binding site are shown in stick representation.
Fig. 3. Superimposition of the homology models of the VEGF-R-l (light grey), VEGF-R-2 (dark grey) and VEGF-R-3 (black). Different amino acid residues are shown in stick representation.
3.2. In silico screening The 2D-/3D-similarity screening of the Inhouse database for chemically and structurally similar compounds resulted in about 60 compounds which resemble the lead structure (with a Tanimoto ~ 0.85). The number of potential candidates could be reduced to 21 drug-like compounds by applying the Lipinski rule of five as molecular property filter.
Comparative VEGF Receptor Tyrosine Kinase Modeling
249
3.3. Docking The remaining 21 structures were docked into the ATP-binding site of the VEGF-Rs. The docking scores and the visual inspection of the docked ligand-receptor complexes were combined as qualitative evaluation of the in silico screening results. The docked structures of the lead compound Vatalanib and compound 10 to VEGF-R-l, -2 and -3 are exemplarily shown in Figure 4a-c) and Figure 4d-f), respectively.
Fig. 4. Ligand docked into the ATP-binding site (surface representation of the VEGF-Rs). Lead structure (Vatalanib) : a) in VEGF-R-l b) in VEGF-R-2 and c) in VEGF-R-3. Compound 10: d) in VEGF-R-l e) in VEGF-R-2 and t) in VEGF-R-3.
In Table 1 the docking scores for Vatalanib and compound 10 are listed. The evaluation of the docking results reveals better scores for compound 10 as for the lead structure. This suggests that compound 10 should have similar or even better biological activity. Therefore, compound 10 was one of the 21 substances selected for experimental validation. Table 1: Docking scores (Glide Score SP)
VEGF-R-l VEGF-R-2 VEGF-R-3
Lead (Vatalanib) -4.51 -4.27 -4.92
Compound 10 -5.01 -4.86 -5.15
3.4. Experimental validation The twelve compounds were tested in vitro for VEGF-R kinase activity inhibition, cell proliferation, migration inhibition and cytotoxicity. In Figure 5 the result of a cell proliferation assay on the endothelial cell line EA-HY 926 for compound 10 compared to the lead structure Vatalanib is exemplarily shown.
U. Schmidt et al.
250
It can be concluded that compound 10, at a concentration of 10 11M, reduces cell proliferation by ~40% (light grey) whereas the cell proliferation decreases about 8% when treated with the lead compound. The results shown here confirm the in silica screening results.
~ ..... c: 0
Cell proliferation (EA-HY 926) 100
:i2 ..Q
:.c
--..... -----.. Vatalanib
.5
(dark grey)
Comp10
.§
(light gray)
(0
....
~
...
"0
Q.
4.1
()
10
Concentration [J-lM]
Fig. 5. Cell proliferation assay (endothelial cell line EA-HY 926).
4.
Conclusion and Future Work
Using this approach, we were able to identify a new potential VEGF-R tyrosine kinase inhibitors. One of the hits was found to have a better effect on the inhibition of cell proliferation than the lead structure. Therefore, we reason that this compound is a specific inhibitor of tumor angiogenesis. This compound will undergo further in vitro and in vivo experiments and will be starting point for further refinement cycles.
Acknowledgements
This work was supported by the International Research Training Group Boston-KyotoBerlin, funded by the DFG. References
[1] Nishida, N., et al., Angiogenesis in cancer. Vase Health Risk Manag, 2(3): 213-219, 2006. [2] Ferrara, N., H.P. Gerber, and J. LeCouter, The biology ofVEGF and its receptors. Nat Med, 9(6): 669-676,2003. [3] Tischer, E., et al., The human gene for vascular endothelial growth factor. Multiple protein forms are encoded through alternative exon splicing. J Bioi Chem, 266(18): 11947-1154,1991.
Comparative VEGF Receptor Tyrosine Kinase Modeling
251
[4] Houck, K.A., et al., The vascular endothelial growth factor family: identification of a fourth molecular species and characterization ofaltemative splicing of RNA. Mol Endocrinol, 5(12): 1806-1814, 1991. [5] Hicklin, D.I. and L.M. Ellis, Role of the vascular endothelial growth factor pathway in tumor growth and angiogenesis. J Clin Oncol, 23(5): 1011-1027.,2005. [6] Los, M., I.M. Roodhart, and E.E. Voest, Target practice: lessons from phase III trials with bevacizumab and vatalanib in the treatment of advanced colorectal cancer. Oncologist, 12(4): 443-450, 2007. [7] Underiner, T.L., B. Ruggeri, and D.E. Gingrich, Development of vascular endothelial growth factor receptor (VEGFR) kinase inhibitors as anti-angiogenic agents in cancer therapy. Curr Med Chem, 11(6): 731-745.,2004. [8] Dvorak, H.F., Vascular permeability factor/vascular endothelial growth factor: a critical cytokine in tumor angiogenesis and a potential target for diagnosis and therapy. J Clin Oncol, 20(21): 4368-4380, 2002. [9] Zeng, H., H.F. Dvorak, and D. Mukhopadhyay, Vascular permeability factor (VPF)/vascular endothelial growth factor (VEGF) peceptor-l down-modulates VPFIVEGF receptor-2-mediated endothelial cell proliferation, but not migration, through phosphatidylinositol 3-kinase-dependent pathways. J BioI Chem, 276(29): 26969-26979. [10] Millauer, B., et af., High affinity VEGF binding and developmental expression suggest Flk-l as a major regulator of vas cuiogene sis and angiogenesis. Cell, 1993. 72(6): 835-846,2001. [11] Drevs, J., PTKlZK (Novartis). !Drugs, 6(8): 787-794,2003. [12] Smellie, A., et al., Conformational analysis by intersection: CONAN. J Comput Chem, 24(1): 10-20,2003. [13] MedChemExplorer, Accelrys Inc., http://www.accelrys.comldstudio/ds_medchem. [14] Johnson, M. and G. Maggiora, Concepts and Applications of Molecular Similarity. Wiley, NY, 1998. [15] 960 bit MDL (Molecular Design LTD.) MACCS keys [16] Delaney, J.S., Assessing the ability of chemical similarity measures to discriminate between active and inactive compounds. Mol Divers, 1(4): 217-222, 1996. [17] Martin, Y.c., I.L. Kofron, and L.M. Traphagen, Do structurally similar molecules have similar biological activity? J Med Chem, 45( 19): 4350-4358, 2002. [18] Guex Nand P. MC, SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis, 18(15): 2714-2723, 1997. [19] SuperLooper, http://bioinformatics.charite.de/superlooper. 2007. [20JMichalsky E, Goede A, and P. R, Loops in Proteins (LIP) - a comprehensive loop database for homology modelling. Protein Eng, 16: 979,2003. [21] Lipinski CA, et aI., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev, 46(1-3): 3-26, 2001. [22J Accelrys Inc., http://accelrys.coml [23J Schrodinger, Glide, version 4.5, Schr6dinger, LLC, New York, NY. 2007.
NETWORK ANALYSIS' OF ADVERSE DRUG INTERACTIONS MASATAKA TAKARABE' [email protected] TOSIHAKI TOKIMATSU' [email protected]
SHUJIRO OKUDA' [email protected] SUSUMU GOTO' [email protected]
MASUMIITOH' [email protected] MINORU KANEHISA'·2 [email protected]
'Bioin/ormatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan 2Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Harmful cffects associated with use of drugs are caused as a result of their side effects and combined use of different drugs. These drug interactions result in increased or decreased drug effects, or produce other new unwanted effects and are serious problems for medical institutions and pharmaceutical companies. In this study, we created a drug-drug interaction network from drug package inserts and characterized drug interactions. The known information about the potential risk of drug interactions is described in drug package inserts. Japanese drug package inserts are stored in the JAPIC (Japan Pharmaceutical Information Center) database and GenomeNet provides the GenomeNet pharmaceutical products database, which integrate the JAPIC and KEGG databases. We cxtracted drug interaction data from GenomeNet, where interactions are classified according to risks, contraindications or cautions for coadministration, and some entries include information about cnzymes metabolizing the drugs. We defined drug target and drug-metabolizing enzymes as interaction factors using information on them in KEGG DRUG, and classified drugs into pharmacological/chemical subgroups. In the resulting drug-drug interaction network, the drugs that are associated with the same interaction factors are closely interconnected. Mechanisms of these interactions were then identified by each interaction factor. To characterize other interactions without interaction factors, we used the ATC classification system and found an association between interaction mechanisms and pharmacological/chemical subgroups.
Keywords: drug interaction; network; KEGG
1.
Introduction
Adverse drug events caused by drug interactions are significant problems in medications and the development of new drugs. These drug interactions lead to increase or decrease of drug effects or other serious reactions. For example, cyclosporin, which is widely used as an immunosuppressant drug, is known to interact with many other drugs such as ketoconazole and erythromycin [1, 2]. Cyclosporin is metabolized by CYP3A4, which is a member of a cytochrome P450 family and catalyzes the oxidation of a number of substrates, whereas, ketoconazole and erythromycin inhibit CYP3A4 enzyme activity. Thus, the combined use of these drugs results in delayed clearance and elevated blood level of cyclosporin and increase or prolong both its therapeutic and adverse effects. Assessing and managing such drug interactions are significant problems for clinical practice and drug development. In this study, we focused on adverse drug interactions
252
Network Analysis of Adverse Drug Intemctions
253
and created drug-drug interaction networks to characterize and investigate the drug interactions. To create the drug-drug interaction networks, we extracted drug interaction data from Japanese drug package inserts, which contain known information about potential risk of drug interactions. The Japanese drug package inserts are stored in the JAPIC (Japan Pharmaceutical Information Center) database [12]. We have integrated the JAPIC and KEGG databases [3] and provide it as the GenomeNet pharmaceutical products database [13]. Additionally we defined interaction factors and merged drugs into pharmacological/chemical subgroups to characterize the drug interactions. In the resulting drug-drug interaction networks, drugs that are associated with the same interaction factors are closely interconnected, and mechanisms of the drug interactions were identified by the interaction factors (CYP enzyme family or monoamine receptors, for example). Some other drug interactions without interaction factors were characterized by using information from pharmacological/chemical subgroups.
2.
Method
2.1. Datasets The GenomeNet pharmaceutical products database provides Japanese drug package insert data linked to the KEGG DRUG database. Each entry contains information on the brand/generic name, physicochemicallpharmacokinetic properties, drug interactions, etc. The drug interaction section lists the drugs or the classes of drugs that cause adverse interactions with the product, and these interactions are classified according to risks, contraindications or cautions for coadministration. Additionally, some drugs contain additional sections which include information on enzymes metabolizing the products like cytochrome P450 family. Most entries are assigned KEGG DRUG IDs (D numbers), which correspond to the active ingredient of the products. The KEGG DRUG database is a chemical structure-based database in which each entry includes information on chemical structure, efficacy, drug target, pathway, ATC code, etc.
2.2. Drug interaction network We used the data from the GenomeNet pharmaceutical products database as of March 26, 2008. 13973 pharmaceutical product entries were stored in the database, of which 7562 entries contained drug interaction information. We extracted drug names from the drug interaction section of each entry and listed JAPIC IDs that correspond to the drug names to create drug interaction data between JAPIC IDs. Next, JAPIC IDs were merged with respect to the D numbers that the JAPIC IDs are assigned because we considered that products assigned the same medicinal properties have the same potential risk of drug interactions. Consequently, we obtained drug interaction data between D numbers and used the data to create drug interaction networks.
254
M. Takarabe et al.
To characterize the drug interactions, we defined drug targets and drug-metabolizing enzymes as interaction factors for each D number and searched drug interactions associated with the same interaction factors. Information on the interaction factors was collected from the package insert data and the KEGG DRUG database. Drug target genes data stored in the KEGG DRUG database were merged with respect to each functional type of protein according to KEGG BRITE, which is a collection of hierarchical classifications [3].
2.3. PharmacologicaVchemical subgroups We used the Anatomical Therapeutic Chemical classification system (ATC classification system), developed by the WHO Collaborating Centre for Drug Statistics Methodology [14], to group D numbers. The ATC classification system divides drugs at 5 different levels according to the sites of action and their therapeutic and chemical characteristics. Each level is assigned a code which consists of 1 letter or 2 digits corresponding to pharmacological/chemical subgroups of the level. The drugs assigned the same ATC codes indicate that they are assigned the same pharmacological/chemical subgroups. Thus, D numbers were grouped into chemical substance subgroups in terms of the pharmacological/chemical categories based on the ATC classification system. 3.
Results
The numbers of extracted interactions between JAPIC IDs are 29,663 and 1,196,494 in contraindications and cautions for coadministration respectively, and we merged JAPIC IDs into D numbers. As a result, 1,513 and 36,040 interactions between D numbers were obtained respectively (Table 1). Table I. Number of drug interactions and entries involved in the interactions.
JAPIC ID D number
Contraindications Interaction Entry 29,663 3,043 1,513 517
Cautions Interaction Entry 1,196,494 9,432 36,040 1,431
3.1. Interaction/actors We created network graphs from the resulting data on the drug interaction and interaction factors. Figure 1 shows the obtained network of contraindications for coadministration. In the network, nodes represent the D numbers that correspond to the drugs, and edges represent interactions. Node sizes are proportional to the numbers of edges they have. Bold edges indicate the interactions between the drugs associated with the same interaction factors and are colored according to the interaction factors.
Network Analysis of Adverse Drug Interactions
255
. ..
•
• '
't
•
/
- - - - CYP family
- - - - Monoamine receptor
Other interaction factors
Fig. I. Drug interaction network of contraindications for coadministration. Interaction factors were merged into the CYP enzyme family, monoamine receptor, and others. Bold edges were colored according to these interaction factor groups.
Obtained interaction factors were 12 and 38 in contraindications and cautions for coadministration, respectively. Table 2 shows the top 5 interaction factors that both drugs in the interaction are associated with. CYP families and monoamine (adrenaline, serotonin, dopamine, histamine, etc.) receptors are the most frequently observed interaction factors which are associated with both drugs in the interactions. The interactions between the drugs associated with the same interaction factors are closely interconnected.
256
M. Takarabe et al. Table 2. Number of interactions and drugs with interaction factor.
Contraindications Interaction factor # of interaction CYP3A Adrenaline receptor Serotonin receotor CYP2D CYPIA
181 33 28 17 16
# of drugs 77 17 8 14 16
Interaction factor CYP3A Adrenaline receptor CYP2C Dooamine receptor CYPIA
Cautions # of interaction 1,916 200 200 182 113
# of drugs 147 52 50 42 31
Information on action mechanisms of these interactions are provided in the package inserts. For instance, drug interactions from CYP families are caused by inhibition/induction of the enzymes and result in a decrease/increase in the effects of drugs. In the case of drug interactions with monoamine receptors, both drugs affect the same receptors, which results in the additive effect of the receptors. Next, we investigated other interactions without interaction factors by using information from pharmacological/chemical subgroups. In the network of contraindications for coadministration, 398 D numbers were assigned ATC codes and merged into 331 pharmacological/chemical subgroups. 1042 D numbers were merged into 941 subgroups in the network of cautions for coadministration. To explore an association between interaction mechanisms and pharmacological/chemical subgroups, we searched hub nodes and common pharmacological/chemical categories of their neighboring nodes. Figure 2 shows an example of D00951 (Medroxyprogesterone acetate) and its neighboring nodes with pharmacological/chemical subgroup information in the network of contraindications for coadministration. D00951 interacts with 97 different drugs, of which 43 are included in the most common category "Corticosteroids, plain" which corresponds to third level A TC code "D07 A". These interactions between D00951 and "Corticosteroids, plain" subgroup increase the risk of side effect of both drugs such as cardiovascular disease [4, 5, 6]. 4.
Discussion
We created drug interaction networks from Japanese drug package insert information to explore adverse drug interactions. In the resulting networks, many drugs are associated with the same interaction factors and closely connected with each other. Therefore there are many drugs that mostly interact only with drugs associated with the same interaction factors. For example, D02211 (Dihydroergotamine mesilate) interacts with 37 different drugs, of which 30 are associated with CYP3A, and D00560 (Pimozide) interacts with 23 different drugs, of which 21 drugs are associated with CYP3A. Dihydroergotamine mesilate and pimozide are reported to be metabolized by CYP3A [7,8], and coadministrations of the two drugs with CYP3A inhibitors or drugs metabolized by CYP3A cause serious side effects such as QT prolongation or ventricular arrhythmia. These interaction factors enabled us to characterize drug interactions and identify mechanisms of these interactions because their interaction mechanisms or clinical symptoms depend on the interaction factors. Obtained drug interaction networks include many nodes and edges. Particularly, in the network of cautions for coadministration, it is difficult to explore drug interactions from the network graph. For efficient analysis,
Network Analysis of Adverse Drug Intemctions
257
elimination of drugs and interactions associated with the same interaction factors may be effective to reduce nodes and edges in the drug networks. Next, we used ATC classification system to investigate interactions between drugs assigned no information of interaction factors or assigned different interaction factors respectively. We applied the information of pharmacological/chemical subgroups to neighboring nodes of each node and searched their common pharmacological/chemical categories that correspond to third level or forth level of ATC code. In some interactions between drugs and their neighboring nodes, common pharmacological/chemical categories were found in the neighboring nodes, and there are characteristic interaction mechanisms or clinical symptoms related to the pharmacological/chemical categories. We illustrated Figure 2 as an example of the association between interaction mechanisms and pharmacological/chemical subgroups, and Figure 3A shows another example of the associations. D00386 (Triamterene) interacts 8 different drugs, of which 6 drugs are classified "Acetic acid derivatives" subgroup, and these interactions cause acute renal failure [9, 10]. Figure 3B illustrates the case of D00089 (Oxytocin), and these interactions result in the enhancement effect of both drugs and lead to serious events [11]. The results indicate this method using pharmacological/chemical subgroups is effective to investigate drug interactions without information of interaction factors. However, some drug interactions remain uncharacterized. For further research, there is a need for more exhaustive data including drug interactions, targets and other new pharmacological/chemical properties to determine the uncharacterized drug interactions. Acknowledgments
We thank lB. Brown for critical reading of our manuscript. This work was supported by grants from the Ministry of Education, Culture, Sports, Science and Technology of Japan and the Japan Science and Technology Agency. The computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University.
258
M. Takarabe et al.
Fig. 2. 000951 and its neighboring nodes in the network of contraindications for coadministration. Red nodes represent nodes that included in the "Corticosteroids, plain" subgroup ("007 A").
A
B
Fig. 3. Associations between interaction mechanisms and pharmacological/chemical subgroups in the network of contraindications for coadministration. Red nodes represent nodes that included in the same pharmacological/chemical subgroups. (A) 000386 (Triamterene) interacts with 6 drugs classified in "Acetic acid derivatives" subgroup. (8) 000089 (Oxytocin) interacts with 5 drugs classified in "Prostaglandins" subgroup.
Network Analysis of Adverse Drug Interactions
259
References [1] Wadhwa, N.K., Schroeder, T.J., Pesce, AJ., Myre, S.A, Clardy, C.W., First, M.R., Cyclosporine drug interactions: a review, Ther. Drug Monit., 9(4):399-406, 1987. [2] Pichard, L., Fabre,l., Fabre, G., Domergue, J., Saint Aubert, B., Mourad, G., Maurel, P., Cyclosporin A drug interactions. Screening for inducers and inhibitors of cytochrome P-450 (cyclosporin A oxidase) in primary cultures of human hepatocytes and in liver microsomes, Drug Metab. Dispos., 18(5): 595-606, 1990. [3] Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., Yamanishi, Y., KEGG for linking genomes to life and the environment, Nucleic Acids Res., 36, D480-D484, 2008. [4] Falkeborn, M., Persson, I., Adami, H.O., Bergstrom, R., Eaker, E., Lithell, H., Mohsen, R., Naessen, T., The risk of acute myocardial infarction after oestrogen and oestrogen-progestogen replacement, Br. J. Obstet. Gynaeeol., 99(10), 821-828, 1992. [5] Lacroix, K.A, Bean, C., Reilly, R., Curran-Celentano, J., The effects of hormone replacement therapy on antithrombin III and protein C levels in menopausal women, Clin. Lab. Sci., 10(3): 145-148, 1997. [6] AI-Farra HM, AI-Fahoum SK, Tabbaa MA., First MR., Effect of hormone replacement therapy on hemostatic variables in post-menopausal women, Saudi Med. J., 26(12):1930-1935, 2005. [7] Moubarak AS, Rosenkrans CF Jr, Johnson ZB., Modulation of cytochrome P450 metabolism by ergonovine and dihydroergotamine, Vet. Hum. Toxieol., 45(1):6-9, 2003. [8] Desta Z, Kerbusch T, Soukhova N, Richard E, Ko JW, Flockhart DA, Identification and characterization of human cytochrome P450 isoforms interacting with pimozide, J. Pharmaeo.l Exp. Ther., 285(2):428-437,1998. [9] Favre L, Glasson P, Vallotton MB., Reversible acute renal failure from combined triamterene and indomethacin: a study in healthy subjects, Ann. Intern. Med., 96(3):317-320, 1982. [10] Favre L, Vallotton MB., Relationship of renal prostaglandins to three diuretics, Prostaglandins Leukot. Med., 14(3):313-319, 1984. [11] Tomialowicz M, Florjanski J, Zimmer M., The use of oxytocin and prostaglandin in pregnancies after cesarean delivery or uterine surgery, Ginekol Pol., 71(4):242-246, 2000. [12] http://database.japic.or.jp/nw/index
[13] http://www.genome.jp/kusuri/ [14] http://www.whocc.no/atcdddl
SAMPLING GEOMETRIES OF PROTEIN-PROTEIN COMPLEXES A YSAM GUERLER
STEPHAN LORENZEN
[email protected]
[email protected]
FLORIAN KRULL
ERNST-WALTER KNAPP
[email protected]
[email protected]
Frie Universitat Berlin, Department a/Chemistry and Biochemistry, Fabeckstr. 36a, 14195, Berlin-Dahlem, Germany Protein-protein docking is a major task in structural biology. In general, the geometries of protein pairs are sampled by generating docked conformations, analyzing them with scoring functions and selecting appropriate geometries for further refinement. Here, we present an algorithm in real space to sample geometries of protein pairs. Therefore, we initially determine uniformly distributed points on the surfaces of the two protein structures to be docked and additionally define a set of uniformly distributed rotations. Then, the sampling method generates structures of protein pairs as follows: (i) We rotate one protein of the protein pair according to a selected rotation and (ii) translate it along a line connecting two surface points belonging to different proteins such that these surface points coincide. The resulting protein pair geometries are then analyzed and selected using a scoring function that considers residues and atom pairs. We applied this approach to a set of 22 enzymeinhibitor complexes and demonstrate that a discretisation of the rigid-body search in real space provides an efficient and robust sampling scheme. Our method generates decoy sets with a considerable fraction of near-native geometries for all considered enzyme-inhibitor complexes.
Keywords: protein-protein docking; rigid-body geometry search; interface analysis
1.
Introduction
Proteins are important regulators of biochemical processes in biological cells. They are for instance used to catalyze chemical reactions, to transport substrates through membranes and to stabilize cellular structures. Interactions with other molecules can affect a protein's macromolecular structure and functionality. For proteins, whose function is to form specific complexes with other proteins, the shape of the contact surface and the residue pair interactions at the contact surface are especially relevant [1]. This protein-protein interaction obeys the key-lock principle and is driven by free energy contributions, resulting in high binding affinities. Binding can influence the function of proteins in diverse ways from total inhibition to enhancement or induction. Although genome-wide proteomics studies indicate that many proteins interact with each other, the number of complexes in the Protein Data Bank (PDB) increases very slowly. Possibly, this is related to the instability of transient protein-protein interactions, which make a crystallographic analysis difficult. Therefore, theoretical approaches for the identification and prediction of protein-protein interactions can be of great importance. Many efforts have been made to find a computational solution to this problem. Unlike the prediction of the binding modes for small molecules (i.e. FlexX [2],
260
Sampling Geometries of Protein-Protein Complexes
261
ICM [3] and Fado [4]), most protein-protein docking approaches consider the structures of the individual proteins in the complex to be rigid. Initially, a wide variety of docked conformations are generated and simultaneously evaluated by scoring functions. In general, these methods perform well when applied on individual protein conformations that are directly taken from the corresponding co-crystallized structures. However, predicting protein complex geometries using protein structures obtained from separate crystallizations essays remains difficult, often leading to many false positives. The binding process often involves conformational changes. Although these are generally subtle, they make it more difficult to find the proper complex geometry. Therefore, a further refinement of the proposed complex geometries by other methods, e.g. Monte Carlo approaches, is often necessary. Currently, most established methods for rigid-body analysis of protein-protein interactions are based on the convolution technique in Fourier space as initially utilized by Katchalski-Katzir et al. in 1992 [5]. These approaches include ZDOCK [6], MolFit [7], 3D-Dock [8], DOT [9], GRAMM [10] and others. These methods use a scoring function defined on a discrete grid for each of the two proteins. Instead of evaluating the scoring function in real space, which is computationally expensive, the values of the scoring function are obtained by multiplication the corresponding Fourier transformed grids. This is done by assigning the atomic interaction parameters for each protein on separate grids, which are subsequently transformed by the fast Fourier transform (FFT) algorithm. In the Fourier space the Fourier coefficients are multiplied and the results are transformed back to real space. This is done for a large set of protein orientations [5]. Besides the FFT-based approaches, a variety of other procedures have also been applied on the protein-protein docking problem. Nussinov et al. proposed an algorithm based on geometric matching of knobs on the interacting surfaces [11]. Others, such as Baker [12] and Abagyan [13] have developed highly accurate methods using Monte Carlo simulations. The protein complex geometries are clustered [14] and their stability is analyzed by perturbation studies using different scoring functions [15]. The development of proper scoring functions is a non-trivial problem in proteinprotein docking. A large variety of scoring functions attempt to capture the biophysically relevant properties for protein complex formation, such as e.g. interactions based on physical principles, on residue pair distributions or on geometric fit [16-20]. In this work, we describe a real space rigid-body protein-protein docking approach. Instead of assigning atom specific interaction parameters to each grid point, as necessary for FFT methods, we can take into consideration all interactions of atom pairs within a certain cutoff distance from the protein surfaces. In order to reduce the computational costs in real space, an efficient sampling strategy of the search space is used, which in tum allows to consider additional parameters in the scoring function. Two proteins are translated and rotated by a discrete set of transformations. To obtain the corresponding parameters for the transformations, the protein surfaces are uniformly covered by surface points. In addition, a set Q of uniformly distributed quatemions is generated from which the rotations are obtained. The translational vector is defined by the line connecting the
262
A. Guerler et al.
pair of surface points selected from each of the two proteins. The residues interacting in the resulting geometry are evaluated by a statistical scoring function, which comprises geometrical and physicochemical components by considering residue pairs and atom pairs. The parameters of the scoring function were determined by Heuser et al. for enzyme-inhibitor complexes [20, 21]. 2.
Methods
2.1. Preparing surface and grid representation
From now on, we call the smaller of both proteins ligand (L) and the larger receptor (R). We embed both proteins by a grid with grid constant of 1.0 A. Points of the receptor grid GR, which are in the van der Waals (vdW) sphere of a receptor atom (radius of 1.8 A for all atoms) are inside the receptor and marked as receptor points. If the receptor grid points are outside of the vdW volume of the corresponding protein they contain a neighbor list of protein atoms, which are within a distance cutoff of rcutCneighbor) 7 A. This neighbor list provides an efficient way to find atomic interaction partners between the two proteins in the complex structure.
a)
b)
c)
Fig. I. Generation of neighbor list and surface points. Small spheres denote the protein atoms. a) Atom neighbor list of a reference grid point (center of large sphere) contains the numbers of atoms within the cut-off distance (largest sphere). b) Initial surface points (thicker red points of the grid) are all grid points, which are within a specified minimal and maximal distance (medium size blue spheres denoted by dashed lines) to the nearest protein atoms. c) The initial surface points are translated towards the center of the nearest protein atom until the vdW surface of the atom is reached (blue points on the surface of the gray spheres).
For both proteins (ligand and receptor) the grids are also used to determine surface points and surface normal vectors (see Fig. 1 for more details). In a first approximation the protein surface points are those grid points whose distances to the nearest protein atoms are between 4.0 and 6.0 A. These points are then projected on the vdW surface of the nearest atom sphere. For each such surface point, we calculate a surface normal vector connecting the assigned atom center with the surface point. Then, we compute for
Sampling Geometries of Protein-Protein Complexes
263
all atoms of a residue the average of the surface normal vectors. Now we reduce the number of surface points. To obtain an even distribution of surface points we randomly select a single surface point and delete all other surface points within a distance of rcut(surface) = 7 A. Next, we select the nearest remaining surface point and repeat the procedure until all surface points have been selected or deleted. We denote the resulting sets of surface points SR and SL and of corresponding normal vectors V R and V L for the receptor and ligand, respectively. For the rotations a set Q of 8000 uniformly distributed quatemions is calculated with the approach described by Kuffner [22].
2.2. Sampling strategy During the generation of the protein-protein geometries (called decoys), the receptor stays fixed, while the ligand is moved, i.e. translated and rotated. A decoy is defined by the triplet [q(k), sR(i), SL(j)], of quatemion q(k) E Q and surface points sR(i) and SL(j) of receptor and ligand, respectively. For each pair [SR(i), SL(j)] of surface points we compute the angle
g feature
=
C feature
(m, n) W feature (m, n)
(1)
m ,nEfeatures
where Cfeature is the number of interactions occurring for an atom pair type with the features m and nand Wfeature(m,n) is corresponding element of the weighting matrix. The total score gtotal of a generated decoy is defined by
g
total
=
g atom· g residue
(2)
where atom-based and residue-based weighting matrices Watom and Wresidue are employed.
3.
Results
3.1. Docking performance We applied the described sampling approach on a set of 22 enzyme-inhibitor complexes (see Fig. 2 for a list of the corresponding PDB codes) from the ZDOCK 1.0 benchmark set [23]. We generated a set of uniformly distributed surface points for each individual protein structure using rcut(surface) = 7 A. Thus, we obtained on average 60 surface points for the receptors and 25 for the ligands (Fig. 2) yielding about 1500 pairs of
264
A. Guerler et al.
surface points per protein complex on average. Hence, we consider 1500 translations and 8000 rotations and check that the normal vectors of the selected surface point pair possess an angle larger than alhre,hold' This yields decoys in the range of 10 7 per protein pair. For each of these decoys we verified that at most 10% of the ligand atoms overlap with receptor points. For the remaining decoys the scoring function glOlal, eq. (2), was evaluated, keeping for each rotation the decoy with the highest score only. This results in about 8000 decoys per protein-protein complex. Figure 3 shows the number of generated near-native receptor-ligand geometries with an interface root mean square displacement (iRMSD) relative to the native complex structure below 5.0 A. On average, about 50 near-native decoys out of the 8000 were generated per protein structure pair. For 1UDI, only 15 near-native decoys were generated, while the maximum number of 650 nearnative decoys was obtained for lBRC (Fig. 3). With a higher density of surface points 120 , - - - - - - - - - - - - - - - - - - - - ,
.5
,P
100
9 \
& ~ 1l
! ~ ~ ~
60
I
¢-o.o.-
40
20
o
0
I
f>-~d
R '- " / '- ~
b..o-o-o-d
,0,
~<>¢'O-<>.()'¢.<>4A¢'¢
b-o-o-o-Q..cr
J:\
A D'fj'
'a,O.q-Q
+-~-T_r,_._~~_r,_._~~_r,_._~~~
protein complex
Fig. 2. Number of surface points of the considered 22 protein complexes consisting of receptor and ligand protein pairs (receptors: diamonds; ligands: squares). 1000 , - - - - - : - - - - - - - - - - - - - - - - - ,
~
.~
,?" ,?-~
A.
'C
100
1;1
,
¥o,;o(
v
\/
<>
c:
....:.
~
),I(
I
" ',..{ '" v
\,
p"
Jv'~ ,r
v"
\~
c:
'6
A
\
I
,
ft
\, y t\
~
"
\:'
\
v
<:!'~
X
A
9'V
\Ill
10
.&I
E
""
protein complex
Fig. 3. Results of the protein docking approach. For each protein complex the highest ranked decoy per rotation was kept (about 8000 decoys per complex in total). The diamonds illustrate the number of decoys with an interface RMSD (iRMSD) below 5.0 A.
using rcut(surface) = 3 A the results remained qualitatively similar.
Sampling Geometries of Protein-Protein Complexes
265
3.2. Sampling of a serine-protease-inhibitor complex In the following, we briefly illustrate the sampling results obtained for the first enzymeinhibitor complex of the ZDOCK 1.0 benchmark set, which is a serine-protease-inhibitor complex (lACB) [24). We applied the algorithm on the separately crystallized protein structures 5CHA and 1CSE. The surface of the serine-protease was covered with 55 the inhibitor with 23 surface points. With the uniform set of 8000 rotations, more than 107 decoys were generated. Less than 5% (387047 in total) of these decoys fulfilled the geometrical criteria probing the ratio of receptor points with ligand atoms and the angle between the normal vectors of assigned surface points. We calculated the iRMSD of these decoys relative to the native reference complex, which was generated by aligning the separately crystallized protein structures on the co-crystallized true native complex structure. The iRMSD of this reference structure with the true native complex structure is 0.7 A. About 10% of them have an iRMSD below 10 A to the reference complex. The decoys were scored and the highest ranked decoy per rotation was kept (see details in 2.2) resulting in 8000 decoys. Figure 4 shows the scores with respect to the iRMSD for the 2000 highest ranked decoys. The complete set of 8000 decoys comprises 186 cases with an iRMSD below 5.0 A, whereby the decoy with the lowest iRMSD of 4.8 A is ranked at position 33. Considering all 8000 decoys, in eleven cases an iRMSD below 2.5 A was detected. Hereby, the highest rank is 1743 with an iRMSD of 2.1 A.
..
810"
Fig. 4. Diagram correlating for the protein complex lACB [24] the iRMSD of the 2000 highest ranked decoys with the corresponding scores given by eq. (2).
Figure 5a shows the surface of the serine-protease and the center of masses of the inhibitor coordinates (dots) in the 8000 generated decoys. In Fig. 5b the serine-protease is shown together with the inhibitor in the native reference structure. The conserved residues of the serine-protease detected with BLAST [25] and CLUST AL W [26] were highlighted in dark red (Fig. 5b). It is evident that the residues in the interface between serine-protease and the inhibitor are highly conserved. Furthermore, we find that the binding cavity allows a better geometric match between the two protein structures than
266
A. Guerler et al.
any other region detected on the serine-protease surface. Probably, the physicochemical specificity and the geometrical fit contribute to the large number of hits in the generated decoy set.
a)
b)
Fig. 5. Illustration of the docking results for the protein complex lACB. a) Surfacc of the receptor with the center of masses of the 8000 highest ranked decoys (green dots). b) Surface of thc receptor and cartoon of the ligand structure (dark blue). The conserved residues of the receptor are highlighted in dark red.
To refine and rerank the decoys obtained with the initial sampling procedure, we performed a Monte Carlo stability analysis [27] using the program ROTAFIT [28]. Briefly, this procedure uses the 2000 top ranked decoys to perform 500 steps of a replica exchange Monte Carlo simulation using 10 replica. After the simulation, the pair-wise iRMSDs of the last 250 time steps of the five lowest temperature replica of each decoy are calculated. We then plot the number of structure pairs below a given iRMSD threshold versus the iRMSD threshold. The structural stability score of the decoys is calculated as the integral under this curve. Near-native decoys show a considerably higher structural stability score then false hits (Fig. 6).
•
.
'io
o
8.10
1
stabilHy integral
Fig. 6. Structural stability scores of the first 2000 decoys versus iRMSD.
Sampling Geometries of Protein-Protein Complexes
4.
267
Discussion
Initial-stage approaches in protein-protein docking are commonly based on the Fourier transform technique (FFT approach). This method is well established and capable to search an extensive variety of receptor-ligand geometries. However, the FFT approach carries inherent limitations. It can account only for interactions referring to pairs of coinciding grid points from the two proteins where the contribution to the scoring function is given as product of the parameters of the corresponding two grid points. The real space sampling technique of decoys allows using more general expressions for the scoring function. The present application for a set of 22 enzyme-inhibitor complexes demonstrated that efficient sampling and scoring of receptor-ligand geometries in real space is computationally feasible. The method provides decoy sets with near-native geometries for all of the considered 22 enzyme-inhibitor complexes. The analysis of lACB, a serineprotease-inhibitor complex, emphasizes that the method is capable to generate a large fraction of near-native binding modes (see Fig. 3). In 186 out of 8000 cases, the proteinprotein decoys exhibit an iRMSD of less than 5.0 A. The highest number of 650 nearnative binding modes has been generated for the protein complex lBRC. The subsequent rescoring by structural stability analysis greatly improves the rank of near-native decoys. Interestingly, the stability integral proved to be a better way of identifying near-native decoys than various energy functions (data not shown). In future studies, we plan to utilize our method for the evaluation of a variety of other all-atom, respectively heavy-atom, or residue-based scoring functions, which can be described as summation of weighted amino acid or atom pair interactions (see 2.2). We will also try to implement new scoring schemes. Thereby, the preliminary analysis of potential interface residues can be of particular interest. This can significantly improve the performances, since the described real space approach is capable to acquire preliminary residue selections to reduce the search space or to increase the surface resolution at particular protein surface sites. In addition, clustering the generated decoys can be used to improve detection of near-native complex structures. Finally, we aim to incorporate further rigid-body optimization procedures and perturbation studies to evaluate the stability of docked conformers and approaches to model the intramolecular flexibility of the two interacting protein structures. Acknowledgments
We like to thank Connie Wang for useful contributions. This study was funded by the International Research Training Group (IRTG) on "Genomics and Systems Biology of Molecular Networks" (GRK1360, Deutsche Forschungsgemeinschaft (DFG».
268
A. Guerler et al.
References [1] Shuhnan-Peleg, A, Shatsky, M., Nussinov, R, and Wolfson, H. J., Spatial chemical conservation of hot spot interactions in protein-protein complexes, BMC Biology, 5, 2007. [2] Rarey, M., Kramer, B., Lengauer, T., and Klebe, G., A fast flexible docking method using an incremental construction algorithm, J Mol. Bioi., 261:470-89, 1996. [3] Totrov, M. M., Abagyan, R A, and Kuznetsov, D., ICM-a new method for protein modeling and design. Applications to docking and structure prediction from the distorted native conformation, J Camp. Chem., 15:488-506, 1994. [4] Guerler, A, Moll, S., Weber, M., Meyer, H., and Cordes, F., Selection and flexible optimization of binding modes from conformation ensembles, Biosystems, 92:42-8, 2008. [5] Katchalski-Katzir, E., Shariv, I., Eisenstein, M., Friesem, A. A, Aflalo, C., and Vakser, I. A, Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques, PNAS, 89:2195-9, 1992. [6] Chen, R L. L. and Weng, Z., ZDOCK: an initial-stage protein-docking algorithm, Proteins, 52:80-7, 2003. [7] Heifetz, A, Katchalski-Katzir, E., and Eisenstein, M., Electrostatics in proteinprotein docking, Protein Science, 11: 571-87, 2002. [8] Gabb, H. A, Jackson, R. M., and Sternberg, M. J. E., Modelling protein docking using shape complementarity, electrostatics and biochemical information J Mol. Biol.,272,1997. [9] Mandell, J. G., Roberts, V. A, Pique, M. E., Kotlovyi, V., Mitchell, J. c., Nelson, E., Tsigelny, I., and Eyck, L. F. T., Protein docking using continuum electrostatics and geometric fit, Protein Eng., 14:105-13,2001. [10] Tovchigrechko, A and Vakser, I. A, Development and testing of an automated approach to protein docking, Proteins, 60, 2005. [11] Duhovny, D., Nussinov, R, and Wolfson, H. J., Efficient unbound docking of rigid molecules, Lecture Notes in Computer Science, 2452: 185-200,2002. [12] Gray, J. J., Moughon, S., Wang, c., Schueler-Furman, 0., Kuhlman, B., Rohl, C. A, and Baker, D., Protein-protein docking with simultaneous optimization of rigidbody displacement and side-chain conformations, J Mol. Bioi., 331:281-99, 2003. [13] Fernandez-Recio, J., Totrov, M., and Abagyan, R., ICM-DISCO docking by global energy optimization with fully flexible side-chains, Proteins, 52: 113-7,2003 [14] Lorenzen, S. and Zhang, Y., Identification of near-native structure by clustering protein docking conformations, Proteins, 68:187-94, 2007. [15] Kozakov, D., Schueler-Furman, 0., and Vajda, S., Discrimination of near-native structure in protein-protein docking by testing the stability of local minima, Proteins, 72(3):993-1004,2008. [16] Kortemme, T. and Baker, D., Computational design of protein-protein interactions, Curro Opin. in Struct. Bioi., 8:91-7, 2004. [17] Lei, H. and Duan, Y., Incorporating intermolecular distance into protein-protein docking, Protein Eng., 17:837-45,2004.
Sampling Geometries of Protein-Protein Complexes
269
[18] Keskin, 0., Ma, B., Nussinov, R., Hot regions in protein-protein interactions: The organization and contribution of structurally conserved hot spot residues, 1. Mol. BioI., 345:1281-94, 2005. [19] Shulman-Peleg, A., Shatsky, M., Nussinov, R., Wolfson, H. 1., Spatial chemical conservation of hot spot interactions in protein-protein complexes, BMC Biology, 5, 2007. [20] Heuser, P., Schomburg, D., Combination of scoring schemes for protein docking, BMC Bioinformatics, 8,2007 [21] Heuser, P., Schomburg, D., Optimised amino acid specific weighting factors for unbound protein docking, BMC Bioinjormatics, 7, 2006 [22] Kuffner, 1. J., Effective sampling and distance metrics for 3D rigid body path planning, In Proc. IEEE Int. Con! on Robotics and Automation, 2004. [23] Chen, R., Mintseris, J., Janin, J., Weng, Z., A protein-protein docking benchmark, Proteins, 52:88-91, 2003. [24] Frigerio, F., Coda, A., Pugliese, L., Lionetti, c., Menegatti, E., Amiconi, G., Schnebli, H. P., Ascenzi, P., Bolognesi, M., Crystal and molecular structure of the bovine a-chymotrypsin-eglin c complex at 2.0 A resolution, 1. Mol. BioI., 225:10723, 1992. [25] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, 1., Zhang, Z., Miller, W., Lipman, D. J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Research, 25:3389-402, 1997. [26] Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., Thompson, J. D., Gibson, T. 1., Higgins, D. G., ClustalW and ClustalX version 2.0, Bioinformatics, 21:2947-8,2007. [27] Lorenzen, S., Detecting near-native docking decoys by Monte Carlo stability analysis, Genome Informatics, 18:206-14,2007. [28] Lorenzen, S., Zhang, P. F., Monte Carlo refinement of rigid-body protein docking structures with backbone displacement and side-chain optimization, Protein Science, 16:2716-25,2007.
COMPUTER AIDED OPTIMIZATION OF CARBON ATOM LABELING FOR TRACER EXPERIMENTS BENJAMIN SEFA MENKUC
CHRISTOPH GILLE
benjamin~menkuec.de
christoph.gille~charite.de
HERMANN-GEORG HOLZHUTTER hergo~charite.de
Medical Faculty of the Humboldt University Berlin, Charite, Institute of Biochemistry, Monbijoustr. 2, 10117 Berlin, Germany Isotopomer tracer experiments are indispensable for the determination of flux rates in already known pathways as well as for the identification of new pathways. The information gained from such experiments depends on the labeling of the feed tracer metabolite, i.e. the atom positions carrying a label. Here we present an algorithm and a software tool to find an optimal carbon labeling pattern that assures the label to disseminate predominantly into those parts of the network under study. Our implementation is based on carbon fate maps and distinguishes between homotopic and prochiral atoms. In addition, the software can be used to generate carbon transition probability matrices, which can be used for the study of biochemical reaction mechanisms. In this article we present the algorithms and show an application of the software for glycolysis and the TCA cycle.
Keywords: isotopomer tracer experiments; metabolic network; compound transition matrix; systems biology
1. Introduction
Isotopomer tracer experiments are essential for determining fiuxrates of known pathways [4, 8] as well as for elucidating new pathways [3, 11] and reaction mechanisms. In typical isotopomer tracer experiments labeled compounds are taken up by an organism and the distribution of the label within certain compounds is observed. The set of metabolites which will carry the label after a certain amount of time depends on the labeling pattern of the feed metabolite. For the analysis it is advantagous to achieve a preferential labeling of those metabolites that belong to the pathway under investigation. If, however, the label is disseminated into many different pathways the labeled fraction of each metabolite is low and therefore the accuracy of the measurement is reduced. Computer programms already exist to simulate the distribution of labels in metabolic networks over time. Simulation of isotopomer tracer experiments requires atom mappings between substrates and products [2]. The mapping can be constructed using common subgraphs in molecule structures [1]. Another database of atom correspondence [10] was established by Mu et al. using a MCS (maximum
270
Computer Aided Optimization of Carbon Atom Labeling
271
common subgraph) algorithm [5]. In this database prochirality is considered and the systematic Inchi naming scheme for compound atoms is used. Here we present a novel software to track the distribution of the label using a graphical pathway view. It can be used to optimize the labeling pattern by probing all possible feed labelings.
2. Methods Using the carbon fate mapes [10] we created carbon transition probability matrices. These matrices enable us to compute the fate of a labeled carbon atom in a certain reaction. For example the glycolytic enzyme Fructose Bisphosphate Aldolase (ROl070) that converts beta D-Fructose 1,6-Bisphosphate (C03578) to Dihydroxyacetone Phosphate (C00118) and Glyceraldehyde 3-Phosphate (C00111) has 2 transition matrices:
P C05378,C00118
=
000100) 100000 ( 001000
P C05378,C00111 =
000010) 0 10000 ( 000001
(1)
The numbering for atom positions in a compound follows the Inchi canonicalization scheme [12] and can be computed by using the Inchi software. However, homotopic atoms cannot be numbered uniquely. This fact becomes important for prochiral compounds. For the hydrolysis of Acetylcholine and Choline, the numbering is shown in Figures 1 and 2.
Fig. 1.
Acetylcholine with Inchi atom numbering applied.
The distribution of labeled atoms for substrate beta D-Fructose 1,6Bisphosphate in reaction ROI070 can be calculated like this lCOOll1
= PC05378,C00111 1C05378
lC00118
= P c05378,C001181C05378
(2)
272
B. S. Menkuc, C. Gille & H.-G. Holzhutter
where P are the transition matrices and I are the labelvectors which contain zeros for unlabeled positions, otherwise the probability for being labeled. For compounds that do not contain homotopic atoms, the transition matrices result directly from the carbon fate maps. For example if substrate atom number 5 becomes product atom number 4, the matrix entry at row 5 and column 4 will be 1. However, if there are homotopic atoms in a compound, the compound is checked for prochirality. Only in the absence of prochirality the corresponding rows or columns of the carbon transition probability matrix P are permutated and divided by nl, where n is the number of homotopic atoms. For example in Acetylcholine (001996) the atoms 2,3,4 are homotopic to each other. Thus columns have to be permutated to generate 6 transition matrices. Then these matrices are summed up and the resulting matrix is divided by 6 to form the final transition matrix. The Acetylcholineesterase (R01026), which hydrolyses Acetylcholine to Acetate (000033) and Oholine (000114), is used as an example here. To create the matrix P C01996,C00114 the algorithm starts with the transition matrix that was created without considering homotopicity:
P C01996,C00114
0100000) 0010000 = 0001000 ( 0000100 0000010
(3)
Since the nitrogen bound methy groups with carbons 2,3,4 of the substrate are homotopic, columns 2,3,4 are permutated in the following scheme (2,3,4), (2,4,3), (3,4,2), (3,2,4), (4,2,3), (4,3,2) which results in the following matrix:
P C01996,C00114
1 333 000 1 1 000 ) 0 111 0 333 111 0 333 000 ( 0000100 0000010
(4)
If there are homotopic atoms that are not prochiral in the substrate the same procedure has to be applied to the corresponding rows of the matrix, except if the
Fig. 2.
Choline with Inchi atom numbering applied
Computer Aided Optimization of Carbon Atom Labeling
273
homotopic atoms are the same as in the substrate which is the case here. In this case atoms 1,2,3 of Choline are homotopic. However if homotopic groups contribute to prochirality of a compound, the corresponding matrix lines are not permutated. As an example, the second carbon of Acetylcholine is labeled which results in a labelvector Ic05378 = (0,1,0,0,0,0, O)T. The calculation of the labeled atoms in choline is done like this
C11 000) 333
IC00114
=
0 111 333 000 0 ~ ~ ~ 000 0000100 0000010
0 1 0 0 0 0 0
m
(5)
This example demonstrates how homotopic atoms lead to dissemination of labels in metabolic networks.
3. Results We have created a software, Metabolic Network Navigator, that assists in finding the appropriate atoms to be labeled in tracer experiments. It is possible to specify a labeled feed metabolite and let Metabolic Network Navigator perform in silico tracer experiments, which aids in finding appropriate atoms for labeling. The selection of reactions that are used for simulation is done by chosing a Kegg [7] organism or by manually selecting Kegg reaction IDs. It is recommended to manually refine the list of reactions, because the genome based organisms that are predefined in Metabolic Network Navigator usually show a lot of differences to real organisms. It has been observed that broad dissemination of the label is predicted when all reactions are regarded as reversible. Therefore we added the possibility to the software to manually assign directions to reactions. To demonstrate how a labeled carbon atom propagates in a metabolic network, we have chosen glycolysis and the TCA cycle as an example. Reaction directions have been assigned manually. As a feed metabolite, alpha D-Glucose was labeled at the first carbon atom. Fig. 3 shows how this label changes its number within the different molecules along glycolysis until it enters the TCA cycle as the first carbon of Acetyl-CoA. In the first cycle the labeled carbon propagates as a single label through the TCA cycle until it reaches Succinate. Here, the label has equally partitioned itself because of the homotopic groups within Succinate. The two labels then reach Oxaloacetate, where they enter, together with a new Acetyl-CoA, a new round of the TCA cycle. The labels that originate from the labeled Oxaloacetate of the first round are shown in the lower brackets.
274
B. S. Menkiic, C. Gille & H.-G. Holzhiitter
alpha D-Glucose ~ [Cl]
~ l
Pyruvate [Cl] .~C02
°l
oa
Citrate [C2]
Cl'C61
~ ~
Malate [Cl,C2] \ [C3,C4]
C02
2-0xoglutarate
,
C21
[C1,C3]
Fumarate
~
[Cl,C2] [C3,C4]
C02
Succinyl-CoA [C31 [C4,C16]
Succinate
/<
[Cl,C2] [C3,C4] Fig. 3.
Labeled Glucose enters glycolysis and the TeA where its label disseminates.
Metabolic Network Navigator offers the possibility to simulate every possible tracer distribution for position within a certain feed compound and to list the total numbers of labeled compounds and iterations in a table (see Fig. 4). For example, if it is desired that the label just goes into the pentosephosphate pathway, the result suggests to label the third or fourth carbon of alpha D-Glucose, which will become CO 2 during Glycolysis and therefore will not enter the TCA cycle.
Computer Aided Ovtimization of Carbon Atom Labeling
275
Fig. 4. Screenshot of Metabolic Network Navigator: Tracer propagation for different positions of D-Glucose as feed compound. The atoms of the target metabolite Oxaloacetate that are labeled by the tracer are shown in the second column. The third column contains the number of labeled metabolites and the forth the number of steps which were neccessary to complete the simulation.
4. Discussion
At the moment there are two critical factors hampering the application of our method: the first is lacking reactions in genome based databases (and thus in the carbon fate maps), the second is the correct assignment of reversibility to reactions. We have experienced that rendering every reaction reversible leads to nonrealistic effusive dissemination of the labeled atoms. Therefore it is desirable for the future to automatically assign a direction to each reaction. The reversibility of reactions depends mainly on b.G, substrate and product concentrations [6]. For most reactions, there are estimates for the b.G values [9], therefore it is possible to predict the direction when the metabolite concentrations are known. If enough about the organism is known, it is possible to perform a flux balance analysis and gain direction properties from that. The reason for missing reactions in genome based databases is that most metabolic networks contain reactions that are very specific for the purpose the network is modelled, Le. biomass reactions. However, using the methods from [10] it is possible to create carbon fate maps for new reactions. For the future it would be useful to be able to make creating carbon fate maps for new reactions more easily, i.g. by creating a wizzard that assists the user. References [1] Arita, M., Metabolic reconstruction using shortest paths., Simulation Pract. Theory, 8:109-125,2000. [2] Arita, M., In Silico Atomic Tracing by Substrate-Product Relationships in Escherichia coli Intermediary Metabolism., Genome Res., 13(11):2455-2466, 2003. [3] Berl, S., Nicklas, W. J., Clarke, D. D., Compartmentation of citric acid cycle
276
[4]
[5]
[6]
[7] [8] [9] [10] [11]
[12] (13]
B. S. Menkuc, C. Gille
fj
H.-G. Holzhutter
metabolism in brain: labeling of glutamate, glutamine, aspartate and gaba by several radioactive tracer metabolites., J. Neurochem., 17(7):1009-1015, 1970. Buxton, D. B., Schwaiger, M., Nguyen, A., Phelps, M. E., Schelbert, H. R., Radiolabeled acetate as' a tracer of myocardial tricarboxylic acid cycle flux., Circ. Res., 63(3):628-634, 1988. Hattori, M., Okuno, Y., Goto, S., Kanehisa, M., Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways., J. Am. Chem. Soc., 125(39):11853-11865, 2003. Hoppe, A., Hoffmann, S., Holzhiitter, H.G., Including metabolite concentrations into flux-balance analysis: Thermodynamic realizability as a constraint on flux distributions in metabolic networks., BMC Syst. Bioi., 1(1):23, 2007. Kanehisa, M., Goto, S., KEGG: kyoto encyclopedia of genes and genomes., Nucleic Acids Res., 28(1):27-30, 2000. Kelleher, J. K., Analysis of tricarboxylic acid cycle using [14C]citrate specific activity ratios., Am. J. Physiol., 248(2 Pt 1):E252-E260, 1985. Mavrovouniotis, M. L., Estimation of standard Gibbs energy changes of biotransformations., J. Bioi. Chem., 266(22):14440-14445,1991. Mu, F., Williams, R. F., Unkefer, C. J., Unkefer, P. J., Faeder, J. R., Hlavacek, W. S., Carbon-fate maps for metabolic reactions, Bioinformatics, 23(23):3193-3199, 2007. Noronha, S. B., Yeh, H. J., Spande, T. F., Shiloach, J., Investigation of the TCA cycle and the glyoxylate shunt in Escherichia coli BL21 and JM109 using (13)C-NMR/MS., Biotechnol. Bioeng., 68(3):316-327, 2000. http://old.iupac.org/inchi/ http://www.genome.ad.jp/
WEB-LINKS AS A MEANS TO DOCUMENT ANNOTATED SEQUENCE AND 3D-STRUCTURE ALIGNMENTS IN SYSTEMS BIOLOGY CHIRSTOPH GILLE
ANDREAS HOPPE
christoph.gille~charite.de
andreas.hoppe~charite.de
HERMAN-GEORG HOLZHUTTER hergo~charite.de
Medical Faculty of the Humboldt University Berlin, Charite Berlin, Institute of Biochemistry, 10117 Berlin, MonbijoustrafJe 2, Germany Reconstructed biological networks are the essence of knowledge originating from experiments, scientific literature, databases and modeling. Proteins are the major players in biological networks. If the function of a protein is not yet known, it can often be deduced from homologous proteins that are already experimentally characterized. As such conclusions are not as reliable as experimental evidences, they should be well documented and reviewed when experimental data is available. Inconsistent operation of the resulting network may indicate invalid functional assignments. Here we present a novel technique to refer to annotated sequence and 3D-structure alignments in terms of Web links. By clicking the Web link the alignment is viewed in the protein viewer STRAP. References to public protein databases such as EMBL, KEGG, GENBANK, PDB, PFAM, PRODOM and UNIPROT/SWISSPROT are encoded in the Web-link whereas the alignment gaps are computed dynamically. Site specific annotations and 3D-rendering commands may also be included in the Web-link. In contrast, sequence features such as active site residues, phosphorylation sites and ligand binding sites do not need to be specified, as long as they are retrievable from public databases. The method has been developed for an information management system that is used for the reconstruction of metabolic pathways. The alignment viewer may also be of interest for experimentalists, as it can be used to document sites of interest in the proteins under experimental investigation. These alignment Web links may be included in project Web sites. Availability: The STRAP program is published under the GNU-license condition and is automatically downloaded from http://3d-alignment.eu/ or http://www.charite.de/bioinf/strap/ when an alignment reference is clicked.
Keywords: sequence alignment; protein structure alignment; prediction; subcellular localization; compartmentalization
1. Introduction
Amino acid sequence and 3D-structure alignment is a widely applied method to identify regions of similarity in proteins that may be a consequence of functional, structural, or evolutionary relationships [17, 23J. An alignment maps sequence positions of one protein to the sequence position of the other protein. By counting the number of identical positions or measuring the root mean square distance (RMSD)
277
278
C. Gille, A. Hoppe
fj
H.-G. Holzhiitter
the similarity of two proteins can be quantified. If the amino acid sequences of two proteins are similar then functional properties such as substrate specificity or even enzyme kinetics [7J may be similar. By this means sequence and 3D-structure alignments are often taken as a clue for biochemical features when experimental evidence is not yet available for the protein of interest. This is a common approach to assign crude functions to a gene of newly sequenced genomes and is increasingly used in systems biology since the amount of sequence data is rapidly increasing. One central goal of computational systems biology is the mathematical modeling of complex reaction networks. The first and most time-consuming step in the development of such models is the stoichiometric reconstruction of the network, i. e. compilation of all metabolites, reactions and transport processes relevant to the considered network and their assignment to the various cellular compartments [15J. Biochemical function of proteins is often not yet characteriyed experimentally. Multiple alignment of a protein together with homo logs with well characterized enzymatic function allow the deduction that the given protein catalyzes particular reactions. Hypotheses based on sequence and structure comparison require revalidation when more data is available. Therefore these alignments must be well documented. Alignments can be stored in three different ways with currently available software: (I) As a project file for a certain program package like JALVIEW [4J or PFAAT [3J (II) As a text document with styles, (III) as a plain text. With project files the alignment can be loaded and viewed in the respective alignment viewer. These viewers have specialized features for sequence visualization like color shading, pattern highlighting and sequences reordering. Some even support association of protein structures to sequences. As alignment projects can usually not be embedded directly into text documents they are stored as separate files. However, this bears the risk of file loss and firm association of the alignment with the referencing text would be desirable. Therefore we developed a method to include annotated sequence and structure alignments directly in any text document. The alignment is represented as a Web link that contains the database references of the aligned proteins. Activating the Web link opens the sequence alignment in a protein viewer STRAP which provides advanced display options to elucidate functional properties of the proteins. These sequence features and protein properties are updated from the SWISSPROT database or with various prediction services each time the alignment is viewed. Table 1. 3D-rendering styles for residue selections in Web-links and their translation into commands suitable for 3D-viewers.
3D style spheres dots ribbon sticks
Pymol command
Rasmol command
JMol command
show show show show
cpk on dots on ribbons on
cpk on dots on ribbons on
spheres,RESIDUES dots,RESIDUES ribbon,RESIDUES sticks, RESIDUES
Web-Links as a Means to Document Annotated Sequence
279
2. Methods
The program STRAP [9J is an alignment viewer written in JAVA. STRAP is started with the JAvA-Webstart technique. The list of proteins is passed to the Web server using POST or GET. A CGI script on the server is generating an XML file from the list of proteins. This XML file is the argument of the program /bin/ javaws which is part of a typical JAVA installation. The protein entries in the Web variable "align" optionally contain additional information: A protein name other than the accession rD, the protein icon image, residue highlightings and 3D-rendering commands (Table 1). These additional data sections (Table 2) are separated by a vertical bar (Fig. 1). Table 2. Definition of five sections in protein entries within alignment links. The fields are separated by vertical bar as shown in Fig. 1. The first field is required, the others are optionally.
field
2 3 4 5
data
example
Database reference or URL of the protein file Sequence name if differing from accession ID Icon png, gif or jpg file Highlighted residues and 3D-rendering styles Reading frame, start and end positions of exons
PDB:ljd2 bLS_Cerevisiae.pdb http://www.server.org/path/file.png #0000FF,1O-20,#FFOOFF,30-40,dots reverse( 100 .. 200,300-400)
Table 3. Web variables for advanced control. Bit-masks are given as hexadecimal numbers. Their bits refer to the proteins. Dialogs are DialogSubcellularLocalization, DialogBlast,
DialogDotPlot, DialogPlot Web-variable
Explanation
noAL=bitmask no3D=bitmask noSP=bitmask
Do not align sequences. E.g. noAL=FO avoids alignment of sequences 5 to 8. Do not include proteins in the 3D-view Do not superimpose 3D-structures Loading the sequences into a dialog
dialog=dialog
UNIPROT:023708Ia2_A_Thaliana.swisslhttp:// ... 40NOO.jpg PDB:1jd2_Nlb1_S_Cerevisiae.pdbl I#CB82BF, 19:N-24:N,#OOOOFF, 14-24, spheres Fig. 1. Creation of the Web link for an alignment of two proteins. Each line refers to one protein and is divided into sections with vertical bar. The example comprises a protein icon and two residue selections. The second selection is shown as spheres in the 3D-structure. The fields are described in Table 2. In the final step to form the Web link special characters like the vertical bar will be encoded by an hexadecimal representation of the ASCII.
280
C. Gille, A. Hoppe & H.-G. Holzhiitter
3. Results We have developed a novel technique to reference to sequence and 3D-structure alignments by means of Web links that comprise sequence database identifiers of the aligned proteins (Fig. 1). The alignment is viewed in the program STRAP [9] when the Web link is activated. Instead of storing the complete alignment data, only the protein file accession IDs together with some optional information are included in the Web link. As a consequence the protein files are downloaded from the databases (Table 4) prior alignment computation and visualization. Table 4. Sequence and structure databases in alignment Web-links. Database ID
SWISS UNIPROT PDB EMBL NCBLNT NCBLAA PFAM PRODOM KEGG-AA
Database
Example
SwissProt database Uniprot database 3D-structure database EMBL nucleotide sequence database NCBI nucleotide sequence database NCBI amino acid sequence database Protein family database Protein domain database Proteins at www.genome.jp
SWISS:hslv _ecoli UNIPROT:Q8ICC3 PDB:ljd2~
EMBL:M57965 NCBI~T:M57965
NCBI-AA:AAC20075 PFAM:PF00097 PRODOM:PD000003 KEGG-AA:btl:BALH_1771
Optional data fields may hold user defined information listed in Table 3. However, they make the Web-link longer and more complicated. Abbreviation of the Web-link may improve the usability and will be explained in the following using METANNOGEN [10] as an example. METANNOGEN is an information management system which assists the reconstruction of complex metabolic networks. It allows convenient access to public data resources such as the LIGAND [11], EHM [5], The Human cell metabolic network [15] and Brenda [1]. METANNOGEN consists of two parts: The graphical user interface is a JAVA application and allows browsing and editing of reactions of the metabolic pathway. The central data storage system is a PERL script in a Web server. By storing each data set in a single flat file on the server, it exploits the capabilities of file systems to store and obtain thousands of files in much less than a second. The users can attach free text to reactions of the metabolic network such as supporting literature references and protein alignments. Alcohol dehydrogenases and co-substrate specificity: Sequence patterns are often associated with biochemical properties. For example, specificity for the co-substrate NADH/NAD+ of alcohol dehydrogenases is linked to the presence of an Aspartate at sequence position 223 [6, 19]. To determine co-substrate specificity for related oxidoreductases an alignment can reveal presence or absence of Asp 223. The following hyperlink in METANNOGEN opens an alignment of three alcohol dehydrogenases and highlights sequence position 223 in the first sequence. The percent-hexadecimal numbers code for vertical bar and comma and the 6 digit hexadecimal number the color of Asp 223.
Web-Links as a Means to Document Annotated Sequence
281
ALIGN: SWISS: adh7_human%7C%7C%7C%2300FFFF%2C223+SWISS: adh7 _mouse+SWISS: adh2_yeast
The prefix ALIGN: is an abbreviation for http://3d-alignment.eu/ce.php?align= in METANNOGEN to reduce the length of alignment references in text annotations. Proving the abscence of enzymes: If certain analyses are required to demonstrate structural or biochemical properties, the proteins can be loaded into the respective dialog form. Dialogs are opened with the option dialog= (see Table 3). If for example the absence of a protein in the organism under consideration needs to be documented then a BLAST search with the enzyme from another organism may be conducted to show that no similar sequence is found. For convenience the hyperlink prefix BLAST: has been defined. One advantage of STRAP as a BLAST front-end compared to the BLAST Web-forms is that all BLAST results are kept and can be reviewed without delay. With respect to the ever-growing sequence databases, it allows discrimination of novel hits from those that had been already present previously. Hexokinase and subcellular localization: Often, the information on subcellular localization in the literature is incomplete or misleading. For example hexokinase may be present in the mitochondrial cell fraction of cell preparations, even though it is not acting in the mitochondrial matrix [24~26]. The expression PUBMED{mitochondria[TIJ hexokinase[TIJ} is a hyperlink in METANNOGEN and invokes a PUBMED search. Several publications suggest hexokinase to be a mitochondrial enzyme, whereas others point out that hexokinase is attached to the mitochondrial membrane, but operates in the cytosolic compartment. With the hyperlink LOC:NCBLAA:AAC20075 the hexokinase is loaded and a prediction of the sub-cellular location based on the amino acid sequence is carried out using SHERLoc [12, 20], HUM-MPLoc [21] and PLOC [16]. LOC: is an abbreviation in METANNOGEN for http://3d-alignment.eu/ce.php?dialog= DialogSubcellularLocalization&align= and is used when the cellular compartment of the given proteins is of interest. Convenience abbreviations need to be recognized as hyperlinks and translated into the Web-links by the underlying software, in this case METANNOGEN. For information management systems with Web-browser clients such as META-ALL [27] and AMAZE [14], the FIREFOX add-on GREASEMONKEY can be used to translate the abbreviations into respective Web-links. 4. Discussion We propose a method to refer to multiple sequences and 3D-structure alignments by sequence accession IDs rather than the sequences and their alignment itself. The main advantage of this method is that an alignment can be included as a Web link into any text document. A further advantage is that the most recent version of the
282
C. Gille, A. Hoppe &J H.-G. Holzhiitter
so-called sequence features are downloaded from EXPASY [8J and CSA [18J since this data is not included statically in a preformed immutable alignment. They comprise functionally important data like phosphorylation sites, interacting domains and catalytic sites. These databases are permanently curated and completed. Our concept further allows to dynamically access the prediction servers which are permanently improved. Our method is implemented in the multiple alignment program STRAP which, as a stand-alone application in java, has superior stability than a possible alternative implementation as a browser applet which are more likely to be affected by memory leaks and instability induced by the browser-applet interface. Because of its robustness, STRAP is a recommended protein structure viewer in public Webresources: CE [22J, SUPERIMOSE [2] and PDBSUM [13]. The main disadvantage of using STRAP is the initial delay when an alignment is viewed for the first time: the java interpreter and its libraries must be loaded, the program and the protein files must be retrieved, the alignment must be computed. However, there is no delay when the alignments are subsequently viewed because all downloaded files and all computed results are kept for further usage by the program. Implementing caches and multi-threading, the delay is reduced to the absolutely necessary. In our opinion, the new concept of encoding multiple alignments in Web links is a very valuable improvement for state-of-the-art computational systems biology when referring to annotated multiple alignments. It is superior to a static annotated alignment document (PDF, PostScript or HTML) which does not benefit from the ongoing correction process of many of the public information databases and prediction services. It is superior to saved project files of any of the alignment programs like JALVIEW and PFAAT because these files are likely to become invalid if the alignment is computed with an updated alignment algorithm (Blast, FASTA), if the sequence data is corrected and also probabaly for major updates of the software that writes and reads these files. It may also be beneficial for experimentalists who visualize site specific information of the protein family under experimental investigation. They may include alignment links in their protocols or project Web pages. Since experiments often take several months and years occasional revision of sequence features and BLAST searches against sequence and structure databases are advisable. Our concept reduces the effort for this updating process to the absolute minimum (correction is necessary only if the annotated positions are directly affected). A further potential application of the described concept are Web-services dealing with sequences and alignments which may exploit the program as a viewer for annotated sequence alignments. References [1) Barthelmes, J., Ebeling, C., Chang, A., Schomburg, I., and Schomburg, D., Brenda, amenda and frenda: the enzyme information system in 2007. Nucleic Acids Res., 35:511-514, 2007. [2J Bauer, R. A., Bourne, P. E., Formella, A. , Frommel, C., Gille, C., Goede, A., Guerler, A., Hoppe, A., Knapp, E. W., Poschel, T., Wittig, B., Ziegler, V., and Preissner, R.,
Web-Links as a Means to Document Annotated Sequence
[3]
[4] [5]
[6]
[7] [8]
[9] [10]
[11]
[12]
[13]
[14]
[15]
[16]
[17] [18]
[19]
283
Superimpose: a 3D structural superposition server. Nucleic Acids Res., 36(Web Server issue):W47-54, 2008. Caffrey, D. R, Dana, P. H., Mathur, V., Ocano, M., Hong, E. J., Wang, Y. E., Somaroo, S., Caffrey, B. E., Potluri, S., and Huang, E. S., PFAAT version 2.0: a tool for editing, annotating, and analyzing multiple sequence alignments. BMC Bioinformatics, 8:381, 2007. Clamp, M., Cuff, J., Searle, S. M., and Barton, G. J., The jalview java alignment editor. Bioinformatics, 20:426-427, 2004. Duarte, N. C., Becker, S. A., Jamshidi, N., Thiele, 1., Mo, M. L., Vo, T. D., Srivas, R, and Palsson, B. 0., Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proc. Natl. Acad. Sci. USA, 104:1777-1782, 2007. Fan, F., Lorenzen, J. A., and Plapp, B. V., An aspartate residue in yeast alcohol dehydrogenase i determines the specificity for coenzyme. Biochemistry, 30:6397-6401, 1991. Gabdoulline, R R, Stein, M., and Wade, R. C., qpipsa: relating enzymatic kinetic parameters and interaction fields. BMC Bioinformatics, 8:373, 2007. Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, 1., Appel, R D., and Bairoch, A., Expasy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res., 31:3784-3788, 2003. Gille, C. and Frommel, C., Strap: editor for structural alignments of proteins. Bioinformatics, 17:377-378, 2001. Gille, C., Hoffmann, S., and Holzhutter, H. G., Metannogen: compiling features of biochemical reactions needed for the reconstruction of metabolic networks. BMC Syst Bioi, 1:5, 2007. Goto, S., Okuno, Y., Hattori, M., Nishioka, T., and Kanehisa, M., Ligand: database of chemical compounds and reactions in biological pathways. Nucleic Acids Res., 30:402-404, 2002. Hoglund, A., Blum, T., Brady, S., Donnes, P., Miguel, J. S., Rocheford, M., Kohlbacher, 0., and Shatkay, H., Significantly improved prediction of subcellular localization by integrating text and protein sequence data. Pac Symp Biocomput, 11:16-27, 2006. Laskowski, R A., Hutchinson, E. G., Michie, A. D., Wallace, A. C., Jones, M. L., and Thornton, J. M., PDBsum: a web-based database of summaries and analyses of all PDB structures. Trends Biochem Sci, 22:488-490, 1997. Lemer, C., Antezana, E., Couche, F., Fays, F., Santolaria, X., Janky, R., Deville, Y., Richelle, J., and Wodak, S. J., The aMAZE LightBench: a web interface to a relational database of cellular processes. Nucleic Acids Res., 32:443-448, 2004. Ma, H., Sorokin, A., Mazein, A., Selkov, A., Selkov, E., Demin, 0., and Goryanin, 1., The Edinburgh human metabolic network reconstruction and its functional analysis. Mol Syst Bioi, 3:135, 2007. Park, K. J. and Kanehisa, M., Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics, 19:1656-1663, 2003. Pei, J., Multiple protein sequence alignment. Curr Opin Struct Bioi, 18:382-386,2008. Porter, C. T., Bartlett, G. J., and Thornton, J. M., The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res., 32:129-133, 2004. Rosell, A., Valencia, E., Ochoa, W. F., Fita, 1., Pares, X., and Farres, J., Complete reversal of coenzyme specificity by concerted mutation of three consecutive residues in alcohol dehydrogenase. J. Bioi. Chem., 278:40573-40580, 2003.
284
C. Gille, A. Hoppe f3 H.-G. Holzhiitter
[20] Shatkay, H., Hoglund, A., Brady, S., Blum, T., Donnes, P., and Kohlbacher, 0., SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics, 23:1410-1417, 2007. [21] Shen, H. B. and Chou, K. C., Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. Biochem Biophys Res Commun, 355:1006-1011, 2007. [22] Shindyalov, 1. N. and Bourne, P. E., Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11:739-747, 1998. [23] Simossis, V., Kleinjung, J., and Heringa, J., An overview of multiple sequence alignment. CUIT Protoc Bioinformatics, Chapter 3:nit 3.7, 2003. [24] Southard, J. H. and Hultin, H. 0., On latent hexokinase activity in skeletal muscle mitochondria. FEBS Lett., 19:349-351, 1972. [25] Sydow, G., The hexokinase activity in the mitochondria of the precancerous rat liver and transplanted diethylnitrosarnine hepatoma. Acta Bioi Med Ger, 13:97-98, 1964. [26] Viitanen, P. V., Geiger, P. J., Erickson-Viitanen, S., and Bessman, S. P., Evidence for functional hexokinase compartment at ion in rat skeletal muscle mitochondria. J. Bioi. Chem., 259:9679-9686, 1984. [27] Weise, S., Grosse, 1., Klukas, C., Koschutzki, D., Scholz, V., Schreiber, F., and Junker, B. H., Meta-All: a system for managing metabolic pathway information. BMC Bioinformatics, 7:465, 2006.
AUTHOR INDEX Kanehisa, M., 149, 252 Kinoshita, K, 212 Klipp, E., 1, 52, 77 Knapp, E.-W., 112, 260 Kojima, K, 37 Krull, F., 260 Kruse, K., 91 Kuhn, C., 77
Ahmed, J., 243 Basler, G., 135 Bauer, R. A., 183 Baumgrass, R., 222 Benary, M., 222 Bendfeldt, H., 222 Bruck, J., 1 Bujnicki, J. M., 183
Liebermeister, W., 1 Lorenzen, S., 260
Chiu, H., 171
Mamitsuka, H., 64, 102 Margulies, E. H., 199 Mendes, P., 52 Menkiic, B. S., 270 Michalsky, E., 243 Miyano, S., 25, 37, 212
Ebenhoh, 0., 91, 112, 135 Falcke, M., 15 Flottmann, M., 52 Fujita, A., 37, 212 Gille, C., 270, 277 Goto, S., 149, 252 Gruening, B., 231 Guerler, A., 260
Nagasaki, M., 25, 212 Nakai, K, 212 Nikoloski, Z., 135 Nordlander, B., 77 Numata, J., 112 Numata, K., 212
Hancock, T., 102 Handorf, T., 135 Hatanaka, Y., 212 Hattori, M., 149 Heopfner, M., 243 Herzel, H., 222 Hohmann, S., 77 Holzhiitter, H.-G., 270, 277 Hoops, S., 52 Hoppe, A., 277 Hossbach, J., 231
Obayashi, T., 212 Okuda, S., 252 Parker, S. C. J., 199 Petelenz, E., 77 Preissner, R, 183, 231, 243 Riehl, W. J., 159 Rother, K, 183
Imoto, S., 37, 212 Itoh, M., 252
Schaber, J., 52, 77 Schmidt, V., 231, 243 Segre, D., 123, 159, 171 Shimamura, T., 37, 212 Shimizu, Y., 149
Jaeger, 1. S., 231 Jeong, E., 25
285
286
Author Index
Skupin, A., 15 Snitkin, E. S., 123 Struck, S., 231 Takarabe, M., 252 Tamada, y., 212 Tokimatsu, T., 252 Tullius, T. D., 199 Wan, R., 64 Wheelock, A. M., 64 Yamaguchi, R., 212
This page intentionally left blank