METHODS IN ENZYMOLOGY Editors-in-Chief
JOHN N. ABELSON AND MELVIN I. SIMON Division of Biology California Institute of Technology Pasadena, California Founding Editors
SIDNEY P. COLOWICK AND NATHAN O. KAPLAN
Academic Press is an imprint of Elsevier 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA 32 Jamestown Road, London NW1 7BY, UK First edition 2011 Copyright # 2011, Elsevier Inc. All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: permissions@ elsevier.com. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made For information on all Academic Press publications visit our website at elsevierdirect.com ISBN: 978-0-12-381270-4 ISSN: 0076-6879 Printed and bound in United States of America 11 12 13 14 10 9 8 7 6 5 4 3 2 1
CONTRIBUTORS
Wassim Abou-Jaoude´ INRIA Sophia-Antipolis, Sophia-Antipolis, France Kurt S. Anderson Computational Dynamics Lab, Mechanical, Nuclear and Aerospace Engineering Department, Rensselaer Polytechnic Institute, Troy, New York, USA Jordan Ang Department of Chemical and Physical Sciences, and Institute for Optical Sciences, University of Toronto Mississauga, Mississauga, Ontario, Canada F. Angulo-Brown Departamento de Fı´sica, Escuela Superior de Fı´sica y Matema´ticas, Instituto Polite´cnico Nacional, Me´xico D.F., Me´xico David Baker Department of Biochemistry, University of Washington, Seattle, Washington, USA Yih-En Andrew Ban Arzeda Corporation, Seattle, Washington, USA John F. Beausang Department of Physics and Astronomy, University of Pennsylvania, Philadelphia, Pennsylvania, USA Monica Berrondo Rosetta Design Group, Fairfax, Virginia, USA Kishor D. Bhalerao Department of Mechanical Engineering, The University of Melbourne, Victoria, Australia Mustafa Burak Boz School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia, USA Philip Bradley Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
xiii
xiv
Contributors
Patrik R. Callis Department of Chemistry and Biochemistry, Montana State University, Bozeman, Montana, USA Gregory S. Chirikjian Department of Mechanical Engineering, Johns Hopkins University, Baltimore, Maryland, USA Carson C. Chow Laboratory of Biological Modeling, NIDDK/CEB, National Institutes of Health, Bethesda, Maryland, USA Seth Cooper Department of Computer Science, University of Washington, Seattle, Washington, USA Jacob E. Corn Department of Biochemistry, University of Washington, Seattle, Washington, USA Michelle N. Costa Pacific Northwest National Laboratory, Computational Biology and Bioinformatics Group, Richland, Washington, USA Rhiju Das Stanford University, Stanford, California, USA Ian W. Davis GrassRoots Biotechnology, Durham, North Carolina, USA Batsal Devkota Research Collaboratory for Structural Biology, Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey, USA Edward J. Dougherty Steroid Hormones Section, NIDDK/CEB, National Institutes of Health, Bethesda, Maryland, USA Hassan M. Fathallah-Shaykh Department of Mathematics, and Department of Neurology, and Department of Cell Biology, The University of Alabama at Birmingham; The UAB Comprehensive Neuroscience and Cancer Centers, Birmingham, Alabama, USA Sarel J. Fleishman Department of Biochemistry, University of Washington, Seattle, Washington, USA Samuel C. Flores Simbios Center, Bioengineering Department, Stanford University, Clark Center S231, Stanford, California, USA
Contributors
xv
Yale E. Goldman Pennsylvania Muscle Institute, and Department of Physiology, University of Pennsylvania, Philadelphia, Pennsylvania, USA Didier Gonze Laboratoire de Bioinformatique des Ge´nomes et des Re´seaux, Universite´ Libre de Bruxelles, Bruxelles, Belgium Jeffrey J. Gray Chemical & Biomolecular Engineering and the Program in Molecular Biophysics, Johns Hopkins University, Baltimore, Maryland, USA Maria Luisa Guerriero Centre for Systems Biology at Edinburgh, University of Edinburgh, Edinburgh, United Kingdom Shailendra K. Gupta Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany, and Indian Institute of Toxicology Research (CSIR), Lucknow, India L. Guzma´n-Vargas Unidad Profesional Interdisciplinaria en Ingenierı´a y Tecnologı´as Avanzadas, Instituto Polite´cnico Nacional, Me´xico D.F., Me´xico Jose´ Halloy Service d’Ecologie Sociale, Universite´ Libre de Bruxelles, Bruxelles, Belgium Stephen C. Harvey School of Biology, and School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia, USA James J. Havranek Washington University, St. Louis, Missouri, USA John K. Heath School of Biosciences and Centre for Systems Biology, University of Birmingham, Edgbaston, Birmingham, United Kingdom R. Herna´ndez-Pe´rez SATMEX, Av. de las Telecomunicaciones S/N CONTEL Edif. SGA-II, Me´xico D.F., Me´xico Brian Ingalls Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, Canada Ron Jacak Department of Biochemistry, University of North Carolina, Chapel Hill, North Carolina, USA
xvi
Contributors
John Karanicolas Center for Bioinformatics, and Department of Molecular Biosciences, University of Kansas, Lawrence, Kansas, USA Kristian Kaufman Department of Biochemistry, Vanderbuilt University, Nashville, Tennessee, USA Caner Kazanci Department of Mathematics/Faculty of Engineering, University of Georgia, Athens, Georgia, USA David E. Kim Department of Biochemistry, University of Washington, Seattle, Washington, USA Tanja Kortemme University of California, San Francisco, California, USA Brian Kuhlman Department of Biochemistry, University of North Carolina, Chapel Hill, North Carolina, USA Don Kulasiri Department of Molecular Biosciences, Centre for Advanced Computational Solutions (C-fACS), Lincoln University, Lincoln, Christchurch, New Zealand Alain Laederach Department of Biomedical Sciences, University at Albany, and Developmental Genetics and Bioinformatics, Wadsworth Center, Albany, New York, USA Oliver F. Lange Department of Biochemistry, University of Washington, Seattle, Washington, USA Andrew Leaver-Fay Department of Biochemistry, University of North Carolina, Chapel Hill, North Carolina, USA Daniel Levine Department of Biochemistry, Robert Wood Johnson Medical School, UMDNJ, and The Center for Advanced Biotechnology and Medicine, Piscataway, New Jersey, USA Steven M. Lewis Department of Biochemistry, University of North Carolina, Chapel Hill, North Carolina, USA Feng Li Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland, USA
Contributors
xvii
Sergey Lyskov Chemical & Biomolecular Engineering and the Program in Molecular Biophysics, Johns Hopkins University, Baltimore, Maryland, USA Daniel J. Mandell University of California, San Francisco, California, USA Alberto Marin-Sanguino Department of Membrane Biochemistry, Max Planck Institute of Biochemistry, Martinsried, Germany David McMillen Department of Chemical and Physical Sciences, and Institute for Optical Sciences, University of Toronto Mississauga, Mississauga, Ontario, Canada Jens Meiler Department of Biochemistry, Vanderbuilt University, Nashville, Tennessee, USA Stuart Mentzer Objexx Engineering, Boston, Massachusetts, USA Radhakrishnan Nagarajan Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA Vikas Nanda Department of Biochemistry, Robert Wood Johnson Medical School, UMDNJ, and The Center for Advanced Biotechnology and Medicine, Piscataway, New Jersey, USA Philip C. Nelson Department of Physics and Astronomy, University of Pennsylvania, Philadelphia, Pennsylvania, USA Karen M. Ong Laboratory of Biological Modeling, NIDDK/CEB, National Institutes of Health, Bethesda, Maryland, USA Djomangan Adama Ouattara INERIS, Parc Technologique Alata, Vermuil-en-Halatte, France, UMR-CNRS 6600, Universite´ de Technologie de Compie´gne, France Anton S. Petrov School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA Zoran Popovic´ GrassRoots Biotechnology, Durham, North Carolina, USA
and
xviii
Contributors
Mohammad Poursina Computational Dynamics Lab, Mechanical, Nuclear and Aerospace Engineering Department, Rensselaer Polytechnic Institute, Troy, New York, USA P. Douglas Renfrew Center for Genomics and Systems Biology, New York University, New York, USA Haluk Resat Pacific Northwest National Laboratory, Computational Biology and Bioinformatics Group, Richland, Washington, USA I. Reyes-Ramı´rez Unidad Profesional Interdisciplinaria en Ingenierı´a y Tecnologı´as Avanzadas, Instituto Polite´cnico Nacional, Me´xico D.F., Me´xico Joshua S. Richman Department of Medicine, Division of Preventive Medicine, University of Alabama School of Medicine, Birmingham, Alabama, USA Florian Richter Department of Biochemistry, University of Washington, Seattle, Washington, USA M. Santilla´n Centro de Investigacio´n y Estudios Avanzados del IPN, Unidad Monterrey, Parque de Investigacio´n e Innovacio´n Tecnolo´gica, Apodaca, NL, Me´xico, and Centre for Applied Mathematics in Bioscience and Medicine, 3655 Promenade Sir William Osler, McIntyre Medical Building, Montreal, Canada Elizabeth Y. Scribner Department of Mathematics, The University of Alabama at Birmingham, Birmingham, Alabama, USA Franc¸oise Seillier-Moiseiwitsch Infectious Disease Clinical Research Program, Department of Preventive Medicine and Biometrics, Uniformed Services University of the Health Sciences, Bethesda, Maryland, USA Harish Shankaran Pacific Northwest National Laboratory, Computational Biology and Bioinformatics Group, Richland, Washington, USA Will Sheffler Department of Biochemistry, University of Washington, Seattle, Washington, USA
Contributors
xix
S. Stoney Simons Jr. Steroid Hormones Section, NIDDK/CEB, National Institutes of Health, Bethesda, Maryland, USA Colin A. Smith University of California, San Francisco, California, USA James Thompson Department of Biochemistry, University of Washington, Seattle, Washington, USA Adrien Treuille Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA Michael Tyka Department of Biochemistry, University of Washington, Seattle, Washington, USA Meenakshi Upreti Division of Radiation Oncology, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA Julio Vera Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany Eberhard O. Voit Integrative BioSystems Institute, and The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA Fei Xu Department of Biochemistry, Robert Wood Johnson Medical School, UMDNJ, and The Center for Advanced Biotechnology and Medicine, Piscataway, New Jersey, USA Necmettin Yildirim Division of Natural Sciences, New College of Florida, Sarasota, Florida, USA Sohail Zahid Department of Biochemistry, Robert Wood Johnson Medical School, UMDNJ, and The Center for Advanced Biotechnology and Medicine, Piscataway, New Jersey, USA E. S. Zeron Centro de Investigacio´n y de Estudios Avanzados del IPN, Departamento de Matema´ticas, Av. Instituto Polite´cnico Nacional 2508, Me´xico DF, Me´xico
PREFACE
The general process of conducting scientific research involves the comparison of experimental results with hypothesis-driven predictions of the experimental results. Traditionally, these hypothesis-driven predictions were commonly based upon “back of the envelope” calculations and “hand waving” predictions. Fortunately, these traditional methods have been replaced with more rigorous mathematical and computational approaches. The use of computers and computational methods has become ubiquitous in biological and biomedical research. This has been driven by numerous factors, a few of which follow: One primary reason is the emphasis being placed on computers and computational methods within the National Institutes of Health (NIH) Roadmap; another factor is the increased level of mathematical and computational sophistication among researchers, particularly amongst junior scientists, students, journal reviewers, and Study Section members; another is the rapid advances in computer hardware and software which make these methods far more accessible to the rank and file research community. A common perception is that the only applications of computers and computer methods in biological and biomedical research are either basic statistical analysis or the searching of DNA sequence databases. While these are obviously important applications, they only scratch the surface of the current and potential applications of computers and computer methods in biomedical research. The various chapters within this volume include a wide variety of applications that extend well beyond this limited perception. The training of the majority of senior MDs and Ph.Ds in clinical or basic disciplines at academic research and medical centers commonly does not include advanced coursework in mathematics, numerical analysis, statistics, or computer science. Thus, the chapters within this volume, and our previous volumes, have been written in order to be accessible to this target audience. MICHAEL L. JOHNSON AND LUDWIG BRAND
xxi
METHODS IN ENZYMOLOGY
VOLUME I. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME II. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME III. Preparation and Assay of Substrates Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME IV. Special Techniques for the Enzymologist Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME V. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VI. Preparation and Assay of Enzymes (Continued) Preparation and Assay of Substrates Special Techniques Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VII. Cumulative Subject Index Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VIII. Complex Carbohydrates Edited by ELIZABETH F. NEUFELD AND VICTOR GINSBURG VOLUME IX. Carbohydrate Metabolism Edited by WILLIS A. WOOD VOLUME X. Oxidation and Phosphorylation Edited by RONALD W. ESTABROOK AND MAYNARD E. PULLMAN VOLUME XI. Enzyme Structure Edited by C. H. W. HIRS VOLUME XII. Nucleic Acids (Parts A and B) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XIII. Citric Acid Cycle Edited by J. M. LOWENSTEIN VOLUME XIV. Lipids Edited by J. M. LOWENSTEIN VOLUME XV. Steroids and Terpenoids Edited by RAYMOND B. CLAYTON xxiii
xxiv
Methods in Enzymology
VOLUME XVI. Fast Reactions Edited by KENNETH KUSTIN VOLUME XVII. Metabolism of Amino Acids and Amines (Parts A and B) Edited by HERBERT TABOR AND CELIA WHITE TABOR VOLUME XVIII. Vitamins and Coenzymes (Parts A, B, and C) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME XIX. Proteolytic Enzymes Edited by GERTRUDE E. PERLMANN AND LASZLO LORAND VOLUME XX. Nucleic Acids and Protein Synthesis (Part C) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME XXI. Nucleic Acids (Part D) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XXII. Enzyme Purification and Related Techniques Edited by WILLIAM B. JAKOBY VOLUME XXIII. Photosynthesis (Part A) Edited by ANTHONY SAN PIETRO VOLUME XXIV. Photosynthesis and Nitrogen Fixation (Part B) Edited by ANTHONY SAN PIETRO VOLUME XXV. Enzyme Structure (Part B) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVI. Enzyme Structure (Part C) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVII. Enzyme Structure (Part D) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVIII. Complex Carbohydrates (Part B) Edited by VICTOR GINSBURG VOLUME XXIX. Nucleic Acids and Protein Synthesis (Part E) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XXX. Nucleic Acids and Protein Synthesis (Part F) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME XXXI. Biomembranes (Part A) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME XXXII. Biomembranes (Part B) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME XXXIII. Cumulative Subject Index Volumes I-XXX Edited by MARTHA G. DENNIS AND EDWARD A. DENNIS VOLUME XXXIV. Affinity Techniques (Enzyme Purification: Part B) Edited by WILLIAM B. JAKOBY AND MEIR WILCHEK
Methods in Enzymology
VOLUME XXXV. Lipids (Part B) Edited by JOHN M. LOWENSTEIN VOLUME XXXVI. Hormone Action (Part A: Steroid Hormones) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XXXVII. Hormone Action (Part B: Peptide Hormones) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XXXVIII. Hormone Action (Part C: Cyclic Nucleotides) Edited by JOEL G. HARDMAN AND BERT W. O’MALLEY VOLUME XXXIX. Hormone Action (Part D: Isolated Cells, Tissues, and Organ Systems) Edited by JOEL G. HARDMAN AND BERT W. O’MALLEY VOLUME XL. Hormone Action (Part E: Nuclear Structure and Function) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XLI. Carbohydrate Metabolism (Part B) Edited by W. A. WOOD VOLUME XLII. Carbohydrate Metabolism (Part C) Edited by W. A. WOOD VOLUME XLIII. Antibiotics Edited by JOHN H. HASH VOLUME XLIV. Immobilized Enzymes Edited by KLAUS MOSBACH VOLUME XLV. Proteolytic Enzymes (Part B) Edited by LASZLO LORAND VOLUME XLVI. Affinity Labeling Edited by WILLIAM B. JAKOBY AND MEIR WILCHEK VOLUME XLVII. Enzyme Structure (Part E) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XLVIII. Enzyme Structure (Part F) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XLIX. Enzyme Structure (Part G) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME L. Complex Carbohydrates (Part C) Edited by VICTOR GINSBURG VOLUME LI. Purine and Pyrimidine Nucleotide Metabolism Edited by PATRICIA A. HOFFEE AND MARY ELLEN JONES VOLUME LII. Biomembranes (Part C: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER
xxv
xxvi
Methods in Enzymology
VOLUME LIII. Biomembranes (Part D: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LIV. Biomembranes (Part E: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LV. Biomembranes (Part F: Bioenergetics) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LVI. Biomembranes (Part G: Bioenergetics) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LVII. Bioluminescence and Chemiluminescence Edited by MARLENE A. DELUCA VOLUME LVIII. Cell Culture Edited by WILLIAM B. JAKOBY AND IRA PASTAN VOLUME LIX. Nucleic Acids and Protein Synthesis (Part G) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME LX. Nucleic Acids and Protein Synthesis (Part H) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME 61. Enzyme Structure (Part H) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 62. Vitamins and Coenzymes (Part D) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 63. Enzyme Kinetics and Mechanism (Part A: Initial Rate and Inhibitor Methods) Edited by DANIEL L. PURICH VOLUME 64. Enzyme Kinetics and Mechanism (Part B: Isotopic Probes and Complex Enzyme Systems) Edited by DANIEL L. PURICH VOLUME 65. Nucleic Acids (Part I) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME 66. Vitamins and Coenzymes (Part E) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 67. Vitamins and Coenzymes (Part F) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 68. Recombinant DNA Edited by RAY WU VOLUME 69. Photosynthesis and Nitrogen Fixation (Part C) Edited by ANTHONY SAN PIETRO VOLUME 70. Immunochemical Techniques (Part A) Edited by HELEN VAN VUNAKIS AND JOHN J. LANGONE
Methods in Enzymology
xxvii
VOLUME 71. Lipids (Part C) Edited by JOHN M. LOWENSTEIN VOLUME 72. Lipids (Part D) Edited by JOHN M. LOWENSTEIN VOLUME 73. Immunochemical Techniques (Part B) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 74. Immunochemical Techniques (Part C) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 75. Cumulative Subject Index Volumes XXXI, XXXII, XXXIV–LX Edited by EDWARD A. DENNIS AND MARTHA G. DENNIS VOLUME 76. Hemoglobins Edited by ERALDO ANTONINI, LUIGI ROSSI-BERNARDI, AND EMILIA CHIANCONE VOLUME 77. Detoxication and Drug Metabolism Edited by WILLIAM B. JAKOBY VOLUME 78. Interferons (Part A) Edited by SIDNEY PESTKA VOLUME 79. Interferons (Part B) Edited by SIDNEY PESTKA VOLUME 80. Proteolytic Enzymes (Part C) Edited by LASZLO LORAND VOLUME 81. Biomembranes (Part H: Visual Pigments and Purple Membranes, I) Edited by LESTER PACKER VOLUME 82. Structural and Contractile Proteins (Part A: Extracellular Matrix) Edited by LEON W. CUNNINGHAM AND DIXIE W. FREDERIKSEN VOLUME 83. Complex Carbohydrates (Part D) Edited by VICTOR GINSBURG VOLUME 84. Immunochemical Techniques (Part D: Selected Immunoassays) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 85. Structural and Contractile Proteins (Part B: The Contractile Apparatus and the Cytoskeleton) Edited by DIXIE W. FREDERIKSEN AND LEON W. CUNNINGHAM VOLUME 86. Prostaglandins and Arachidonate Metabolites Edited by WILLIAM E. M. LANDS AND WILLIAM L. SMITH VOLUME 87. Enzyme Kinetics and Mechanism (Part C: Intermediates, Stereo-chemistry, and Rate Studies) Edited by DANIEL L. PURICH VOLUME 88. Biomembranes (Part I: Visual Pigments and Purple Membranes, II) Edited by LESTER PACKER
xxviii
Methods in Enzymology
VOLUME 89. Carbohydrate Metabolism (Part D) Edited by WILLIS A. WOOD VOLUME 90. Carbohydrate Metabolism (Part E) Edited by WILLIS A. WOOD VOLUME 91. Enzyme Structure (Part I) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 92. Immunochemical Techniques (Part E: Monoclonal Antibodies and General Immunoassay Methods) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 93. Immunochemical Techniques (Part F: Conventional Antibodies, Fc Receptors, and Cytotoxicity) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 94. Polyamines Edited by HERBERT TABOR AND CELIA WHITE TABOR VOLUME 95. Cumulative Subject Index Volumes 61–74, 76–80 Edited by EDWARD A. DENNIS AND MARTHA G. DENNIS VOLUME 96. Biomembranes [Part J: Membrane Biogenesis: Assembly and Targeting (General Methods; Eukaryotes)] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 97. Biomembranes [Part K: Membrane Biogenesis: Assembly and Targeting (Prokaryotes, Mitochondria, and Chloroplasts)] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 98. Biomembranes (Part L: Membrane Biogenesis: Processing and Recycling) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 99. Hormone Action (Part F: Protein Kinases) Edited by JACKIE D. CORBIN AND JOEL G. HARDMAN VOLUME 100. Recombinant DNA (Part B) Edited by RAY WU, LAWRENCE GROSSMAN, AND KIVIE MOLDAVE VOLUME 101. Recombinant DNA (Part C) Edited by RAY WU, LAWRENCE GROSSMAN, AND KIVIE MOLDAVE VOLUME 102. Hormone Action (Part G: Calmodulin and Calcium-Binding Proteins) Edited by ANTHONY R. MEANS AND BERT W. O’MALLEY VOLUME 103. Hormone Action (Part H: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 104. Enzyme Purification and Related Techniques (Part C) Edited by WILLIAM B. JAKOBY
Methods in Enzymology
xxix
VOLUME 105. Oxygen Radicals in Biological Systems Edited by LESTER PACKER VOLUME 106. Posttranslational Modifications (Part A) Edited by FINN WOLD AND KIVIE MOLDAVE VOLUME 107. Posttranslational Modifications (Part B) Edited by FINN WOLD AND KIVIE MOLDAVE VOLUME 108. Immunochemical Techniques (Part G: Separation and Characterization of Lymphoid Cells) Edited by GIOVANNI DI SABATO, JOHN J. LANGONE, AND HELEN VAN VUNAKIS VOLUME 109. Hormone Action (Part I: Peptide Hormones) Edited by LUTZ BIRNBAUMER AND BERT W. O’MALLEY VOLUME 110. Steroids and Isoprenoids (Part A) Edited by JOHN H. LAW AND HANS C. RILLING VOLUME 111. Steroids and Isoprenoids (Part B) Edited by JOHN H. LAW AND HANS C. RILLING VOLUME 112. Drug and Enzyme Targeting (Part A) Edited by KENNETH J. WIDDER AND RALPH GREEN VOLUME 113. Glutamate, Glutamine, Glutathione, and Related Compounds Edited by ALTON MEISTER VOLUME 114. Diffraction Methods for Biological Macromolecules (Part A) Edited by HAROLD W. WYCKOFF, C. H. W. HIRS, AND SERGE N. TIMASHEFF VOLUME 115. Diffraction Methods for Biological Macromolecules (Part B) Edited by HAROLD W. WYCKOFF, C. H. W. HIRS, AND SERGE N. TIMASHEFF VOLUME 116. Immunochemical Techniques (Part H: Effectors and Mediators of Lymphoid Cell Functions) Edited by GIOVANNI DI SABATO, JOHN J. LANGONE, AND HELEN VAN VUNAKIS VOLUME 117. Enzyme Structure (Part J) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 118. Plant Molecular Biology Edited by ARTHUR WEISSBACH AND HERBERT WEISSBACH VOLUME 119. Interferons (Part C) Edited by SIDNEY PESTKA VOLUME 120. Cumulative Subject Index Volumes 81–94, 96–101 VOLUME 121. Immunochemical Techniques (Part I: Hybridoma Technology and Monoclonal Antibodies) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 122. Vitamins and Coenzymes (Part G) Edited by FRANK CHYTIL AND DONALD B. MCCORMICK
xxx
Methods in Enzymology
VOLUME 123. Vitamins and Coenzymes (Part H) Edited by FRANK CHYTIL AND DONALD B. MCCORMICK VOLUME 124. Hormone Action (Part J: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 125. Biomembranes (Part M: Transport in Bacteria, Mitochondria, and Chloroplasts: General Approaches and Transport Systems) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 126. Biomembranes (Part N: Transport in Bacteria, Mitochondria, and Chloroplasts: Protonmotive Force) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 127. Biomembranes (Part O: Protons and Water: Structure and Translocation) Edited by LESTER PACKER VOLUME 128. Plasma Lipoproteins (Part A: Preparation, Structure, and Molecular Biology) Edited by JERE P. SEGREST AND JOHN J. ALBERS VOLUME 129. Plasma Lipoproteins (Part B: Characterization, Cell Biology, and Metabolism) Edited by JOHN J. ALBERS AND JERE P. SEGREST VOLUME 130. Enzyme Structure (Part K) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 131. Enzyme Structure (Part L) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 132. Immunochemical Techniques (Part J: Phagocytosis and Cell-Mediated Cytotoxicity) Edited by GIOVANNI DI SABATO AND JOHANNES EVERSE VOLUME 133. Bioluminescence and Chemiluminescence (Part B) Edited by MARLENE DELUCA AND WILLIAM D. MCELROY VOLUME 134. Structural and Contractile Proteins (Part C: The Contractile Apparatus and the Cytoskeleton) Edited by RICHARD B. VALLEE VOLUME 135. Immobilized Enzymes and Cells (Part B) Edited by KLAUS MOSBACH VOLUME 136. Immobilized Enzymes and Cells (Part C) Edited by KLAUS MOSBACH VOLUME 137. Immobilized Enzymes and Cells (Part D) Edited by KLAUS MOSBACH VOLUME 138. Complex Carbohydrates (Part E) Edited by VICTOR GINSBURG
Methods in Enzymology
xxxi
VOLUME 139. Cellular Regulators (Part A: Calcium- and Calmodulin-Binding Proteins) Edited by ANTHONY R. MEANS AND P. MICHAEL CONN VOLUME 140. Cumulative Subject Index Volumes 102–119, 121–134 VOLUME 141. Cellular Regulators (Part B: Calcium and Lipids) Edited by P. MICHAEL CONN AND ANTHONY R. MEANS VOLUME 142. Metabolism of Aromatic Amino Acids and Amines Edited by SEYMOUR KAUFMAN VOLUME 143. Sulfur and Sulfur Amino Acids Edited by WILLIAM B. JAKOBY AND OWEN GRIFFITH VOLUME 144. Structural and Contractile Proteins (Part D: Extracellular Matrix) Edited by LEON W. CUNNINGHAM VOLUME 145. Structural and Contractile Proteins (Part E: Extracellular Matrix) Edited by LEON W. CUNNINGHAM VOLUME 146. Peptide Growth Factors (Part A) Edited by DAVID BARNES AND DAVID A. SIRBASKU VOLUME 147. Peptide Growth Factors (Part B) Edited by DAVID BARNES AND DAVID A. SIRBASKU VOLUME 148. Plant Cell Membranes Edited by LESTER PACKER AND ROLAND DOUCE VOLUME 149. Drug and Enzyme Targeting (Part B) Edited by RALPH GREEN AND KENNETH J. WIDDER VOLUME 150. Immunochemical Techniques (Part K: In Vitro Models of B and T Cell Functions and Lymphoid Cell Receptors) Edited by GIOVANNI DI SABATO VOLUME 151. Molecular Genetics of Mammalian Cells Edited by MICHAEL M. GOTTESMAN VOLUME 152. Guide to Molecular Cloning Techniques Edited by SHELBY L. BERGER AND ALAN R. KIMMEL VOLUME 153. Recombinant DNA (Part D) Edited by RAY WU AND LAWRENCE GROSSMAN VOLUME 154. Recombinant DNA (Part E) Edited by RAY WU AND LAWRENCE GROSSMAN VOLUME 155. Recombinant DNA (Part F) Edited by RAY WU VOLUME 156. Biomembranes (Part P: ATP-Driven Pumps and Related Transport: The Na, K-Pump) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER
xxxii
Methods in Enzymology
VOLUME 157. Biomembranes (Part Q: ATP-Driven Pumps and Related Transport: Calcium, Proton, and Potassium Pumps) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 158. Metalloproteins (Part A) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 159. Initiation and Termination of Cyclic Nucleotide Action Edited by JACKIE D. CORBIN AND ROGER A. JOHNSON VOLUME 160. Biomass (Part A: Cellulose and Hemicellulose) Edited by WILLIS A. WOOD AND SCOTT T. KELLOGG VOLUME 161. Biomass (Part B: Lignin, Pectin, and Chitin) Edited by WILLIS A. WOOD AND SCOTT T. KELLOGG VOLUME 162. Immunochemical Techniques (Part L: Chemotaxis and Inflammation) Edited by GIOVANNI DI SABATO VOLUME 163. Immunochemical Techniques (Part M: Chemotaxis and Inflammation) Edited by GIOVANNI DI SABATO VOLUME 164. Ribosomes Edited by HARRY F. NOLLER, JR., AND KIVIE MOLDAVE VOLUME 165. Microbial Toxins: Tools for Enzymology Edited by SIDNEY HARSHMAN VOLUME 166. Branched-Chain Amino Acids Edited by ROBERT HARRIS AND JOHN R. SOKATCH VOLUME 167. Cyanobacteria Edited by LESTER PACKER AND ALEXANDER N. GLAZER VOLUME 168. Hormone Action (Part K: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 169. Platelets: Receptors, Adhesion, Secretion (Part A) Edited by JACEK HAWIGER VOLUME 170. Nucleosomes Edited by PAUL M. WASSARMAN AND ROGER D. KORNBERG VOLUME 171. Biomembranes (Part R: Transport Theory: Cells and Model Membranes) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 172. Biomembranes (Part S: Transport: Membrane Isolation and Characterization) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER
Methods in Enzymology
xxxiii
VOLUME 173. Biomembranes [Part T: Cellular and Subcellular Transport: Eukaryotic (Nonepithelial) Cells] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 174. Biomembranes [Part U: Cellular and Subcellular Transport: Eukaryotic (Nonepithelial) Cells] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 175. Cumulative Subject Index Volumes 135–139, 141–167 VOLUME 176. Nuclear Magnetic Resonance (Part A: Spectral Techniques and Dynamics) Edited by NORMAN J. OPPENHEIMER AND THOMAS L. JAMES VOLUME 177. Nuclear Magnetic Resonance (Part B: Structure and Mechanism) Edited by NORMAN J. OPPENHEIMER AND THOMAS L. JAMES VOLUME 178. Antibodies, Antigens, and Molecular Mimicry Edited by JOHN J. LANGONE VOLUME 179. Complex Carbohydrates (Part F) Edited by VICTOR GINSBURG VOLUME 180. RNA Processing (Part A: General Methods) Edited by JAMES E. DAHLBERG AND JOHN N. ABELSON VOLUME 181. RNA Processing (Part B: Specific Methods) Edited by JAMES E. DAHLBERG AND JOHN N. ABELSON VOLUME 182. Guide to Protein Purification Edited by MURRAY P. DEUTSCHER VOLUME 183. Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences Edited by RUSSELL F. DOOLITTLE VOLUME 184. Avidin-Biotin Technology Edited by MEIR WILCHEK AND EDWARD A. BAYER VOLUME 185. Gene Expression Technology Edited by DAVID V. GOEDDEL VOLUME 186. Oxygen Radicals in Biological Systems (Part B: Oxygen Radicals and Antioxidants) Edited by LESTER PACKER AND ALEXANDER N. GLAZER VOLUME 187. Arachidonate Related Lipid Mediators Edited by ROBERT C. MURPHY AND FRANK A. FITZPATRICK VOLUME 188. Hydrocarbons and Methylotrophy Edited by MARY E. LIDSTROM VOLUME 189. Retinoids (Part A: Molecular and Metabolic Aspects) Edited by LESTER PACKER
xxxiv
Methods in Enzymology
VOLUME 190. Retinoids (Part B: Cell Differentiation and Clinical Applications) Edited by LESTER PACKER VOLUME 191. Biomembranes (Part V: Cellular and Subcellular Transport: Epithelial Cells) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 192. Biomembranes (Part W: Cellular and Subcellular Transport: Epithelial Cells) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 193. Mass Spectrometry Edited by JAMES A. MCCLOSKEY VOLUME 194. Guide to Yeast Genetics and Molecular Biology Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 195. Adenylyl Cyclase, G Proteins, and Guanylyl Cyclase Edited by ROGER A. JOHNSON AND JACKIE D. CORBIN VOLUME 196. Molecular Motors and the Cytoskeleton Edited by RICHARD B. VALLEE VOLUME 197. Phospholipases Edited by EDWARD A. DENNIS VOLUME 198. Peptide Growth Factors (Part C) Edited by DAVID BARNES, J. P. MATHER, AND GORDON H. SATO VOLUME 199. Cumulative Subject Index Volumes 168–174, 176–194 VOLUME 200. Protein Phosphorylation (Part A: Protein Kinases: Assays, Purification, Antibodies, Functional Analysis, Cloning, and Expression) Edited by TONY HUNTER AND BARTHOLOMEW M. SEFTON VOLUME 201. Protein Phosphorylation (Part B: Analysis of Protein Phosphorylation, Protein Kinase Inhibitors, and Protein Phosphatases) Edited by TONY HUNTER AND BARTHOLOMEW M. SEFTON VOLUME 202. Molecular Design and Modeling: Concepts and Applications (Part A: Proteins, Peptides, and Enzymes) Edited by JOHN J. LANGONE VOLUME 203. Molecular Design and Modeling: Concepts and Applications (Part B: Antibodies and Antigens, Nucleic Acids, Polysaccharides, and Drugs) Edited by JOHN J. LANGONE VOLUME 204. Bacterial Genetic Systems Edited by JEFFREY H. MILLER VOLUME 205. Metallobiochemistry (Part B: Metallothionein and Related Molecules) Edited by JAMES F. RIORDAN AND BERT L. VALLEE
Methods in Enzymology
xxxv
VOLUME 206. Cytochrome P450 Edited by MICHAEL R. WATERMAN AND ERIC F. JOHNSON VOLUME 207. Ion Channels Edited by BERNARDO RUDY AND LINDA E. IVERSON VOLUME 208. Protein–DNA Interactions Edited by ROBERT T. SAUER VOLUME 209. Phospholipid Biosynthesis Edited by EDWARD A. DENNIS AND DENNIS E. VANCE VOLUME 210. Numerical Computer Methods Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 211. DNA Structures (Part A: Synthesis and Physical Analysis of DNA) Edited by DAVID M. J. LILLEY AND JAMES E. DAHLBERG VOLUME 212. DNA Structures (Part B: Chemical and Electrophoretic Analysis of DNA) Edited by DAVID M. J. LILLEY AND JAMES E. DAHLBERG VOLUME 213. Carotenoids (Part A: Chemistry, Separation, Quantitation, and Antioxidation) Edited by LESTER PACKER VOLUME 214. Carotenoids (Part B: Metabolism, Genetics, and Biosynthesis) Edited by LESTER PACKER VOLUME 215. Platelets: Receptors, Adhesion, Secretion (Part B) Edited by JACEK J. HAWIGER VOLUME 216. Recombinant DNA (Part G) Edited by RAY WU VOLUME 217. Recombinant DNA (Part H) Edited by RAY WU VOLUME 218. Recombinant DNA (Part I) Edited by RAY WU VOLUME 219. Reconstitution of Intracellular Transport Edited by JAMES E. ROTHMAN VOLUME 220. Membrane Fusion Techniques (Part A) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 221. Membrane Fusion Techniques (Part B) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 222. Proteolytic Enzymes in Coagulation, Fibrinolysis, and Complement Activation (Part A: Mammalian Blood Coagulation Factors and Inhibitors) Edited by LASZLO LORAND AND KENNETH G. MANN
xxxvi
Methods in Enzymology
VOLUME 223. Proteolytic Enzymes in Coagulation, Fibrinolysis, and Complement Activation (Part B: Complement Activation, Fibrinolysis, and Nonmammalian Blood Coagulation Factors) Edited by LASZLO LORAND AND KENNETH G. MANN VOLUME 224. Molecular Evolution: Producing the Biochemical Data Edited by ELIZABETH ANNE ZIMMER, THOMAS J. WHITE, REBECCA L. CANN, AND ALLAN C. WILSON VOLUME 225. Guide to Techniques in Mouse Development Edited by PAUL M. WASSARMAN AND MELVIN L. DEPAMPHILIS VOLUME 226. Metallobiochemistry (Part C: Spectroscopic and Physical Methods for Probing Metal Ion Environments in Metalloenzymes and Metalloproteins) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 227. Metallobiochemistry (Part D: Physical and Spectroscopic Methods for Probing Metal Ion Environments in Metalloproteins) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 228. Aqueous Two-Phase Systems Edited by HARRY WALTER AND GO¨TE JOHANSSON VOLUME 229. Cumulative Subject Index Volumes 195–198, 200–227 VOLUME 230. Guide to Techniques in Glycobiology Edited by WILLIAM J. LENNARZ AND GERALD W. HART VOLUME 231. Hemoglobins (Part B: Biochemical and Analytical Methods) Edited by JOHANNES EVERSE, KIM D. VANDEGRIFF, AND ROBERT M. WINSLOW VOLUME 232. Hemoglobins (Part C: Biophysical Methods) Edited by JOHANNES EVERSE, KIM D. VANDEGRIFF, AND ROBERT M. WINSLOW VOLUME 233. Oxygen Radicals in Biological Systems (Part C) Edited by LESTER PACKER VOLUME 234. Oxygen Radicals in Biological Systems (Part D) Edited by LESTER PACKER VOLUME 235. Bacterial Pathogenesis (Part A: Identification and Regulation of Virulence Factors) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 236. Bacterial Pathogenesis (Part B: Integration of Pathogenic Bacteria with Host Cells) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 237. Heterotrimeric G Proteins Edited by RAVI IYENGAR VOLUME 238. Heterotrimeric G-Protein Effectors Edited by RAVI IYENGAR
Methods in Enzymology
xxxvii
VOLUME 239. Nuclear Magnetic Resonance (Part C) Edited by THOMAS L. JAMES AND NORMAN J. OPPENHEIMER VOLUME 240. Numerical Computer Methods (Part B) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 241. Retroviral Proteases Edited by LAWRENCE C. KUO AND JULES A. SHAFER VOLUME 242. Neoglycoconjugates (Part A) Edited by Y. C. LEE AND REIKO T. LEE VOLUME 243. Inorganic Microbial Sulfur Metabolism Edited by HARRY D. PECK, JR., AND JEAN LEGALL VOLUME 244. Proteolytic Enzymes: Serine and Cysteine Peptidases Edited by ALAN J. BARRETT VOLUME 245. Extracellular Matrix Components Edited by E. RUOSLAHTI AND E. ENGVALL VOLUME 246. Biochemical Spectroscopy Edited by KENNETH SAUER VOLUME 247. Neoglycoconjugates (Part B: Biomedical Applications) Edited by Y. C. LEE AND REIKO T. LEE VOLUME 248. Proteolytic Enzymes: Aspartic and Metallo Peptidases Edited by ALAN J. BARRETT VOLUME 249. Enzyme Kinetics and Mechanism (Part D: Developments in Enzyme Dynamics) Edited by DANIEL L. PURICH VOLUME 250. Lipid Modifications of Proteins Edited by PATRICK J. CASEY AND JANICE E. BUSS VOLUME 251. Biothiols (Part A: Monothiols and Dithiols, Protein Thiols, and Thiyl Radicals) Edited by LESTER PACKER VOLUME 252. Biothiols (Part B: Glutathione and Thioredoxin; Thiols in Signal Transduction and Gene Regulation) Edited by LESTER PACKER VOLUME 253. Adhesion of Microbial Pathogens Edited by RON J. DOYLE AND ITZHAK OFEK VOLUME 254. Oncogene Techniques Edited by PETER K. VOGT AND INDER M. VERMA VOLUME 255. Small GTPases and Their Regulators (Part A: Ras Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL
xxxviii
Methods in Enzymology
VOLUME 256. Small GTPases and Their Regulators (Part B: Rho Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 257. Small GTPases and Their Regulators (Part C: Proteins Involved in Transport) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 258. Redox-Active Amino Acids in Biology Edited by JUDITH P. KLINMAN VOLUME 259. Energetics of Biological Macromolecules Edited by MICHAEL L. JOHNSON AND GARY K. ACKERS VOLUME 260. Mitochondrial Biogenesis and Genetics (Part A) Edited by GIUSEPPE M. ATTARDI AND ANNE CHOMYN VOLUME 261. Nuclear Magnetic Resonance and Nucleic Acids Edited by THOMAS L. JAMES VOLUME 262. DNA Replication Edited by JUDITH L. CAMPBELL VOLUME 263. Plasma Lipoproteins (Part C: Quantitation) Edited by WILLIAM A. BRADLEY, SANDRA H. GIANTURCO, AND JERE P. SEGREST VOLUME 264. Mitochondrial Biogenesis and Genetics (Part B) Edited by GIUSEPPE M. ATTARDI AND ANNE CHOMYN VOLUME 265. Cumulative Subject Index Volumes 228, 230–262 VOLUME 266. Computer Methods for Macromolecular Sequence Analysis Edited by RUSSELL F. DOOLITTLE VOLUME 267. Combinatorial Chemistry Edited by JOHN N. ABELSON VOLUME 268. Nitric Oxide (Part A: Sources and Detection of NO; NO Synthase) Edited by LESTER PACKER VOLUME 269. Nitric Oxide (Part B: Physiological and Pathological Processes) Edited by LESTER PACKER VOLUME 270. High Resolution Separation and Analysis of Biological Macromolecules (Part A: Fundamentals) Edited by BARRY L. KARGER AND WILLIAM S. HANCOCK VOLUME 271. High Resolution Separation and Analysis of Biological Macromolecules (Part B: Applications) Edited by BARRY L. KARGER AND WILLIAM S. HANCOCK VOLUME 272. Cytochrome P450 (Part B) Edited by ERIC F. JOHNSON AND MICHAEL R. WATERMAN VOLUME 273. RNA Polymerase and Associated Factors (Part A) Edited by SANKAR ADHYA
Methods in Enzymology
xxxix
VOLUME 274. RNA Polymerase and Associated Factors (Part B) Edited by SANKAR ADHYA VOLUME 275. Viral Polymerases and Related Proteins Edited by LAWRENCE C. KUO, DAVID B. OLSEN, AND STEVEN S. CARROLL VOLUME 276. Macromolecular Crystallography (Part A) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 277. Macromolecular Crystallography (Part B) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 278. Fluorescence Spectroscopy Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 279. Vitamins and Coenzymes (Part I) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 280. Vitamins and Coenzymes (Part J) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 281. Vitamins and Coenzymes (Part K) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 282. Vitamins and Coenzymes (Part L) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 283. Cell Cycle Control Edited by WILLIAM G. DUNPHY VOLUME 284. Lipases (Part A: Biotechnology) Edited by BYRON RUBIN AND EDWARD A. DENNIS VOLUME 285. Cumulative Subject Index Volumes 263, 264, 266–284, 286–289 VOLUME 286. Lipases (Part B: Enzyme Characterization and Utilization) Edited by BYRON RUBIN AND EDWARD A. DENNIS VOLUME 287. Chemokines Edited by RICHARD HORUK VOLUME 288. Chemokine Receptors Edited by RICHARD HORUK VOLUME 289. Solid Phase Peptide Synthesis Edited by GREGG B. FIELDS VOLUME 290. Molecular Chaperones Edited by GEORGE H. LORIMER AND THOMAS BALDWIN VOLUME 291. Caged Compounds Edited by GERARD MARRIOTT VOLUME 292. ABC Transporters: Biochemical, Cellular, and Molecular Aspects Edited by SURESH V. AMBUDKAR AND MICHAEL M. GOTTESMAN
xl
Methods in Enzymology
VOLUME 293. Ion Channels (Part B) Edited by P. MICHAEL CONN VOLUME 294. Ion Channels (Part C) Edited by P. MICHAEL CONN VOLUME 295. Energetics of Biological Macromolecules (Part B) Edited by GARY K. ACKERS AND MICHAEL L. JOHNSON VOLUME 296. Neurotransmitter Transporters Edited by SUSAN G. AMARA VOLUME 297. Photosynthesis: Molecular Biology of Energy Capture Edited by LEE MCINTOSH VOLUME 298. Molecular Motors and the Cytoskeleton (Part B) Edited by RICHARD B. VALLEE VOLUME 299. Oxidants and Antioxidants (Part A) Edited by LESTER PACKER VOLUME 300. Oxidants and Antioxidants (Part B) Edited by LESTER PACKER VOLUME 301. Nitric Oxide: Biological and Antioxidant Activities (Part C) Edited by LESTER PACKER VOLUME 302. Green Fluorescent Protein Edited by P. MICHAEL CONN VOLUME 303. cDNA Preparation and Display Edited by SHERMAN M. WEISSMAN VOLUME 304. Chromatin Edited by PAUL M. WASSARMAN AND ALAN P. WOLFFE VOLUME 305. Bioluminescence and Chemiluminescence (Part C) Edited by THOMAS O. BALDWIN AND MIRIAM M. ZIEGLER VOLUME 306. Expression of Recombinant Genes in Eukaryotic Systems Edited by JOSEPH C. GLORIOSO AND MARTIN C. SCHMIDT VOLUME 307. Confocal Microscopy Edited by P. MICHAEL CONN VOLUME 308. Enzyme Kinetics and Mechanism (Part E: Energetics of Enzyme Catalysis) Edited by DANIEL L. PURICH AND VERN L. SCHRAMM VOLUME 309. Amyloid, Prions, and Other Protein Aggregates Edited by RONALD WETZEL VOLUME 310. Biofilms Edited by RON J. DOYLE
Methods in Enzymology
VOLUME 311. Sphingolipid Metabolism and Cell Signaling (Part A) Edited by ALFRED H. MERRILL, JR., AND YUSUF A. HANNUN VOLUME 312. Sphingolipid Metabolism and Cell Signaling (Part B) Edited by ALFRED H. MERRILL, JR., AND YUSUF A. HANNUN VOLUME 313. Antisense Technology (Part A: General Methods, Methods of Delivery, and RNA Studies) Edited by M. IAN PHILLIPS VOLUME 314. Antisense Technology (Part B: Applications) Edited by M. IAN PHILLIPS VOLUME 315. Vertebrate Phototransduction and the Visual Cycle (Part A) Edited by KRZYSZTOF PALCZEWSKI VOLUME 316. Vertebrate Phototransduction and the Visual Cycle (Part B) Edited by KRZYSZTOF PALCZEWSKI VOLUME 317. RNA–Ligand Interactions (Part A: Structural Biology Methods) Edited by DANIEL W. CELANDER AND JOHN N. ABELSON VOLUME 318. RNA–Ligand Interactions (Part B: Molecular Biology Methods) Edited by DANIEL W. CELANDER AND JOHN N. ABELSON VOLUME 319. Singlet Oxygen, UV-A, and Ozone Edited by LESTER PACKER AND HELMUT SIES VOLUME 320. Cumulative Subject Index Volumes 290–319 VOLUME 321. Numerical Computer Methods (Part C) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 322. Apoptosis Edited by JOHN C. REED VOLUME 323. Energetics of Biological Macromolecules (Part C) Edited by MICHAEL L. JOHNSON AND GARY K. ACKERS VOLUME 324. Branched-Chain Amino Acids (Part B) Edited by ROBERT A. HARRIS AND JOHN R. SOKATCH VOLUME 325. Regulators and Effectors of Small GTPases (Part D: Rho Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 326. Applications of Chimeric Genes and Hybrid Proteins (Part A: Gene Expression and Protein Purification) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON VOLUME 327. Applications of Chimeric Genes and Hybrid Proteins (Part B: Cell Biology and Physiology) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON
xli
xlii
Methods in Enzymology
VOLUME 328. Applications of Chimeric Genes and Hybrid Proteins (Part C: Protein–Protein Interactions and Genomics) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON VOLUME 329. Regulators and Effectors of Small GTPases (Part E: GTPases Involved in Vesicular Traffic) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 330. Hyperthermophilic Enzymes (Part A) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 331. Hyperthermophilic Enzymes (Part B) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 332. Regulators and Effectors of Small GTPases (Part F: Ras Family I) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 333. Regulators and Effectors of Small GTPases (Part G: Ras Family II) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 334. Hyperthermophilic Enzymes (Part C) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 335. Flavonoids and Other Polyphenols Edited by LESTER PACKER VOLUME 336. Microbial Growth in Biofilms (Part A: Developmental and Molecular Biological Aspects) Edited by RON J. DOYLE VOLUME 337. Microbial Growth in Biofilms (Part B: Special Environments and Physicochemical Aspects) Edited by RON J. DOYLE VOLUME 338. Nuclear Magnetic Resonance of Biological Macromolecules (Part A) Edited by THOMAS L. JAMES, VOLKER DO¨TSCH, AND ULI SCHMITZ VOLUME 339. Nuclear Magnetic Resonance of Biological Macromolecules (Part B) Edited by THOMAS L. JAMES, VOLKER DO¨TSCH, AND ULI SCHMITZ VOLUME 340. Drug–Nucleic Acid Interactions Edited by JONATHAN B. CHAIRES AND MICHAEL J. WARING VOLUME 341. Ribonucleases (Part A) Edited by ALLEN W. NICHOLSON VOLUME 342. Ribonucleases (Part B) Edited by ALLEN W. NICHOLSON VOLUME 343. G Protein Pathways (Part A: Receptors) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT VOLUME 344. G Protein Pathways (Part B: G Proteins and Their Regulators) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT
Methods in Enzymology
xliii
VOLUME 345. G Protein Pathways (Part C: Effector Mechanisms) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT VOLUME 346. Gene Therapy Methods Edited by M. IAN PHILLIPS VOLUME 347. Protein Sensors and Reactive Oxygen Species (Part A: Selenoproteins and Thioredoxin) Edited by HELMUT SIES AND LESTER PACKER VOLUME 348. Protein Sensors and Reactive Oxygen Species (Part B: Thiol Enzymes and Proteins) Edited by HELMUT SIES AND LESTER PACKER VOLUME 349. Superoxide Dismutase Edited by LESTER PACKER VOLUME 350. Guide to Yeast Genetics and Molecular and Cell Biology (Part B) Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 351. Guide to Yeast Genetics and Molecular and Cell Biology (Part C) Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 352. Redox Cell Biology and Genetics (Part A) Edited by CHANDAN K. SEN AND LESTER PACKER VOLUME 353. Redox Cell Biology and Genetics (Part B) Edited by CHANDAN K. SEN AND LESTER PACKER VOLUME 354. Enzyme Kinetics and Mechanisms (Part F: Detection and Characterization of Enzyme Reaction Intermediates) Edited by DANIEL L. PURICH VOLUME 355. Cumulative Subject Index Volumes 321–354 VOLUME 356. Laser Capture Microscopy and Microdissection Edited by P. MICHAEL CONN VOLUME 357. Cytochrome P450, Part C Edited by ERIC F. JOHNSON AND MICHAEL R. WATERMAN VOLUME 358. Bacterial Pathogenesis (Part C: Identification, Regulation, and Function of Virulence Factors) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 359. Nitric Oxide (Part D) Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 360. Biophotonics (Part A) Edited by GERARD MARRIOTT AND IAN PARKER VOLUME 361. Biophotonics (Part B) Edited by GERARD MARRIOTT AND IAN PARKER
xliv
Methods in Enzymology
VOLUME 362. Recognition of Carbohydrates in Biological Systems (Part A) Edited by YUAN C. LEE AND REIKO T. LEE VOLUME 363. Recognition of Carbohydrates in Biological Systems (Part B) Edited by YUAN C. LEE AND REIKO T. LEE VOLUME 364. Nuclear Receptors Edited by DAVID W. RUSSELL AND DAVID J. MANGELSDORF VOLUME 365. Differentiation of Embryonic Stem Cells Edited by PAUL M. WASSAUMAN AND GORDON M. KELLER VOLUME 366. Protein Phosphatases Edited by SUSANNE KLUMPP AND JOSEF KRIEGLSTEIN VOLUME 367. Liposomes (Part A) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 368. Macromolecular Crystallography (Part C) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 369. Combinational Chemistry (Part B) Edited by GUILLERMO A. MORALES AND BARRY A. BUNIN VOLUME 370. RNA Polymerases and Associated Factors (Part C) Edited by SANKAR L. ADHYA AND SUSAN GARGES VOLUME 371. RNA Polymerases and Associated Factors (Part D) Edited by SANKAR L. ADHYA AND SUSAN GARGES VOLUME 372. Liposomes (Part B) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 373. Liposomes (Part C) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 374. Macromolecular Crystallography (Part D) Edited by CHARLES W. CARTER, JR., AND ROBERT W. SWEET VOLUME 375. Chromatin and Chromatin Remodeling Enzymes (Part A) Edited by C. DAVID ALLIS AND CARL WU VOLUME 376. Chromatin and Chromatin Remodeling Enzymes (Part B) Edited by C. DAVID ALLIS AND CARL WU VOLUME 377. Chromatin and Chromatin Remodeling Enzymes (Part C) Edited by C. DAVID ALLIS AND CARL WU VOLUME 378. Quinones and Quinone Enzymes (Part A) Edited by HELMUT SIES AND LESTER PACKER VOLUME 379. Energetics of Biological Macromolecules (Part D) Edited by JO M. HOLT, MICHAEL L. JOHNSON, AND GARY K. ACKERS VOLUME 380. Energetics of Biological Macromolecules (Part E) Edited by JO M. HOLT, MICHAEL L. JOHNSON, AND GARY K. ACKERS
Methods in Enzymology
VOLUME 381. Oxygen Sensing Edited by CHANDAN K. SEN AND GREGG L. SEMENZA VOLUME 382. Quinones and Quinone Enzymes (Part B) Edited by HELMUT SIES AND LESTER PACKER VOLUME 383. Numerical Computer Methods (Part D) Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 384. Numerical Computer Methods (Part E) Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 385. Imaging in Biological Research (Part A) Edited by P. MICHAEL CONN VOLUME 386. Imaging in Biological Research (Part B) Edited by P. MICHAEL CONN VOLUME 387. Liposomes (Part D) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 388. Protein Engineering Edited by DAN E. ROBERTSON AND JOSEPH P. NOEL VOLUME 389. Regulators of G-Protein Signaling (Part A) Edited by DAVID P. SIDEROVSKI VOLUME 390. Regulators of G-Protein Signaling (Part B) Edited by DAVID P. SIDEROVSKI VOLUME 391. Liposomes (Part E) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 392. RNA Interference Edited by ENGELKE ROSSI VOLUME 393. Circadian Rhythms Edited by MICHAEL W. YOUNG VOLUME 394. Nuclear Magnetic Resonance of Biological Macromolecules (Part C) Edited by THOMAS L. JAMES VOLUME 395. Producing the Biochemical Data (Part B) Edited by ELIZABETH A. ZIMMER AND ERIC H. ROALSON VOLUME 396. Nitric Oxide (Part E) Edited by LESTER PACKER AND ENRIQUE CADENAS VOLUME 397. Environmental Microbiology Edited by JARED R. LEADBETTER VOLUME 398. Ubiquitin and Protein Degradation (Part A) Edited by RAYMOND J. DESHAIES
xlv
xlvi
Methods in Enzymology
VOLUME 399. Ubiquitin and Protein Degradation (Part B) Edited by RAYMOND J. DESHAIES VOLUME 400. Phase II Conjugation Enzymes and Transport Systems Edited by HELMUT SIES AND LESTER PACKER VOLUME 401. Glutathione Transferases and Gamma Glutamyl Transpeptidases Edited by HELMUT SIES AND LESTER PACKER VOLUME 402. Biological Mass Spectrometry Edited by A. L. BURLINGAME VOLUME 403. GTPases Regulating Membrane Targeting and Fusion Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 404. GTPases Regulating Membrane Dynamics Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 405. Mass Spectrometry: Modified Proteins and Glycoconjugates Edited by A. L. BURLINGAME VOLUME 406. Regulators and Effectors of Small GTPases: Rho Family Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 407. Regulators and Effectors of Small GTPases: Ras Family Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 408. DNA Repair (Part A) Edited by JUDITH L. CAMPBELL AND PAUL MODRICH VOLUME 409. DNA Repair (Part B) Edited by JUDITH L. CAMPBELL AND PAUL MODRICH VOLUME 410. DNA Microarrays (Part A: Array Platforms and Web-Bench Protocols) Edited by ALAN KIMMEL AND BRIAN OLIVER VOLUME 411. DNA Microarrays (Part B: Databases and Statistics) Edited by ALAN KIMMEL AND BRIAN OLIVER VOLUME 412. Amyloid, Prions, and Other Protein Aggregates (Part B) Edited by INDU KHETERPAL AND RONALD WETZEL VOLUME 413. Amyloid, Prions, and Other Protein Aggregates (Part C) Edited by INDU KHETERPAL AND RONALD WETZEL VOLUME 414. Measuring Biological Responses with Automated Microscopy Edited by JAMES INGLESE VOLUME 415. Glycobiology Edited by MINORU FUKUDA VOLUME 416. Glycomics Edited by MINORU FUKUDA
Methods in Enzymology
xlvii
VOLUME 417. Functional Glycomics Edited by MINORU FUKUDA VOLUME 418. Embryonic Stem Cells Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 419. Adult Stem Cells Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 420. Stem Cell Tools and Other Experimental Protocols Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 421. Advanced Bacterial Genetics: Use of Transposons and Phage for Genomic Engineering Edited by KELLY T. HUGHES VOLUME 422. Two-Component Signaling Systems, Part A Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 423. Two-Component Signaling Systems, Part B Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 424. RNA Editing Edited by JONATHA M. GOTT VOLUME 425. RNA Modification Edited by JONATHA M. GOTT VOLUME 426. Integrins Edited by DAVID CHERESH VOLUME 427. MicroRNA Methods Edited by JOHN J. ROSSI VOLUME 428. Osmosensing and Osmosignaling Edited by HELMUT SIES AND DIETER HAUSSINGER VOLUME 429. Translation Initiation: Extract Systems and Molecular Genetics Edited by JON LORSCH VOLUME 430. Translation Initiation: Reconstituted Systems and Biophysical Methods Edited by JON LORSCH VOLUME 431. Translation Initiation: Cell Biology, High-Throughput and Chemical-Based Approaches Edited by JON LORSCH VOLUME 432. Lipidomics and Bioactive Lipids: Mass-Spectrometry–Based Lipid Analysis Edited by H. ALEX BROWN
xlviii
Methods in Enzymology
VOLUME 433. Lipidomics and Bioactive Lipids: Specialized Analytical Methods and Lipids in Disease Edited by H. ALEX BROWN VOLUME 434. Lipidomics and Bioactive Lipids: Lipids and Cell Signaling Edited by H. ALEX BROWN VOLUME 435. Oxygen Biology and Hypoxia Edited by HELMUT SIES AND BERNHARD BRU¨NE VOLUME 436. Globins and Other Nitric Oxide-Reactive Protiens (Part A) Edited by ROBERT K. POOLE VOLUME 437. Globins and Other Nitric Oxide-Reactive Protiens (Part B) Edited by ROBERT K. POOLE VOLUME 438. Small GTPases in Disease (Part A) Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 439. Small GTPases in Disease (Part B) Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 440. Nitric Oxide, Part F Oxidative and Nitrosative Stress in Redox Regulation of Cell Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 441. Nitric Oxide, Part G Oxidative and Nitrosative Stress in Redox Regulation of Cell Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 442. Programmed Cell Death, General Principles for Studying Cell Death (Part A) Edited by ROYA KHOSRAVI-FAR, ZAHRA ZAKERI, RICHARD A. LOCKSHIN, AND MAURO PIACENTINI VOLUME 443. Angiogenesis: In Vitro Systems Edited by DAVID A. CHERESH VOLUME 444. Angiogenesis: In Vivo Systems (Part A) Edited by DAVID A. CHERESH VOLUME 445. Angiogenesis: In Vivo Systems (Part B) Edited by DAVID A. CHERESH VOLUME 446. Programmed Cell Death, The Biology and Therapeutic Implications of Cell Death (Part B) Edited by ROYA KHOSRAVI-FAR, ZAHRA ZAKERI, RICHARD A. LOCKSHIN, AND MAURO PIACENTINI VOLUME 447. RNA Turnover in Bacteria, Archaea and Organelles Edited by LYNNE E. MAQUAT AND CECILIA M. ARRAIANO
Methods in Enzymology
xlix
VOLUME 448. RNA Turnover in Eukaryotes: Nucleases, Pathways and Analysis of mRNA Decay Edited by LYNNE E. MAQUAT AND MEGERDITCH KILEDJIAN VOLUME 449. RNA Turnover in Eukaryotes: Analysis of Specialized and Quality Control RNA Decay Pathways Edited by LYNNE E. MAQUAT AND MEGERDITCH KILEDJIAN VOLUME 450. Fluorescence Spectroscopy Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 451. Autophagy: Lower Eukaryotes and Non-Mammalian Systems (Part A) Edited by DANIEL J. KLIONSKY VOLUME 452. Autophagy in Mammalian Systems (Part B) Edited by DANIEL J. KLIONSKY VOLUME 453. Autophagy in Disease and Clinical Applications (Part C) Edited by DANIEL J. KLIONSKY VOLUME 454. Computer Methods (Part A) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 455. Biothermodynamics (Part A) Edited by MICHAEL L. JOHNSON, JO M. HOLT, AND GARY K. ACKERS (RETIRED) VOLUME 456. Mitochondrial Function, Part A: Mitochondrial Electron Transport Complexes and Reactive Oxygen Species Edited by WILLIAM S. ALLISON AND IMMO E. SCHEFFLER VOLUME 457. Mitochondrial Function, Part B: Mitochondrial Protein Kinases, Protein Phosphatases and Mitochondrial Diseases Edited by WILLIAM S. ALLISON AND ANNE N. MURPHY VOLUME 458. Complex Enzymes in Microbial Natural Product Biosynthesis, Part A: Overview Articles and Peptides Edited by DAVID A. HOPWOOD VOLUME 459. Complex Enzymes in Microbial Natural Product Biosynthesis, Part B: Polyketides, Aminocoumarins and Carbohydrates Edited by DAVID A. HOPWOOD VOLUME 460. Chemokines, Part A Edited by TRACY M. HANDEL AND DAMON J. HAMEL VOLUME 461. Chemokines, Part B Edited by TRACY M. HANDEL AND DAMON J. HAMEL VOLUME 462. Non-Natural Amino Acids Edited by TOM W. MUIR AND JOHN N. ABELSON VOLUME 463. Guide to Protein Purification, 2nd Edition Edited by RICHARD R. BURGESS AND MURRAY P. DEUTSCHER
l
Methods in Enzymology
VOLUME 464. Liposomes, Part F Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 465. Liposomes, Part G Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 466. Biothermodynamics, Part B Edited by MICHAEL L. JOHNSON, GARY K. ACKERS, AND JO M. HOLT VOLUME 467. Computer Methods Part B Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 468. Biophysical, Chemical, and Functional Probes of RNA Structure, Interactions and Folding: Part A Edited by DANIEL HERSCHLAG VOLUME 469. Biophysical, Chemical, and Functional Probes of RNA Structure, Interactions and Folding: Part B Edited by DANIEL HERSCHLAG VOLUME 470. Guide to Yeast Genetics: Functional Genomics, Proteomics, and Other Systems Analysis, 2nd Edition Edited by GERALD FINK, JONATHAN WEISSMAN, AND CHRISTINE GUTHRIE VOLUME 471. Two-Component Signaling Systems, Part C Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 472. Single Molecule Tools, Part A: Fluorescence Based Approaches Edited by NILS G. WALTER VOLUME 473. Thiol Redox Transitions in Cell Signaling, Part A Chemistry and Biochemistry of Low Molecular Weight and Protein Thiols Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 474. Thiol Redox Transitions in Cell Signaling, Part B Cellular Localization and Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 475. Single Molecule Tools, Part B: Super-Resolution, Particle Tracking, Multiparameter, and Force Based Methods Edited by NILS G. WALTER VOLUME 476. Guide to Techniques in Mouse Development, Part A Mice, Embryos, and Cells, 2nd Edition Edited by PAUL M. WASSARMAN AND PHILIPPE M. SORIANO VOLUME 477. Guide to Techniques in Mouse Development, Part B Mouse Molecular Genetics, 2nd Edition Edited by PAUL M. WASSARMAN AND PHILIPPE M. SORIANO VOLUME 478. Glycomics Edited by MINORU FUKUDA
Methods in Enzymology
VOLUME 479. Functional Glycomics Edited by MINORU FUKUDA VOLUME 480. Glycobiology Edited by MINORU FUKUDA VOLUME 481. Cryo-EM, Part A: Sample Preparation and Data Collection Edited by GRANT J. JENSEN VOLUME 482. Cryo-EM, Part B: 3-D Reconstruction Edited by GRANT J. JENSEN VOLUME 483. Cryo-EM, Part C: Analyses, Interpretation, and Case Studies Edited by GRANT J. JENSEN VOLUME 484. Constitutive Activity in Receptors and Other Proteins, Part A Edited by P. MICHAEL CONN VOLUME 485. Constitutive Activity in Receptors and Other Proteins, Part B Edited by P. MICHAEL CONN VOLUME 486. Research on Nitrification and Related Processes, Part A Edited by MARTIN G. KLOTZ VOLUME 487. Computer Methods, Part C Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND
li
C H A P T E R
O N E
Predicting Fluorescence Lifetimes and Spectra of Biopolymers Patrik R. Callis Contents 1. Introduction 1.1. General principles for predicting fluorescence properties 2. Qualitative Concepts: Developing Intuition 2.1. Trajectories 2.2. How to recognize a quenching environment 3. Methods 3.1. Philosophy 3.2. MD simulations 3.3. Quantum mechanics 3.4. Interface between QM and MD 4. Nonexponential Fluorescence Decay 5. Final Remarks Acknowledgments References
2 3 7 7 10 14 14 14 14 15 25 28 34 34
Abstract Use of fluorescence in biology and biochemistry for imaging and characterizing equilibrium and dynamic processes is growing exponentially. Much progress has been made in the last few years on the microscopic understanding of the underlying principles of what controls the wavelength and quenching of fluorescence in biopolymers, both of which are central to the utility of fluorescent probes. This chapter is concerned with the quantitative microscopic understanding and prediction of the fluorescence wavelength and/or intensity of a fluorescent probe molecule attached to a biopolymer as revealed by hybrid quantum and classical mechanical computation procedures. The aim is not only to provide a recipe, but also even more importantly, to communicate the qualitative basic concepts of interpretation of fluorescence. These are surprisingly simple, although not broadly appreciated at this time. In addition, an effort has been made to show how these techniques have led to an emerging Department of Chemistry and Biochemistry, Montana State University, Bozeman, Montana, USA Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87001-1
#
2011 Elsevier Inc. All rights reserved.
1
2
Patrik R. Callis
understanding of the relation between time-dependent wavelengths shifts due to solvent relaxation and population decay of conformational subensembles.
1. Introduction Use of fluorescence is enjoying rapid growth and development in its application to an array of fundamental and practical studies in biological sciences. The usefulness of fluorescence falls roughly into three broad categories: (1) imaging (2) quantifying equilibrium systems (3) following dynamic processes This chapter deals uniquely with the quantitative microscopic understanding and prediction of the fluorescence wavelength and/or intensity of a fluorescent probe molecule attached to a biopolymer while in an equilibrium state or during a dynamic process. (Of course, one is often now quantifying equilibrium and dynamic processes with the aid of imaging). Even at equilibrium, most biopolymers are increasingly perceived as an ensemble of conformations (Luo et al., 2006; Prakash and Marcus, 2007, 2008; Yang et al., 2003), which means that there are fluctuations in conformation on long-time scales and apparent heterogeneity of fluorescence properties on short-time scales. The theme of this chapter is that sensitivity to the precise local environment of the intensity, that is, quantum yield (Ff), excited state lifetime (tf), and wavelength (lmax) of a probe molecule often lies in responsiveness to the local electric field strength and direction. Or more usefully, it is the change in electric potential (volts) between the space initially occupied by the electron and the space occupied after the electronic transition that determines whether the fluorescence will be red- or blue-shifted and whether an electron acceptor will or will not quench the fluorescence. Electrons, of course, will be at lower energy if the potential is more positive (less negative), and such positive differences in potential can be provided by ions and electric dipoles. For example, the ruggedness of the electrostatic landscape in proteins is mirrored in the ruggedness of the tryptophan (Trp) fluorescence quantum yield and lifetime landscape. This is quite pertinent to the study of proteins because the diversity and strength of electric fields in proteins underlies their function as enzymes. (Kamerlin et al., 2009; Roca et al., 2008; Warshel et al., 2006). In practice, the mechanism of the detected change in wavelength or intensity of fluorescence is almost always unknown in detail, or plausibly surmised at best. While this does not limit measurement of the rate of the
Predicting Fluorescence
3
process, lack of knowledge can lead to incorrect conclusions as to the nature of the process. For example, folding of a protein is often followed by the extent of quenching of a fluorophore by a quencher residue that is thought to be spatially close in the folded state, but this alone does not assure that the intensity change precisely follows the folding process. Much progress has been made in the last few years on the microscopic understanding of the underlying principles of what controls the wavelength and quenching of fluorescence in biopolymers (Beierlein et al., 2006; Callis, 2009; Callis and Liu, 2004; Callis et al., 2007; Clark et al., 1999, 2002; Muin˜o and Callis, 2009; Rusu et al., 2008; Vivian and Callis, 2001). This chapter attempts to relate the surprisingly simple ideas that have emerged from quantum mechanics–classical molecular dynamics (QM–MD) simulations applied to proteins, including insight into correlation of wavelength and subpopulation decay, and to present details of the procedures.
1.1. General principles for predicting fluorescence properties All light may be formally classified as fluorescence inasmuch as photons are emitted from their source during downward energy transitions in matter. In this chapter, we take fluorescence to have the usual meaning common to chemistry and biology: light emitted from a molecule by virtue of being electronically excited by a substantially allowed transition (i.e., in times <1 ms), typically by absorption of UV or visible light. As with everything involving electrons in our physical world, understanding and predicting fluorescence emission wavelengths, spectral distribution, and intensity can, in principle, be reduced to a combination of stationary and time-dependent QM applied to the system of interest. The most effective probes, however, exhibit great sensitivity to their microenvironment, meaning that the property being detected will fluctuate with high amplitude and display a large variance (Loring, 1990), reflecting the dynamic nature of the polymer. Thus, any method seeking to simulate response requires simulating the microenvironment in an appropriately dynamic manner that converges to correct average values. While it is becoming feasible to apply QM to a single biopolymer including solvent using specialized GPUs (Ufimtsev and Martinez, 2009), the large number of QM calculations needed to adequately simulate the fluctuating environment in biopolymers requires, in practice, a classically propagating set of point charges while treating only the probe molecule with QM. 1.1.1. Wavelength prediction The wavelength at which the probe absorbs or emits photons comes directly from the original Planck expression relating frequency to quantum energy level difference, DE:
4
Patrik R. Callis
DE ¼ hn ¼
hc l
in which h is Planck’s constant, c the speed of light, n the frequency, and l is the wavelength. This simple expression, however, begs the question as to which energy levels. At ambient temperatures, the assumption that molecules begin in the zero-point energy is reasonable, but because the geometry of a molecule changes considerably upon excitation, and electronic transitions are fast compared to the motion of nuclei (vertical excitation), there is a range of photon wavelengths that can take the molecule to the new electronic state in which some degree of vibrational excitation makes up the DE. Therefore, the final states are known as vibronic states. Generally, the vibronic state with strongest intensity is that corresponding to vertical excitation. Calculating DE at a fixed geometry, therefore, leads to the relation of direct interest: DE ¼ hn ¼
hc l max
ð1:1Þ
wherein lmax is the wavelength at which the absorptivity or emission intensity is a maximum. The bandwidth and vibronic structure are directly related to the shape and amplitude of the geometry change, with the relative amplitudes of intensity given by the Franck–Condon (FC) factors. These can be modeled quite precisely by simple ab initio quantum methods (Callis et al., 1995), but are normally not needed to make predictions of lmax in a semiempirical manner.
1.1.2. Intensities and lifetimes The ratio of photons absorbed to photons emitted is called the quantum yield, Ff, and determines the integrated fluorescence signal relative to a perfectly efficient emitter during steady illumination. Ff ¼
kr ¼ kr tf ðkr þ kn Þ
where kr is the radiative rate constant (which is proportional to the integrated extinction coefficient) and kn is the total nonradiative rate constant describing how fast the molecule becomes de-excited by any means other than emission of fluorescence from the emitting excited state. Time-resolved methods are increasingly used because of ease and higher information content (Beechem and Brand, 1985; Eftink, 1991; Lakowicz, 2006; Prendergast, 1991; Ross et al., 2007). If only a single
5
Predicting Fluorescence
species and process is involved, the fluorescence lifetime is given by tf ¼ (kr þ knr)–1. The strategy for predicting Ff and tf is as follows: The rate constant describing the spontaneous emission of photons, kr, associated with an electronic transition between two states is relatively insensitive to environment. The distinguishing property of intensity-sensitive probes, therefore, is that the nonradiative rate constant, kn, must be strongly modulated by the microenvironment, that is, we must be in the business of calculating kn. A crucial—and not widely appreciated—aspect of calculating kn is to recognize that kn virtually always consists of two components: kn0, which is intrinsic to the chromophore and relatively insensitive to the environment, and what may be called a quenching constant, kq, which is strongly modulated by environmental factors and is reasonably easy to calculate. kn0 stems from strong internal coupling of the electrons and the vibrations of the molecule, leading to intramolecular transitions from the fluorescing state to a high vibrational state of a lower electronic state (e.g., triplet or ground state), that is, what is commonly called intersystem crossing and internal conversion, and is generally still not predictable in a reliable manner.1 Intensity-sensitive probes function because kq is extremely sensitive to the environment in some way. The most common examples involve quenching caused by electron transfer (ET) and resonance energy transfer, both of which are effectively treated using the Fermi golden rule form of time-dependent QM. Other types of quenching will be detailed below. At the lowest order, the above expressions for quantum yield and lifetime become Ff ¼
kr kr þ kn0 þ kq
1 and tf ¼ kr þ kn0 þ kq :
ð1:2Þ
where typically the environmental dependence is virtually all carried in kq, which is computed as a function of the electrostatic environment and/or the distance to the quencher in the methods to be considered in this chapter. The two primary mechanisms by which the environment modulates kq are modulation of the spatial distance between the fluorophore and the 1
A footnote is to be added to this paragraph regarding the “insensitivity” of the kr and kn0 to the environment. Because both rate constants depend on the chromophore excited state wavefunction, which in turn depends somewhat on the environment, factors on the order of 2 or 3 can be observed. This is not the extreme sensitivity one normally seeks in a fluorescent probe, and often is ignored. As a tangible example, for the Trp indole ring, kr increases by a factor of 3 going from water to hydrocarbon solvent, with only a factor of 1.8 accounted for by the frequency cubed factor going with the blue shift from 360 to 310 nm (Meech et al., 1983). The remaining factor is that the transition dipole is larger as the charge transfer character of the wavefunction decreases in a nonpolar environment. kn0 also increases considerably going from water to nonpolar solvent, for Trp. As a result, the quantum yield remains relatively independent of surroundings (near 0.3 in the absence of quenching), while the lifetime becomes shorter by a factor of 2–3 for Trps buried in hydrophobic pockets (e.g., Trp48 of azurin), relative to those in nonquenching polar environments.
6
Patrik R. Callis
quencher, and modulation of the energy gap (DE00) by the local electrostatic fields and potentials affecting the fluorescing state and the final state created either by ET or excitation transfer. The energy gap is crucial because, as for all processes in the universe, energy must be conserved, that is, transitions only happen when there is resonance between the initial and the final states. The general time-dependent quantum mechanical rate for many processes is given by the Fermi rule: kq ðt Þ ¼
2p V ðr ðt ÞÞ2 rðDE00 ðt ÞÞ ¼ 4p2 cV ðr ðtÞÞ2 rðDE00 ðtÞÞ ℏ
ð1:3Þ
where V is the Hamiltonian matrix element that connects the initial and final electronic states, and r(DE00) is the density of final states, which physically may be thought of as the number of states in resonance with the initial fluorescing state. In the convenient right-hand version, c is the speed of light in cm s–1, V is in energy units of cm–1, and r is in states per cm–1. V is shown here as a function of the fluctuating distance (r) between the chromophore and quencher, and r is given as a function of time to emphasize the fluctuating energy gap due to the fluctuating local electric field. Special cases of kq include ET and (Fo¨rster or “fluorescence”) resonance energy transfer (FRET). Ð For ET: V ¼ CS1HCCTdv, where H is the Hamiltonian operator and CS1 and CCT are the wavefunctions of Ð the fluorescing and charge transfer states; r(DE00) is given by r(DE00) ¼ FIFAdDE00 where FI and FA are FC weighted densities of states for the ionizations of the S1 state of the donor and of the acceptor anion, respectively. For FRET: ! !! ! ð m D r m A r 1 ! ! V ¼ CD A HCDA dv ¼ 3 m D mA 3 r r2 where in the initial state, D*A, the chromophore is excited, and in the final state, DA*, excitation has been transferred to the acceptor A. ! m D and ! mA are the transition dipole vectors for the ground ! S1 excitations of donor chromophore D and acceptor A, and ! r is the intermolecular distance vector. Procedures outlined below are mostly about ET, but they map closely onto what has been done in QM–MD predictions of FRET (Beierlein et al., 2006). Computations of FRET are, in principle, somewhat more straightforward, because an accurate form of V is easy to compute, and r(DE00) is simply the well-known overlap integral of the fluorescence spectrum of the donor with the absorption spectrum of the acceptor.
Predicting Fluorescence
7
2. Qualitative Concepts: Developing Intuition In this section, we present a tour of what the methods presented in this chapter will provide. We use the properties of Trp fluorescence in proteins as the main example, but the concepts and procedures remain the same for any probe molecule. We have ported this technology to the quenching of flavins (Callis and Liu, 2006) and of fluorescein in an antibody (Hutcheson, 2009). Note that the quenching of flavins and dyes in proteins involves ET from Trp and also from tyrosine (Tyr). Figure 1.1 shows plots of calculated versus experimental quantum yields from Algorithms I and II applied to 24 Trps in 17 proteins. Table 1.1 matches numbers with PDB codes. Although the agreement is far from perfect, the method can clearly distinguish between high- and low-fluorescence quenching by the closest backbone amides.
2.1. Trajectories Figure 1.2 shows some typical QM–MD trajectories of the semiempircal QM-computed transition energies for the fluorescing state S1 (points) and the CT state (upper trace) for typical high- and low fluorescing cases of single-Trp proteins. The predictions of quantum yield come from these trajectories, with the energy gap between the S1 and CT states and the amplitude of the fluctuations being the primary controlling factors. In Fig. 1.2 are shown the trajectories for two weakly emitting Trps and two strongly emitting Trps in single-Trp proteins. Note the fluctuations of both states. In three of the four cases, the S1 state fluctuates on the order of 4000 cm 1, corresponding to the well-known broad featureless fluorescence spectra of Trp in most proteins. The mean value of the S1 energies provides a prediction of the fluorescence lmax. The CT state fluctuations are much larger: on the order of 8000 cm 1 or 1 eV. This reflects the larger amount of charge transferred and the larger distance of transfer. Both the CT–S1 energy gap and its fluctuations are key determinants of kq and therefore the fluorescence quantum yield and lifetime. A small energy gap and/or large fluctuations signify a low quantum yield, because there are many opportunities for the CT and S1 states to have the same energy. This is clearly more the case in Fig. 1.2A and B for the weak emitters when compared to the examples for high quantum yield cases shown in Fig. 1.2C and D, where the gap is large, and even large fluctuations of the CT state will not be sufficient to bring the CT state into resonance with the fluorescing state. Looking again at Fig. 1.2A for Trp126 of dsba, we focus on a quenching event. Arbitrarily at the 150 ps point, a transition to the CT state is simulated by removing the restraint that the Trp charges are those of S1. This is just to
8
Patrik R. Callis
A
0.35 0.30 21 24
Calculated yield
0.25 17
0.20 0.15
16 18
13
0.10 8 12 9 456 10 213 7 11
0.05 0.00 0.00
15
19
14
0.05
22 23
20 0.10
0.15
0.20
0.25
0.30
0.35
Experimental yield B
0.35 23
Calculated quantum yield
0.30
24
0.25
17 19
0.20
21 12
0.15
9
18 15
0.10 6 0.05
20
4 5 11 28 7
0.00 0.00
0.05
14 0.10
22
16 0.15
0.20
0.25
0.30
0.35
Experimental quantum yield
Figure 1.1 Plots of calculated versus experimental quantum yields for 24 Trps in 17 proteins by two algorithms. (A) Algorithm I: a constant V ¼ 10 cm 1 is used and the CT–S1 gap is made 4000 cm 1 smaller than given by the vertical Zindo energies. Table 1.1 matches numbers with PDB codes. (B) Algorithm II: D95 ab initio hV2i values and a slightly modified FC density of states scheme. The error bars are the standard deviations for three 50 ps MD trajectories. Here, 4700 cm 1 is added to the Zindo CT energy computed at CT geometry, and a Gaussian with standard deviation ¼ 3000 cm 1 is used for the amide electron attachment FC density spectrum. For the buried Trp59 of rnt (#23) only, the quantum yield was computed from the small equilibrium constant (positive DG0) for the electron transfer process stemming from its small reorganization energy (the relaxed CT state lies above the 1La state). For dsb76(#20) and pfk(#22), the present accuracy of our DG0 estimates leaves open the possibility that equilibrium will favor the 1 La state, and the quantum yield predictions will be much higher.
9
Predicting Fluorescence
Table 1.1
Key to names for 24 Trps in 17 proteins for Fig. 1.1A and B
Num
Description, pdb code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
T4 lysozyme W158-asn2 1lyd dsBa W126 1dsb Barnase W94-H18 pH5 1a2p Human cyclophilinA 2cpl TrpCage 1l2y (1 & 2) fkb506 binding protein 1d6o Subtilisin C 1sbc Phospholypase A2 2bpp NSCP W57 W4F W170F dsba W126 Q74A, N127A T4 lysozyme W138 1lyd T4 lysozyme W126 1lyd Barnase W94 pH8 1a2p Cobra toxin 1ctx Melittin 2mlt Glucagon 1gcn Barnase W71 his neutral 1a2p Parvalbumin 1b8r Barnase W35 his neutral 1a2p dsba W76 1dsb,1fvk average Staph. nuclease 1stn Phosphofructokinase 6pfk Ribonuclease T1 9rnt Apo-azurin W48 1azb
see what would happen. At the 156 ps point, a random fluctuation causes the CT state to become S1, and the atom charges of the chromophore are switched by the software to those given by the QM for the lowest CT state. Solvent reorganization about the drastically different charge distribution, in which the indole ring has become þ1 and the amide has a charge of 1, stabilizes the CT state to well below the S1 state within less than 100 fs. This is quenching. The system will return to the ground state well before being able to mount the large barrier to return to the S1 state. Figure 1.2B shows the trajectories for another weak emitter, Trp3 of phospholipaseA2, which displays similar behavior. Figure 1.2C and D shows trajectories for Trp140 of staphylococcal nuclease (SNase) and for Trp48 of azurin, both of which have quantum yields near 0.30, corresponding to kq ¼ 0. These two cases show how a high quantum yield is possible for both despite the quite different environments. For SNase, the Trp is near the protein surface and is surrounded by several charged residues. The fluctuations are therefore very large, yet the
10
Patrik R. Callis
B 45 CT with S1 charges S1
40 35 30 25
CT with CT charges
20 15 0
20
40
60
80
100 120 140 160 180 200
Time, ps
C
45 40 35 30 25 20 15 0
10
20
D
45
Transition energy, cm–1/1000
Transition energy, cm–1/1000
Transition energy, cm–1/1000
Transition energy, cm–1/1000
A
40 35 30 25 20
30 Time, ps
40
50
60
45 40 35 30 25 20 15
15 0
10
20
30 Time, ps
40
50
60
0
10
30
20
40
50
60
Time, ps
Figure 1.2 Typical QM–MD trajectories of S1(points) and CT(solid lines) state energies from Zindo. (A) For Trp126 Dsba (weakly fluorescent); (B) phospholipaseA2 (weakly fluorescent); (C) Trp140 in Staph. nuclease (strongly fluorescent); and (D) Trp48 in apo-azurin (strongly fluorescent). In each panel, during the early part of the trajectory, the Trp charge distribution is that of the S1 (fluorescing state). The sharp drop at the later end is the response of the system due to arbitrarily at some point changing the charge distribution to that of the CT state, as if an electron transfer to the amide had taken place. The sharp drop is caused by the classical response by the environment (mainly water) through the MD to the new charge distribution.
charged groups conspire to create a large energy gap. In contrast, for Trp48 of azurin, the environment is extremely hydrophobic and impenetrable by water. This has the effect of greatly reducing the fluctuations of the CT state. There is a corresponding reduction in the reorganization stabilization of the CT state following simulated ET. The small fluctuations and lack of reorganization imply that no quenching can take place.
2.2. How to recognize a quenching environment One is naturally going to ask what is it that makes these dramatic differences in computed properties, which seem to correlate well with observed fluorescence yields. As the title of this section suggests, in most cases of extremely strong or weak fluorescence, it is possible to see the reason by inspection using a simple and natural criterion: that electrons are at lower energy if they are near positive charge (i.e., at positive voltage) or far from negative charge (small negative voltage).
11
Predicting Fluorescence
The large energy fluctuations of the CT state seen in Fig. 1.2 are because the fluctuations in electric potential difference (voltage difference) across the region of electron density shift for the ground ! CT transition for Trp in proteins are enormous, spanning a range of 8000 cm 1 (¼1 eV ¼ 23 kcal/ mol ¼ 96 kJ/mol), corresponding to electric fields 5 107 V/cm. To put this in perspective, only1.4 kcal/mol in DG0 (equivalent to about one-third of a typical H-bond) changes an equilibrium constant or a rate constant by a factor of 10. Figure 1.3 is a collection of cartoons illustrating major factors leading to quenching (or no quenching) of Trp fluorescence by the nearby backbone B
A
+
Lys132 Lys132
+ H2O H2O
-
-
-
Asp 123
Asp 71
C
Asp 71 D
+ Ala 1
-
Asp 123
+
H2O
+ Lys 110
+ Lys 10
+ Lys 133
Glu 129
Figure 1.3 Cartoons of major electrostatic interactions that dictate whether the amide will or will not quench Trp fluorescence in three proteins. (A) Dsba in a stabilizing environment corresponding to a low CT energy in Fig. 1.2A. (B) Dsba in a destabilizing environment corresponding to a high CT energy in Fig. 1.2A. (C) Phospholipase A2 showing stabilization of the CT state by the helix dipole, and by two nearby waters and positive charges. (D) Staph. nuclease. The CT state is destabilized primarily by two Lys residues near the indole ring. Arrows indicate the path of electron transfer.
12
Patrik R. Callis
amide. The arrow shows the path of ET. Transfer toward a positive charge or away from a negative charge stabilizes the CT state, leading to quenching by the amide. For Trp126 of Dsba seen in Fig. 1.3A, the amide CT state is greatly stabilized by the positioning of Lys132, Asp71, Asp123, and a water molecule. In the CT state, the ring is positive and the amide C is negative. Therefore, the negative charges near the ring are stabilizing. Likewise, the positive charge of the Lys132 and the nearness of the positive H atom of the water to the amide C equally well stabilize the CT state. Such H-bonds— either from water or protein donors—are quite effective, because the large partial positive charge on the H (modeled as a þ0.4 point charge) can get quite close to the C when it H-bonds to the carbonyl O of the amide (Fig. 1.4). Figure 1.3B shows the same Trp and surrounding groups 20 ps earlier at a time of an extreme upward fluctuation. A major factor in the higher energy of this configuration is that the positive-charged end of the flexible ˚ ngstroms more distant, and the water has a different HLys132 is a few A bond partner and has almost no influence on the CT energy. Note that these cartoons only show the few most prominent interactions which capture the overall effect of the environment. In fact, the net effect is
0.5 volts on amide
+0.4 +0.4 –0.8 CT, vacuum: 0 volts on amide 0 volts on ring CT, +0.5 volts = 1 H-bond
ET ET
CT, +1 volts on amide S1 ET is possible
Figure 1.4 Separation of CT and S1 energy for Trp as given by Zindo calculations in vacuum and when the amide O is H-bonded by a single water. The single water makes the amide more positive by 0.5 V, thereby lowering the CT state by about 0.5 eV or 4000 cm 1. A second interaction of this magnitude would certainly cause quenching.
Predicting Fluorescence
13
always a small difference of two extremely large numbers of stabilizing and destabilizing interactions. For example, in the frame in Fig. 1.3A, the four entities in the cartoon contribute 2000 cm 1 to the CT state energy (stabilize). But all together, protein residues contribute a total 30,000 cm 1 of stabilizing interactions and þ18,000 cm 1 of destabilizing interactions, resulting in a net stabilization of 12,000. A similar pattern is seen for water. There is a total of 44,000 cm 1 stabilization and 43,000 cm 1 destabilization for a net stabilization of 1000 cm 1. In the frame of Fig. 1.3B, corresponding to a high energy fluctuation, the net stabilization by protein is only 7000 cm 1, and the water has a net destabilization of 4000 cm 1. Figure 1.3C shows the essential features of the environment of Trp3 of phospholipase A2, another highly quenched Trp. Here, the Trp amide is at the N-terminus of a section of helix. The collective dipoles from the backbone atoms reinforce to create a large positive potential at the amide in such a case. The nearby positively charged N-terminal Ala and Lys10 also contribute strongly to stabilizing the CT state, as do two waters whose negative O atoms are near the indole ring. In contrast, Fig. 1.3D shows the main contributors to the CT state destabilization for a high quantum yield case, Trp140 of SNase. Here, the positively charged Lys133 and Lys110 close to the indole ring strongly destabilize the CT state, dominating the stabilizing interaction of the water and the Glu129. No cartoon is needed to illustrate why Trp48 of azurin has a high quantum yield. As noted above in relation to Fig. 1.2D, there are simply no waters and charge groups within 10 A˚ of this Trp, which lies in an extremely hydophobic pocket and has one of the most blue-shifted lmax values known. Figure 1.4 illustrates more quantitatively that even a single H-bond from water in a strategic location can affect the quantum yield of fluorescence. The numbers presented in Fig. 1.4 are based on an actual quantum calculation of the CT–S1 energy gap for the model Trp þ backbone chromophore system used in our QM–MD simulations. Without the water (as modeled by three point charges), the energy of the CT state lies about 1 eV (8000 cm 1) above S1, and no quenching is expected. The presence of the water protons in the H-bonded complex with the amide O stabilizes the CT state by 0.5 eV. When augmented by other stabilizing groups or another such water H-bond, quenching will be efficient. The above discussion focused on electrostatic stabilization as it modulates fluorescence intensity. Entirely parallel ideas govern fluorescence wavelength (Vivian and Callis, 2001). In the case of Trp, the electron density shifts from the pyrrole ring to the benzene ring during excitation to S1. Therefore, positive charges near the benzene ring or negative charges near the pyrrole ring contribute a shift to lower energy (longer wavelengths). The reverse will contribute a blue shift. As with the CT state stabilization, the net result is the small difference of large numbers of contributions. Figure 1.2D shows how the S1 energy of Trp48 of azurin in a hydrophobic pocket exhibits a high
14
Patrik R. Callis
mean value and much lower fluctuations than is the case for the more water exposed Trps in Fig. 1.2A–C. This corresponds well with the extreme blueshifted fluorescence (308 nm) and the presence of vibrational structure seen in the experimental fluorescence. Note, however, that the character of the fluorescence is 1La, as shown by recent fluorescence anisotropy experiments (Broos et al., 2007).
3. Methods 3.1. Philosophy The general scheme is to perform hybrid quantum mechanics–classical molecular dynamics (QM–MD) simulations of sufficient length that the configuration space of the surroundings of the chromophore is adequately sampled. The choice of QM is a compromise between accuracy and speed required to sample the changing environment over long simulations.
3.2. MD simulations There exist now several widely accepted and used MD packages. A useful introduction to the use of MD methods, including a recent compendium of modeling software sources, has been published in this series (Saxena et al., 2009). We have had direct experience primarily with Discover, Charmm, and Gromacs and have not seen a significant difference in performance with regard to predictive capability.
3.3. Quantum mechanics The choice of the QM method is conditioned by the criteria that it must be reasonably accurate for a variety of low-lying electronic state types, while also being simple enough to deliver energies and eigenfunction information on several excited states in a short time (ideally, in a few seconds). We have used Michael Zerner’s spectroscopically calibrated INDO/S-CIS method (Zindo) (Cory et al., 1998; Ridley and Zerner, 1973; Thompson and Zerner, 1991), incorporating a modification that allows for input of electrostatic potentials and fields generated from the partial charges of every atom in the environment, for example, protein and solvent. Zindo and similar semiempirical methods have been used in earlier QM–MM studies (Gehlen et al., 1994; Marchi et al., 1993; Warshel, 1982, 1991; Cory et al., 1998), and these studies have influenced our choices to various extents. Another essential requirement of the QM is the ability to incorporate the differing electric potentials at the chromophore atoms generated by (typically) the point charges assigned to the atoms of the surrounding biopolymer
Predicting Fluorescence
15
and solvent. The one-electron Hamiltonian is modified to include electric fields and potentials at each atom center computed from the Coulomb sum of point charges provided from the forcefield for all non-QM atoms, including all waters (Sreerama et al., 1994; Theiste et al., 1991). Alternatively, the original program has the option to read the point charges as part of the input file. An absolutely necessary addition to the parameter set for oxygen has been made for Zindo and must be used whenever oxygen plays a significant role in a chromophore (Li et al., 1999). The response of Zindo to electric fields (Callis and Burgess, 1997) is quite accurate, as indicated by a more fundamental quantum chemical study on indole (Donder-Lardeux et al., 2003). For the methods described here, it is essential that the quantum mechanical calculation be relatively accurate and faithful to the electrostatic perturbations from the environment. In the case of Trp, CASPT2 (Andersson et al., 1992, 2002) calculations for several formamide–indole pairs show that Zindo is effective at tracking the ring ! amide CT state energies as a function of electrostatic environment (Liu et al., 2005). Zindo has been incorporated into Gaussian (Frisch et al., 2002) and is also available in the VAMP (Clark et al., 2008) package, but care must be taken that the parameters of Li et al. are included. The VAMP package contains other semiempirical methods based on the PM3 and AM1 Hamiltonians that are likely quite suitable for the type of QM–MD computations described here. Regardless of the QM and MD method, quantum calculations are usually performed at 10 fs intervals on an appropriate chromophore fragment clipped from the biopolymer and capped with hydrogens. QM bond lengths are typically fixed in the MD at values representative of either the ground or of the CT state. It is useful to use the CT state geometry when the average CT energy lies well above the S1 state to avoid excessive mixing with the large number of excited states at high energy. CASPT2 calculations for the formamide–indole pairs also show that the S1 state energy increases by 4000 cm–1 relative to the CT state energy as the geometry is increased from that at the 1La energy minimum to that at the CT energy minimum (Liu et al., 2005). In addition, CASPT2 CT state energies are found to be 1000–3000 cm–1 higher than those estimated using Zindo (Liu et al., 2005). Therefore, Zindo-computed vertical excitation energies with the CT state geometry should be increased by about 6000 1000 cm–1 to make a realistic estimate of the true difference in CT and S1 zero-point energies.
3.4. Interface between QM and MD The prediction scheme we use has a number of steps, none of which is especially daunting. A block diagram gives an overview in Fig. 1.5. Whereas the MD and QM procedures in this scheme are generic and can, in principle, be used “off the shelf,” the procedures labeled with bold
16
Patrik R. Callis
Raw .pdb file
1
Edit for MD
3 Unequilibrated coords
MD: min/equil
Atom charges
4
Update chromophore charges
Coord time = t
Create atom charges and group charge differences; identify S1 and lowest CT state
S1, CT energies, and group charges at time t
State energies and molecular orbitals
5
QM
S1, CT energies, and group charges as function of time
QM input file
MD: trajectory
Coords at time t + Δt
Refine and output S1, CT energies and group chg diffs at time t
2
Cut/cap the chromophore; Add point charges of env.
6
Compute electron transfer rates, lifetimes, quantum yields, and spectra
Figure 1.5 Flow chart indicating components of the general procedure. The cylinders represent files and rectangles represent programs. The MD and QM are generic. The procedures identified with the bold numbers represent components of the interface that were programmed in Fortran locally. If QM charges are fed back to the MD, the block arrow is ignored, and the dashed arrows complete the loop. If MD charges are kept fixed, the block arrow is followed and the solid arrows represent procedures that may proceed asynchronously behind the MD cycles if the option of updating charges on the chromophore is not elected.
numbers in Fig. 1.5 represent Fortran programs written by the author’s group that interface the MD and QM, and extract the necessary information to make wavelength and/or intensity predictions. A brief synopsis of the different modules follows. 3.4.1. Editing the raw structure file This is the hardest step to automate because of the nonuniform nature of X-ray structure files that are available from the RCSB Protein Data Bank, for example, (http://www.rcsb.org/pdb), combined with nonuniformity of input formats used by different MD platforms. Examples of typical special editing tasks are (1) most crystal structures have multiple protein molecules per unit cell; if the molecule is a monomer
Predicting Fluorescence
17
in solution, one must separate out one of these for the calculation. (2) Very few X-ray structures provide information about the location of hydrogens. These must be added by the program; (3) specifying one of the three common forms of histidine; the structure seldom specifies which tautomer and charge form is likely; (4) crystallographic waters are quite often of value, given the importance of water in determining wavelength and quenching; these may or may not be present, and which ones to retain is not so clear when there are multiple molecules per unit cell; (5) often there are disulfide bridges that must be connected; (6) metal ions may or may not need to be retained; (7) cocrystallizing agents of no relevance may appear in the structure and should be removed. (8) We have defined new atom types used for CT or S1 geom. These are available by request from the author. The entire protein is then solvated. We have most often used a sphere of TIP3P model waters, with radius such that all parts of the protein were solvated to a depth of at least 5 A. A quartic confining potential localized on the surface of the spherical droplet prevented “evaporation” of any of the waters during the course of the trajectory. The fully solvated protein structure is energyminimized and equilibrated before the production simulation. Use of periodic boundary conditions is preferred for simulations longer than 100 ps. 3.4.2. Program to create the chromophore QM input file with environment point charge information (1) When the chromophore is not covalently bonded to the polymer, the process is quite straightforward. The appropriate states of the chromophore are found by the QM procedure as augmented by either adding the electric potential at each atomic center due to all the point charges of the nonchromophore atoms (the environment), or by supplying as part of the input the coordinates and numerical values of the point charges. (2) When the chromophore is covalently attached to the polymer, for example, Trp in a protein, one must artificially create a realistic model of the chromophore in which atoms that have minimal influence on the spectral properties are omitted. This has been implemented with most MD packages by defining “linker atoms” that insulate the chromophore of interest from the rest of the polymer. In our procedure, the chromophore remains attached to the polymer at all times during MD propagation just as in the original structure file. The input file of coordinates for the QM procedure is changed by judiciously changing carbons defining the “point of attachment” of the chromophore to polymer into hydrogens, and shortening the bonds accordingly. The resulting coordinates, point charge information, and other details required by the QM procedure of choice are written to a file in a format appropriate to the QM method. All other atoms in the system form the environment point charge array.
18
Patrik R. Callis
In either case, we commonly constrain the bond lengths of the chromophore so that fluctuations of the energy reflect more closely changes in the environment alone. In most of our applications to date, we fed the QM charge information back to MD system using the USERSB subroutine provided in the Charmm program. This is formally necessary if wavelengths are to be computed without sizable empirical adjustment. Aside from simply reading in the coordinates and charges from the generic files of the MD program, one must map chromophore atom numbers for QM to the MD system numbers and change the connecting atoms to H. There are three important subroutines that are reproduced below. One of these is called to shorten bonds to the capping hydrogen atoms. The other two make the Coulomb sums to create potentials and fields at each chromophore atom. SUBROUTINE BOND(XX,YY,ZZ,I,J,RR) c This routine scales a bond that connects the chromophore to the c polymer wherein a heavy atom j, initially located at xx,yy,zz c and is connected to the chromophore atom i becomes an H, and is c assigned new xx,yy,zz values that make the bond an appropriate c X-H bond length, where X = C, N,... dimension xx(50),yy(50),zz(50) c....Scale the i-->j bond... c create the unit vector pointing from the chromophore heavy atom, i, to the polymer c atom, j, that has already been changed to H in the main program. c x,y,z components are: xvec=xx(j)-xx(i) yvec=yy(j)-yy(i) zvec=zz(j)-zz(i) c rvec = length of vector rvec=xvec*xvec + yvec*yvec + zvec*zvec rvec=sqrt(rvec) c make the unit vector xvec=xvec/rvec yvec=yvec/rvec zvec=zvec/rvec c new xx(j),yy(j),zz(j) : position atom j along original bond from
Predicting Fluorescence
c
19
atom i so the bond length = rr xx(j)=xx(i)+xvec*rr yy(j)=yy(i)+yvec*rr zz(j)=zz(i)+zvec*rr return end
The potentials at chromophore atoms are the trivial Coulomb sum over the nonchromophore atoms, seen below as the double loop, in which the outer loop is the sum over all atoms and the inner loop is over chromophore atoms. Self-terms are avoided by the index id_chrom(k), which is equal to 1 if k corresponds to a chromophore atom: do k=1,natoms if(id_chrom(k).eq.0)then !...then include the contribution; do kk=1,nchrom r=((x(k)-xx(kk))**2 + (y(k)-yy(kk))**2+(z(k)-zz (kk))**2)**0.5 call potential(x,y,z,k,xx,yy,zz,kk,r,q,vv) vtrp(kk)=vtrp(kk) + vv call field (x,y,z,k,xx,yy,zz,kk,r,q,ex,ey,ez) extrp(kk)=extrp(kk) + ex eytrp(kk)=eytrp(kk) + ey eztrp(kk)=eztrp(kk) + ez enddo endif enddo SUBROUTINE POTENTIAL(X,Y,Z,K, XX,YY,ZZ,KK,R,Q, V) c computes the potential in volts between environment atom k c and chromophore atom kk, which are located at (x,y,z) and (xx,yy,zz) with charges in units of e q(k) and q(kk) c respectively and a distance apart r in Angstroms dimension x(30000),y(30000),z(30000), xx(50), yy (50), zz(50) dimension q(30000) v = q(k)/r*14.399644 return end SUBROUTINE FIELD(X,Y,Z,K, XX,YY,ZZ,KK, R,Q, EX, EY, EZ)
20
Patrik R. Callis
c computes the electric field vector in volts/angstrom between c environment atom k and chromophore atom kk, which are located at c (x,y,z) and (xx,yy,zz) with charges in units of e q(k) and q(kk) c respectively and a distance apart, r, in Angstroms dimension x(30000),y(30000),z(30000), xx(50), yy (50), zz(50) dimension q(30000) ex = ((xx(kk)-x(k))*q(k)/r**3)*14.399644 ey = ((yy(kk)-y(k))*q(k)/r**3)*14.399644 ez = ((zzr**3)*14.399644/r**3)*14.399644 return end
The chromophore dynamics and atomic coordinates are governed entirely by the MD (perhaps with QM-modified charges). The transition energy calculation, with the potentials and fields added, is performed on the chromophore only. The effect of the environment surrounding the chromophore atoms is incorporated directly into the QM calculation through a straightforward modification of the matrix elements of the Fock operator in vacuum involving atomic orbitals m and n: 0 Fmm ¼ Fmm eVa ! 0 Fmn ¼ Fmn þ e Ea r!mn
in which e is the electronic charge, Va is the electrostatic potential at quantum mechanical atom a created by all non-QM protein and solvent atoms, k and Ea is the associated electrostatic field. The potential and field are evaluated at the QM atoms, a, via straightforward Coulomb sums: P qk Va ¼ k rak ! X qk ! r ak Ea ¼ 3 rak k where the summations extend over all non-QM atoms. No dielectric constant is assumed and electronic polarizability is not included. We find that the potentials make by far the most important contribution to the state energy shifts. In practice, the fields may be considered optional. The effect of the fields is typically only to create a minor
21
Predicting Fluorescence
additional hybridization of local s and p function orbitals if a minimal basis set is used. Electronic excitation of the chromophore is simulated by instantly switching the charges (and in some studies also the geometry) on the QM system to the excited state (S1) values. Most of the internal Stark effect (Callis and Burgess, 1997; Lockhart and Kim, 1992) is expressed implicitly by the difference of the potentials at different atoms. The chromophore atoms are assigned charges given by the electron density matrix. For example, in the Lo¨wdin basis (which assumes orthogonal atomic orbitals) from the INDO/S calculation: ! X qa ¼ e Pmm Za m
where Pmm are diagonal elements of the density matrix for all valence atomic orbitals, m, centered on atom a, Za is the atomic core charge. The charge distribution of the excited state is sufficiently sensitive to the local field (potential differences) that these charges should be updated by a QM calculation at least every 10 fs of simulated time to capture the fastest relaxation times ( Jimenez et al., 1994; Maroncelli, 1993; Muino and Callis, 1994). 3.4.3. Identifying the S1 and CT states The QM output from a CIS treatment (a good compromise) minimally would be the electronic transition energies and the canonical molecular orbitals. We have adapted the original Zerner subroutine ONEOP, which creates the electron density matrix for the ground and excited states, thereby giving the approximate charge on each atom in each state. What is of interest is the charge difference at least at the resolution of atoms accompanying each transition. The QM program may output the permanent dipole difference, but this should only be used to help identify states. The point dipole approximations are likely to be poor, since some of the stronger interactions involve distances on the order of the dipole length. A small file, for example, name_t.eden at each time step t is created which lists the first 50 excited state energies, oscillator strengths, and the charge differences experienced by the groups of interest upon excitation to the state. The groups are chosen typically as the electron donor and acceptor (s), such that charge transfer (CT) states are readily identified by inspection by eye and by software. The large number of states is necessary because the CT states fluctuate over a range of 2 eV in some cases, but must be identified to compute the ET rate. Another reason this is necessary is that the CT states repeatedly cross other states that are less sensitive to the local fields, and so
22
Patrik R. Callis
are often mixed with other states to different extents. This problem is dealt with in Section 3.4.5. 3.4.4. Update MD charges The interface for informing the MD of the QM charges (if chosen) is accomplished by a user written program, either by modifying the MD code, or by a provided subroutine, for example, the USERSB subroutine in the case of CHARMM. This is most pertinent in the case of simulating transient and steady-state Stokes shifts of fluorescence spectra caused by solvent and protein response to electron density changes due to excitation. We find that the polarization of the excited state provided by the feedback of QM charges affects the shift by 20–30%. Qualitative information concerning the dynamics of the relaxation can probably be deduced without the feedback, and fluorescence maxima can be corrected to provide good estimates. 3.4.5. Program to refine CT energies The primary job of this program is to examine each of the name_t.eden files of Section 3.4.3 and create a charge difference-weighted average lowest CT energy value. If necessary, a similar weighted average for the S1 state might be necessary if the S1 state is often mixed. This is a crucial program because the energy gap between the lowest CT state and the S1 state is a key variable in determining the ET rate. At least two factors enter the process. The process is complicated typically by the existence of multiple CT states arising either from there being more than one electron donor/acceptor in the quenching process or from “excited CT states” corresponding to the “hole” residing in alternate high energy occupied MOs. The process consists of averaging over the lowest CT states whose ET does not exceed one electron by more than 20%. The output of this program is a single trajectory file, in which the charge-weighted average CT and S1 energies are tabulated as a time series along with the group charge differences. 3.4.6. Compute lifetimes, quantum yields, and wavelengths ET rate, fluorescence lifetime, and quantum yield: As in earlier work (Callis and Liu, 2004; Callis and Vivian, 2003; Callis et al., 2007), the fluorescence quantum yield (Ff) and lifetime (tf) are estimated from Eq. (1.2). For quenching by ET, kq becomes kET, and can be computed according to different assumptions. Algorithm I: Our first procedure (Callis and Liu, 2004; Callis and Vivian, 2003) was somewhat ad hoc and specialized to cases such as Trp
23
Predicting Fluorescence
in proteins wherein the donor–acceptor distance is nearly constant and the coupling large, such that the rate could be assumed to be controlled primarily by the energy gap. The operative form of the golden rule became ð 2p 2 1 DE00 hDE00 i 2 2 1=2 jV j 2ps rFC ðDE00 Þ exp dDE00 hkET i ¼ ℏ 2 s
ð1:4Þ where hDE00i is the mean energy gap between the zero-point energies of the CT and S1 states, s is the standard deviation of DE00, and rFC is the FC weighted density of states for the ET process, which is a function of the energy gap. The Gaussian is normalized and gives the probability of finding a particular value of DE00. rFC was estimated following the method we used in several recent studies (Callis et al., 1995; Fender et al., 1999; Short and Callis, 2000), using a direct product vibrational space established for the geometry changes for the electron ejection from indole and electron attachment to formamide with optimization using HF/3-21g and CIS/3-21g. The vibrational frequencies and modes used were those for the neutral ground-state molecules, determined at the B3LYP/6-311Gpd level. This formula gives an excellent agreement with isolated molecule vibronic spectra for isolated indole and is assumed to do the same for ET, in which the process is treated as simultaneous electron ejection and attachment. In other words, it is the overlap of the photoelectron spectrum of the indole ring with that of the acceptor anion. This is precisely analogous to the overlap of the donor emission and acceptor absorption spectra in the more familiar Fo¨rster resonance excitation energy transfer (FRET) (Fo¨rster, 1959, 1971), as emphasized by Hopfield (1974). In this procedure, the unknown Vel and DE00 values were essentially treated as empirically adjusted constants. DE00 ¼ DEvert þ D, where DEvert is the CT–S1 energy gap as read off the trajectories of Fig. 1.2, and D and Vel are the adjustable constants. A good fit is shown in Fig. 1.1A using V ¼ 10 cm 1 and D ¼ 4000 cm 1. Algorithm II: Recently, we refined the procedure to be more general by computing V from ab initio Gaussian CIS Hamiltonian elements, while continuing to use D as a fitting parameter. The typical ab initio V values are in the range 200–1000 cm 1, which would give rates 4 102 to 104 times those with 10 cm 1. The best D was on the order of þ 5000 cm 1, which is physically more reasonable, given that CT geometry was used in these calculations. Using CT geometry would lower the CT state energy while raising the S1 energy relative to the zero-point energies. The result of computing hkETi gives hkET i ¼
2p 2 V hrFC ðDEvert þ DÞi ℏ
ð1:5Þ
24
Patrik R. Callis
in which hV2i comes from the average of 500 ab initio values of V taken from 15,000 points of a 150 ps trajectory, and the average density of states came from all 15,000 points. Calculation of lmax and Fluorescence Spectra. lmax values are calculated simply as the reciprocal of the average over the corresponding MD trajectory of raw Zindo vertical S1 transition energies. Often the molecular geometry is not that of the S1 state, which means that the vertical S1 transition energy is not a good approximation for lmax. We find that relative shifts for different electrostatic environments are rather insensitive to the actual geometry used. Therefore, good estimates may be found by displacing the directly calculated value by a constant. Simulated fluorescence spectra may be constructed from the S1energy distributions from the MD trajectory. To construct authentic fluorescence spectra from the distributions of vertical 1La energies, we use a vibronically resolved ab initio computed 1La spectrum that has been shown to well represent resolved 1La fluorescence of 3MI-complexed with tri-ethylamine in a cold jet (Short and Callis, 2000). This 3MI spectrum is convoluted with the instantaneous vertical 1La energies from the trajectory and weighted by frequency cubed to produce the computed fluorescence spectrum on the energy scale. Statistical noise from a limited number of points can be reduced by smoothing the resultant histograms. A wavelength to compare with experiment comes from the average over the trajectory of calculated transition energies following equilibration. Examination of transition energy time correlation functions and directresponse results do not reveal a relaxation component beyond 5 ps of the S0 ! S1 excitation that is significant compared to fluctuations during a single trajectory. Averaging several hundred trajectories of ns length are now possible (Li et al., 2007, 2008) and are able to capture slower relaxations and presumably could result in more precisely predicted wavelength maxima, and ns relaxation times (Toptygin et al., 2001, 2006). 3.4.7. Contributions to S1 and CT energies relative to vacuum As noted above, the wavelength and CT energy predictions from the QM–MM method come from approximating the electrostatic environment of each atom of the QM molecule by augmenting the Fock matrix diagonal terms in the AO representation with the electric potential coming from the Coulomb’s Law summation over the point charges of all non-QM atoms in the simulation. The environment-caused energy change of a particular state is therefore easily divided into contributions from individual amino acid residues and water molecules using the previously described analysis tool (Vivian and Callis, 2001), in which the contribution from a given peptide residue or water is given by the sum of contributions from the individual atoms of the residue. This is done effectively at a given point in a trajectory from the scalar projection of the electric potentials at the QM atoms and the
25
Predicting Fluorescence
electron density changes at those atoms accompanying excitation or ET. The energy change relative to vacuum, DEj, caused by the point charges of residue j is given by X DEj ¼ Vaj Dra a
4. Nonexponential Fluorescence Decay Understanding the ubiquitous nonexponential decay exhibited by Trp fluorescence in proteins is becoming crucial for interpretation of ultrafast decay experiments, especially when the power of such experiments is used to extract dynamical information about water near protein surfaces (Halle and Nilsson, 2009; Li et al., 2007; Zhang et al., 2009). In principle, the phenomenon of nonexponential decay is expected for any useful fluorescent probe. As noted in Section 1, the high sensitivity of a useful fluorescent probe to a certain environment property in conjunction with the multiple conformers typical of biopolymers often leads to multiexponential (or nonexponential) fluorescent decay. No consensus exists as to the cause of the complex decay. For Trp, opinions divide mostly between the view that discrete ground-state subpopulations (often different rotamers) exhibit different decay times (heterogeneity model) (Dahms et al., 1995; Donzel et al., 1974; Engh et al., 1986; Gordon et al., 1992; Szabo and Rayner, 1980), and at the other extreme, the view that the excited population is homogeneous, but has a time-dependent fluorescence spectrum that shifts to longer wavelengths on a nanosecond time scale (relaxation model) (Lakowicz, 2000, 2006). Other views have also been presented (Bajzer and Prendergast, 1993; Hudson et al., 1999; Wlodarczyk and Kierdaszuk, 2003). Heterogeneity in the form of different rotamers was modeled in a QM– MD simulation of FRET for Trp43 in the TetR-tetracycline complex and found to give rise to nonexponential decay primarily due to different orientation factors of V (Beierlein et al., 2006). The observation by Xie and coworkers (Luo et al., 2006; Yang et al., 2003), of 100- to 1000-fold fluctuations in lifetime while tracking the fluorescence decay times of single molecules of FAD and of fluorescein in proteins at 100 ms intervals for time periods of seconds, emphasizes the generality of this phenomenon. For these bulky chromophores, dynamic interchange of rotamer conformations at the ms time scale is not attractive, though perhaps possible, and they have suggested that the heterogeneity is in the chromophore–quencher intermolecular distance. This is reasonable considering that energy gap heterogeneity may not be so important as in the
26
Patrik R. Callis
Trp case. The single-molecule fit to four exponential decay constants was found to agree well with bulk measurements. Entangled in this division of views regarding Trp is a strong correlation of fluorescence lifetime and fluorescence wavelength, such that the lifetime virtually always becomes longer at longer wavelengths. This intriguing correlation is beginning to be understood, Pan et al., (2010) and plays a central role in defining interpretations. Relaxation models naturally predict such a correlation (Lakowicz, 2000; Maroncelli, 1993) because simple solvent response to an excited state dipole always lowers the excited state energy. The spectrum shifts in time to lower energies, and the loss of intensity at short wavelengths adds a fast decay component at the short-wavelength edge and adds a complementary rising component at the long-wavelength side— both in addition to an underlying excited state population decay. A rising component appears as a negative amplitude if one is globally fitting the decay profiles as a linear combination of exponentials with different decay constants. The heterogeneity model can cause apparent shifts of the fluorescence spectrum in time as the faster decaying populations disappear to leave the longer decaying components. In contrast to the relaxation mechanism, the heterogeneity model cannot produce negative amplitudes. For proteins, the heterogeneity model often seems more appropriate because there are very few cases in which negative amplitudes are found. Two studies on the protein monellin highlight the difficulty of interpretation in the 10–100 ps range (Xu et al., 2006; Peon et al., 2002). Heterogeneity has appeared in more clear-cut cases. In the weakly fluorescing Trp68 of human gamma D- and gamma S-crystallins, a major 50 ps decay component appears along with substantial amplitudes of 0.2 and 2 ns components with no measureable shift in lmax over a time range of 0–300 ps (Xu et al., 2009; Chen et al., 2009). Strong evidence for this heterogeneity is reported in that study from a 2-ns QM–MD simulation, in which a water modulates ET to the amide by forming and breaking an H-bond at intervals on the order of 100 ps. Extreme lifetime-wavelength correlation is also seen in cyclic hexapeptides on the few ns time scale (Pan et al., 2006), a time scale in which solvent relaxation for exposed Trps is not expected. In contrast to the relaxation model, no comparably attractive principle explains why heterogeneity would lead to the correlation in the absence of relaxation. The absence of plausible hypotheses for the underlying principle is used as evidence against the standard heterogeneity (rotamer) model (Lakowicz, 2000, 2006), and the absence of rising long-wavelength fluorescence (or negative amplitudes) is used as evidence against the relaxation model. Our work since finding that the short-range ET coupling in these systems is strong (Callis et al., 2007), with concomitant greatly increased sensitivity of the ET quenching rate to the energy gap between S1 and CT states (Callis, 1991, 1997), leads naturally to the missing underlying
27
Predicting Fluorescence
universal physical principle accounting for the strong correlation between lmax and tf (Pan et al., 2010). As seen in Fig. 1.2, the familiar broad fluorescence spectrum of a solvent-exposed chromophore is actually an ensemble average of single molecular lmax values, fluctuating on a femtosecond time scale typically over 3000–4000 cm–1 (fwhw) or 40 nm. This leads naturally to a picture in which those conformers having shorter wavelength emission spectra, that is, higher average energy, have an increased probability for transient fast quenching during large fluctuations in environment that bring the nonfluorescent CT state and the fluorescing state into resonance. Figure 1.6 illustrates what has just been stated. The lower panel shows the distribution of lmax values from the trajectory in the top panel. Also on the lower panel is a bar graph that shows the instantaneous kq values at each point in time, plotted versus the S1 transition energy. One sees that the bulk of the quenching probability comes from the most extreme high energy
Transition –1 energy, kcm
44 40 36 32 28 300 Rate, ns–1
250 0
20
40
80 60 Time, ps
100
120
140
20
40
60
100
120
140
200 150 100 50 0 0
Rate, ns-1
300
80 Time, ps
Distribution of instantaneous S1 transition energies
DSBa
250 200
Rates vs. S1 energy
150 100 50 0 28
29
30
31
32
33
34
1La energy, kcm-1
Figure 1.6 Explanation of the physical principle underlying the correlation of lifetime and wavelength expected in cases of heterogeneity. The lower panel shows the histogram of S1 energies (solid curve), coming from the distribution seen in the upper panel. The vertical lines on the lower panel represent a bar graph of instantaneous electron transfer rates at the S1 energy indicated on the horizontal axis. Instantaneous rates as a function of time are plotted on the middle panel.
28
Patrik R. Callis
fluctuations, that is, those having the greatest probability of being in resonance with the CT state. Therefore, one concludes that subpopulations with the most blue-shifted spectra will decay fastest, providing the subpopulations persist for the excited state lifetime and have comparable fluctuation amplitudes. The principle leading to the strong correlation is actually restricted to cases, for example, Trp, for which the CT state lies well above the S1 state, and the wavelength is quite sensitive to the local electric field. In these cases, heterogeneity and relaxation give superficially similar behavior, making heterogeneity a rather elusive concept. But, two studies using nonnatural amino acids underscore the reality of heterogeneity by contrasting behavior. The fluorescence decays of 5-fluoroTrp incorporated in proteins are almost always monoexponential (Broos et al., 2004). This probe is not as easily quenched by ET as Trp because it has a higher ionization potential (Liu et al., 2005), but it retains wavelength sensitivity to solvent quite close to that of Trp. Apparently, suppressing ET quenching removes the major environmentally sensitive nonradiative decay pathway of Trp in proteins (i.e., causes kq to become quite small). Another probe has behavior opposite to Trp in regard to lifetime-wavelength correlation. Boxer and coworkers (Abbyad et al., 2007) find that time-resolved fluorescence spectra of Aladan incorporated at several sites in the protein GB1 shift to shorter wavelength on the nanosecond time scale. This striking behavior unequivocally reveals ground-state heterogeneity, because relaxation always requires a red shift in time; this is consistent with the observation that tf for this probe decreases with increasing solvent polarity by an internal mechanism.
5. Final Remarks The QM–MD studies over the past 7 years using the methods described here have produced an understanding of the phenomenally large range of fluorescence wavelengths, lifetimes, and quantum yields observed for Trp in proteins. No competing explanation has been put forward to our knowledge. The method is not just a “black box.” A visual examination of structure files of solvated and equilibrated proteins often produces excellent qualitative predictions as to the nature of wavelength and intensity, providing one has a minimal grasp of electrostatic interactions at the level of Figs. 1.3 and 1.4. The principles put forth are quite general within a broad class of fluorescent probes. Improvements in the form of including ab initio coupling elements and longer simulations out to 25 ns have not helped the ragged fit of calculation versus experiment in the intermediate region between low and very high quantum yield. Future studies will have to examine at least three
Predicting Fluorescence
29
independent questions: (1) whether the Fermi rule is adequate in all cases; (2) whether ab initio quantum methods are needed; and (3) whether the MD parameters used so far are adequate.
Appendix: Ab Initio Computation of Electron Transfer Coupling Matrix Elements Obtaining better estimates of ET coupling elements has enabled an understanding the ubiquitous correlation of lmax and tf in the absence of solvent relaxation that can arise from heterogeneity in ET rates for Trp in proteins. This Appendix details a procedure for quickly estimating ab initio ET coupling elements, V in Eq. (1.3). Using Gaussian 03, revision D.01 diabatic electronic matrix elements were taken from the singles configuration interaction (CIS) Hamiltonian matrix element coupling the two single configurations that constitute nearly pure S1 (1La) and amide CT states: the highest occupied molecular orbital (HOMO) ! ring p* lowest unoccupied molecular orbital(LUMO) and HOMO ! amide LUMO (p*). We find that a CI basis consisting of only the three excitations from HOMO to LUMO þ n (n ¼ 0, 1, 2) adequately span the desired space for the systems studied here, because mixing of ring and amide MOs is not extensive. Subroutine MrgCIS in link 914 was slightly modified using the recommended link modification procedure, so it will always write the CI < AA,BBjAA,BB> matrix, its eigenvalues, and its eigenvectors independent of other print options. The corresponding singlet and triplet adapted CI matrices were constructed from the output and were diagonalized to ensure that the results agreed with the normally printed singlet and triplet transition energies. The largest difficulty was the mixing between the amide p* and ring LUMO þ 1 MOs and the corresponding mixing of the configurations involving these MOs. This problem was solved by determining which linear combination of the mixed MOs resulted in maximum amide p* density; this linear combination of CI matrix elements then gave the desired interaction element. Three basis sets were compared: STO-3G, 3-21G, and the Dunning/Huzinga full double-z basis, D95. Below is the result of the Linux diff command: diff l914.F l914.F. modified > c modified by callis 27may06so it will always write the CI
matrix and its eigenvecs 3140c3141 < If(IPrint.ge.6) then — >c If(IPrint.ge.6) then
30
Patrik R. Callis
3144c3145 < endIf — >c endIf 5059,5060c5060,5061 < If(IPrint.ge.6) < $ Call OutMtS(IOut,’ matrix after diagonals:’,0,0, — >c If(IPrint.ge.6) > Call OutMtS(IOut,’ matrix after diagonals:’,0,0, The following .gjf file requires about 35 s with 1 cpu and about 10 s with the STO-3G basis. %nproc¼1 %Mem¼100MW %subst l914 /home/callis/l914 # hf/3-21g cis(singlet,icdiag,rw) pop¼reg iop(6/7¼0,6/8¼2,6/9¼2) v12 hf/3-21g for villin 01 1 32.81637 33.61985 32.28627 6 33.63000 33.29000 32.88000 1 34.50000 33.68000 32.36000 6 33.48000 33.99000 34.24000 1 33.65000 35.06000 34.12000 1 32.49000 33.90000 34.69000 6 34.36000 33.69000 35.40000 6 34.00000 32.96000 36.48000 1 33.00000 32.56000 36.54000 6 35.78000 33.96000 35.61000 7 35.11000 32.64000 37.24000 1 35.11000 31.94000 37.97000 6 36.24000 33.29000 36.78000 6 36.74000 34.73000 34.92000 1 36.44000 35.20000 34.00000 6 37.50000 33.47000 37.35000 1 37.81000 33.05000 38.29000 6 38.06000 34.89000 35.36000 1 38.69000 35.51000 34.75000 6 38.38000 34.19000 36.53000 1 39.38000 34.40000 36.91000 6 33.83000 31.78000 32.93000 8 32.78000 31.14000 32.99000
Predicting Fluorescence
31
7 35.03000 31.20000 32.87000 1 35.21010 30.21638 32.87693 1 35.90063 31.62840 32.62816 50,53 The program getv.f, reproduced below reads the .log file produced from a run using the above input file from standard input, assembles the nonspinadapted matrix elements into the correct singlet and triplet linear combinations, and outputs the desired single matrix element. This is a “debugging” version, and checking is commented out for production. c 29may06 getv.f callis: program to test algorithm for constructing c singlet CI Hamiltonian matrix elements from the CI matrix elements c from Gaussian03. This gives exactly the values for singlet and c triplet CIS energy eigenvalues and eigenvectors in the Gaussian c output. program getv IMPLICIT REAL*8(A-H,O-Z) character line*80 dimension H(6,6),HL(21), C(6,6),E(6) dimension V(3,3),VL(6), CS(3,3), ES(3) c dimension of HL ¼ n*(nþ1)/2 if H(n,n) 10 read(5,’(a80)’)line if(line(1:4).ne.’ < AA’)goto 10 read(5,*) 900 format(i7,5g13.6) 901 format(i3,2x,6f10.6) 902 format(2x,6f10.6) do i¼1,6 read(5,900)ii,(H(i,j),j¼1,5) enddo read(5,*) do i¼1,6 read(5,900)ii,H(i,6) enddo write(6,’(6i10)’)(i, i¼1,6) do i¼1,6 write(6,901)i,(H(i,j),j¼1,6)
32
Patrik R. Callis
enddo write(6,*) write(6,*) c elements of the matrix being diagolized are stored in a linear c array h in order 11, 12, 22, 13, 23, 33, 14, 24, 34, 44, ..... c c c c
if i¼row and j¼col: K¼I þ J*(J-1)/2 IF(I.GT.J)K¼JþI*(I-1)/2 H(K)¼HIJ do i¼1,6 do j¼1,i k¼iþj*(j-1)/2 IF(I.GT.J)K¼JþI*(I-1)/2 HL(k)¼H(i,j) enddo enddo
c check the eigenvalues,vectors call givens(HL,C,E,6,6) write(6,902)(E(i),i¼1,6) write(6,*) do i¼1,6 write(6,901)i,(C(i,j),j¼1,6) enddo c
get the singlet eigenvalues and vectors V(1,1)¼ H(1,1)þabs(H(1,4)) V(2,2)¼ H(2,2)þabs(H(2,5)) V(3,3)¼ H(3,3)þabs(H(3,6)) V(1,2)¼ H(1,2)þH(1,5) V(1,3)¼ H(1,3)þH(1,6) V(2,3)¼ H(2,3)þH(2,6) print*,’ Singlet CI Hamiltonian’ do i¼1,3 write(6,901)i,(V(i,j),j¼1,3) enddo write(6,*) write(6,*) do i¼1,3 do j¼i,3 k¼iþj*(j-1)/2
Predicting Fluorescence
33
IF(I.GT.J) K¼JþI*(I-1)/2 VL(k)¼V(i,j) print*,k,VL(k) enddo enddo call givens(VL,CS,ES,3,3) print*, ’Singlet eigenvalues and vectors’ print*,’ Coef. divided by root 2 as in G03 output’ print*,’ ’ write(6,902)(ES(i),i¼1,3) c in eV write(6,902)(ES(i)*27.2116,i¼1,3) write(6,*) do i¼1,3 write(6,901)i,(CS(i,j)/2**.5,j¼1,3) enddo c check Triplet eigenvalues V(1,1)¼ H(1,1)-abs(H(1,4)) V(2,2)¼ H(2,2)-abs(H(2,5)) V(3,3)¼ H(3,3)-abs(H(3,6)) V(1,2)¼ H(1,2)-H(1,5) V(1,3)¼ H(1,3)-H(1,6) V(2,3)¼ H(2,3)-H(2,6) print*,’ Triplet CI Hamiltonian’ do i¼1,3 write(6,901)i,(V(i,j),j¼1,3) enddo write(6,*) write(6,*) do i¼1,3 do j¼i,3 k¼iþj*(j-1)/2 IF(I.GT.J) K¼JþI*(I-1)/2 VL(k)¼V(i,j) print*,k,VL(k) enddo enddo c givens is a commonly used routine for finding eigenvalues and c eigenvectors of symmetric matrices, as result from the linear
34
Patrik R. Callis
c variation method. call givens(VL,CS,ES,3,3) print*, ’Triplet eigenvalues and vectors’ print*,’ Coef. divided by root 2 as in G03 output’ print*,’ ’ write(6,902)(ES(i),i¼1,3) c in eV write(6,902)(ES(i)*27.2116,i¼1,3) write(6,*) do i¼1,3 write(6,901)i,(CS(i,j)/2**.5,j¼1,3) enddo end
ACKNOWLEDGMENTS This work was supported by NSF grants MCB-013306, MCB-0446542, and MCB0847047. The author thanks Pedro Muino, James Vivian, Tiqing Liu, Alexander Petrenko, Jose Tusell, Ryan Hutcheson, and Carl Fahlstrom for work contributing to this chapter; he is also indebted to Andy Albrecht, Bruce Hudson, Lenny Brand, Mary Barkley, Frank Prendergast, Jay Knutson, Jaap Broos, Dongping Zhong, Jonathan King, Yves Engelborghs, and J. Michael Schurr for inspiration and discussions. Final thanks are to the late Michael Zerner for generously providing his program so many years ago.
REFERENCES Abbyad, P., Shi, X. H., Childs, W., McAnaney, T. B., Cohen, B. E., and Boxer, S. G. (2007). Measurement of solvation responses at multiple sites in a globular protein. J. Phys. Chem. B 111, 8269–8276. Andersson, K., Malmqvist, P.-A˚., and Roos, B. O. (1992). 2nd-order perturbation-theory with a complete active space self-consistent field reference function. J. Chem. Phys. 96, 1218–1226. Andersson, K., Barysz, M., Bernhardsson, A., Blomberg, M. R. A., Cooper, D. L., Fu¨lscher, M. P., de Graaf, C., Hess, B. A., Karlstro¨m, G., Lindh, R., ˚ ., and Nakajima, T. (2002). MOLCAS5.4 Lund University, Sweden. Malmqvist, P.-A Bajzer, Z., and Prendergast, F. G. (1993). A model for multiexponential tryptophan fluorescence intensity decay in proteins. Biophys. J. 65, 2313–2323. Beechem, J. M., and Brand, L. (1985). Time-resolved fluorescence of proteins. Ann. Rev. Biochem. 54, 43–71. Beierlein, F. R., Othersen, O. G., Lanig, H., Schneider, S., and Clark, T. (2006). Simulating FRET from tryptophan: Is the rotamer model correct? J. Am. Chem. Soc. 128, 5142–5152. Broos, J., Maddalena, F., and Hesp, B. H. (2004). In vivo synthesized proteins with monoexponential fluorescence decay kinetics. J. Am. Chem. Soc. 126, 22–23.
Predicting Fluorescence
35
Broos, J., Tveen-Jensen, K., de Waal, E., Hesp, B. H., Jackson, J. B., Canters, G. W., and Callis, P. R. (2007). The emitting state of tryptophan in proteins with highly blue-shifted fluorescence. Angew. Chem. Int. Ed Engl. 46, 5137–5139. Callis, P. R. (1991). Molecular orbital theory of the 1La and 1Lb states of indole. J. Chem. Phys. 95, 4230–4240. Callis, P. R. (1997). 1La and 1Lb transitions of tryptophan: Applications of theory and experimental observations to fluorescence of proteins. Methods Enzym. 278, 113–150. Callis, P. R. (2009). Exploring the electrostatic landscape of proteins with tryptophan fluorescence. In “Reviews in Fluorescence 2007,” (C. D. Geddes, ed.), Vol. 4, pp. 199–248. Springer, New York. Callis, P. R., and Burgess, B. K. (1997). Tryptophan fluorescence shifts in proteins from hybrid simulations: An electrostatic approach. J. Phys. Chem. 101, 9429–9432. Callis, P. R., and Liu, T. (2004). Quantitative prediction of fluorescence quantum yields for tryptophan in proteins. J. Phys. Chem. B 108, 4248–4259. Callis, P. R., and Liu, T. Q. (2006). Short range photoinduced electron transfer in proteins: QM–MM simulations of tryptophan and flavin fluorescence quenching in proteins. Chem. Phys. 326, 230–239. Callis, P. R., and Vivian, J. T. (2003). Understanding the variable fluorescence quantum yield of tryptophan in proteins using QM–MM simulations. Quenching by charge transfer to the peptide backbone. Chem. Phys. Lett. 369, 409–414. Callis, P. R., Vivian, J. T., and Slater, L. S. (1995). Ab initio calculations of vibronic spectra for indole. Chem. Phys. Lett. 244, 53–58. Callis, P. R., Petrenko, A., Muino, P. L., and Tusell, J. R. (2007). Ab initio prediction of tryptophan fluorescence quenching by protein electric field enabled electron transfer. J. Phys. Chem. B 111, 10335–10339. Chen, J., Callis, P. R., and King, J. (2009). Mechanism of the very efficient quenching of tryptophan fluorescence in human gamma D- and gamma S-crystallins: The gammacrystallin fold may have evolved to protect tryptophan residues from ultraviolet photodamage. Biochemistry 48, 3708–3716. Clark, T., Alex, A., Beck, B., Gedeck, P., and Lanig, H. (1999). A semiempirical QM/MM implementation and its application to the absorption of organic molecules in zeolites. J. Mol. Model. 5, 1–7. Clark, T., Bleisteiner, B., and Schneider, S. (2002). Excited state conformational dynamics of semiflexibly bridged electron donor-acceptor systems: A semiempirical CI-study including solvent effects. J. Mol. Model. 8, 87–94. Clark, T., Alex, A., Beck, B., Burkhardt, F., Chandrasekhar, J., Gedeck, P., Horn, A., Hutter, M., Martin, B., Rauhut, G., Sauer, W., Schindler, T., et al. (2008). VAMP 10.0 Accelrys Inc., San Diego, CA. Cory, M. G., Zerner, M. C., Xu, X. C., and Schulten, K. (1998). Electronic excitations in aggregates of bacteriochlorophylls. J. Phys. Chem. B 102, 7640–7650. Dahms, T. E. S., Willis, K. J., and Szabo, A. G. (1995). Conformational heterogeneity of tryptophan in a protein crystal. J. Am. Chem. Soc. 117, 2321–2326. Donder-Lardeux, C., Jouvet, C., Perun, S., and Sobolewski, A. L. (2003). External electric field effect on the lowest excited states of indole: Ab initio and molecular dynamics study. Phys. Chem. Chem. Phys. 5, 5118–5126. Donzel, B., Gaudechon, P., and Wahl, P. (1974). Study of the conformation in the excited state of two tryptophanyl diketopiperazines. J. Am. Chem. Soc. 96, 801–808. Eftink, M. R. (1991). Fluorescence techniques for studying protein structure. Methods Biochem. Anal. 35, 127–205. Engh, R. A., Chen, L. X. Q., and Fleming, G. R. (1986). Conformational dynamics of tryptophan: A proposal for the origin of the non-exponential fluorescence decay. Chem. Phys. Lett. 126, 365–372.
36
Patrik R. Callis
Fender, B. J., Short, K. W., Hahn, D. K., and Callis, P. R. (1999). Vibrational assignments for indole with the aid of ultrasharp phosphorescence spectra. Int. J. Quantum Chem. 72, 347–356. Fo¨rster, Th. (1959). Discuss. Faraday Soc. 27, 7. Fo¨rster, Th. (1971). Excitation transfer and internal conversion. Chem. Phys. Lett. 12, 422–424. Frisch, M. J., Trucks, G. W., Schlegel, H. B., Scuseria, G. E., Robb, M. A., Cheeseman, J. R., Zakrzewski, V. G., Montgomery, J. A., Stratmann, R. E., Burant, J. C., Dapprich, S., Millam, J. M., et al. (2002). Gaussian 98 (Revision A.11.3). Gaussian, Inc., Pittsburgh, PA. Gehlen, J. N., Marchi, M., and Chandler, D. (1994). Dynamics affecting the primary charge transfer in photosynthesis. Science 263, 499–502. Gordon, H. L., Jarrell, H. C., Szabo, A. G., Willis, K. J., and Somorjai, R. L. (1992). Molecular dynamics simulations of the conformational dynamics of tryptophan. J. Phys. Chem. 96, 1915–1921. Halle, B., and Nilsson, L. (2009). Does the dynamic stokes shift report on slow protein hydration dynamics? J. Phys. Chem. 113, 8210–8213. Hopfield, J. J. (1974). Electron transfer between biological molecules by thermally activated tunneling. Proc. Natl. Acad. Sci. USA 71, 3640–3644. Hudson, B. S., Huston, J. M., and Soto-Campos, G. (1999). A reversible “dark state” mechanism for complexity of the fluorescence of tryptophan in proteins. J. Phys. Chem. A 103, 2227–2234. Hutcheson, R. M. (2009). M.S. thesis. Montana State University, Bozeman, MT. Jimenez, R., Fleming, G. R., Kumar, P. V., and Maroncelli, M. (1994). Femtosecond solvation dynamics of water. Nature 369, 471–473. Kamerlin, S. C. L., Haranczyk, M., and Warshel, A. (2009). Progress in Ab Initio QM/MM free-energy simulations of electrostatic energies in proteins: Accelerated QM/MM studies of pK(a), redox reactions and solvation free energies. J. Phys. Chem. B 113, 1253–1272. Lakowicz, J. (2000). On spectral relaxation in proteins. Photochem. Photobiol. 72, 421–437. Lakowicz, J. R. (2006). Principles of Fluorescence Spectroscopy. Springer, New York. Li, J., Williams, B., Cramer, C. J., and Truhlar, D. G. (1999). A class IV charge model for molecular excited states. J. Phys. Chem. 110, 724–733. Li, T. P., Hassanali, A. A. P., Kao, Y. T., Zhong, D. P., and Singer, S. J. (2007). Hydration dynamics and time scales of coupled water–protein fluctuations. J. Am. Chem. Soc. 129, 3376–3382. Li, T. P., Hassanali, A. A., and Singer, S. J. (2008). Origin of slow relaxation following photoexcitation of W7 in myoglobin and the dynamics of its hydration layer. J. Phys. Chem. B 112, 16121–16134. Liu, T. Q., Callis, P. R., Hesp, B. H., de Groot, M., Buma, W. J., and Broos, J. (2005). Ionization potentials of fluoroindoles and the origin of nonexponential tryptophan fluorescence decay in proteins. J. Am. Chem. Soc. 127, 4104–4113. Lockhart, D. J., and Kim, P. S. (1992). Internal Stark effect measurement of the electric field at the amino terminus of an alpha helix. Science 257, 947–951. Loring, R. A. (1990). Statistical mechanical calculation of inhomogeneously broadened absorption line shapes in solution. J. Phys. Chem. 94, 513–515. Luo, G. B., Andricioaei, I., Xie, X. S., and Karplus, M. (2006). Dynamic distance disorder in proteins is caused by trapping. J. Phys. Chem. B 110, 9363–9367. Marchi, M., Gehlen, J. N., Chandler, D., and Newton, M. (1993). Diabatic surfaces and the pathway for primary electron transfer in a photosynthetic reaction center. J. Am. Chem. Soc. 115, 4178–4190. Maroncelli, M. (1993). The dynamics of solvation in polar liquids. J. Mol. Liquids 57, 1–37.
Predicting Fluorescence
37
Meech, S. R., Lee, A., and Phillips, D. (1983). On the nature of the fluorescent state of methylated indole derivatives. Chem. Phys. 80, 317–328. Muino, P. L., and Callis, P. R. (1994). Hybrid simulations of solvation effects on electronic spectra: Indoles in water. J. Chem. Phys. 100, 4093–4109. Muin˜o, P. L., and Callis, P. R. (2009). Solvent effects on the fluorescence quenching of tryptophan by amides via electron transfer. Experimental and computational studies. J. Phys. Chem. B 113, 2572–2577. Pan, C. P., Callis, P. R., and Barkley, M. D. (2006). Dependence of tryptophan emission wavelength on conformation in cyclic hexapeptides. J. Phys. Chem. B 110, 7009–7016. Pan, C.-P., Muı´n˜o, P. L., Barkley, M. D. and Callis, P. R. (2010). Understanding wavelength dependence of tryptophan fluorescence decays. Biophys. J. 98, 583a. Peon, J., Pal, S. K., and Zewail, A. H. (2002). Hydration at the surface of the protein Monellin: Dynamics with femtosecond resolution. Proc. Natl. Acad. Sci. USA 99, 10964–10969. Prakash, M. K., and Marcus, R. A. (2007). An interpretation of fluctuations in enzyme catalysis rate, spectral diffusion, and radiative component of lifetimes in terms of electric field fluctuations. Proc. Natl. Acad. Sci. USA 104, 15982–15987. Prakash, M. K., and Marcus, R. A. (2008). Dielectric dispersion interpretation of single enzyme dynamic disorder, spectral diffusion, and radiative fluorescence lifetime. J. Phys. Chem. B 112, 399–404. Prendergast, F. G. (1991). Time-resolved fluorescence techniques: Methods and applications in biology. Curr. Opin. Struct. Biol. 1, 1054–1059. Ridley, J., and Zerner, M. (1973). Intermediate neglect of differential overlap (INDO) technique for spectroscopy: Pyrrole and the azines. Theor. Chim. Acta (Berl) 32, 111–134. Roca, M., Messer, B., Hilvert, D., and Warshel, A. (2008). On the relationship between folding and chemical landscapes in enzyme catalysis. Proc. Natl. Acad. Sci. USA 105, 13877–13882. Ross, J. B. A., Laws, W. R., and Shea, M. A. (2007). Intrinsic fluorescence in protein structure analysis. In “Protein Structures: Methods in Protein Structure and Structure Analysis,” (V. Uversky and E. A. Permyakov, eds.), pp. 1–18. Nova Science Publishers, Inc., New York. Rusu, C. F., Lanig, H., Othersen, O. G., Kryschi, C., and Clark, T. (2008). Monitoring biological membrane-potential changes: A CI QM/MM study. J. Phys. Chem. B 112, 2445–2455. Saxena, A., Wong, D., Diraviyam, K., and Sept, D., 2009. The basic concepts of molecular modeling. Methods Enzymol. 467, 307–334. Short, K. W., and Callis, P. R. (2000). Studies of jet-cooled 3-methylindole-polar solvent complexes: Identification of 1La fluorescence. J. Chem. Phys. 113, 5235–5244. Sreerama, N., Woody, R. W., and Callis, P. R. (1994). Theoretical study of the crystal field effects on the transition dipole moments in methylated adenines. J. Phys. Chem. 98, 10397–10407. Szabo, A. G., and Rayner, D. M. (1980). Fluorescence decay of tryptophan conformers in aqueous solution. J. Am. Chem. Soc. 102, 554–563. Theiste, D., Callis, P. R., and Woody, R. W. (1991). Effects of the crystal field on transition moments in 9-ethylguanine. J. Am. Chem. Soc. 113, 3260–3267. Thompson, M. A., and Zerner, M. C. (1991). A theoretical examination of the electronic structure and spectroscopy of the photosynthetic reaction center from Rhodopseudomonas viridis. J. Am. Chem. Soc. 113, 8210–8215. Toptygin, D., Savtchenko, R. S., Meadow, N. D., and Brand, L. (2001). Homogeneous spectrally- and time-resolved fluorescence emission from single-tryptophan mutants of IIAGlc. J. Phys. Chem. B 105, 2043–2055.
38
Patrik R. Callis
Toptygin, D., Gronenborn, A. M., and Brand, L. (2006). Nanosecond relaxation dynamics of protein GB1 identified by the time-dependent red shift in the fluorescence of tryptophan and 5-fluorotryptophan. J. Phys. Chem. B 110, 26292–26302. Ufimtsev, I. S., and Martinez, T. J. (2009). Quantum chemistry on graphical processing units. 3. Analytical energy gradients, geometry optimization, and first principles molecular dynamics. J. Chem. Theory Comput. 5, 2619–2628. Vivian, J. T., and Callis, P. R. (2001). Mechanisms of tryptophan fluorescence shifts in proteins. Biophys. J. 80, 2093–2109. Warshel, A. (1982). Dynamics of reactions in polar solvents. Semiclassical trajectory studies of electron transfer and proton-transfer reactions. J. Phys. Chem. 86, 2218–2224. Warshel, A. (1991). Computer Modeling of Chemical Reactions in Enzymes and Solutions. Wiley-Interscience, New York, NY. Warshel, A., Sharma, P. K., Kato, M., Xiang, Y., Liu, H. B., and Olsson, M. H. M. (2006). Electrostatic basis for enzyme catalysis. Chem. Rev. 106, 3210–3235. Wlodarczyk, J., and Kierdaszuk, B. (2003). Interpretation of fluorescence decays using a power-like model. Biophys. J. 85, 589–598. Xu, J., Chen, J., Toptygin, D., Tcherkasskaya, O., Callis, P. R., King, J., Brand, L., and Knutson, J. (2009). Femtosecond fluorescence spectra of tryptophan in human gammacrystallin mutants: Site-dependent ultrafast quenching. J. Am. Chem. Soc. 131, 16751–16757. Xu, J. H., Toptygin, D., Graver, K. J., Albertini, R. A., Savtchenko, R. S., Meadow, N. D., Roseman, S., Callis, P. R., Brand, L., and Knutson, J. R. (2006). Ultrafast fluorescence dynamics of tryptophan in the proteins monellin and IIA(Glc). J. Am. Chem. Soc. 128, 1214–1221. Yang, H., Luo, G., Karnchanaphanurach, P., Louie, T.-M., Rech, I., Cova, S., Xun, L., and Xie, X. S. (2003). Protein conformational dynamics probed by single-molecule electron transfer. Science 302, 262–266. Zhang, L. Y., Yang, Y., Kao, Y. T., Wang, L. J., and Zhong, D. P. (2009). Protein hydration dynamics and molecular mechanism of coupled water–protein fluctuations. J. Am. Chem. Soc. 131, 10677–10691.
C H A P T E R
T W O
Modeling of Regulatory Networks: Theory and Applications in the study of the Drosophila Circadian Clock Elizabeth Y. Scribner* and Hassan M. Fathallah-Shaykh*,†,‡,§ Contents 1. Introduction 2. Developmental History of the Drosophila Circadian Clock 3. Comparative Analysis of Three Network Regulatory Models 3.1. Michaelis–Menten enzyme kinetics: “The gold standard” 3.2. A probabilistic model for the Drosophila circadian clock 3.3. A new regulatory network model 4. The CWO Anomaly and a New Network Regulatory Rule 5. Concluding Remarks References
40 42 51 51 53 58 63 67 69
Abstract Biological networks can be very complex. Mathematical modeling and simulation of regulatory networks can assist in resolving unanswered questions about these complex systems, which are often impossible to explore experimentally. The network regulating the Drosophila circadian clock is particularly amenable to such modeling given its complexity and what we call the clockwork orange (CWO) anomaly. CWO is a protein whose function in the network as an indirect activator of genes per, tim, vri, and pdp1 is counterintuitive—in isolated experiments, CWO inhibits transcription of these genes. Although many different types of modeling frameworks have recently been applied to the Drosophila circadian network, this chapter focuses on the application of continuous deterministic dynamic modeling to this network. In particular, we present three unique systems of ordinary differential equations that have been used to successfully model different aspects of the circadian network. The last model incorporates the newly identified protein CWO, and we explain how this model’s unique mathematical equations can be used to explore and resolve the CWO * Department of Mathematics, The University of Alabama at Birmingham, Birmingham, Alabama, USA Department of Neurology, The University of Alabama at Birmingham, Birmingham, Alabama, USA Department of Cell Biology, The University of Alabama at Birmingham, Birmingham, Alabama, USA } The UAB Comprehensive Neuroscience and Cancer Centers, Birmingham, Alabama, USA { {
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87002-3
#
2011 Elsevier Inc. All rights reserved.
39
40
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
anomaly. Finally, analysis of these equations gives rise to a new network regulatory rule, which clarifies the unusual role of CWO in this dynamical system.
1. Introduction The Drosophila circadian clock is a self-regulating intracellular network with a chain of biochemical reactions that generate oscillations of its key molecular components, with peak-to-peak time periods of approximately 24 h. Major advances in molecular biology and genome sequencing over the past few decades have identified the underlying architecture of the circadian clock, thus making this network an archetype for mathematical modeling and simulation. In the section that follows, we present several static diagrams of the Drosophila circadian network as it evolved with the discovery of new clock components. Red arrows denote activation of a molecule, whereas blue lines with bars denote repression, or negative feedback. Blue lines with an “x” denote degradation of one molecule by another. These molecules consist of mRNAs and proteins, which are represented with lowercase italics (per, tim, vri, etc.) and uppercase font (PER, TIM, VRI, etc.), respectively. While these static diagrams provide useful illustrations of the network’s architecture, however, a more quantitative approach is required in order to understand the system’s behaviors and complex dynamics (Kitano, 2002). In particular, we are interested in using quantitative modeling to resolve an anomaly in the clock’s architecture. Like any scientific investigation, a good mathematical model begins with a question. In the case of the Drosophila circadian network, we are trying to resolve the counterintuitive effects of a recently identified network molecule (clockwork orange, CWO) on its direct target genes (per, tim, pdp1, vri) (Kadener et al., 2007). Direct target genes in this network are defined as mRNAs, which are transcriptionally activated or repressed by regulating proteins. The per and tim transcripts encode for proteins PER and TIM, respectively, which participate in a negative feedback loop that suppresses the expression of their own mRNA (Gekakis et al., 1995). We, henceforth, refer to this loop as the PER/TIM negative feedback loop. The vri and pdp1 transcripts encode for proteins VRI and PDP1, respectively. VRI suppresses its own transcription in a negative feedback loop, while PDP1 activates its expression in a feed-forward loop (Blau and Young, 1999; Cyran et al., 2003). These three loops (PER/TIM, VRI, PDP1), modular in structure, comprise the basic framework of the clock’s architecture. Very recent studies have revealed the existence of an additional clock network gene, cwo (Kadener et al., 2007). Laboratory experiments
Modeling of Regulatory Networks
41
performed in vitro show that, in addition to suppressing its own expression in a negative feedback loop (like PER/TIM and VRI), cwo’s encoded protein CWO represses the expression of its other direct target genes per, tim, pdp1, and vri. However, detected levels of these proteins at the peak of their oscillations are higher in wild-type (wt) flies than in cwo-mutant flies, suggesting that CWO acts as an activator (rather than a repressor) of its direct target genes in the overall network (Matsumoto et al., 2007; Richier et al., 2008). This anomaly highlights the limitations of isolated in vitro experiments when trying to answer broader questions about a molecule’s overall role in a dynamic network (Endy and Brent, 2001). Like any engineering system, it is impossible to fully understand the function of an individual clock component (like cwo) without analyzing its behavior holistically within the network. In fact, the clock’s underlying control mechanisms exhibit many of the properties characteristic of robust engineering systems, including its ability to respond to environmental stimuli, the presence of negative and positive feedback loops, redundancy among these loops in the event of one component’s failure, and finally modularity (Kitano, 2002). For this reason, many scientists have attempted to model the behavior of the clock using mathematical equations and computer simulations. One of the most important characteristics of any mathematical model is its ability to replicate the observed biology of a network system (Szallasi et al., 2006). In Section 3, we present a comparative analysis of three different mathematical models, which have succeeded in reproducing 24-h oscillations of the key network components used in each model. The first model we present utilizes a system of rate equations with Michaelis–Menten and Hill-type kinetics, which refer to as the “Gold Standard,” given its prolific use among models of intracellular regulatory networks. These types of equations, however, require several variables and parameters to model a single molecular interaction, thus making the method prohibitive—or at least very difficult to understand—when the number of network molecules is large, as in the case of the Drosophila circadian clock (see Fig. 2.5, Section 2). Therefore, for comparison, we present two other models with very different types of rate equations that also simulate the Drosophila clock network. The second model utilizes first-order kinetic equations that incorporate the binding probabilities of transcription factors (both repressors and activators) to gene promoter regions, also known as E-boxes (Xie and Kulasiri, 2007). However, this particular model was developed prior to the discovery of the new clock gene cwo, and hence it does not help in resolving the aforementioned anomaly in the clock’s architecture. Finally, we present a third mathematical model with a novel set of rate equations, which are simplistic yet effective in replicating the observed biology of the circadian clock, such as its response to light stimuli as well as the phenotypes
42
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
associated with various gene mutations (Fathallah-Shaykh et al., 2009). In addition to reproducing experimental observations of network behaviors, a good model attempts to answer a proposed question. This model incorporates the new network gene cwo and its corresponding protein CWO, with the added advantage that the proposed system of equations allows for easy analysis of the effects of CWO on its direct target genes (per, tim, vri, pdp1) at the peaks of their oscillations. We explain the unique properties of these equations and the results of their analysis in Section 4. Finally, a good model not only generates insights into the dynamic behavior of a system but also helps to make predictions about network behaviors in similar systems (Endy and Brent, 2001). Indeed, it is reasonable to assume that control mechanisms employed by the Drosophila circadian clock might also be effective in regulating other types of intracellular networks, such as the mammalian circadian clock. With this in mind, we use the insights gained from our analysis of the CWO anomaly to develop a new network regulatory rule, which can be applied generally to any network system exhibiting control mechanisms similar to that of the Drosophila clock.
2. Developmental History of the Drosophila Circadian Clock Until recently, the molecular mechanism of the Drosophila circadian clock has eluded scientists. In the 1980s, scientists used genetic screens to identify the first of many key molecular actors in this complex dynamic network (Hall and Rosbash, 1987). Aware of the clock’s endogenous nature (Moore-Ede et al., 1982), they were looking for a molecule whose biochemical structure might shed light on the nature of this self-sustaining oscillator. The discovery of the per gene offered new insights by revealing a negative feedback loop, in which the per transcript and PER protein oscillate in abundance (Baylies et al., 1987; Hardin et al., 1990). Peaks in concentration of per mRNA and PER protein are separated by a 4–6 h gap, after which activation of per expression begins to decline, suggesting that the PER protein either directly or indirectly inhibits the transcription of its own mRNA (Zerr et al., 1990). In an effort to understand the mechanism of this interaction in more detail, Goldbeter (1995) proposed a continuous, deterministic mathematical model of a single negative feedback loop in the Drosophila circadian clock. A schematic diagram of his 1995 model is shown in Fig. 2.1. The study particularly focused on the function of PER phosphorylation, which delays entry of the protein into the nucleus, and hence the repression of per transcription (Goldbeter, 1995). A more detailed analysis of this model and its accompanying system of rate equations will be presented in Section 3.
43
Modeling of Regulatory Networks
VS
–
E-boxper
PN
k1
k2
per
kS Vm
V1 P0
V3 P1
V2
P2
Vd
V4
Figure 2.1 1995 Goldbeter model. per mRNA is transcribed and transported to the cytoplasm at a rate VS, where it translates unphosphorylated PER protein (P0) at a rate kS and degrades at a rate Vm. Multiple and reversible phosphorylation of PER (P0, P1, P2) delays entry of the fully phosphorylated protein (P2) into the nucleus (PN), where it represses the transcription of per mRNA by binding to E-box. P0 is phosphorylated into P1 and P2 at rates V1 and V3, respectively; while P2 is unphosphorylated into P1 and P0 at rates V4 and V2, respectively. P2 is transported into the nucleus at a rate k1, and PN is transported back into the cytoplasm at a rate k2.
In the same year, Gekakis et al. (1995) isolated a second circadian network gene—timeless (tim)—whose protein TIM binds with PER. Moreover, Saez and Young (1996) soon discovered that cytoplasmic assembly of the PER/TIM complex is a prerequisite for nuclear transport of either protein, without which oscillations cease to exist. Further studies revealed that light stimulates the degradation of TIM and thereby helps to regulate the phase of the circadian cycle (Hunter-Ensor et al., 1996; Lee et al., 1996). TIM’s response to light is mediated by a separate oscillating protein CRY, whose expression peaks during daylight hours in response to external stimuli (Emery et al., 1998). Intuitively, the addition of a second tim feedback loop in the circadian network would increase the system’s resistance to small perturbations or variability. With the help of Leloup, Golbeter proceeded to verify this hypothesis by updating his 1995 model to fit these new discoveries. The updated schematic diagram is shown in Fig. 2.2. Not included in their expanded model is the exact mechanism through which light degrades phosphorylated TIM via the photoreceptor CRY. Nevertheless, Leloup and Goldbeter (1998) discovered that the domain in parameter space was indeed increased with the addition of a second feedback loop. Meanwhile, Bae et al. (1998) reported the discovery of a new Drosophila circadian network gene dClock (Clk), homologous to the circadian clock gene found in mammals. A previously identified 69-bp promoter region (or E-box) upstream of the per gene suggested the presence of a transcriptional activator (Hao et al., 1997), and scientists soon confirmed that the Drosophila CLOCK protein (CLK) drives expression of both PER and TIM by
44
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
Light
tim
E-boxtim
T0
–
T1
T2
P-TN
P-T
E-boxper
per
P0
P1
P2
Figure 2.2 1998 Goldbeter and Leloup model (updated from 1995). The expanded diagram includes 10 molecules. Like PER, TIM is phosphorylated multiple times (T0, T1, T2) before binding with fully phosphorylated PER (P2) to form the PER–TIM complex (P–T). After being transported to the nucleus, the complex (P–TN) represses transcription of both the per and tim genes by binding to E-box.
binding with the protein CYCLE (CYC) to form a heterodimer and transcriptional activator CLK–CYC. Accumulation of the PER–TIM dimer in the nucleus subsequently inhibits CLK–CYC’s transcriptional regulation of the per and tim genes (Darlington et al., 1998), as shown in Fig. 2.3. These findings are consistent with evidence that PER and TIM do not bind directly to their own DNA (Sasson-Corsi, 1998). Another mammalian homolog double-time (dbt) was found to encode for a protein kinase DBT, which positively regulates the accumulation and nuclear-transport of PER–TIM through cytoplasmic phosphorylation of the dimer, without which per and tim cease to oscillate (Kloss et al., 1998; Price et al., 1998). More recent inquiries into the role of DBT have revealed that it actually binds to PER, using the molecule as a vehicle for entry into the nucleus where it directly suppresses the activity of CLK–CYC (Kim et al., 2007; Yu et al., 2006). A year later, Glossop et al. (1999) observed that protein concentrations of CLK oscillate in counter phase to PER and TIM concentrations, suggesting
45
Modeling of Regulatory Networks
tim
TIM
E-boxtim
– P-TN
C-CN
DBT
P-TP
E-boxper
C-C
CLK CYC per
clk PER
Figure 2.3 Expanded model of the circadian network. Protein kinase DBT positively regulates PER–TIM (P–T) accumulation and transport through phosphorylation of the dimer (P–TP). The nuclear PER–TIM dimer (P–TN) then inhibits the transcriptional activity of the nuclear CLK–CYC heterodimer (C–CN), which otherwise binds to promoter regions (E-boxes) in the per and tim genes to activate transcription.
the presence of an additional feedback loop with transcriptional activators and/or repressors of the clk gene. These predictions were supported by reports of a novel clock-controlled gene vrille (vri), whose transcripts oscillate in the same phase as per and tim (Blau and Young, 1999). In 2001, Ueda et al. (2001) revised the mathematical model of the Drosophila circadian clock to include this additional clk feedback loop, using equations similar to those introduced by Goldbeter (1995) and Leloup and Goldbeter (1998). Though this model confirmed the feasibility of interlocked feedback loops, it was not until 2003 that the exact role of the protein VRILLE as a transcriptional repressor of CLK was confirmed (Glossop et al., 2003). In the same year, Cyran et al. (2003) identified the Par-Domain-Protein (pdp1) gene as part of positive feedback loop that regulates the expression of the clk gene. In turn, CLK activates the expression of the pdp1 gene, without which clock components become arrhythmic, or cease to oscillate. Following this discovery, Smolen et al. (2004) proposed an updated mathematical model of the Drosophila clock that incorporated the vrille negative feedback loop and the newly discovered Pdp1 positive feedback loop into the network. In 2007, scientists Xie and Kulasiri modeled the same network using a novel set of continuous, deterministic mathematical equations, which will be outlined in more detail in Section 3. Figure 2.4 provides a schematic diagram of the additional feedback loops used in the model by Xie and Kulasiri (2007).
46
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
clk
CLK
CYC PER
+ PDP1
PER-TIM
VP-BOXclk
–
–
TIM CLK-CYC
VRI
vri
pdp1
E-BOXvri
E-BOXpdp1
CLK-CYC-PER-TIM
tim
E-BOXtim
E-BOXper
per
Figure 2.4 2007 Xie and Kulasiri model. Two negative loops (PER–TIM and VRI) and one positive loop (PDP1) regulate this network of 14 molecules. PER–TIM binds to CLK–CYC to inhibit the transcriptional activity of this E-box promoter. VRI binds to the clk VP-box to inhibit transcription of the clk gene. Meanwhile, PDP1 competes with VRI to activate transcription of clk. For simplification, phosphorylation and compartmentalization are ignored in this diagram, as are the effects of both light on TIM degradation and the protein DBT on PER/TIM transport.
The latest update to the Drosophila circadian clock occurred in 2007 with the much anticipated discovery of the cwo gene, a core clock component that functions as both an activator and repressor of clock gene expression (Kadener et al., 2007; Richier et al., 2008). To date, the only mathematical model to incorporate this new discovery was published in 2009 by Fathallah-Shaykh, Bona, and Kadener. Figure 2.5 provides a schematic diagram of this new model. A more detailed analysis of the accompanying mathematical equations used to model the network in Fathallah-Shaykh et al. (2009) will be presented in Section 3. In the diagram, CWO acts as a direct transcriptional repressor of pdp1, vri, tim, per, and cwo itself, indicating that cwo-mutants should exhibit higher expression levels of each gene compared to wt Drosophila. However, several key studies indicate that the opposite is true (Matsumoto et al., 2007; Richier et al., 2008). The absence of cwo actually increases peak levels of the four transcripts (pdp1, vri, tim, and per) indicating that cwo also functions as an activator—an anomaly that will be resolved in more detail in Section 4, using the above model (Fathallah-Shaykh et al., 2009). All the aforementioned models in this section and those discussed in Sections 3 and 4 are continuous and deterministic, meaning the behavior of each molecular species is analyzed in the absence of unknown perturbations (or noise). Each model is accompanied by a set of ordinary differential equations, which measure the time-evolution of the concentration of the
47
Modeling of Regulatory Networks
CYC clk
+
DBT
CLK CLK-CYC
PDP1
–
PER
PER-TIMP
– cry
CRY
TIM
cwo
VRI
CWO vri
tim pdp1
per
Figure 2.5 A schematic diagram incorporating the regulatory effects of cwo. For simplicity, transcriptional activation and repression through binding to E-box is summarized with single red arrows and blue lines, connecting the transcriptional regulator to the final mRNA transcript produced by the regulated gene. CLK–CYC activates expression of all gene products, while CWO directly inhibits expression of each gene product. The model also incorporates the protein CRY, which is known to accelerate the degradation of TIM, indicated by a blue X. DBT positively regulates phosphorylation and transport of PER–TIM into the nucleus, where is represses CLK–CYC.
mRNAs and proteins involved in the proposed network. These rate equations are functions of both kinetic parameters—which include the expression and degradation rates of each molecule—and the varying concentrations of the other molecular species. Using numerical integration, these proposed models are capable of producing robust oscillations of the mRNAs and proteins, which replicate the period and phase-peak timing observed in nature. However, given the unpredictable variation (or stochasticity) in gene expression, which is not taken into account in the deterministic model, questions have been raised regarding the effects of noise on the dynamical behavior of the system (Gonze et al., 2002; Li and Lang, 2008). To address these concerns, several stochastic models have been proposed that examine the Drosophila circadian rhythm’s resistance to molecular noise (see Table 2.1). The majority of these models utilize a birth and death chain stochastic process governed by a chemical master equation (CME), which measures the time evolution of the probability function of the system’s state. !; t Þ, computes the likelihood that the This probability function, P ð! x; tjx 0 0 quantity of molecules of each variable (or species) in the system will assume certain values ! x within the parameter space O at time t, given an initial state ! x0 at time t0. Here, ! x and ! x0 are n-dimensional vectors representing the number of molecules of the n molecular species in the network.
Table 2.1 An historical selection of mathematical models and computer simulations (both stochastic and deterministic) of the Drosophila circadian clock over the past two decades Authors, date
Det. vs. stoch.
Methods/equations
Molecules/loops
Goldbeter (1995) Leloup and Goldbeter (1998) Ueda et al. (2001)
Deterministic Deterministic
Michaelis–Menten and Hill-Type Michaelis–Menten and Hill-Type
PER negative feedback loop PER/TIM negative feedback loop
Deterministic
Michaelis–Menten and Hill-Type
Smolen et al. (2001)
Deterministic
Gonze et al. (2002)
Stochastic
Smolen et al. (2002)
Deterministic/ stochastic (comparison)
Gonze et al. (2003)
Stochastic
Smolen et al. (2004)
Deterministic
Leise and Moin (2007)
Deterministic
Interlocked PER/TIM and CLK–CYC negative feedback loops Michaelis–Menten and Hill-Type Interlocked PER feedback loop (negative) w/delay terms and CLK feedback loop (positive) Birth and death chain governed by a PER negative feedback loop CME, and solved using the Gillespie algorithm Michaelis–Menten with delay terms Interlocked PER feedback loop (negative) and CLK feedback loop (positive) (deterministic); birth and death chain governed by a CME, and solved using both the Gillespie algorithm and a fixed-time algorithm (stochastic) Birth and death chain governed by a PER/TIM negative feedback loop CME, and solved using the Gillespie algorithm Michaelis–Menten and Hill-Type Interlocked PER and VRILLE negative feedback loops and PDP1 positive feedback loop Michaelis–Menten and Hill-Type Interlocked PER/TIM and CLK–CYC negative feedback loops
Xie and Kulasiri (2007)
Deterministic
First order reactions w/binding probability equations
Li and Lang (2008)
Deterministic/ stochastic (comparison)
Fathallah-Shaykh et al. (2009)
Deterministic
Michaelis–Menten with time delays (deterministic); birth and death chain governed by a CME, and solved using the Gillespie algorithm and the Gillespie Chemical Langevin Equation (Gillespie, 2000) (stochastic) Regulatory equations w/sigmoidal and logistic functions
Interlocked PER/TIM and VRILLE negative feedback loops and PDP1 positive feedback loop Interlocked PER feedback loop (negative) and CLK feedback loop (positive)
Interlocked PER/TIM, VRILLE, and CWO negative feedback loops and the PDP1 positive feedback loop
50
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
!; t Þ=@t, canUnfortunately, the time evolution of this function, @P ð! x; tjx 0 0 not be solved analytically for a system as complex as the Drosophila circadian network (Szallasi et al., 2006). These models, therefore, must utilize predesigned algorithms in order to produce numerical simulations of the CME. To predict the system’s advancement from one reaction state space to the next, most of these models implement the stochastic simulation algorithm (SSA), or Gillespie algorithm, which is capable of producing numerical solutions to the CME (Gillespie, 1977; Szallasi et al., 2006). This algorithm is a discrete method of analysis that replaces the differentiable concentrations of each molecular species in the deterministic model with the probabilities that a given reaction will occur in a given time step. These “transitional probabilities” are proportional to the number of molecules of each species in the network as well as the kinetic rate parameters from the deterministic model (Gonze et al., 2002). Using transition probabilities, the algorithm determines stochastically the next reaction that will take place as well as the time interval to the next reaction (Li and Lang, 2008). Meanwhile, the transition probabilities and numbers of molecules are updated at each time step, as different species are produced and consumed in what is commonly referred to as the birth and death chain (Levin et al., 2009). A more detailed description of these methods can be found in the models listed in Table 2.1, as well as the sources referenced above. Several of the stochastic models listed in Table 2.1 introduce slight variations to the Gillespie algorithm by advancing the system by a predetermined time step t, instead o f the time to the next reaction event (Li and Lang, 2008; Smolen et al., 2002). Intended to increase computational speed, this courser-grained approximation is equally capable of producing sustained oscillations under proper choices of t (Szallasi et al., 2006). Nevertheless, numerical analysis of these models is very time-consuming with increasing numbers of mRNAs and proteins in the network (Li and Lang, 2008). As a result, most of these stochastic simulations only incorporate one or two feedbacks loops of the Drosophila circadian network into their model, thus limiting their ability to perform system-wide analysis of the entire network (Fig. 2.5). For this reason, we focus on the use of continuous and deterministic models as tools to explain the behavior of the system as a whole. Important results, however, can be drawn from the conclusions of these stochastic simulations. Namely, these models are indeed capable of producing sustained oscillations when the numbers of mRNAs and proteins are relatively low—in the tens and hundreds, respectively (Gonze et al., 2002, 2003). Moreover, the results of these simulations largely agree with the predictions of the deterministic model, thus validating its use as a tool for system-wide analysis of the network (Gonze et al., 2002). Given the aforementioned limitations of stochastic simulations, however, and our desire to resolve a system-wide anomaly in the Drosophila circadian network, we now turn to the use of continuous and deterministic models of regulatory networks.
Modeling of Regulatory Networks
51
3. Comparative Analysis of Three Network Regulatory Models Having established the advantages of continuous deterministic models of regulatory networks in the previous section, this section presents examples of three such models, which have been used to simulate the Drosophila circadian clock over the last 15 years. More specifically, we introduce these models within the framework of the discovery of the various components of the circadian network, and we offer analysis relating the methods used to the known complexity of the system. In our third example, we introduce a new and more simple network model that is capable of both addressing the complicated dynamics of the system and answering important questions about the relationships among various components of the circadian network. In an effort to optimize our presentation of the mathematics used in the second and third models, we have elected to use reduced versions of the current network model shown in Fig. 2.5, Section 2. For comparison, these reduced networks each consist of a single positive and a single negative feedback loop.
3.1. Michaelis–Menten enzyme kinetics: “The gold standard” In 1995, Goldbeter proposed a model for a single negative feedback loop exerted by the PER protein in the Drosophila circadian clock on the transcription of its own per mRNA (Goldbeter, 1995). Many of the other transcription factors, proteins, and feedback loops were unknown at the time. However, scientists had made important advances in elucidating the underlying mechanism of cellular rhythmic behavior, which was and continues to be associated with the periodic repression of the network’s various components. The model (Fig. 2.1, Section 2) correctly hypothesizes that multiple phosphorylation of the PER protein (P0, P1, P2) in the cytoplasm delays entry of the protein (PN) into the nucleus (dashed line), where it then represses the transcription of its own mRNA. Since the publication of this model, further analysis has revealed that the system is more complicated than a single negative feedback loop. Despite limiting knowledge of the Drosophila circadian clock, however, Goldbeter succeeded in using his model to produce sustained periodic oscillations of both total PER protein (PT ¼ P0 þ P1 þ P2 þ PN) and per mRNA (per) through numerical integration of a system of five differential equations. These equations estimate the rates of change of the five system molecules (P0, P1, P2, PN, MP) by employing Michaelis–Menten and Hill-type enzyme kinetics—a method that we refer to as the “Gold Standard” given its frequent use among the continuous deterministic models listed in Table 2.1.
52
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
In 1913, chemists Michaelis and Menten used rate equations and the Law of Mass Action to conclude that the rate of any enzyme-catalyzed reaction can be expressed as: dS dP k2 E0 S k max S ¼ ¼ ¼ dt dt Km þ S K m þ S where S is the concentration of substrate that reacts with the enzyme (E0) to produce the product (P), and where Km ¼ (k 1 þ k2) / k1 is known as the Michaelis–Menten constant (Michaelis and Menten, 1913). A detailed derivation of this equation can be found in a number of sources, including Murray (2002) and Ellner and Guckenheimer (2006). The reaction diagram in Fig. 2.6 illustrates the transformation of substrate S into product P, according to reaction rates k1, k 1, and k2. Comparison of these equations with Goldbeter’s equations reveal that both degradation and phosphorylation rates of the molecules (P0, P1, P2, per) in his model adhere to Michaelis– Menten kinetics, while translation and nuclear transport of the PER protein are linear according to the Law of Mass Action, which states that the transcription (or transport) of a protein occurs at a rate proportional to the concentration of its mRNA (or protein, respectively): ½transcription ½degradation n dðper Þ per ¼ vs K1 dt K1n þ KNn Km þ per h i h i phos: unphos: ½translation P0 þ P P P 1 0 1 ! dP0 ¼ P0 P1 dt ks per V1 þ V2 K1 þ P0 K2 þ P 1 phos: unphos: phos: unphos: ½P0 ! P1 ½P0 P1 ½P1 ! P2 þ ½P1 P2 dP1 ¼ V1 P0 V2 P1 V3 P1 þ V4 P2 dt K1 þ P 0 K2 þ P1 K3 þ P1 K4 þ P2
ð2:1Þ
ð2:2Þ
ð2:3Þ
phos: unphos: ½P1 ! P2 ½P1 P2 ½nuc: transport þ ½cyto: transport ½degradation dP2 ¼ V3 P1 V4 P2 k1 P2 þ k2 PN vd P2 dt K3 þ P1 K4 þ P 2 Kd þ P2
ð2:4Þ dPN ½nuc: transport ½cyto: transport ¼ k1 P2 k2 PN dt k1 ES
S+E
k2
ð2:5Þ
P+E
k–1
Figure 2.6 Enzyme-catalyzed transformation of substrate S into product P.
Modeling of Regulatory Networks
53
Equation (2.2), for example, models the rate of change in concentration of the unphosphorylated protein P0. The protein is produced at a rate (ks), proportional to the concentration of cytoplasmic per mRNA. In the second term, P0 is phosphorylated into P1, according to Michaelis–Menten kinetics, where V1 ¼ vmax and K1 is the Michaelis–Menten constant. Likewise, P1 is unphosphorylated back into P0 in a similar fashion with V2 ¼ vmax and K2 as the Michaelis–Menten constant. In order to model the negative feedback by PER on the concentration of per mRNA, Goldbeter introduced a Hill-type equation with degree of binding cooperativity n ¼ 4. The equation H(n) ¼ K1n / (K1n þ PNn) produces a sigmoidal function that assumes a value close to H(n) ¼ 1, and hence the maximum rate of transcription vs, when the concentration of PN is low. As the concentration of nuclear PER increases, H(n) ! 0, and the rate of transcription becomes negligible. Hence, in Goldbeter’s Eq. (2.1), the transcription of per mRNA is regulated by a Hill sigmoidal function, and the molecule degrades at a rate vmMP / (Km þ MP), where vm is the maximum rate of degradation, and Km is the Michaelis–Menten rate constant. Goldbeter selected parameter values that would produce sustained oscillations of PER and per that were close to a period of 24 h, and his mathematical model provides an excellent illustration of the regulatory effects of an isolated negative feedback loop. In 1998, he and Leloup expanded this model to incorporate the formation of the PER/TIM complex (Fig. 2.2, Section 2), and more recent discoveries of additional negative and positive feedbacks loops in the circadian network have prompted even more detailed mathematical models that employ Michaelis–Menten and Hill-type kinetics (see Table 2.1). Some of these latest models include up to five interlocking positive and negative loops, the dynamics of which can grow quite complicated and become difficult to verify when using equations like the ones above. Other criticisms of the use of Michaelis–Menten kinetics to model protein–protein interactions in regulatory networks have been raised by Maini et al. (1991) and Sabouri-Ghomi et al. (2008). They point out that the Michaelis–Menten rate law assumes that the concentration of the regulating element, or enzyme, is much less than the concentration of the element being regulated, or substrate (see Fig. 2.6). Hence, in Eq. (2.2), for example, one must paradoxically assume that the concentration of unphosphorylated PER protein (P0) and phosphorylated PER protein (P1) are simultaneously in excess of one another. Nevertheless, as in Goldbeter’s model, these models can be quite effective in reproducing oscillations of clock molecules with 24 peak-to-peak times, as observed in nature.
3.2. A probabilistic model for the Drosophila circadian clock In 2006, with the knowledge of additional positive and negative regulatory loops in the Drosophila circadian network, Xie and Kulasiri proposed a new and robust mathematical model with a set of equations that incorporate
54
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
first-order reactions for translation, degradation, and dissociation, as opposed to Michaelis–Menten and Hill-type equations (Xie and Kulasiri, 2007). In their model, binding probabilities of transcription factors to E-box determine the rates of transcription of the mRNAs, rather than Hill-type equations, which they argue imply an unnatural “switch-like behavior” of transcription. Xie and Kulasiri also use first-order linear expressions to model both translation and degradation, as opposed to Michaelis–Menten equations. The schematic diagram in Fig. 2.4 from Section 2 describes the feedback loops included in their model. CYC–CLK activates the transcription of four genes—pdp1, vri, tim, and per. The protein PDP1 exerts a positive feedback loop on the transcription of its regulator by binding to the V/P box for the clk gene. VRI, however, competes with PDP1 as a transcriptional repressor of clk, thereby exerting a negative feedback loop on the system. PER and TIM also introduce negative feedback loops into the network by inhibiting the transcriptional effects of CLK–CYC. Xie and Kulasiri make several assumptions in their model. For example, unlike the Goldbeter model, they disregard separate nuclear and cytoplasmic compartments, and hence the nuclear transport of proteins like PER– TIM. Also, for simplification, they do not consider phosphorylation of proteins but instead focus on the effects of transcriptional regulation. Even with this simplification, however, their entire model consists of 19 different equations—13 rate equations for the different proteins and mRNAs (CYC is assumed to be constant) and 5 binding probability equations. In an effort to explain their method, we will consider the reduced schematic diagram shown in Fig. 2.7, which only consists of one positive PDP1 loop and one negative VRI loop. Notice that even with this reduced model, there are already 28 parameters to account for in the system (Fig. 2.7). With the exception of CYC (assumed constant), each of these molecules is associated with a degradation rate (di). Other parameters include binding rates (bi) and unbinding rates (ui) to E-boxes, translation rates (ki), transcription rates (ai and si), disassociation/ association rates of CLK/CYC (vi), and finally the number of E-boxes in gene promoter regions (ngene). The number of E-boxes in the vri promoter (nvri) and the pdp1 promoter (npdp1) is 4 and 6, respectively. Xie and Kulasiri also include rate of change equations for the binding probabilities of transcription factors to promoter regions, or E-boxes. Here, the fraction of E-boxes occupied by CLK–CYC in the promoter regions of vri and pdp1 is Pr1 and Pr2, respectively. Hence, (1 Pr1) and (1 Pr2) represent the fraction of unoccupied E-boxes in the vri and pdp1 gene promoter regions, respectively. Finally, Pr3 and Pr4 denote the fraction of E-boxes occupied by vri and pdp1, respectively, in the gene promoter region for clk, which encodes for CLK. Figure 2.8 provides a pictorial cartoon of this interaction between transcriptional regulator CLK–CYC (C/C) and the promoter region of the vri gene. The solid purple balls represent bound and unbound
55
Modeling of Regulatory Networks
b1 = binding rate of CLK-CYC to E-boxvri b2 = binding rate of CLK-CYC to E-boxpdp1 b3 = binding rate of PDP1 to VP-boxclk b4 = binding rate of VRI to VP-boxclk
k3 clk
(+) b3 PDP1
CLK
CYC
VP-BOXclk b4
v1
v2
(–) CLK–CYC
VRI k2
k1 vri
pdp1
b1 E-BOXvri b2
u1 u2 u3 u4
= = = =
unbinding unbinding unbinding unbinding
rate rate rate rate
of of of of
CLK-CYC to E-boxvri CLK-CYC to E-boxpdp1 PDP1 to VP-boxclk VRI to VP-boxclk
d1 = degradation rate of vri mRNA d2 = degradation rate of pdp1 mRNA d3 = degradation rate of clk mRNA d4 = degradation rate of CLK d5 = degradation rate of VRI d6 = degradation rate of PDP1 d7 = degradation rate of CLK-CYC dimer k1 = translation rate of vri mRNA k2 = translation rate of pdp1 mRNA k3 = translation rate of clk mRNA a1 = transcription rate of CLK/CYC-activated vri gene a2 = transcription rate of CLK/CYC-activated pdp1 gene a3 = transcription rate of PDP1-activated clk gene s1 = transcription rate of VRI-repressed clk gene s2 = transcription rate deactivated vri, pdp1 s3 = basal transcription rate unbound clk gene
E-BOXpdp1
v1 = association rate of CLK/CYC dimer v2 = disassociation rate of CLK/CYC dimer npdp1 = Number of E-boxes in pdp1 promoter nvri = Number of E-boxes in vri promoter
Figure 2.7 Reduced diagram of the circadian network model. One negative loop () and one positive loop (þ) regulate this network of eight molecules.
CLK–CYC heterodimers, which bind to E-box sites (yellow boxes) to enhance vri transcription. In this figure, Pr1 ¼
CC bound Ebox 2 ¼ Total vri Ebox 4
The probability binding equations for our truncated system are listed below. In these equations, the CYC–CLK dimer is denoted as CC. Each of these equations takes the general form: rate of change ¼ binding rate unbinding rate. For example, the first term, or binding rate, in the rate of change Eq. (2.6) for the binding probability of CC to the promoter region of the vri gene is the fraction of unoccupied E-box, multiplied by the binding rate of CC (b1) and the concentration of CC available to bind to these unoccupied sites. The second term, or unbinding rate, in the equation measures the decrease in bound E-box sites, which is the product of the
56
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
C/C
C/C
E-Box
E-Box
Unbound C/C vri transcription E-Box
E-Box
vri promoter region
Figure 2.8 Cartoon depicting activity in the promoter region of the vri gene. The fraction of occupied E-boxes over the total number of E-boxes, or P1, is 2:4 or 1/2 in this diagram of the vri gene and its promotor region. The rate of change in this ratio is represented by the binding rate of unbound C/C minus the unbinding rate of the bound C/C. (P1 ¼ binding rate unbinding rate). The rate of vri transcription is the sum of its C/C activated rate and its basal transcription rate (or the transcription rate of unbound E-box).
fraction of occupied E-box and the unbinding rate of CC (u1). In Fig. 2.8, the binding rate at this particular time is 0.5 b1 CC, and the unbinding rate is 0.5 u1. A more detailed derivation of these probability rate equations can be found in the Appendix of Xie and Kulasiri (2006): dðPr1 Þ ½binding rate ½unbinding rate ¼ ½ð1 Pr1 Þ b1 CC ½ðPr1 Þ u1 dt
ð2:6Þ
dðPr2 Þ ½binding rate ½unbinding rate ¼ ½ð1 Pr2 Þ b2 CC ½ðPr2 Þ u2 dt
ð2:7Þ
dðPr3 Þ ½binding rate ½unbinding rate ¼ ½ð1 Pr3 Pr4 Þ b3 VRI ½ðPr3 Þ u3 dt
ð2:8Þ
dðPr4 Þ ½binding rate ½unbinding rate ¼ ½ð1 Pr4 Pr3 Þ b4 PDP1 ½ðPr4 Þ u4 dt
ð2:9Þ
The remaining rate equations for the change in concentrations of the mRNA and protein molecules are shown below. Equations (2.10)–(2.12) represent the rates of change of the mRNA concentrations, while Eqs. (2.13)–(2.16) model the change in concentration of the four proteins and protein complexes. Recall that the concentration of CYC is assumed to be constant in these equations: ½activated transcription þ ½repressed transcription þ½basal transcription ½degradation dðclkÞ ¼ ½Pr4 s1 þ ½Pr3 a3 þ ½ð1 Pr4 Pr3 Þ s3 ½d3 clk dt ð2:10Þ
Modeling of Regulatory Networks
57
dðvriÞ ½activated transcription þ ½basal transcription ½degradation ¼ 4 4 1 ð1 Pr1 Þ a1 þ ð1 Pr1 Þ s2 ½d1 vri dt ð2:11Þ ½activated transcription þ ½basal transcription ½degradation dðpdp1Þ ¼ 1 ð1 Pr2 Þ6 a2 þ ð1 Pr2 Þ6 s2 ½d2 pdp1 dt ð2:12Þ dðCLKÞ ½translation ½association CC þ ½disassociation CC ½degradation ¼ ½k3 clk ½v1 CLK þ ½v2 CC ½d4 CLK dt
ð2:13Þ dðVRIÞ ½translation ½degradation ¼ ½k1 vri ½d5 VRI dt
ð2:14Þ
dðPDP1Þ ½translation ½degradation ¼ ½k2 pdp1 ½d2 PDP1 dt
ð2:15Þ
dðCCÞ ½association CC ½disassociation CC ½degradation ð2:16Þ ¼ ½v1 CLK ½v2 CC ½d7 CC dt
In Eq. (2.10), for example, the first term is the rate of transcription of VRI-bound clk VP-box. The second term is the rate of transcription of PDP1-bound clk VP-box; and the third term is the basal transcription rate of unbound VP-box. The last term in every equation is the rate of degradation of each molecule (di molecule), which is a linear rate as opposed to the Michaelis–Menten expressions used in Goldbeter’s model. In Eq. (2.11), (1 Pr1)4 is the probability that all four vri gene E-boxes are unbound, and likewise, (1 (1 Pr1)4) is the probability that all four E-boxes are bound by CLK–CYC. Hence, the first term, or CC-activated transcription, is described by (1 (1 P1)4)a1, where a1 is the transcription rate of a CC-saturated vri promoter region, when P1 ¼ 1. Note that when P1 ¼ 1, all E-boxes are bound and hence the expression (1 (1 P1)4) ¼ 1. The expression, (1 P1)4s2, describes the basal transcription rate of vri, which is equal to s2 when all E-boxes are unoccupied, and hence P1 ¼ 0. Equations (2.13)–(2.16) follow simple first-order reaction rates. For example, in Eq. (2.13), the first term is the rate of translation of CLK, the second term is the association rate of the CLK–CYC (CC) complex, the third term is the disassociation rate of CC, and the last term is the degradation rate of CLK. Like Goldbeter’s model, which employed very different types of rate equations, this model is capable of producing sustained 24-h periodic oscillations of the interacting circadian network molecules. Also similar to other proposed models in Table 2.1, this model replicates observed biology by
58
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
responding to changes in light cycles with phase-adjustments (or entrainment) in the periodic rhythms. The results of various simulated mutations in the per gene were also consistent with experimental observations (Xie and Kulasiri, 2007). Hence, this model validates the idea that models with very different mathematical equations can replicate the same observed biological behaviors of an intracellular regulatory network (Murray, 2002). Hence, when studying a regulatory network, it is prudent to select an appropriate model, which best serves as a tool for exploration and understanding of the question at hand. With this in mind, we now introduce a new intracellular network model whose appeal lies in its relative simplicity in comparison to the previously introduced models.
3.3. A new regulatory network model In this section, we introduce a new set of equations, originally presented by Fathallah-Shaykh et al. (2009) to model the CWO-expanded network in Fig. 2.5 of Section 2. For comparison, however, we will only use these equations to model a simplified version of the full network. The schematic diagram in Fig. 2.9, shown below, displays one positive and one negative
VRI
aPDP1-CC
aVRI-CC (–)
CLK-CYC
avri-VRI aCC-vri
PDP1
(+) apdp1-PDP1
aCC-pdp1 pdp1
vri a = strength of CC-activation of vri CC-vri avri-VRI = strength of vri-activation of VRI aVRI-CC = strength of VRI-repression of CC aCC-pdp1 = strength of CC-activation of pdp1 apdp1-PDP1 = strength of pdp1-activation of PDP1 a = strength of PDP1-activation of CC PDP1-CC
s 1 s2 s3 s 4 s
b1 = maximum rate of formation of vri mRNA b = maximum rate of formation of pdp1 mRNA 2
d = degradation rate of vri mRNA 1 d = degradation rate of pdp1 mRNA
b3 = maximum rate of formation of VRI b4 = maximum rate of formation of PDP1 b5 = maximum rate of formation of CLK-CYC dimer
5
= = = = =
saturation saturation saturation saturation saturation
level level level level level
of of of of of
vri mRNA pdp1 mRNA VRI PDP1 CLK-CYC dimer
2
d3 = degradation rate of VRI d4 = degradation rate of PDP1 d = degradation rate of CLK-CYC dimer 5
Figure 2.9 Reduced diagram of the circadian network model in Fig. 2.5, Section 2. One negative loop () and one positive loop (þ) regulate this network of five molecules.
Modeling of Regulatory Networks
59
feedback loop, with a shorter list of 11 parameters, in comparison to the 28 parameters used in the previous model. Note also that this model only includes five network molecules, as opposed to eight, thus reducing the number of rate equations and simplifying the task of numerical computation. The saturation levels of each molecule at its site of action (si) and the degradation rates of the molecules (di) are all fixed at 100% and 0.01, respectively. Hence, each of the five network molecules is associated with one individual parameter, bi, which measures the maximum rate of formation of the molecule. The remaining six parameters measure the activation and repression strengths of each molecular interaction in the diagram. This model (Fig. 2.9) preserves the fundamental structures, or modules, of the previous schematic diagram (Fig. 2.7), while simplifying intermediary tasks—like phosphorylation and binding to E-box—with single rate constants. Module identification in genetic regulatory networks can help reduce a dynamic system to a simpler network that maintains the important molecular interactions in the network, while also preserving the observed biology of the system (Szallasi et al., 2006). Contributing authors, Szallasi, Periwal, and Stelling in System Modeling in Cellular Biology, explain that modules in genetic regulatory networks can be characterized by their circular feedback structure, where each molecule in a particular module is linked to at least two other molecules in that same modular structure. Figure 2.9 is a reduced model of the more detailed circadian network shown in Fig. 2.5 (Section 2). In this modified version of the model, the positive PDP1-feedback loop and the negative VRI-feedback loop represent two separate modules. Unlike the previous model, however, this reduced diagram does not ignore compartmentalization of molecules and diffusion rates across different cellular membranes. Rather, it incorporates these rates into single activation/ repression parameters. In the VRI loop, for example, CLK–CYC-activation of vri transcription, vri mRNA translation, and VRI-repression of CLK– CYC have all been reduced to single activation/repression parameters (aCC vri, avri VRI, aVRI CC, etc.). These rates (axy) are either positive (þ) or negative (), depending on whether the action of molecule x on molecule y is positive (activation) or negative (repression). Hence, we can dissect avri VRI into a sum of several positive growth curves, which include the rate of translation of VRI and the transport of VRI to the nucleus, where it then represses CLK–CYC. An analogy can be made to a car factory that puts ready-to-sell cars onto a car dealer’s lot at a rate a. In this case, a is the sum of the rate at which the factory produces cars and the rate of transportation of these cars to the dealer lot. As a result, the five molecular symbols (vri, VRI, etc.) only denote the concentration of each molecule at the site of its action in the network system (Fathallah-Shaykh et al., 2009). However, if one desires to further investigate the influence of nuclear transport on the dynamics of the network, these parameters can be broken down into their
60
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
components, and one can compare cytoplasmic and nuclear concentrations of each molecule. Another advantage to this simplified system is that the probability binding rates from Xie and Kulasiri (2007) are directly incorporated into the rate equations for each molecule. The rate of change for each molecule takes the general form: dxi ðt Þ ui ðtÞ ¼ bi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi xi ðtÞðsi xi ðtÞÞ dt 1 þ ui ðtÞ2
ð2:17Þ
In this general Eq. (2.17), n represents the total number of molecules in n X aji xj ðtÞ di xi ðtÞ the system (in this case, n ¼ 5), and ui ðtÞ ¼ j¼1 1jn models the overall sum of the regulatory influences, including degradation, on a given molecule xi at time t. If molecule j does not regulate molecule i, then aji ¼ 0, a rate that is also used to simulate null mutations of specific circadian network genes in Fathallah-Shaykh et al. (2009). Equations (2.18)– (2.22) are the five rate of change equations needed to analyze the system shown in Fig. 2.9: dðvriÞ uvri ¼ b1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðvriÞðs1 vriÞ; dt 1 þ u2vri
½activation=repression ½degradation uvri ¼ ½aCCvri CC ½d1 vri
ð2:18Þ updp1 dðpdp1Þ ¼ b2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðpdp1Þðs2 pdp1Þ; dt 1 þ u2 pdp1
½activation=repression ½degradation updp1 ¼ aCCpdp1 CC ½d2 pdp1
ð2:19Þ dðVRIÞ uVRI ¼ b3 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðVRIÞðs3 VRIÞ; dt 1 þ u2VRI
½activation=repression ½degradation uVRI ¼ ½avriVRI vri ½d3 VRI
ð2:20Þ dðPDP1Þ uPDP1 ðPDP1Þðs4 PDP1Þ; ¼ b4 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dt 1 þ u2PDP1
½activation=repression ½degradation uPDP1 ¼ apdp1PDP1 PDP1 ½d4 PDP1
ð2:21Þ dðCCÞ uCC ¼ b5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðCCÞðs5 CCÞ; dt 1 þ u2CC
½activation=repression ½degradation uCC ¼ ½ðaPDP1CC PDP1Þ þ ðaVRICC VRIÞ ½d5 CC
ð2:22Þ
61
Modeling of Regulatory Networks
u
√1 + u2
Odd sigmoidal function 1 0.8 0.6 0.4 0.2 0 – 0.2 – 0.4 – 0.6 – 0.8 –1 – 10
PRI >> NRI
Sum of PRI = Sum of NRI PRI << NRI –8
–6
0 –4 –2 2 4 u = Sum of regulatory influences
6
8
Figure 2.10 Sigmoidal graph. The range of the function g is between 1 and 1.
10
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi The odd sigmoidal function, gðui ðtÞÞ ¼ ui ðt Þ= 1 þ ui ðtÞ2 , is bounded by 1 and 1, and hence the rate of formation and degradation—bi g(ui(t))—of a given product xi fluctuates between bi, its maximum rate, and bi at any given time t. Figure 2.10 displays a graphical representation of this function g. This function g exhibits similar behavior to the binding probabilities Pi in the previous model, which fluctuate between 0 and 1 depending on the fraction of occupied E-boxes in the promoter region of each gene. Note that g(ui(t)) ¼ 0 when ui(t) ¼ 0, and hence the sum of the negative regulatory influences (NRI) (degradation and repression) is equal to the sum of the positive regulatory influences (PRI) (activation). The curve approaches g(ui(t)) ¼ 1 when the sum of the PRI is much greater than that of the NRI. Likewise, when repression and degradation outweigh activation, the curve approaches g(ui(t)) ¼ 1, and hence the maximum rate of decay. The overall structure of Eqs. (2.18)–(2.22) is similar to that of a logistic equation, dx(t) /dt ¼ rx(t)(s x(t)), where r is the growth or decline rate of population x and s is the carrying capacity, or saturation level. However, in the equations used by Fathallah-Shaykh et al. (2009), the constant rate r has been replaced by a varying rate of change, big(ui(t)), which is dependent on the summation of a molecule’s regulatory signals at a given time t. This rate of change is essential to generating oscillations in the concentrations of each molecular species: dxi(t) /dt ¼ big(ui(t))xi(t)(si xi(t)). At any given time, the population (or concentration of a molecular species) is growing or decaying at a rate big(ui(t)) ¼ ri(t). Since g(ui(t)) fluctuates between 1 and 1, the varying rate ri(t) likewise fluctuates between a maximum rate, ri(t) ¼ bi, and a minimum rate, ri(t) ¼ bi. The closer this rate is to the maximum rate of formation, for example, the faster molecule i will reach its saturation level si. When integrating the simple logistic equation in Matlab with initial conditions x0 at time t ¼ 0, the graph of the concentration of
62
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
A 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
B x0 = 1
r(t2) = –0.0015
r(t1) = –0.0035 0
5
10
15
20
25
1 x0 = 0.01 0.9 0.8 0.7 0.6 r(t1) = 0.35 0.5 0.4 r(t2) = 0.15 0.3 0.2 0.1 0 0 10 20 30
30
40
50
60
Figure 2.11 Simulations to the logistic equation: dx(t) /dt ¼ rx(t)(s x(t)). In (A), with initial condition x0 ¼ 1, the concentration of molecular species x(t) declines to zero at different negative rates r(t1) ¼ 0.0035 and r(t2) ¼ 0.0015. In (B), with initial condition x0 ¼ 0.01, the population grows and approaches it saturation level 1 with positive rates r(t1) ¼ 0.35 and r(t2) ¼ 0.15. The larger the absolute value of the growth rate r, the faster the concentration x(t) approaches 0 or 1.
molecule xi approaches either si ¼ 1 or 0, when the rate big(ui(t)) is either positive or negative, respectively. To illustrate this behavior, Fig. 2.11 displays the graphs of several solutions x(t) with varying parameter rates big (ui(t1)) ¼ r(t1) and big(ui(t2)) ¼ r(t2). One of the many advantages of this modeling method is that the regulatory equation (ui(t)), which models the time-dependent relationship between the regulatory molecules that act on molecule i, has a special property at the peaks and troughs of each molecule’s oscillation. Here, the rate of change equation for each molecule is equal to zero. Thus, as long as the concentration of each molecular species never reaches its saturation level or zero, this regulatory equation is also equal to zero; and hence, the concentration of each molecule can be represented by a linear equation at these peaks and troughs. This linearity affords a greater understanding of network interactions, particularly when analyzing the comparative strengths of competing regulatory signals on molecule i. When the molecule reaches its peak concentration, the rate of change (or derivative) of the function xi(t) is zero, and hence the sum of the positive and NRI must be zero as well. For example, when the concentration of CLK–CYC (CC) is at its peak: uCC ¼0 ¼ ðaPDP1CC PDP1Þ þ ðaVRICC VRIÞ ðd5 CCÞ and hence, ðaPDP1CC PDP1Þ þ ðaVRICC VRIÞ ¼ ½CC d5 The simplicity of the model is also useful when trying to answer a big-picture question about a network, such as why the absence of a direct
Modeling of Regulatory Networks
63
repressor CWO might actually cause a decrease in peak levels of certain molecules, like per and PER, which it directly represses. Section 4 explains the activity of CWO in more detail and illustrates how these equations were used in Fathallah-Shaykh et al. (2009) to solve this Drosophila circadian network anomaly.
4. The CWO Anomaly and a New Network Regulatory Rule As explained in the history section of this paper, CWO was not identified as a Drosophila circadian network molecule until 2007, and its counterintuitive effects on the overall network are now well documented. Kadener et al. (2007) recorded that the protein CWO directly inhibits CLK–CYC-mediated transcription of all direct target genes (per, tim, vri, pdp1), as well as cwo itself, by binding to and repressing E-boxes. Almost synonymously, Matsumoto et al. (2007) determined that CWO’s negative transcriptional effects actually contribute to sustained high-amplitude oscillations of direct target genes, suggesting that CWO behaves as an activator of these direct target genes in the overall network. In 2009, Fathallah-Shaykh and his colleagues devised the system of equations presented in Section 3 to model this CWO-expanded version of the Drosophila circadian network (Fig. 2.5, Section 2). Their goal was to develop a better understanding of CWO’s regulatory influences in the overall network and to resolve this apparent anomaly in its behavior as both an activator and a repressor of direct target genes. Parameters for the rate equations (Eq. (2.17), Section 3.3) were chosen so that numerical integration yielded timely peaks of all target mRNAs (per, tim, vri, pdp1, cwo) and proteins, with a period of 24 h—consistent with observed biology. Their model was also robust in the sense that it replicated several other behaviors and control mechanisms intrinsic to the clock, like entrainment. For example, the computer model responded to a 12 h time shift in the light–dark (LD) cycle—simulated by advancing the level of cry mRNA at midnight—with a complete phase reversal over the course of 4.5 days. This entrainment period is consistent with that observed in the mammalian clock in nature (Chen et al., 2008). Also, enhancing the activity of the gene promoter CLK–CYC on direct target genes confirms observed increases in peak mRNA levels and shortened periods in constant darkness (DD) (Kadener et al., 2008). Finally, the model duplicates the clock’s response to null mutations, simulated in silico by setting the appropriate ax y ¼ 0. For example, the clock is arrhythmic, or ceases to oscillate, with constant high levels of PER
64
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
A
B Wild type
CWO mutant
CLK-CYC
CLK-CYC a CC- per
per
aCC- per a CWO- per
CWO
per
Figure 2.12 Wild-type and CWO-mutant model of a simplified per feedback loop. (A) CWO directly represses per transcription by competing with the transcriptional enhancing activities of CLK–CYC. (B) In the CWO mutant, only CLK–CYC regulates per transcription. In both models, per represses the activity of CLK–CYC.
and TIM, when a mutation of the DBT-binding domain on PER is introduced (aDBT PER/TIM ¼ 0) (Kim et al., 2007). Mutations in the cwo gene are also consistent with biological data, resulting in depressed amplitudes in the oscillations of the per, tim, vri, and pdp1 mRNAs. However, the peak level of cwo is higher in the cwo-mutant model than the wt model, which is also observed in nature (Matsumoto et al., 2007). These observed results present an interesting paradox, as CWO is known to directly repress four direct target genes (per, tim, vri, and pdp1), whose loops all intersect at CLK– CYC. In consequence, intuition might suggest that the peak levels of per, tim, vri, and pdp1 would increase in the absence of CWO. Figure 2.12A presents a simplified network diagram of CWO’s interaction with one of these feedback loops in both the wild-type model. Supported by biological observations and the results of their model simulations, Fathallah-Shaykh et al. (2009) characterized the unusual effects of CWO on the network, which we now present as a novel network regulatory rule. Namely, Box 2.1
In the network presented in Fig. 2.12A, a weak direct repressor (CWO) of per, in competition with a strong direct activator (CLK–CYC), can become an indirect activator of per in the network, if certain conditions apply (see Box 2).
To explain this anomaly, we consider a generic model of the simplified network from Fig. 2.1, where a strong activator (SA) competes with a weak repressor (WR), like CWO, to regulate a target gene (per), which in turn represses the activity of SA. The illustrative cartoon in Fig. 2.13 provides a phase portrait of this interaction across five discrete time intervals, as the activity of WR is downregulated to simulate a cwo mutation. Although the target gene (tg) initially increases with the downregulation of WR, the
65
Modeling of Regulatory Networks
t0
t1
t2
t3
•
t4
Strong activator
Strong activator
Strong activator
Strong activator
Strong activator
tg
tg
tg
tg
tg
WR
WR
WR
WR
WR
Figure 2.13 Illustrative cartoon of new regulatory network rule. At time t0, all molecules are oscillating and peaking at normal concentration levels. At time t1, we introduce a WR mutation, and WR protein levels drop. This decrease causes a temporary increase in tg concentration at t2, as WR normally competes with SA to bind to and repress tg E-box. At t3, concentration in SA declines, since tg inhibits the activity of SA. Finally, at t4, tg levels decline because the decrease in SA (i.e., CC) outweighs the effects of the change in WR (i.e., CWO), and system levels remain low for tg, thus preventing high amplitude oscillations (see Boxes 1 and 2).
consequent increased repression of SA causes a long-term decrease in peak concentration of tg. Mathematical analysis of the model confirms this phenomenon. One of the advantages of the model proposed by Fathallah-Shaykh et al. (2009) is that their master equation (Eq. (2.17), Section 3.3) affords easy analysis of the behavior of the network at peak concentration levels of the direct target genes, like per. As explained in the previous section, since the rate equation’s derivative (or rate of change) is zero at the peaks and troughs, the concentration of per at its peak in wt flies is given by: WT WT aCCper CCpeak jaCWOper j CWOpeak ½per WT ¼ ð2:23Þ dper WT where CCWT peak and CWOpeak represent the concentrations of CLK–CYC and CWO, respectively, in wt flies at the time when per mRNA concentration peaks (Fig. 2.12A). Likewise, in cwo mutant flies, the concentration of per at its peak is given by: cwo ja aCCper CCcwo j ½ 0 a CC CWOper CCper peak peak ½per cwo ¼ ¼ dper dper
ð2:24Þ where again, CCcwo peak represents the new concentration of CLK–CYC in cwo-mutant flies at the time when per mRNA concentration peaks, and CWOWT peak ¼0 (Fig. 2.12B). Hence, the difference in Eqs. (2.23) and (2.24),
66
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
h ½per WT ½per cwo ¼
i i h cwo WT aCCper CCWT CC j CWO 0 ja CWOper peak peak peak dper
is positive (indicating a decrease in peak per levels in cwo mutants) if [per]WT > [per]cwo, or: h i h i cwo WT > ja aCCper CCWT CC j CWO 0 CWOper peak peak peak A more detailed algebraic proof of this relationship can be found in the supporting material for (Fathallah-Shaykh et al., 2009), but the results can be summarized with the following inequality: Box 2.2
Activating signal of strong activator > repressive signal of weak repressor.
In other words, if the activating signal from the change in peak levels cwo of SA (DCC ¼ CCWT peak CCpeak ) is greater than the repressive signal from the change in peak levels of WR (DCWO ¼ CWOWT peak 0 ¼ CWOpeakWT), then per will peak at higher levels in the wild-type model. These activating and repressive signals are described by aCC per DCC and aCWO per DCWO, respectively. Hence, if (see Box 2.2) is satisfied and the signal from the competing activator (i.e., CC) is stronger, then the network can transform a WR (i.e., CWO) into an activator (see Box 2.1). Indeed, (Fathallah-Shaykh et al., 2009) observed that peak levels of CLK–CYC are higher in the wild-type model than in the cwo-mutant cwo model, and thus: DCC ¼ (CCWT peak CCpeak ) > 0. Moreover, the observed depressed amplitudes of per in the simulated cwo mutants imply that the above inequality is satisfied and that the positive regulatory signals from DCC are stronger than the repressive signals from DCWO. Though oscillations may still persist in the cwo-mutant model, as verified by previous models preceding the discovery of this unusual network molecule, CWO functions to mollify the repressive effects of the vri and per/tim negative loops, as evidenced by the increase in peak CLK–CYC levels in wt flies. Further, by lowering this threshold of CLK–CYC repression, the Drosophila clock is able to maintain high-amplitude oscillations of the four direct target genes that intersect at this transcriptional activator. In addition to these findings, Fathallah-Shaykh has recently completed a second analysis of these equations, which further clarifies the role of CWO in the Drosophila circadian network (Fathallah-Shaykh, 2010). Namely, theoretical results suggest that CWO regulates an antijitter control system,
Modeling of Regulatory Networks
67
which detects and then stabilizes variability in the oscillating rhythms of network molecules. As explained in Section 2, CWO’s four target genes (per, tim, vri, and pdp1) oscillate with a 24-h peak-to-peak time. However, slight variations in peak-timing, also known as jitter, occurs naturally and is also predicted by the model. These variations are not unlike the “jitters,” or deviations in period signals, observed in digital clocks. To eliminate these discrepancies, most digital clocks employ an antijitter device that detects and then corrects minor deviations in peak-timing. Analysis of the model predicts that CWO may function in a similar manner. Fathallah-Shaykh (2010) cites evidence that, in simulations of the CWO mutant model, cycleto-cycle peak-time variability of direct target genes is driven by the variability of CLK–CYC, since CLK–CYC directly stimulates the production of these genes. In simulations of the wt model, however, CWO appears to reduce cycle-to-cycle jitter in the direct target genes because its own variability, proportional to these direct target genes, is subtracted from the variability induced by CLK–CYC. Again, the linearity of each molecule’s regulatory equation at the peak of its oscillation was used to develop the theory of this causal relationship, which can be found in the supplementary material for (Fathallah-Shaykh, 2010).
5. Concluding Remarks In this chapter, we introduced the theory and applications behind the modeling of regulatory networks. Although we focused on modeling’s particular application to the dynamics of the Drosophila circadian clock, its spectrum of use is extensive and includes the study of population dynamics, the spread of infectious diseases, tumor growth, and numerous other dynamical systems (Murray, 2002; Stamper et al., 2010). In particular, we have discussed the use of different deterministic and continuous mathematical models, whose purpose is to both simulate and better understand the network behavior of circadian clock molecules. First, we explained the theory behind the use of Michaelis–Menten and Hill-type equations in models of intracellular regulatory networks. The majority of the models listed in Table 2.1 employ Michaelis–Menten and Hill-type equations, whose primary limitation lies in the large number of parameters needed to simulate a single molecular interaction. Second, we introduced a probabilistic model, with binding probability equations and first order reactions in place of Michaelis–Menten and Hill-type equations. Finally, the novel set of equations presented in Section 3.3 eliminates many of these parameters, with the added advantage that the relationship between regulatory signals that act on a particular network molecular is linear at the peaks and troughs of the molecule’s oscillations. This linearity affords new opportunities for
68
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
analysis of the system’s behavior, thus leading to testable hypotheses, as demonstrated in the discussion of Section 4. We explained how these equations helped resolve an anomaly in the clock’s architecture. The revelation that CWO—a direct WR of specific network genes—actually activates these direct target genes led to the formulation of a new network regulatory rule, which can now be applied to network constructs in general. Moreover, analysis of these equations in Fathallah-Shaykh (2010) suggested that CWO acts as an antijitter regulator of direct target genes. These findings demonstrate the many ways that models may serve as tools for exploration of intracellular network behavior, thus saving much of the expensive and time-consuming work that must otherwise be done in vivo or in vitro (Szallasi et al., 2006). The findings of Fathallah-Shaykh et al. (2009) also have significant implications for the broader study of intracellular molecular networks. First and foremost, because the regulatory equation g is linear at the peaks and troughs of a molecule’s oscillation, one is able to quantify the repressive and activating signals, which can lead to the development of novel network regulatory rules, like the one in Section 4. In addition, their model underscores the importance of looking at a regulator’s broader role within a network as opposed to its isolated interaction with specific genes or proteins. As we have seen, the elimination of a direct repressor in a system may lead to unintended consequences in the overall network. We observe this phenomenon on a larger scale in the study of predator–prey populations, where, for example, the removal of one predator opens the doors for another, and perhaps more lethal, predator to eliminate the prey. On a molecular level, understanding the complete behavior of a system over time and its reactions to small perturbations or mutations can aid in drug development by identifying potential systematic side effects that could counteract the intended use of the drug (Endy and Brent, 2001). For example, Robe et al. (2009) recently published a study of a clinical trial of the anticancer drug Sulfasalazine, which exhibited strong preclinical evidence of its ability to treat progressing malignant gliomas in animals. However, results of human trials revealed no beneficial effects at best, and several patients actually developed lethal side effects. The authors concluded that future trials should exercise extreme caution in testing the effects of Sulfasalazine on human patients. A more complex dynamic model of this drug’s interaction with its target molecules might have prevented a harmful human clinical trial. As Kitano et al. (2002) explains, it is probable that one day the U.S. Food and Drug Administration may require dynamic network simulation of all drugs in the same way that large-scale engineering projects undergo comprehensive environmental impact assessments before receiving their final stamp of approval. In the long run, such precautions may not only save lives but also reduce the financial impact of large-scale human trials.
Modeling of Regulatory Networks
69
REFERENCES Bae, K., Lee, C., Sidote, D., Chuang, K. Y., and Edery, I. (1998). Circadian regulation of a Drosophila homolog of the mammalian Clock gene: PER and TIM function as positive regulators. Mol. Cell. Biol. 18, 6142–6151. Baylies, M. K., Bargiello, T. A., Jackson, F. R., and Young, M. W. (1987). Changes in abundance or structure of the per gene product can alter periodicity of the Drosophila clock. Nature 326, 390–392. Blau, J., and Young, M. W. (1999). Cycling vrille expression is required for a functional Drosophila clock. Cell 99, 661–671. Chen, R., Seo, D. O., Bell, E., von Gall, C., and Lee, C. (2008). Strong resetting of the mammalian clock by constant light followed by constant darkness. J. Neurosci. 28, 11839–11847. Cyran, S. A., Buchsbaum, A. M., Reddy, K. L., Lin, M. C., Glossop, N. R., Hardin, P. E., Young, M. W., Storti, R. V., and Blau, J. (2003). Vrille, Pdp1, and dClock form a second feedback loop in the Drosophila circadian clock. Cell 112, 329–341. Darlington, T. K., Wager-Smith, K., Ceriani, M. F., Staknis, D., Gekakis, N., Steeves, T. D., Weitz, C. J., Takahashi, J. S., and Kay, S. A. (1998). Closing the circadian loop: CLOCK-induced transcription of its own inhibitors per and tim. Science 280, 1599–1603. Ellner, S. P., and Guckenheimer, J. (2006). Dynamic Models in Biology. Princeton University Press, Princeton/Oxford. Emery, P., So, W. V., Kaneko, M., Hall, J. C., and Roshbash, M. (1998). CRY, a Drosophila clock and light-regulated cryptochrome, is a major contributor to circadian rhythm resetting and photosensitivity. Cell 95, 669–679. Endy, D., and Brent, R. (2001). Modelling cellular behaviour. Nature 409, 391–395. Fathallah-Shaykh, H. M. (2010). Dynamics of the Drosophila circadian clock: theoretical anti-jitter network and controlled chaos. PLos One 5, 1–7. Fathallah-Shaykh, H. M., Bona, J. L., and Kadener, S. (2009). Mathematical model of the Drosophila circadian clock: Loop regulation and transcriptional integration. Biophys. J. 97, 2399–2408. Gekakis, N., Saez, L., Delahaye-Brown, A. M., Myers, M. P., Sehgal, A., Young, M. W., and Weitz, C. J. (1995). Isolation of timeless by PER protein interaction: Defective interaction between timeless protein and long-period mutant PERL. Science 270, 811–815. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. Gillespie, D. T. (2000). The chemical Langevin equation. J. Chem. Phys. 113, 297–306. Glossop, N. R., Lyons, L. C., and Hardin, P. E. (1999). Interlocked feedback loops within the Drosophila circadian oscillator. Science 286, 766–768. Glossop, N. R., Houl, J. H., Zheng, H., Ng, F. S., Dudek, S. M., and Hardin, P. E. (2003). VRILLE feeds back to control circadian transcription of Clock in the Drosophila circadian oscillator. Neuron 37, 249–261. Goldbeter, A. (1995). A model for circadian oscillations in the Drosophila period protein (PER). Proc. Biol. Sci. 261, 319–324. Gonze, D., Halloy, J., and Goldbeter, A. (2002). Robustness of circadian rhythms with respect to molecular noise. C. R. Biol. 326, 189–203. Gonze, D., Halloy, J., Leloup, J., and Goldbeter, A. (2003). Stochastic models for circadian rhythms: Effect of molecular noise on periodic and chaotic behavior. PNAS 99(2), 673–678.
70
Elizabeth Y. Scribner and Hassan M. Fathallah-Shaykh
Hall, J. C., and Rosbash, M. (1987). Genetic and molecular analysis of biological rhythms. J. Biol. Rhythms 2, 153–178. Hao, H., Allen, D. L., and Hardin, P. E. (1997). A circadian enhancer mediates PERdependent mRNA cycling in Drosophila melanogaster. Mol. Cell. Biol. 17, 3687–3693. Hardin, P. E., Hall, J. C., and Rosbash, M. (1990). Feedback of the Drosophila period gene product on circadian cycling of its messenger RNA levels. Nature 343, 536–540. Hunter-Ensor, M., Ousley, A., and Sehgal, A. (1996). Regulation of the Drosophila protein timeless suggests a mechanism for resetting the circadian clock by light. Cell 84, 677–685. Kadener, S., Stoleru, D., McDonald, M., Nawathean, P., and Rosbash, M. (2007). Clockwork Orange is a transcriptional repressor and a new Drosophila circadian pacemaker component. Genes Dev. 21, 1675–1686. Kadener, A., Menet, J. S., Schoer, R., and Rosbash, M. (2008). Circadian transcription contributes to core period determination in Drosophila. PLoS Biol. 6, 0965–0977. Kim, E. Y., Ko, H. W., Yu, W., Hardin, P. E., and Edery, I. (2007). A DOUBLETIME kinase binding domain on the Drosophila PERIOD protein is essential for its hypophosphorylation, transcriptional repression, and circadian clock function. Mol. Cell. Biol. 27, 5014–5028. Kitano, H. (2002). Systems Biology: A Brief Overview. Science 295, 1662–1664. Kloss, B., Price, J. L., Saez, L., Blau, J., Rothenfluh, A., et al. (1998). The Drosophila clock gene double-time encodes a protein closely related to human casein kinase Ie. Cell 94, 97–107. Lee, C., Parikh, V., Itsukaichi, T., Bae, K., and Edery, I. (1996). Resetting the Drosophila clock by photic regulation of PER and PER-TIM complex. Science 271, 1740–1744. Leise, T., and Moin, E. (2007). A mathematical model of the Drosophila circadian clock with emphasis on post-translational mechanisms. J. Theor. Biol. 28, 48–63. Leloup, J. C., and Goldbeter, A. (1998). A model for the circadian rhythms in Drosophila incorporating the formation of a complex between the PER and TIM proteins. J. Biol. Rhythms 13, 70–87. Levin, D. A., Peres, Y., and Wilmer, E. L. (2009). Markov Chains and Mixing Times. American Mathematical Society, Providence, RI. Li, Q., and Lang, X. (2008). Internal noise-sustained circadian rhythms in a Drosophila model. Biophys. J. 94, 1983–1994. Maini, P. K., Meghan, A., Burke, A., and Murray, J. D. (1991). On the Quasi-Steady-State Assumptoin Applied to Michaelis-Menton and Suicide Substrate Reactions with Diffusion. Phil. Trans. R. Soc. Lond. 337, 299–306. Matsumoto, A., Ukai-Tadenuma, M., Yamada, R. G., Houl, J., Kasukawa, T., et al. (2007). A functional genomics strategy reveals clockwork orange as a transcriptional regulator in the Drosophila circadian clock. Genes Dev. 21, 1687–1700. Michaelis, L., and Menten, M. (1913). Die Kinetik der Invertinwirkung. Biochem. Z. 49, 333–369. Moore-Ede, M. C., Sulzman, F. M., and Fuller, C. A. (1982). The Clocks That Time Us. Harvard University Press, Cambridge, MA. Murray, J. (2002). Mathematical Biology: I: An Introduction, third ed. Springer-Verlag, Berlin/Heidelberg. Price, J. L., Blau, J., Rothenfluh, A., Abodeely, M., Kloss, B., et al. (1998). double-time is a novel Drosophila clock gene that regulates PERIOD protein accumulation. Cell 94, 83–95. Richier, B., Michard-Vanhee, C., Lamouroux, A., Papin, C., and Rouyer, F. (2008). The clockwork orange Drosophila protein functions as both an activator and a repressor of clock gene expression. J. Biol. Rhythms 23, 103–116.
Modeling of Regulatory Networks
71
Robe, P. A., Martin, D. H., Nguyen-Khac, M. T., Artesi, M., Deprez, M., Albert, A., Vanbelle, S., Califice, S., Bredel, M., and Bours, V. (2009). Early termination of ISRCTN45828668, a phase 1/2 prospective, randomized study of sulfasalazinie for the treatment of progressing malignant gliomas in adults. BMC Cancer 19, 372. Sabouri-Ghomi, M., Ciliberto, A., Kar, S., KarNovak, B., and Tyson, J. (2008). Antagonism and bistability in the protein interaction networks. J. Theor. Biol. 250, 209–218. Saez, L., and Young, M. W. (1996). Regulation of nuclear entry of the Drosophila clock proteins period and timeless. Neuron 17, 911–920. Sasson-Corsi, P. (1998). Molecular clocks: Mastering time by gene regulation. Nature 392, 871–874. Smolen, P., Baxter, D. A., and Byrne, J. H. (2001). Modeling circadian oscillations with interlocking positive and negative feedback loops. J. Neurosci. 21, 6644–6656. Smolen, P., Baxter, D. A., and Byrn, J. H. (2002). A reduced model clarifies the role of feedback loops and time delays in the Drosophila Circadian Oscillator. Biophys. J. 83, 2349–2359. Smolen, P., Hardin, P. E., Lo, B. S., Baxter, D. A., and Byrne, J. H. (2004). Simulation of Drosophila circadian oscillations, mutations, and light responses by a model with VRI, PDP-1, and CLK. Biophys. J. 86, 2786–2802. Stamper, I. J., Owen, M. R., Maini, P. K., and Byrne, H. M. (2010). Oscillatory dynamics in a model of vascular tumour growth—Implications for chemotherapy. Biol. Direct 5(27), 1–17. Szallasi, Z., Stelling, J., and Perival, V. (eds.), (2006). System Modeling in Cellular Biology, The MIT Press, Cambridge/London. Ueda, H. R., Hagiwara, M., and Kitano, H. (2001). Robust oscillations within the interlocked feedback model of Drosophila circadian rhythm. J. Theor. Biol. 210, 401–406. Xie, Z., and Kulasiri, D. (2007). Modelling of circadian rhythms in Drosophila incorporating the interlocked PER/TIM and VRI/PDP1 feedback loops. J. Theor. Biol. 245, 290–304. Yu, W., Zheng, H., Houl, J. H., Dauwalder, B., and Hardin, P. E. (2006). PER-dependent rhythms in CLK phosphorylation and E-box binding regulate circadian transcription. Genes Dev. 20, 723–733. Zerr, D. M., Hall, J. C., et al. (1990). Circadian fluctuations of period protein immunoreactivity in the CNS and the visual system of Drosophila. J. Neurosci. 10(8), 2749–2762.
C H A P T E R
T H R E E
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA Mohammad Poursina,* Kishor D. Bhalerao,† Samuel C. Flores,‡ Kurt S. Anderson,* and Alain Laederach§,} Contents 1. Introduction 2. Need for the Development of Adaptive Coarse-Graining Machinery 2.1. Model description 2.2. Inadequacy of the static coarse graining based on simulation results 3. Metrics to Guide Transitions in Adaptive Modeling 3.1. Metrics to guide transitions from finer to coarser models 3.2. Metrics to guide transitions from coarser to finer models 4. Adaptive Modeling Framework in DCA Scheme 5. Conclusions Acknowledgments References
74 78 78 80 83 83 87 89 96 96 96
Abstract Efficient modeling approaches are necessary to accurately predict large-scale structural behavior of biomolecular systems like RNA (ribonucleic acid). Coarsegrained approximations of such complex systems can significantly reduce the computational costs of the simulation while maintaining sufficient fidelity to capture the biologically significant motions. However, given the coupling and nonlinearity of RNA systems (and effectively all biopolymers), it is expected that different parameters such as geometric and dynamic boundary conditions, and applied forces will affect the system’s dynamic behavior. Consequently, static * Computational Dynamics Lab, Mechanical, Nuclear and Aerospace Engineering Department, Rensselaer Polytechnic Institute, Troy, New York, USA Department of Mechanical Engineering, The University of Melbourne, Victoria, Australia { Simbios Center, Bioengineering Department, Stanford University, Clark Center S231, Stanford, California, USA } Department of Biomedical Sciences, University at Albany, Albany, New York, USA } Developmental Genetics and Bioinformatics, Wadsworth Center, Albany, New York, USA {
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87003-5
#
2011 Elsevier Inc. All rights reserved.
73
74
Mohammad Poursina et al.
coarse-grained models (i.e., models for which the coarse graining is time invariant) are not always able to adequately sample the conformational space of the molecule. We introduce here the concept of adaptive coarse-grained molecular dynamics of RNA, which automatically adjusts the coarseness of the model, in an effort to more optimally increase simulation speed, while maintaining accuracy. Adaptivity requires two basic algorithmic developments: first, a set of integrators that seamlessly allow transitions between higher and lower fidelity models while preserving the laws of motion. Second, we propose and validate metrics for determining when and where more or less fidelity needs to be integrated into the model to allow sufficiently accurate dynamics simulation. Given the central role that multibody dynamics plays in the proposed framework, and the nominally large number of dynamic degrees of freedom being considered in these applications, a computationally efficient multibody method which lends itself well to adaptivity is essential to the success of this effort. A suite of divide-and-conquer algorithm (DCA)-based approaches is employed to this end. These algorithms have been selected and refined for this purpose because they offer a good combination of computational efficiency and modular structure.
1. Introduction Development and application of the efficient techniques to model highly complex biomolecular systems are pursued by engineers and scientists in an effort to predict and understand different structural dynamics of the systems and explain various biological processes (Chen, 2008; Dill et al., 2008; Lebrun and Lavery, 1998; Parisien and Major, 2008; Scheraga et al., 2007). Among various types of biomolecular systems, nucleic acids, in particular, RNA, play a central regulatory role in the cell (Grundy and Henkin, 2006; Schroeder et al., 2004; Tucker and Breaker, 2005; Ying and Lin, 2006). Key to RNA function is structure, in particular, its ability to fold into a functional molecule capable of gene regulation and catalysis (Guo and Cech, 2002; Woodson, 2002; Zaug et al., 1998). Structural dynamics of biomolecular systems including RNA can be modeled using a variety of different techniques. Conventional molecular dynamics (MD) (Haile, 1992; Leach, 2001), which benefits from the fully atomistic representation of the system as shown schematically in Fig. 3.1A, is conceptually the simplest approach, and results from the direct application of Newton’s laws of motion. As such, these fully atomistic models capture all the dynamics of the system. The formulation and solution of the associated equations are trivial, once the forcing terms (force field calculations) have been determined. The overwhelming majority of the computational cost in MD simulations is in these force field calculations. Additionally, due to the high frequency motion of the atoms, it is necessary to use exceedingly
B A
Fully atomistic representation
Flexible body representation of groups of atoms
Rigid body representation of groups of atoms
Figure 3.1 Different types of modeling of biomolecular systems: (A) conventional MD simulations with fully atomistic representation of the system and (B) using articulated multibody concepts with different rigid and flexible subdomains connected to each other via kinematic joints.
76
Mohammad Poursina et al.
small temporal integration step sizes when explicit integrators are used. High-stability implicit schemes for stiff differential equations, such as implicit-Euler (IE) (Hairer and Wanner, 1996; Peskin and Schlick, 1989; Schlick and Peskin, 1989) and implicit-midpoint (IM) (Mandziuk and Schlick, 1995), are also unsatisfactory for proteins and nucleic acids at atomic resolution at large time steps because of numerical damping (Nyberg and Schlick, 1992; Schlick and Peskin, 1995; Zhang and Schlick, 1993). Consequently, these techniques are not recommended to model the systems with large number of degrees of freedom. Alternatively, development of the coarse-grained models with the idea of introducing superatoms (beads) (Praprotnik et al., 2005) or using articulated multibody dynamics (Chun et al., 2000; Mukherjee et al., 2007) shown in Fig. 3.1B reduces the cost of the simulation while still capturing the overall conformational motion. These models can contain multiple resolutions ranging from fine scale atomistic domains, coarse-grained macromolecules, to the continuum level system descriptions. Eliminating high frequency modes of motion within certain subdomains of the system allows one to significantly increase the size of the temporal integration steps. Additionally, it is not necessary to solve the equations governing the dynamics of the system for cartesian coordinates of all existing atoms. The dynamics needs only be solved for much smaller number of degrees of freedom of the system which can be more efficient in capturing the overall conformational motion. Furthermore, the structure of these models is such that the force field calculations may be performed more efficiently. To roughly estimate and compare the computational costs associated with the force field calculations between traditional atomistic MD simulation and the coarse-grained approximation, let us consider a system with n number of atoms. MD simulation needs (n2/2) (1 (1/n)) force field calculations to find either Van der Waals or electrostatic forces. If the system is modeled by the multibody dynamics approach with s rigid substructures each containing p atoms, the force field calculations (Van der Waals or electrostatic) are limited to the interactions between the atoms not located in the same rigid body (see Fig. 3.2). As such, the force field calculations reduce to (n2/2)(1 (1/s)). This rough estimate of the force field calculations in the coarse-grained models demonstrates that significant reduction in the number of the rigid substructures in the modeling (i.e., s n) can result in the significant decrease in the computational cost corresponding to the force field calculations. In biological systems of nucleic acids, and proteins, which tend to possess many (nominally O(103) – O(107)) degrees of freedom, important physical phenomena can occur at vastly different spatial and temporal scales. The small oscillations of individual tightly bonded atoms with high frequency content provide the sub-femto second O(10 16) temporal domain while the conformational motions of interest occur in milliseconds O(10 3) or larger time scales. Additionally, RNA molecules are not static, but in fact are
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA
77
Figure 3.2 Force field calculation using multirigid-body dynamics approach. The interactions between the atoms inside the same rigid bodies are ignored.
highly dynamic (Kent et al., 2000). The dynamic behavior of RNA is complex because of the dominant effect of base-pairing and stacking, which leads to regions of the polymer being highly rigid, connected by flexible loops (Ra´zga et al., 2004, 2006). Stacking and pairing forces are cumulative, and with time, different regions of the RNA can become rigid or flexible (Shcherbakova et al., 2008). These important properties indicate that the quasi-static approach (Redon and Lin, 2006; Rossi et al., 2007) is not appropriate to model such systems, and consequently, more challenges are introduced in the development of the adaptive multiscale methods to model the dynamics of RNA. In the adaptive multiscale framework presented herein, the underlying dynamics formulation used within each system subdomain, as well as the subdomain definitions themselves, needs to be intelligently chosen such that the system level solutions are determined with an acceptable accuracy in a timely manner. For instance, Fig. 3.1B provides an example system which comprises multiple subdomains, each containing potentially different forms of models. In such a system level model, the system is treated as multi-rigid and/or flexible bodies connected to each other via kinematic joints. In this case, the coarse-graining algorithm should as a minimum be capable of adaptively adding/removing degrees of freedom to/from the model, as well as changing the definition of the flexible and rigid regions of the model during the simulation to efficiently provide the appropriate simulation results. Physics-, mathematics-, or knowledge-based internal metrics (Rossi et al., 2007) initiate and guide the change in the number and definition of the system subdomains. Additionally, these metrics may be used to assess the
78
Mohammad Poursina et al.
model performance and guide the types of models (atomistic, articulated multi-rigid, and/or flexible body) used within those subdomains to obtain the optimal combination of speed and accuracy. Thus, mechanisms need to be put in place for identifying critical locations where constraints can be added or removed as needed to enhance the simulation performance. Finally, a multibody formulation needs to be used, which both lends itself well to adaptive (on the fly) changes in model and domain definition, while being efficient when applied to large complex systems. In this chapter, important features of adaptive coarse grain modeling and simulation built off of articulated multibody dynamics with the application to RNA are addressed.
2. Need for the Development of Adaptive Coarse-Graining Machinery In this section, we investigate the effects of specific parameters on the behavior of different kinds of RNAs establishing the necessity for developing general and robust adaptive machineries to model complex RNA systems.
2.1. Model description We study different sequences of nucleotides adenine (A), cytosine (C), guanine (G), and uracil (U) and construct representative single-stranded RNA segments, each 18 nucleotides long, as shown in Fig. 3.3. All the simulations are performed using RNABuilder (Flores et al., 2010). The RNABuilder package is written using the Simbody (Schmidt et al., 2008)
Figure 3.3 RNA with 18 residues shown in different colors. The prescribed motion is applied to the first residue located at the left side of the RNA. The last residue at the right side of the RNA is fixed in space.
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA
79
internal coordinate mechanics library and its molecular mechanics extension, Molmodel, both available from SimTK.org. RNABuilder provides a series of mobilizer commands which set the flexibility of the molecule (Flores et al., 2010). In this work, we use the “rigid” mobilizer to rigidify each residue in the system. As such, each nucleotide (residue) is treated as a rigid body, being attached to its neighboring nucleotides by kinematic joints producing an articulated multibody system. Defining the BondMobility as “torsion,” each nucleotide is connected to its parent residue via a revolute joint whose joint axis is coincident with the local bond. Each of the spatial 18 residue segments investigated may be viewed as a portion of a very much longer RNA, and the prescribed motions at the segments boundaries represent the dynamic effect of the rest of the system on the 18 residue RNA segment. For the performed simulations, one boundary (18th residue) is fixed in space using the constraint called “Weld.” The other boundary of the RNA (first residue) receives a prescribed motion in the form of simple harmonic functions in both x and y directions (in Newtonian frame) as xðtÞ ¼ 0:05½ cosðo1 t Þ 1;
ð3:1Þ
yðtÞ ¼ 0:05½cosðo2 t Þ 1;
ð3:2Þ
where x and y are measured in nanometers. This task has been performed using the command “prescribed motion” and modifying the class “function” to create the functions indicated in Eqs. (3.1) and (3.2). We conduct the simulations on RNAs with various sequences described previously with two different prescribed motions for 20 ps. The first prescribed motion is defined with the same frequencies in both x and y directions as o1 ¼ o2 ¼ 20 rad/ps, while the second prescribed motion is composed of the frequency content of o2 ¼ 2o1 ¼ 40 rad/ps. The values of the frequencies and the amplitudes are selected such that the range of the prescribed motion is in accordance with the real motion. The simulations are performed at 300 K, and the AMBER (Parm99) potential field (Wang et al., 2000) is used to determine the bond and nonbond forces. The overall conformation of the RNA is important to identify the structural dynamics of the system. This overall conformation is determined by the values of the generalized coordinates (internal coordinates) defined at the joints connecting the consecutive residues. Since, in our truth model, all these joints are revolute, and torsional motion is allowed about the axis of the bonds which connect the consecutive residues, the joint angles are monitored during the course of the simulation to assess the overall conformation.
80
Mohammad Poursina et al.
2.2. Inadequacy of the static coarse graining based on simulation results Some representative simulation results for the systems described previously are provided here to convey the nature of the associated dynamic behavior. Figure 3.4 illustrates how differently the representative joint angles at various locations of the system and at different instants may behave, when the systems are excited by the prescribed motion characterized by o2 ¼ 2o1 ¼ 40 rad/ps. In Fig. 3.4A, high amplitude motion is observed in dynamic behavior of the joint angle between residues 1 and 2. Consequently, this joint should not be locked in the coarse-graining process. Figure 3.4B shows that after the passage of a couple of picoseconds, the joint angle between residues 8 and 9 experiences a transition from one regime of motion (experiencing little relative motion) to another regime of motion where the motion across the joint is significantly greater. Therefore, the coarse-grained model in which this joint is initially locked, though potentially valid at the beginning of the simulation, does not provide a reliable conformation for the entire course of the simulation. The prescribed motion at the RNA segment boundary represents the effect of the dynamics of the rest of the system as an input to the portion of interest. To study the effects of the changes in the input parameters on the response of the system in more detail, the behavior of the selected joint angles at different locations of the poly-A RNA is shown in Fig. 3.5. Due to the subtle change in one of the input frequencies, an interesting behavior is observed which had not been predicted. It is expected that the joint angle between residues 1 and 2 that is located most closely to the prescribed motion would never be a good candidate for being locked within the course of the simulation. However, the results shown in Fig. 3.5A indicate that when the input motions contain the same frequencies as o1 ¼ o2 ¼ 20 rad/ps, this joint can be locked after the passage of time, while the same coarsening is inappropriate when the system is driven by the prescribed motions in which the frequency associated with the y component is doubled to 40 rad/ps. As another example, in Fig. 3.5B, the dynamics of the joint angle between residues 8 and 9 is compared when the system is excited by two different prescribed motions. This joint can be locked through the course of the simulation due to its small variations when both frequencies in the prescribed motions are chosen 20 rad/ps. However, the results in Fig. 3.5B indicate that if the value of o2 changes to 40 rad/ps, for the same system, this joint angle cannot be locked during the whole course of the simulation. The simulation results affirm the nonlinear nature of the systems described previously. They demonstrate that the dynamic behavior of each joint angle is highly time variant and is significantly affected by the
A –40
B –70
–60 –80
–75 Joint angle (deg)
Joint angle (deg)
–100 –120 –140 –160
–80
–85
–180 –90 –200 –220 –240
–95 0
2
4
6
8
10 12 Time (ps)
14
16
18
20
0
2
4
6
8
10 12 Time (ps)
14
16
18
20
Figure 3.4 Illustration of the nonlinear dynamics of RNAs with sequences: 18 As (red), 18 Us (blue), 9 GCs (green), and 9 AUs (black), when the prescribed motion is characterized by o2 ¼ 2o1 ¼ 40 rad/ps. (A) High amplitude motion of the joint angle between residues 1 and 2. (B) Significantly different regimes of motion of the joint angle between residues 8 and 9.
A
B –70
–40 –60
–75
Joint angle (deg)
Joint angle (deg)
–80 –100 –120 –140
–80
–85
–160 –90
–180 –200
–95 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Time (ps) Time (ps) Figure 3.5 Effect of changing the input dynamics from o1 ¼ o2 ¼ 20 rad/s (red) to o2 ¼ 2o1 ¼ 40 rad/s (blue) on behavior of the poly-A RNA with sequence of 18 As. (A) Joint angle between residues 1 and 2, the closest one to the excitation point, can be locked after a passage of time when o2 ¼ 20 rad/ps; however, it cannot be locked when o2 ¼ 40 rad/ps. (B) Joint angle between residues 8 and 9 can be frozen for the whole course of the simulation when o2 ¼ 20 rad/ps; however, when o2 ¼ 40 rad/ps, the dynamic behavior of this joint angle prevents the model reduction at this location.
0
2
4
6
8
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA
83
changes in the dynamics of the rest of the system. Consequently, even for a specific structure, a suggested coarse model may not be valid when the dynamics of the boundaries changes, or the conformation of the structure changes with time. As such, the static (i.e., time invariant) coarse graining may not provide the appropriate results, and the coarse grain modeling of these systems must be implemented in an adaptive framework. These conclusions can be generalized to other biomolecular systems such as DNAs and proteins. In this process, some degrees of freedom of the system are adaptively constrained or released at different instants and different locations of the system. Additionally, based on the behavior of different subdomains of the system, the definition of the rigid and/or flexible regions may change.
3. Metrics to Guide Transitions in Adaptive Modeling Key to this effort is the need to develop metrics that guide the adaptive machinery. These metrics may be knowledge- (derived empirically), math(derived from strictly mathematical relations), and/or physics-based (derived directly from physical laws). Herein, we introduce two different types of metrics for guiding the model transitions to the coarser and finer models.
3.1. Metrics to guide transitions from finer to coarser models The behavior of the individual degrees of freedom of the model can be used to assess which of them may be removed, while the resulting reduced order model still produces essentially the same behavior, but with less computational effort. As such, the significant (active) degrees of freedom are retained in the coarsened model while the less significant degrees of freedom are identified and removed. In biomolecular systems, the high frequency modes of motion that are related to the finer fidelity models provide large instantaneous relative velocities and accelerations. However, such modes do not contribute significantly to the global conformation of the system. Thus, velocity- and acceleration-based metrics are not well suited for identifying those degrees of freedoms which are more significant in the overall conformational motions. Additionally, dynamics of biomolecular systems is highly nonlinear and chaotic, and as such, using coarse-graining metrics based on the instantaneous values of the states of the system are not expected to (and have been shown not to) yield improved results. Therefore, we propose to monitor the moving-window statistical properties of the generalized coordinates of the system for math-based metrics to assess and guide the coarse-
84
Mohammad Poursina et al.
graining process instead of instantaneous velocity- and acceleration-based metrics described in Redon and Lin (2006) and Rossi et al. (2007). The metric of choice for determining if an existing joint should be kept or removed is the standard deviation of the generalized coordinates defined at the joint collected within the sliding window, as given by sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xn ðxk xw Þ2 Sw ¼ : k¼1 n
ð3:3Þ
In the above equation, xw is the moving-window average of the sequence of data within the window of the size of n. Each generalized coordinate contributes differently to the overall conformation. In other words, the overall conformation of the system may be more sensitive to some specific generalized coordinates, while this sensitivity varies with time. Consequently, if the weighted (scaled) value of the moving-window standard deviation of any generalized coordinate defined at the joint of the system (i.e., internal coordinate) is less than a predefined threshold, then the associated degree of freedom is considered less significant, and thus eligible to be frozen. In this work, since the torsional motion is the only allowable motion by the kinematic joints of the system, if the moving-window standard deviation of any specific joint angle is insignificant in comparison to the predefined threshold, the entire joint will be locked. Herein, we pick the Poly-GC 18 nucleotide RNA to perform the coarse-graining process based on the results obtained during a 1-ps simulation. Both frequencies in the prescribed motions are considered to be 20 rad/ps. The size of the window to calculate the standard deviation of the joint angles, as well as the value of the threshold on the standard deviation, is both system dependent. In this case, the values of the joint angles are sampled every 0.01 ps, and the length of the moving-window is chosen to be 100 sampling times. We also ignore the weights on the standard divinations associated with different joint angles of the system. Examining different values for the threshold, and comparing the results of the associated coarse model to those of the truth model, we lock the joints of the system if the moving-window standard deviation of their angles within the first 100 sampling times is less than 1 . Based on this threshold, the joints between residues 5 and 6, 7 and 8, 9 and 10, 11 and 12, 13 and 14, 14 and 15, 16 and 17, and 17 and 18 are allowed to be frozen. The adequacy of the proposed coarse graining can be assessed by comparing the results achieved by the coarse-grained model with those obtained exclusively from the fine grained “truth model” over the extended simulation periods. The simulation of the suggested coarse model is, therefore, conducted for 30 ps with the same initial conditions and prescribed motions applied to the truth model. Figure 3.6 shows the behavior of some
A
–60
B –71
–70 –80
–72
Joint angle (deg)
Joint angle (deg)
–90 –100 –110 –120
–73
–74
–75
–130 –140
–76
–150 –77 –160
0
5
10
15
20
25
30
Time (ps)
Figure 3.6 Continued
0
5
10
15 Time (ps)
20
25
30
C
D –62
–56 –58
–64 –60 –62 Joint angle (deg)
Joint angle (deg)
–66 –68 –70
–64 –66 –68 –70
–72
–72 –74 –76
–74 0
5
10
15 Time (ps)
20
25
30
–76
0
5
10
15 Time (ps)
20
25
30
Figure 3.6 Comparison between the dynamic behavior of the fine system (red) and the suggested coarse model (blue) for poly-GC RNA. (A) Joint angle between residues 2 and 3, (B) Joint angle between residues 9 and 10, (C) Joint angle between residues 11 and 12, and (D) Joint angle between residues 12 and 13.
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA
87
representative joint angles for both models. The joint angle between residues 2 and 3 in both models experiences the same behavior as shown in Fig. 3.6A. Although the joint angle between residues 9 and 10 has been locked in the coarsening process, Fig. 3.6B indicates that the associated behavior in the proposed coarse model is very close to that obtained for the finer model. Based on the results shown in Fig. 3.6C, the joint angle between residues 11 and 12 in the coarse model deviates from that of the truth model as time passes. According to the required accuracy, it may be needed to unlock this joint during the adaptive coarse-graining process. Finally, in Fig. 3.6D, similar trends are observed in the dynamics of the relative motion between residues 12 and 13 for both models.
3.2. Metrics to guide transitions from coarser to finer models An inadequate coarse-grained model as a result of constraining certain critical degrees of freedom of the system can lead to an incorrect representation of the system behavior and conformation. An important metric to check the validity of the selected coarse-grained model is the spatial constraint loads (forces and moments) acting on all the kinematic joints of the system and intermediate locations within the rigid and flexible bodies which represent segments of the articulated molecular system. The system’s internal and constraint loads arise from the interactions between the bodies, the imposed boundary conditions, and the kinematic constraints imposed on adjacent body-to-body motions by the connecting joints as shown in Fig. 3.7. For instance, consider the coarse-grained model provided in Fc
Joint
Body k
Body k + 1
Imposed motion due to the dynamics of the rest of the system
Figure 3.7 Magnitude of constraint load gives an estimate of errors introduced due to locking the joint.
88
Mohammad Poursina et al.
Section 3.1. If the frequency content in the y-direction of the prescribed motion in Eq. (3.2) changes from 20 to 60 rad/ps at t ¼ 10 ps, the behavior of the constraint torques change significantly in some joints of the system locked previously in the coarse-graining process. For instance, Fig. 3.8 shows the significant increase in the value of the constraint torque in the direction of the bond connecting residues 16 and 17 which has been locked in the coarsening process. Thus, the constraint load magnitude gives an indication of the degree to which the body or joint in question is attempting to be deformed at a location. If during the course of the simulation, the value of the spatial constraint load at any location of the system exceeds the nominal load which figuratively causes the mechanical failure, the associated joint is released. This effectively releases (removes) the constraint and permits the adjacent segments to move relative to one another in a manner permitted by the joint and in response to the system forcing terms. The required mechanisms to change the definition of the joints and the derivation of the appropriate math for the model transitions remain a contemporary challenge in the field. Adding degrees of freedom into the system poses additional challenges. In real mechanical systems, since energy is not “created” by unlocking a
×104
Constraint torque (Da. (nm/ps)2)
4
3
Input frequency increases at t = 10 ps
2
1
0
–1
–2 0
5
10
15 Time (ps)
20
25
30
Figure 3.8 Constraint torque between residues 10 and 11. The constraint torque about the bond direction (locked in the coarsening process) changes significantly when a modest change occurs in the frequency of the prescribed motion at t ¼ 10 ps.
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA
89
joint or the failure of a member, the “creation” of a new mode of motion by the removal of a constraint does not pose any problem. Therefore, there is no jump in the system velocity variables. However, in the coarse-graining process of biomolecular systems, naturally existing higher modes of motion are ignored in the modeling because the internal metric had previously indicated these modes as less relevant. Therefore, in the transition from the coarse model to the finer model, the current value of the kinetic energy of these ignored modes must be estimated and considered appropriately. Details of choosing the optimum solution and the various issues in adding joints in the model have been discussed in detail in Anderson and Poursina (2009a) and Poursina et al. (2009).
4. Adaptive Modeling Framework in DCA Scheme Within the adaptive framework, a divide-and-conquer algorithm (DCA) is used to solve the forward dynamics problem for the articulated multibody systems composed of rigid and flexible bodies with any arbitrary configurations such as kinematically open loops and multiple closed loops (Featherstone, 1999a,b; Mukherjee and Anderson, 2007b,c). These DCAbased methods are used in the context of the large-scale adaptive molecular problems because: (1) they are relatively efficient for large-scale sequential computer implementation; (2) these formulations have a highly modular structure, which makes their implementation and use within an adaptive framework relatively straightforward; and (3) these methods are highly parallelizable. The computational complexity of the algorithm is O(n) and O(log(n)) in serial and parallel implementations, respectively, where n denotes the number of degrees of freedom of the system. The basic idea of the DCA is to treat a large multibody system by recursively assembling adjacent articulated bodies/subsystems (Fig. 3.9A) into larger encompassing subsystems (Fig. 3.9B). The brief overview of the method is provided here. Consider the two consecutive articulated bodies k and k þ 1 shown in Fig. 3.9A which are connected to each other via the kinematic joint Jk. The term handle, which appears in the paper continually, is any selected point on the body used in modeling the interactions of the body with the environment. The handles on the body can correspond to the joint locations, center of mass or any desired reference points. Each body can have any number of handles on it. For the algorithm presented here, the joint locations are chosen as the handles on the body. The two-handle equations of motion for each body are those in which the spatial acceleration of each handle of the body is expressed as a linear combination of the spatial constraint forces applied to the inward and outward handles. These equations for bodies k and k þ 1 are expressed as
F2kc
A
H2k Body k
B
F1kc+1
Jk
H1k + 1
H1k
H2k + 1 F1kc
Body k
Body k + 1
F2kc+1
Body k+1 H2k+1
H1k F1kc
F2kc+1
Figure 3.9 Assembling of two consecutive bodies to form a new body; (A) consecutive bodies k and k þ 1 connected to each other via a kinematic joint and (B) fictitious body formed from two distinct bodies by eliminating the constraint force and spatial accelerations corresponding to the common joints.
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA
91
k
k
ð3:4Þ
k
k
ð3:5Þ
Ak1 ¼ fk11 ℱ 1c þ fk12 ℱ 2c þ fk13 ; Ak2 ¼ fk21 ℱ 1c þ fk22 ℱ 2c þ fk23 ; and kþ1
kþ1
ð3:6Þ
kþ1
kþ1
ð3:7Þ
kþ1 kþ1 ¼ fkþ1 Akþ1 11 ℱ 1c þ f12 ℱ 2c þ f13 ; 1 kþ1 kþ1 ¼ fkþ1 Akþ1 21 ℱ 1c þ f22 ℱ 2c þ f23 ; 2
respectively. In the above equations, all the coefficients fij ’s are known quantities at each time step. The terms fij (i, j ¼ 1, 2) are associated with the inertia of the body and, consequently, are constant for each body within the course of the simulation if expressed in body basis. The terms fi3 (i ¼ 1, 2) are associated with the known applied forces, as well as centripetal and coriolis terms which should be updated at each time step. At the joint Jk, the kinematical constraint at the acceleration level is expressed as: Jk
Akþ1 ¼ Ak2 þ P J u_ J þ P_ uJ ; 1 k
k
k
ð3:8Þ
k
joint free modes of where P J is the known matrix associated with the k motion (Roberson and Schwertassek, 1988), and u J represents the known generalized speeds defined at the jointk Jk. Using the relations provided above, and introducing the matrix D J as the orthogonal complement of the joint free-motion map at Jk, one can arrive at the two-handle equations governing the dynamics of the assembly k:k þ 1 as k
kþ1
Ak1 ¼ fk:kþ1 ℱ 1c þ fk:kþ1 ℱ 2c þ fk:kþ1 ; 11 12 13 k
kþ1
¼ fk:kþ1 ℱ 1c þ fk:kþ1 ℱ 2c þ fk:kþ1 ; Akþ1 21 22 23 2
ð3:9Þ ð3:10Þ
where fk:kþ1 ¼ fk11 Wfk21 ; 11 fk:kþ1 ¼ Wfkþ1 12 12 ; fk:kþ1 ¼ fk13 WY ; 13 fk:kþ1 ¼ Zfk21 ; 21 kþ1 fk:kþ1 ¼ fkþ1 22 22 Zf12 ;
ð3:11Þ ð3:12Þ ð3:13Þ ð3:14Þ ð3:15Þ
92
Mohammad Poursina et al.
fk:kþ1 ¼ fkþ1 23 23 þ ZY ; h i T J k 1 h J k iT k kþ1 Jk Jk D f22 þ f11 D D ; X ¼D Jk
ð3:16Þ ð3:17Þ
_ J Y ¼ fk23 fkþ1 13 þ P u ;
ð3:18Þ
W ¼ fk12 X ;
ð3:19Þ
Z ¼ fkþ1 21 X :
ð3:20Þ
k
The DCA is implemented in two main passes: assembly and disassembly as shown in Fig. 3.10. The assembly process starts at the level of the individual bodies (leaf nodes) by coupling together pairs of adjacent bodies to form the assemblies. Proceeding in this manner, one can recursively eliminate both unknown constraint loads and joint accelerations at the common joints of the consecutive bodies/assemblies to form the twoLeaf level of the binary tree
A 2
1
4
3
5
8
Assembly
7−8
5−6
3−4
1−2
7
6
5−6−7−8
1−2−3−4
1−2−3−4−5−6−7−8
Root node
B
7
8
Disassembly process of these three new bodies stops at this level Disassembly
7−8
5−6
3−4
1−2
5−6−7−8
1−2−3−4
1−2−3−4−5−6−7−8
Figure 3.10 Adaptive framework in DCA scheme. Red joints are locked, and red assemblies are composed of the consecutive bodies whose connecting joint is locked: (A) assembly process by coupling together pairs of adjacent bodies/subassemblies to form the new assemblies until reaching the all-encompassing node and (B) disassembly process to find the spatial constraint forces and the generalized speeds at the common joints.
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA
93
handle equations of the resulting assemblies. This process works hierarchically exploiting the same structure as that of a binary tree. At the end of the hierarchic assembly process, the whole articulated system may be modeled in terms of the two-handle equations of motion of a single all encompassing assembly. Different types of boundary conditions may be applied to this root node of the binary tree. It is possible for each boundary of the system to be either free floating or constrained by a kinematic joint or prescribed motion. The disassembly process starts by applying the boundary conditions to the system and solve for the unknown spatial constraint forces and/or accelerations of the terminal handles of the whole system. Solving this problem is addressed in detail in Mukherjee and Anderson (2007c). Then, these known values of the constraint loads and accelerations of the terminal handles are substituted into the two-handle equations of the associated subassemblies (Eqs. 3.4–3.7) to find the values of the spatial constraint forces and the accelerations at the common joints of the subassemblies. This process is repeated in a hierarchic disassembly of the binary tree where the known boundary conditions are used to solve the two-handle equations of the subassemblies, until the constraint loads and the generalized speeds of all bodies in the system are determined. In performing the adaptive modeling of the dynamics of the system in DCA framework, if based on the values of the moving-window standard deviation of the joint angles, it is desired to lock (remove) a joint of the system, the associated orthogonal complement of the joint free-motion map becomes the identity matrix. As such, Eq. (3.17) is simplified as: 1 X ¼ ½fk22 þ fkþ1 11 :
ð3:21Þ
If based on monitoring the values of the constraint loads, it is deemed necessary to change the definition of the joint within the adaptive modeling, the only change occurs at the leaf level is changing the joint freek motion map P J of the associated joint. For instance, consider the current coarse model shown in Fig. 3.11A composed of eight bodies. All the connecting joints are revolute, except the last one shown in red which is locked. The desired model is formed by releasing the joint between bodies 7 and 8 due to the corresponding high strain. Additionally, in this model, the joints between bodies 1 and 2, 3 and 4, and 5 and 6 are to be locked as shown in Fig. 3.11B because they are determined to be making an insignificant contribution to the overall conformation of the system. In this case, the spatial joint free-motion maps of the locked joints become zero matrices. Consequently, the associated orthogonal complement of the joint free-motion map is replaced by the identity matrix if expressed in the joint coordinate system. For the revolute joints of the system, the spatial joint free-motion map expressed in joint basis becomes
94
Mohammad Poursina et al.
A 1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
B
Figure 3.11 Illustration of transitioning between coarse models; red joints are locked. (A) Current coarse model, in which a high strain is measured in the joint between bodies 7 and 8. (B) New coarse model, where the joint between bodies 7 and 8 is released and the joints between bodies 1 and 2, 3 and 4, and 5 and 6 are locked.
2 3 1 607 6 7 607 Jk 7 P ¼6 6 0 7: 6 7 405 0
ð3:22Þ
As such, the associated orthogonal complement of the joint free-motion map is expressed as 2 3 0 0 0 0 0 61 0 0 0 07 6 7 60 1 0 0 07 Jk 6 7 D ¼6 ð3:23Þ 7 60 0 1 0 07 40 0 0 1 05 0 0 0 0 1 The assembly and disassembly processes are performed as described before using the appropriate matrices characterizing the joints of the desired coarse model. In these processes, subassemblies shown in red in Fig. 3.10 are those composed of consecutive bodies whose connecting joint is locked. For the new coarse model, the disassembly process of the three new bodies stops at one level prior to the leaf level of the binary tree as shown in Fig. 3.10B, as all the associated spatial constraint forces and generalized speeds are available at this level. However, this process continues to the leaf level for the determination of the unknown values of the constraint loads and generalized speeds associated with common joint of bodies 7 and 8. Any violation in the conservation of the generalized momentum of the system in the transition between different models leads to nonphysical results since the instantaneous switch in the system model definition is
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA
95
incurred without the influence of any external forces. In other words, the momentum of each differential element projected onto the space of admissible motions permitted by the more restrictive model (whether pre- or post-transition) when integrated over the entire system must be conserved across the model transition (Kane and Levinson, 1985). The jumps in the system partial velocities (free modes of motion; Roberson and Schwertassek, 1988) due to the sudden change in the model resolution result in the jumps in the generalized speeds corresponding to the new set of degrees of freedom. Therefore, the adaptive framework should be equipped with the machinery to provide the momenta balance equations. The formation of the impulse–momentum equations within the transitions can also be performed in divide-and-conquer scheme (Mukherjee and Anderson, 2007a). Similar to the DCA, the DCA-based generalized momentum assembly–disassembly procedures are performed on the two-handle impulse–momentum equations of each body as follows (Mukherjee and Anderson, 2007a) DV k1 ¼ Fk11 DV k2
¼
Fk21
ð tþ t
ð tþ t
k
ℱ 1c dt þ Fk12 k ℱ 1c dt
þ
Fk22
ð tþ t
ð tþ t
k
ð3:24Þ
k
ð3:25Þ
ℱ 2c dt þ Fk13 ; ℱ 2c dt þ Fk23 ;
where Fijk is an intermediate known term. The terms DV ki (i ¼ 1, 2) indicate the jumps in the spatial velocities of the ith handle of body k. The spatial impulsive constraint forces acting on body k at its inward (parent) and outward (child) handles within theÐ instantaneous time period Ðtransition þ þ from t to tþ are represented by tt ℱ1ck dt and tt ℱ2ck dt, respectively. Since the impulses associated with the external loads, as well as the coriolis k and centripetal terms can be ignored, the terms F13 and F23k do not appear in the equations governing the impulse–momentum of the individual bodies at the leaf level of the binary tree. Additionally, at this level of the binary tree, for each body, the coefficients of the spatial constraint forces, as well as the impulsive constraint forces which are essentially functions of the inertia and the geometry are the same, in other words: Fkij ¼ fkij
i; j ¼ 1; 2
ð3:26Þ
As such, setting up the generalized momentum balance equations is fast and easy. This algorithm is also capable of efficiently providing the required equations for the knowledge-, math-, or physics-based optimization problem in the transitions from the coarse models to the finer models as explained in detail in Anderson and Poursina (2009b).
96
Mohammad Poursina et al.
5. Conclusions We have addressed different issues associated with the adaptive coarse grain modeling of biomolecular systems using articulated multibody dynamics concepts. Based on the results of the simulations of different types of RNAs, it has been demonstrated that the static (time invariant) coarse graining of models are not likely to provide the appropriate information about the structural behavior of the system for the entire course of the simulation due to the nonlinearities embedded in the system’s dynamics. Therefore, there is a need for the development of an adaptive molecular modeling and simulation framework. The adaptive algorithm should be capable of identifying the critical locations of the system to remove and add degrees of freedom, or change model types and definitions as necessary. In this chapter, we have suggested the moving-window standard deviation of the joint angles, as well as the values of the constraint loads as metrics to guide the transitions to the coarser and finer models, respectively. Since the system is treated as the articulated multibody system, using an efficient algorithm is necessary for the modeling. DCA has been proposed as a convenient technique for the efficient modeling of the system. This method is highly modular and, as such, lends itself well to adaptivity and massive parallelization. The computational complexity of the method is O(n) and O (log(n)) in serial and parallel implementations, respectively, where n is the number of the degrees of freedom of the system.
ACKNOWLEDGMENTS This work was supported through the NSF award No. CMMI-0757936 to Kurt Anderson and in part through the US National Institutes of Health grant R00 GM079953 (NIGMS) to Alain Laederach. The authors would like to thank the funding agencies and also show their gratitude to Dr. Russ Altman and Michael Sherman from Simbios group at Stanford University for their help in this effort.
REFERENCES Anderson, K. S., and Poursina, M. (2009a). Energy concern in biomolecular simulations with transition from a coarse to a fine model. Proceedings of the Seventh International Conference on Multibody Systems, Nonlinear Dynamics and Control, ASME Design Engineering Technical Conference 2009, (IDETC09), IDETC2009/MSND-87297, San Diego, CA. Anderson, K. S., and Poursina, M. (2009b). Optimization problem in biomolecular simulations with dca-based modeling of transition from a coarse to a fine fidelity. Proceedings of the Seventh International Conference on Multibody Systems, Nonlinear Dynamics and
Strategies for Articulated Multibody-Based Adaptive Coarse Grain Simulation of RNA
97
Control, ASME Design Engineering Technical Conference 2009, (IDETC09), IDETC2009/MSND-87319, San Diego, CA. Chen, S. J. (2008). RNA folding: Conformational statistics, folding kinetics, and ion electrostatics. Annu. Rev. Biophys. 37, 197–214. Chun, H. M., Padilla, C. E., Chin, D. N., Watenabe, M., Karlov, V. I., Alper, H. E., Soosaar, K., Blair, K. B., Becker, O. M., Caves, L. S. D., Nagle, R., Haney, D. N., et al. (2000). MBO(N)D: A multibody method for long-time molecular dynamics simulations. J. Comput. Chem. 21, 159–184. Dill, K. A., Ozkan, S. B., Shell, M. S., and Weikl, T. R. (2008). The protein folding problem. Annu. Rev. Biophys. 37, 289–316. Featherstone, R. (1999a). A divide-and-conquer articulated body algorithm for parallel O (log(n)) calculation of rigid body dynamics. Part 1: Basic algorithm. Int. J. Rob. Res. 18, 867–875. Featherstone, R. (1999b). A divide-and-conquer articulated body algorithm for parallel O (log(n)) calculation of rigid body dynamics. Part 2: Trees, loops, and accuracy. Int. J. Rob. Res. 18, 876–892. Flores, S., Wan, Y., Russell, R., and Altman, R. B. (2010). Predicting RNA structure by multiple template homology modeling. Proceedings of the Pacific Symposium on Biocomputing. 216–227. Grundy, F., and Henkin, T. (2006). From ribosome to riboswitch: Control of gene expression in bacteria by RNA structural rearrangements. Crit. Rev. Biochem. Mol. Biol. 41, 329–338. Guo, F., and Cech, T. (2002). Evolution of tetrahymena ribozyme mutants with increased structural stability. Nat. Struct. Biol. 9, 855–861. Haile, J. (1992). Molecular Dynamics Simulation: Elementary Methods. Wiley Interscience, New York. Hairer, E., and Wanner, G. (1996). Solving Ordinary Differential Equations II. Stiff and Differential-Algebraic Problems, 2nd ed. Springer Series in Computational Mathematics. Vol. 14. Springer-Verlag, New York. Kane, T. R., and Levinson, D. A. (1985). Dynamics: Theory and Application. Mcgraw-Hill, New York. Kent, O., Chaulk, S., and MacMillan, A. (2000). Kinetic analysis of the M1 RNA folding pathway. J. Mol. Biol. 304, 699–705. Leach, A. R. (2001). Molecular Modelling Principles and Applications, 2nd ed. Prentice Hall, Harlow, England. Lebrun, A., and Lavery, R. (1998). Modeling the mechanics of a DNA oligomer. J. Biomol. Struct. Dyn. 16, 593–604. Mandziuk, M., and Schlick, T. (1995). Resonance in the dynamics of chemical systems simulated by the implicit-midpoint scheme. Chem. Phys. Lett. 237, 525–535. Mukherjee, R. M., Crozier, P. S., Plimpton, S. J., and Anderson, K. S. (2007). Substructured molecular dynamics using multibody dynamics algorithms. Int. J. Non. Mech.: Non. Mech. Dyn. Macromol. 43(10), 1040–1055. Mukherjee, R. M., and Anderson, K. S. (2007a). Efficient methodology for multibody simulations with discontinuous changes in system definition. Multibody Syst. Dyn. 18, 145–168. Mukherjee, R. M., and Anderson, K. S. (2007b). A logarithmic complexity divide-andconquer algorithm for multi-flexible articulated body systems. J. Comput. Nonlinear Dyn. 2, 10–21. Mukherjee, R. M., and Anderson, K. S. (2007c). An orthogonal complement based divideand-conquer algorithm for constrained multibody systems. Nonlinear Dyn. 48, 199–215. Nyberg, A., and Schlick, T. (1992). Increasing the time step in molecular dynamics. Chem. Phys. Lett. 198, 538–546.
98
Mohammad Poursina et al.
Parisien, M., and Major, F. (2008). The MC-fold and MC-Sym pipeline infers RNA structure from sequence data. Nature 452, 51–55. Peskin, C. S., and Schlick, T. (1989). Molecular dynamics by the backward Euler’s method. Commun. Pure Appl. Math. 42, 1001–1031. Poursina, M., Bhalerao, K. D., and Anderson, K. S. (2009). Energy concern in biomolecular simulations with discontinuous changes in system definition. Proceedings of the ECCOMAS Thematic Conference Multibody Systems Dynamics, Warsaw, Poland. Praprotnik, M., Site, L., and Kremer, K. (2005). Adaptive resolution molecular-dynamics simulation: Changing the degrees of freedom on the fly. J. Chem. Phys. 123, 224106–224114. Ra´zga, F., Spackova, N., Re´blova, K., Koca, J., Leontis, N., and Sponer, J. (2004). Ribosomal RNA kink-turn motif-a flexible molecular hinge. J. Biomol. Struct. Dyn. 22, 183–194. Ra´zga, F., Zacharias, M., Re´blova, K., Koca, J., and Sponer, J. (2006). RNA kink-turns as molecular elbows: Hydration, cation binding, and large-scale dynamics. Structure 14, 825–835. Redon, S., and Lin, M. C. (2006). An efficient, error-bounded approximation algorithm for simulating quasi-statics of complex linkages. Comput. Aided Des. 38, 300–314. Roberson, R. E., and Schwertassek, R. (1988). Dynamics of Multibody Systems. SpringerVerlag, Berlin. Rossi, R., Isorce, M., Morin, S., Flocard, J., Arumugam, K., Crouzy, S., Vivaudou, M., and Redon, S. (2007). Adaptive torsion-angle quasi-statics: A general simulation method with applications to protein structure analysis and design. ISMB/ECCB (Supplement of Bioinformatics), 408–417. Scheraga, H. A., Khalili, M., and Liwo, A. (2007). Protein-folding dynamics: Overview of molecular simulation techniques. Annu. Rev. Phys. Chem. 58, 57–83. Schlick, T., and Peskin, C. S. (1989). Can classical equations simulate quantum-mechanical behavior? A molecular dynamics investigation of a diatomic molecule with a morse potential. Commun. Pure Appl. Math. 42, 1141–1163. Schlick, T., and Peskin, C. (1995). Comment on: The evaluation of LI and LIN for dynamics simulations. J. Chem. Phys. 103, 9888–9889. Schmidt, J. P., Delp, S. L., Sherman, M. A., Taylor, C. A., Pande, V. S., and Altman, R. B. (2008). The simbios national center: Systems biology in motion. Proc. IEEE 96, 1266–1280, special issue on Computational System Biology. Schroeder, R., Barta, A., and Semrad, K. (2004). Strategies for RNA folding and assembly. Nat. Rev. Mol. Cell Biol. 5, 908–919. Shcherbakova, I., Mitra, S., Laederach, A., and Brenowitz, M. (2008). Energy barriers, pathways, and dynamics during folding of large, multidomain RNAs. Curr. Opin. Chem. Biol. 12, 655–666. Tucker, B., and Breaker, R. (2005). Riboswitches as versatile gene control elements. Curr. Opin. Struct. Biol. 15, 342–348. Wang, J., Cieplak, P., and Kollman, P. A. (2000). How well does a restrained electrostatic potential (resp) model perform in calculating conformational energies of organic and biological molecules? J. Comput. Chem. 21, 1049–1074. Woodson, S. (2002). Recent insights on RNA folding mechanisms from catalytic RNA. Cell. Mol. Life Sci. 57, 796–808. Ying, S., and Lin, S. (2006). Current perspectives in intronic micro RNAs (miRNAs). J. Biomed. Sci. 13, 5–15. Zaug, A., Grosshans, C., and Cech, T. (1998). Sequence-specific endoribonuclease activity of the tetrahymena ribozyme: Enhanced cleavage of certain oligonucleotide substrates that form mismatched ribozyme-substrate complexes. Biochemistry 27, 8924–8931. Zhang, G., and Schlick, T. (1993). LIN: A new algorithm combining implicit integration and normal mode techniques for molecular dynamics. J. Comput. Chem. 14, 1212–1233.
C H A P T E R
F O U R
Modeling Loop Entropy Gregory S. Chirikjian Contents 1. Introduction 1.1. Literature review 1.2. Statistical mechanics 1.3. Mathematics review 2. Computing Bounds on the Entropy of the Unfolded Ensemble 2.1. End-to-end position and orientation distributions and the Cartesian conformational entropy of serial polymer chains 2.2. Modeling excluded volume effects 2.3. Bounding Cartesian conformational entropy 3. Approximating Entropy of the Loops in the Folded Ensemble 4. Examples 4.1. Model 1: Long loops modeled as Gaussian chains 4.2. Model 2: Short loops modeled as semiflexible polymers 4.3. From covariance matrices to entropy 5. Conclusions Acknowledgments References
100 101 103 107 113 113 116 118 119 120 120 122 124 127 128 128
Abstract Proteins fold from a highly disordered state into a highly ordered one. Traditionally, the folding problem has been stated as one of predicting “the” tertiary structure from sequential information. However, new evidence suggests that the ensemble of unfolded forms may not be as disordered as once believed, and that the native form of many proteins may not be described by a single conformation, but rather an ensemble of its own. Quantifying the relative disorder in the folded and unfolded ensembles as an entropy difference may therefore shed light on the folding process. One issue that clouds discussions of “entropy” is that many different kinds of entropy can be defined: entropy associated with overall translational and rotational Brownian motion, configurational entropy, vibrational entropy, conformational entropy computed in internal or Cartesian coordinates (which can even be different from each other), conformational entropy computed on a lattice, each of the above with different Department of Mechanical Engineering, Johns Hopkins University, Baltimore, Maryland, USA Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87004-7
#
2011 Elsevier Inc. All rights reserved.
99
100
Gregory S. Chirikjian
solvation and solvent models, thermodynamic entropy measured experimentally, etc. The focus of this work is the conformational entropy of coil/loop regions in proteins. New mathematical modeling tools for the approximation of changes in conformational entropy during transition from unfolded to folded ensembles are introduced. In particular, models for computing lower and upper bounds on entropy for polymer models of polypeptide coils both with and without end constraints are presented. The methods reviewed here include kinematics (the mathematics of rigid-body motions), classical statistical mechanics, and information theory.
1. Introduction In a classic observation, Anfinsen observed the spontaneous and repeatable folding of a protein from a highly disordered state into a highly ordered one (Anfinsen, 1973). From this result and others that followed, it has been inferred over the years that similar processes work for wide classes of proteins. But exactly how unstructured is the unfolded/denatured state? And how structured is the native state? New evidence suggests that the ensemble of unfolded forms may not be as disordered as once believed and that the native form may not be as rigid as one might expect. In this light, protein folding is a transformation of a high-conformational entropy ensemble into a lower one. But how high is high, and how low is low? To answer such questions, some new mathematical and computer models will be helpful. Therefore, new mathematical tools for the approximation of conformational entropy in the unfolded and folded ensembles are introduced here. A number of related tools already exist in other fields. These are reviewed, adapted, and developed further. In particular, lower and upper bounds on entropy are derived for polymer models of polypeptide chains, both with and without, constraints on the positions and orientations of the ends. The methods reviewed here include kinematics (the mathematics of rigid-body motions as studied in the field of Robotics), information theory, and functional analysis on Lie groups (which, in part, considers how probability density functions of group-valued argument combine and propagate). In particular, we attach reference frames to polypeptide chains, as shown in Fig. 4.1, where the origin of the ith frame is located at the ith Ca atom with a unique orientation defined by the Ca–C0 bond and the plane defined by the Ca–C0 ¼O atoms, as in Lee and Chirikjian (2005). We use the distributions of relative motion between consecutive residues to characterize backbone conformational entropy. Side-chain motions are computed relative to these reference frames and we show how to compute the associated side-chain entropy. These new and powerful methods make it
101
Modeling Loop Entropy
Side chain R
R2
O H
f
C⬘ Ca
N
R1
H
H H
y Ca
H
Ca
H C⬘
Ca
H H
O
OH
N
N
R3
Amino group
C⬘ O
Carboxyl group
Figure 4.1 Reference frames attached to a polypeptide chain: (left) dihedral angle definitions; (right) attaching frames to Ca atoms in a canonical way.
possible to approximate changes in entropy between relatively ordered and disordered states without using traditional sampling techniques. To summarize, the main contributions of this work are
A method for generating the distribution of relative positions and orientations of polymer-like polypeptide coils is presented, building on prior work in the Robotics literature. The distribution of end-constrained loop conformations is obtained from this information by applying Bayes’ rule. Quantitative bounds on the associated loop entropy are derived, and the change in loop entropy resulting from constraining one end relative to the other is computed. The computational complexity of this approach is low enough that it can be implemented on a single-processor personal computer running standardized software such as Matlab. The remainder of this work is structured as follows: A comprehensive review of the literature is provided in Section 1.1. This is followed by a review of the necessary concepts from Statistical Mechanics in Section 1.2, and of background mathematics in Section 1.3. Section 2 applies these techniques to compute lower and upper bounds on the entropy of the unfolded ensemble of polypeptide conformations. Section 3 develops bounds on the entropy of the folded ensemble. Section 4 demonstrates the methodology with some closed-form examples. Finally, Section 5 summarizes the results and maps out future directions.
1.1. Literature review Protein folding is often viewed graphically as a funnel from the polymerlike ensemble of unfolded states to the native state (Bryngelson et al., 1995). Changes in backbone entropy between unfolded and native states have been measured experimentally (D’Aquino et al., 1996). And NMR has been
102
Gregory S. Chirikjian
shown to be a useful tool for experimentally observing conformational fluctuations in proteins in general (Li et al., 1996; Palmer, 1997; Yang and Kay, 1996). A growing body of literature suggests that “the native state” of certain proteins may not be as ordered as once believed (Bracken et al., 2004; Dunker et al., 2001, 2005; Radivojac et al., 2004; Vucetic et al., 2005). On the other hand, recent studies suggest that the unfolded ensemble is not as disordered as once believed (Shortle and Ackerman, 2001), and that sequential interactions and sterics provide strong constraints on possible folding pathways (Baldwin and Rose, 1999; Fitzkee and Rose, 2004, 2005; Gong and Rose, 2005; Pappu et al., 2000). Furthermore, conformational entropy of the native ensemble is believed to play an important role in binding (Boehr et al., 2009; Frederick et al., 2007). For these reasons, the development of analytical and computational models of entropy in protein loops with and without end constraints provides a way to compare the relative amount of disorder in the folded and unfolded cases. Many statistical mechanical treatments of protein folding have been performed (e.g., Crippen, 2001, 2004; Dill et al., 1993; Wang and Crippen, 2004). In some studies, full chemical detail is used in molecular dynamics (Karplus and Weaver, 1976; Levitt, 1983), yet it appears that this level of detail may not be required for successful prediction of folding (Rhee and Pande, 2006). Furthermore, when computing statistical quantities such as entropy, sufficient data must be obtained in high-dimensional configuration or phase spaces in order to obtain robust results. This is almost impossible to do at a fully detailed level. Therefore, simplified statistical models such as those presented in this work may be useful. Furthermore, while the emphasis here is loop entropy in proteins, the methodology presented here can in principle be applied to RNA structures. Models of loop entropy in nucleic acids have been presented in Chirikjian (2010); Liu and Chen (2010); Zhang et al. (2008). The author’s original field is the kinematic geometry of snakelike (or “hyper-redundant”) robot arms with many degrees of freedom (Chirikjian and Burdick, 1992, 1994). A tool which is useful for the analysis of all positions and orientations reachable by the “gripper” at the distal end of this kind of arm is noncommutative harmonic analysis (Chirikjian and Kyatkin, 2001). This mathematical tool combines ideas from group theory and Fourier analysis (Gel’fand et al., 1963; Miller, 1968; Sugiura, 1990; Talman, 1968; Vilenkin and Klimyk, 1991), and can be used to compute convolutions and diffusions of functions on Lie groups, such as the rotation group or rigid-body motion group (Chirikjian and Kyatkin, 2001; Wang and Chirikjian, 2004). This is a particularly useful tool in the quantitative analysis of the distribution of all possible reachable gripper positions and orientations. Such quantities are quite similar to those encountered in polymer statistical mechanics. In polymer theory, distributions of relative end-to-end distance and orientation of backbone points and their tangents play central roles, as described in Birshtein and Ptitsyn (1966), Boyd and
Modeling Loop Entropy
103
Phillips (1993), de Gennes (1979), des Cloizeaux and Jannink (1990), Doi and Edwards (1986), Flory (1969), Grosberg and Khokhlov (1994), Mattice and Suter (1994), and Skliros and Chirikjian (2008). With the tool of noncommutative harmonic analysis, distributions in all six dimensions of rigid-body motion (three translational and three rotational) can be obtained, and marginals of these distributions can be taken to yield those that are commonly of interest in polymer physics (such as the distribution of endto-end distances or relative orientations; Chirikjian, 2001). This approach has been taken by the author in a series of papers, particularly concerned with semiflexible polymers in which there is internal bending and torsional stiffness (Chirikjian and Kyatkin, 2000; Chirikjian and Wang, 2000). The case of statistical distributions when semiflexible polymers have internal joints and rigid bends has also been addressed using these methods (Zhou and Chirikjian, 2003, 2006). And it has been shown that this method can be applied to more general polymers, including unfolded polypeptide chains (Kim and Chirikjian, 2005). Similar tools can be used to analyze large amounts of geometric data in the protein data bank (Berman et al., 2000) such as statistics of helix–helix crossing angle (Lee and Chirikjian, 2004) and the relative pose (position and orientation) between alpha carbons in proteins (Chirikjian, 2001; Lee and Chirikjian, 2005). Of course, the author is not the only (and not even the first) member of the robotics community to attempt to transfer theoretical and computational tools from that field to study structural biology and biophysical phenomena. Lozano-Perez and coworkers have applied methods from robot motion planning and artificial intelligence to a number of problems in structural biology and rational drug design (Rienstra et al., 2002; Wang et al., 1998). Latombe and his students have applied methods from robot motion planning to explore configuration spaces and do energy minimization in the context of protein structures (Hsu et al., 1999; Lavalle et al., 2000; Lotan et al., 2004). These build on the method of probabilistic roadmaps (Kavraki et al., 1996). Amato and coworkers (Amato and Song, 2002; Amato et al., 2003; Tang et al., 2005; Thomas et al., 2005) and Kavraki (Das et al., 2006; Shehu et al., 2006; Teodoro et al., 2001; Zhang et al., 2005) have been leaders in the application of robotics techniques in computational biology. The cyclic descent algorithm for robot kinematics has been applied to protein loops (Canutescu and Dunbrack, 2003), as has other methods from kinematics (Kazerounian, 2004; Kazerounian et al., 2005; Manocha et al., 1995). Of late, it has been fashionable in engineering to consider proteins as examples of molecular machines (Kim et al., 2005; Mavroidis et al., 2004).
1.2. Statistical mechanics In classical equilibrium statistical mechanics, the Boltzmann distribution is defined as
104
Gregory S. Chirikjian
f ðp; qÞ ¼
1 expðbHðp; qÞÞ; Z
where the partition function is defined as ð ð expðbHðp; qÞÞdpdq: Z¼ q p
ð4:1Þ
ð4:2Þ
Here, b ¼ 1/kBT (kB is the Boltzmann constant and T is the temperature measured in degrees Kelvin), pi ¼ p ei is the momentum conjugate to the ith generalized coordinate qi ¼ q ei, H is the Hamiltonian for the system, and dpdq ¼ dp1dpNdq1dqN for a system with N degrees of freedom. The range of integration is over all possible states of the system. The Boltzmann distribution describes the probability density of all states of a system at equilibrium. The full set of generalized coordinates, {q}, describes the configuration of the system, which includes the overall rigid-body motion, and the intrinsic structural degrees of freedom. These intrinsic degrees of freedom can be further broken down into “hard” degrees of freedom such as bond angles and bond lengths that do not vary substantially from referential values, and “soft” degrees of freedom such as torsion angles that can vary widely. The hard degrees of freedom describe vibrational states and the soft degrees of freedom describe conformational changes, that is, motions due to rotations around covalent chemical bonds. While the words, “configuration” and “conformation,” are often used interchangeably in the literature, the distinction between them as defined above is important in this work. For any classical mechanical system, the Hamiltonian is of the form 1 Hðp; qÞ ¼ pT M 1 ðqÞ p þ V ðqÞ; 2
ð4:3Þ
where V(q) is the potential energy and M(q) is the mass matrix (also called the mass metric tensor; Patriciu et al., 2004).The Gibbs formula for entropy of an ensemble described by f(p, q) is ð ð S ¼ kB f ðp; qÞ log f ðp; qÞdpdq: ð4:4Þ p q
Mathematically, “continuous entropy” as defined above can take on negative values (and the entropy in the limiting case of a Dirac delta function goes to negative infinity). As explained in Chirikjian (2009), this is very different than discrete entropy. Physically, continuum theory and classical mechanics break down at very small scales in phase space. By definition, a discretization of phase space is chosen such that S ¼ 0 corresponds to the
105
Modeling Loop Entropy
most ordered system that is physically possible, which is when all the states in an ensemble are contained in the same smallest possible element of discretized phase space. This is not the same as discretizing conformational space on a coarse lattice, as is often done in polymer simulations. The effects of discretization of continuous entropy are discussed in Chirikjian (2009). As a practical matter, there are several limitations in using Eq. (4.4) as a computational tool. First, there is some debate about what molecular potentials to use. On the one hand, the accuracy of ab initio potentials derived from first principles for small molecules and then applied to macromolecular simulations can be questioned. On the other hand, the accuracy of statistical potentials derived from structural data is limited by the richness of the databases from which they are extracted. For different perspectives on this debate, see Fang and Shortle (2005), Jernigan and Bahar (1996), Kortemme et al. (2003), Lazaridis and Karplus (1999), Moult (1997), and Vajda et al. (1997). Second, the number of degrees of freedom in macromolecules is so high (many thousands for a protein in continuum solvent, and perhaps millions when including explicit solvent degrees of freedom) that it is not possible to approximate f(p, q) with any degree of fidelity. (If the number of sample values required to accurately estimate a pdf in one degree of freedom is K, then one would expect to need K2N samples to approximate a pdf in a 2N-dimensional phase space.) If K is on the order of 10–100 and N ranges from thousands to millions, this is clearly intractable. One way to circumvent this problem is to compute only marginals of the full Boltzmann distribution, which as explained below, allows one to establish bounds on the true value of entropy. Due to the structure of the Hamiltonian (Eq. (4.3)), it is easy to see that in general, the Boltzmann distribution cannot be separated into a product of configurational and momentum distributions, so f ðp; qÞ 6¼ fp ðpÞfq ðqÞ (due to the dependence of the mass matrix on configuration), and so the thermodynamic entropy is bounded by the entropies of each marginal as1 S Sp þ Sq where the configurational entropy is ð Sq ¼ kB fq ðqÞ log fq ðqÞdq; ð4:5Þ q
and it is often assumed that Sp is constant. In fact, when the generalized coordinates are the Cartesian coordinates of the positions of all atoms in a macromolecule so that q becomes the 3n-dimensional vector of all such T positions, denoted here as x ¼ fxT1 ; . . . ; xTn g , then f ðp; xÞ ¼ fp ðpÞfx ðxÞ and S ¼ Sp þ Sx. Furthermore, in this special case, the mass matrix is diagonal and constant, M(x) ¼ M0, and 1
Using results from information theory (Chirikjian, 2009; Shannon, 1948).
106
Gregory S. Chirikjian
1 1 T 1 fp ðpÞ ¼ exp p M0 p=kB T Zp 2 and so Sp ¼ logfð2pekB T Þ3n=2 jM0 j1=2 g
ð4:6Þ
is in fact constant at constant temperature, without having to assume anything. It follows that DS ¼ DSx, which is not necessarily true for general choices of coordinates, including dihedral angles. Under a change of coordinates x ¼ x(q), it is generally the case that Sx 6¼ Sq because the computation of ð Sx ¼ kB fx ðxÞ log fx ðxÞdx ð4:7Þ x
in an alternative coordinate systems (such as dihedral angles) becomes ð Sx ¼ kB fx ðxðqÞÞ log fx ðxðqÞÞj det J ðqÞjdq q
which is not generally equal to Sq unless jdet J(q)j ¼ 1. Therefore, when referring to configurational entropy, it is important to distinguish between Cartesian configurational entropy and dihedral configurational entropy unless jdet J(q)j ¼ 1. In some scenarios, it is convenient to subdivide the configurational degrees of freedom into the categories: rigid body, hard, and soft, so that q ¼ (qrb, qhard, qsoft). It can be shown that the determinants of the mass and Jacobian matrices for chain structures can be written as functions proportional to the form w1(qrb) w2(qhard). Similarly, it is a common modeling assumption that for a system not subjected to an external force field, and with sufficiently hard degrees of freedom, that V ðqrb ; qhard ; qsoft Þ ¼ V1 ðqhard Þ þ V2 ðqsoft Þ: Assumptions such as these lead to the separability of the partition function into a product, and the separability of entropy into a sum of terms: S ¼ Srb þ Shard þ S soft :
107
Modeling Loop Entropy
Since the rigid-body term is the same for all ensembles of a given system in the same volume and temperature, DS ¼ DShard þ DS soft : We will focus on methods for computing Cartesian conformational entropy, DSx, using the concept of convolution on the rigid-body motion group. When the hard degrees of freedom are treated as rigid, DShard ! 0 and DSx ! DS soft. In the remainder of this work, we will examine Sx for (a) polymer-like ensembles with rotatable bonds and free ends and (b) polymer-like loop regions with end constraints.
1.3. Mathematics review When considering models of polypeptide chains, it often will be convenient to treat parts of the chain as rigid. For example, the plane of the peptide bond can be considered rigid, as can a cluster of side-chain atoms such a methyl group. At a coarser level, one might consider an alpha helix to be a rigid object. At a coarser level still, a whole domain might be approximated as a rigid body. Therefore, it is clear that at various levels of detail, when characterizing the conformational entropy of a protein, it is conceivable that attaching reference frames to the rigid elements and recording the set of all possible rigid-body motions between these elements is a way to describe the conformational part of the Boltzmann distribution, and therefore get at the conformational entropy via Gibbs’ formula. In this section, a coordinate-free review of rigid-body motions is presented. More detailed reviews and comparisons of various parametrizations such as Euler angles and Cayley parameters can be found in Chirikjian and Kyatkin (2001). 1.3.1. Mathematics of rigid-body motion The group of rigid-body motions, which is also called the special Euclidean group and is denoted SE(3), is the semidirect product of (R3, þ) (threedimensional Euclidean space endowed with the operation of vector addition) with the special orthogonal group, SO(3), which consists of 3 3 rotation matrices together with the operation of matrix multiplication. In both instances, the word “special” means that reflections are excluded and only physically allowable isometries of three-dimensional space are allowed. We denote elements of SE(3) as g ¼ ða; AÞ 2 SEð3Þ where A 2 SO(3) and a 2 R3 . For any g ¼ (a, A) and h ¼ ðr; RÞ 2 SEð3Þ, the group law is written as g∘h ¼ ða þ Ar; ARÞ, and g1 ¼ ðAT a; AT Þ. Alternately, one may represent any element of SE(3) as a 4 4 homogeneous transformation matrix of the form
108
Gregory S. Chirikjian
g¼
A 0T
a ; 1
in which case the group law is matrix multiplication. The bottom row in these matrices, which consists of three zeros (i.e., 0T is the transposed, or row, vector corresponding to the column vector of zeros, 0) and the number one, is a placeholder which ensures that the matrix multiplication reproduces the correct group operation. In the above matrix, A 2 SO(3) denotes rotations and a 2 R3 denotes translations of a reference frame which, when attached to a rigid body, represent the motion of that body from the reference position and orientation defined by the identity element e ¼ (0, I). In Lie theory,2 the exponential mapping from the Lie algebra to a corresponding Lie group plays an important role (Chirikjian and Kyatkin, 2001). In the current context, the Lie group of interest is SE(3), and the corresponding Lie algebra is se(3), which consists of all matrices formed by linear combinations of the following basis elements: 0 1 0 1 0 0 0 0 0 0 1 0 B 0 0 1 0 C B 0 0 0 0C C B C E1 ¼ B @ 0 1 0 0 A; E2 ¼ @ 1 0 0 0 A; 00 0 0 01 0 0 0 0 01 0 1 0 0 0 0 0 1 B1 0 0 0C B0 0 0 0C C B C E3 ¼ B @ 0 0 0 0 A; E4 ¼ @ 0 0 0 0 A; 00 0 0 01 00 0 0 01 0 0 0 0 0 0 0 0 B0 0 0 1C B0 0 0 0C C C E5 ¼ B E6 ¼ B @ 0 0 0 0 A; @ 0 0 0 1 A: 0 0 0 0 0 0 0 0 For small (infinitesimal) motions around the identity (null motion), g I þ X 2 SE(3) where X 2 se(3). However, for larger motions, this is not true. For those unfamiliar with his terminology, definitions and properties important to our formulation have been provided in the book (Chirikjian and Kyatkin, 2001). The essential thing to know is that elements of se(3) and SE(3) can both be viewed as 4 4 matrices; however, while it makes sense to add elements of se(3) (i.e., velocities add), it only makes sense to multiply elements of SE(3). Furthermore, by the matrix exponential mapping, it is
2
Named after Norwegian mathematician Marius Sophus Lie (1842–1899).
109
Modeling Loop Entropy
possible to produce elements of SE(3) from those in se(3), and vice versa using the matrix logarithm: exp : seð3Þ ! SEð3Þ and log : SEð3Þ ! seð3Þ: Figure 4.2(left) illustrates that the composition of rigid-body motions is not a commutative operation. Figure 4.2(right) shows the relationship between the Lie algebra se(3), consisting of infinitesimal motions (which form a linear vector space), and SE(3), consisting of large motions (which form a curved manifold, which is a Lie group). For small translational (rotational) displacements from the identity along (about) the ith coordinate axis, the homogeneous transforms representing infinitesimal motions look like expðeEi Þ I þ eEi ;
ð4:8Þ
where I is the 4 4 identity matrix, jej 1, and exp(X) ¼ I þ X þ X2 /2 þ is the matrix exponential defined by the Taylor series of the usual exponential function evaluated with a matrix rather than a scalar. For example, 0 1 0 1 cosy siny 0 0 1 0 0 0 B siny cosy 0 0 C B C C and expðyE5 Þ ¼ B 0 1 0 y C; expðyE3 Þ ¼ B @ 0 A @ 0 1 0 0 0 1 0A 0 0 0 1 0 0 0 1
Frame 2′ SE(3) g1 g′0,2
g
Frame 1′ g2 Frame 0′ Frame 2
exp
log
g2 g0,2
X g1
Frame 0
Frame 1 se (3)
Figure 4.2 (left) Rigid-body transformations between reference frames form a noncommutative Lie group (g1 ∘ g2 6¼ g2 ∘ g1); (right) the exponential map.
110
Gregory S. Chirikjian
and for small values expanding sin y y and cos y 1, it is easy to see that Eq. (4.8) holds for the example on the left. For the example on the right, Eq. (4.8) holds even for large values of y. The “exponential parametrization” ! 6 X g ¼ gðw1 ; w2 ; . . . ; w6 Þ ¼ exp ð4:9Þ wi Ei i¼1
is a useful way to describe relatively small rigid-body motions because, unlike the Euler angles, it does not have singularities near the identity. One defines the “vee” operator, _, such that for any 6 X w i Ei ; i¼1 0 1 w1 B w2 C B C X _ ¼ B . C: @ .. A
X¼
w6 The 6
6 adjoint matrix, Adg, is defined by the expression _ Adg ðX _ Þ ¼ gXg1 ;
and explicitly if g ¼ (a, A) then Adg ¼
A 0 ; aA A
where a A denotes the matrix resulting from the cross product of a with each column of A. The vector of exponential parameters, x 2 R6 , can be obtained from g 2 G with the formula x ¼ ð log gÞ_ :
ð4:10Þ
The action of an element of the motion group, g ¼ (a, A), on a vector x in three-dimensional space is defined as gx ¼ Ax þ a. In contrast, given a function f(x), we can translate and rotate the function by g as f ðg1 xÞ ¼ f ðAT ðx aÞÞ. The fact that the inverse of the transformation applies under the function (rather than the transformation itself) in order to implement the desired motion is directly analogous to the case of translation
111
Modeling Loop Entropy
-1
f1,2 (h2
f (g-1⭈ x) →
⬚ g)
f1 (h -1 ,2 4
-1
f1,2 (h1
⬚ g)
-1
f1,2 (h3
⬚ g)
⬚ g) (h f 1,2
-1
g
g)
⬚
f0,2 = f0,1 * f 1,2
f0,1 (h)
→
f (x)
Figure 4.3 (left) Action of a motion on a function; (right) convolution of functions of rigid-body motion.
on the real line. For example, given a function on the real line, f(x), with its mode at x ¼ 0, if we want to translate the whole function in the positive x direction by amount x so that the mode is at x ¼ x, we compute f(x x) (not f(x þ x)). This is a very important point to understand in order for the rest of this work to make sense. Figure 4.3(left) illustrates the shifting of a function under rigid-body motion geometrically. 1.3.2. Manipulations of functions of rigid-body motion Suppose that three rigid bodies labeled 0, 1, and 2 are given, with reference frames attached to each, and assume that only sequentially adjacent bodies interact. Suppose also that body 0 is fixed in space and the ensemble of all possible motions of body 1 with respect to 0 are recorded, and motions of 2 with respect to 1 are also recorded. Then, we have two functions of motion, f0,1(g) and f1,2(g) which together describe the conformational variability of this simple system. If we are interested in knowing the probability distribution describing the ensemble of all possible ways that body 2 can move relative to body 0, how is this obtained? In fact, it is computed via the convolution on SE(3) (Chirikjian and Kyatkin, 2001): ð f0;2 ðgÞ ¼ f0;1 f1;2 ðgÞ ¼ f0;1 ðhÞf1;2 h1 ∘g dh: ð4:11Þ G
What this says is that the distribution f1,2(g) is shifted through all possible rigid-body motions, h, weighted by the frequency of occurrence of these motions, f0,1(h), and integrated over all values of h 2 G (G is just short for “Group,” which throughout this work is the group of rigid-body motions, SE(3)). Figure 4.3(right) illustrates this geometrically.
112
Gregory S. Chirikjian
Explicitly, what is meant by this integral? Let us assume for the moment that rotations are parameterized using Euler angles. The range of the Euler angles is 0 a, g 2p, and 0 b p. In this parametrization, the volume element for G is given by dg ¼
1 sinb dadbdgdr1 dr2 dr3 ; 8p2
which is the of the volume elements for R3 (dr ¼ dr1dr2dr3), and product for SO(3) dR ¼ 8p1 2 sinb Ð dadbdg . The normalization factor in the definition of dR is such that SOð3Þ dR ¼ 1. The volume element for SE(3) can also be expressed in the exponential coordinates described in Section 1.3.1, in which case dg ¼ jJ ðxÞjdw1 dw6 ; where jJ (x)j is a Jacobian determinant for this parametrization. The Jacobian can be computed using the formula J ðxÞ ¼
g
1
@g @w1
_
_ 1 @g ;...; g @w6
and it can be shown that jJ(0)j ¼ 1 and so close to the identity that the Jacobian factor in this parametrization can be ignored (which is not true for many other parametrizations, including the Euler angles). The fact that the volume element is invariant to right and left translations, that is, dg ¼ dðh∘gÞ ¼ dðg∘hÞ is well known in certain communities (see, e.g., Sugiura, 1990; Vilenkin and Klimyk, 1991). A convolution integral of the form in Eq. (4.11) can be written in the following equivalent ways: ð ð f0;1 f1;2 ðgÞ ¼ f0;1 z1 f1;2 ðz∘gÞdz ¼ f0;1 g∘k1 f1;2 ðkÞdk; ð4:12Þ G
G
where the substitutions z ¼ h 1 and k ¼ h 1 ○ g have been made, and the invariance of integration under shifts and inversions is used. The concept of convolution on SE(3) will be central in the formulation that follows:
One can define a Gaussian distribution on the six-dimensional Lie group SE(3) much in the same way as is done on R6 provided that (1)
113
Modeling Loop Entropy
the covariances are small and (2) the mean is located at the identity. The reason for these conditions is because near the identity, SE(3) resembles R6 which means that dg dw1 dw6 and we can define the Gaussian in the exponential parameters as 1 1 T 1 f ðgðxÞÞ ¼ exp x S x : ð4:13Þ 2 ð2pÞ3 jSj1=2
Given two such distributions that are shifted as fi,iþ1(gi,iþ11 ∘ g), each with 6 6 covariance Si,iþ1, then it can be shown that the mean and covariance of the convolution f0,1(g0,11 ∘ g) f1,2(g1,21 ∘ g), respectively, will be of the form g0,2 ¼ g0,1 ∘ g1,2 and (Wang and Chirikjian, 2008) T S0;2 ¼ Ad1 g1;2 S0;1 Adg1;2 þ S1;2 :
ð4:14Þ
This provides a method for computing covariances of two concatenated segments, and this formula can be iterated to compute covariances of chains without having to compute convolutions directly. This is demonstrated numerically in the context of robotic arms in Wang and Chirikjian (2008).
2. Computing Bounds on the Entropy of the Unfolded Ensemble 2.1. End-to-end position and orientation distributions and the Cartesian conformational entropy of serial polymer chains Consider a polymer consisting of a serial chain of n þ 1 essentially rigid monomer units numbered from 0 to n. Attach a frame of reference to the ith such unit. Let gi denote the rigid-body motion from the reference frame of the zeroth unit to that attached to the ith. Let gk,kþ1 denote the relative motion from body k to body kþ1. Then, gi ¼ g0,i ¼ g0,1 ∘ g1,2 ∘ ∘gi1,i will be the cumulative motion from body 0 to body i. The relationship between these reference frames is described in Fig. 4.4. In a purely pairwise energy model, only the interactions between adjacent units are important. In this simplest model, the probability of the relative pose gi,iþ1 ¼ gi 1 ∘ giþ1 taking a particular value is given by fi;iþ1 gi;iþ1 ¼ 1=Zi;iþ1 exp bV gi;iþ1 :
114
Gregory S. Chirikjian
g3,4 g2,3
Body i g4 = g0,4
®
gi
xi1
g3 = g0,3 g2 = g0,2 ®
®
xi4
g1 = g0,1
xi2 ®
xi3 g0 = e
Figure 4.4 (left) Relative and absolute reference frames attached to the chain; (right) the relative positions of mass points within body i.
Then, the conformational distribution described in terms of rigid-body poses is f ðg1 ; g2 ; . . . ; gn Þ ¼
n1 Y
fi;iþ1 gi1 ∘giþ1 ;
ð4:15Þ
i¼0
where g0 ¼ e, the identity. This is related to the end-to-end position and orientation distribution f0;n ðgn Þ ¼ f0;1 f1;2 fn1;n ðgn Þ; ð4:16Þ which is an n-fold convolution of the form in Section 1.3.2, by marginalization of Eq. (4.15) as ð ð f0;n ðgn Þ ¼ f ðg1 ; g2 ; . . . ; gn Þdg1 dgn1 : G
G
This is illustrated in Fig. 4.5. Equation (4.15) represents a generalization of the classical polymer models in which only pairwise interactions are considered. If the frames of reference gi and giþ 1 are attached at the Ca atoms of residues i and iþ1 in a polypeptide, then the function fi,iþ1(gi,iþ1) would be the six-dimensional generalization of a Ramachandran map that could include small bond angle bending, warping of the peptide plane, and even bond stretching. If one chooses not to model these effects, then the classical Ramachandran map
115
Modeling Loop Entropy
Copies of gi + 1
f0,n (g) f0,k (g) gi
f0,1 (g)
Figure 4.5 Kinematic covariance propagation: (left) in the absence of other constraints, distributions describing the allowable rigid-body motions between consecutive residues “add” by convolution, resulting in a spreading out of probability density in position and orientation, f0,i(gi), as i increases; (right) a zoomed-in view of the probabilistic relationship between reference frames i and iþ1 embodied by the functions fi, iþ1(gi, iþ1).
(Ramachandran et al., 1963) can be reflected by appropriately defining fi,iþ1(gi,iþ1), as has been done in Kim and Chirikjian (2005). This is consistent with the Flory isolated pair model (Flory, 1969), which has been challenged in recent years (Pappu et al., 2000). However, as an upper bound on conformational entropy, it may still be useful in some contexts. Note that since gn ¼ (r, R) describes both the end-to-end position and orientation of the distal end of the chain relative to the proximal end, we can marginalize further to obtain quantities such as the end-to-end distance distribution or end-to-end orientational distribution. These quantities (or several of their moments) can be measured directly from a variety of experimental measurements. In order to convert these probabilities into a form that is directly useful for computing Cartesian conformational entropy, we must know the positions of all atoms in each of the i rigid monomer units. Given f(g1, g2, . . ., gn) in Eq. (4.15) and given the family of probability density functions fDi ðxi1 ; . . . ; xik Þg, each of which describes the distribution of motions of the ik i1 þ 1 atoms within body i, it is possible to compute the full Cartesian conformational distribution as ð rðx1 ; . .. ;xN Þ ¼
ð
G
f ðg1 ;g2 ; . .. ; gn Þ G
n Y Di gi1 xi1 ; .. . ;gi1 xik dg1 dgn ; i¼1
ð4:17Þ
116
Gregory S. Chirikjian
where N ¼ in is the total number of atoms in the chain and h iT T T xi ¼ xi1 ; . .. ; xik is the composite vector of Cartesian coordinates of all positions in the ith body. Di ðxi1 ; . . . ; xik Þ is a probability density on 3(ik i1 þ 1)-dimensional Euclidean space. In other words, ð ð Di ðxi1 ; . . . ; xik Þdxi1 dxik ¼ 1: R3
R3
As an example, when the ith body is modeled as being perfectly rigid, ik
Y 0 d xj xj ; Di ðxi1 ; . . . ; xik Þ ¼ j¼i1
where x0j is the fixed position of atom j as seen in the frame of reference gi affixed to rigid body i. In contrast, if body i is an articulated side chain, averaging over all of its conformational states would result in a Di which is not a sum of Dirac delta functions. In some cases, it may be useful to compute the full pose entropy of the chain: ð ð Sg ¼ f ðg1 ; g2 ; . . . ; gn Þ log f ðg1 ; g2 ; . . . ; gn Þdg1 dgn : ð4:18Þ G
G
2.2. Modeling excluded volume effects The phantom polymer chain model in which the effects of excluded volume are ignored is clearly not a realistic model, but it can be used as a baseline onto which self-avoidance can be built. In a polypeptide, residue i interacts with residues i þ 1,. . .,i þ 4 substantially as well as more sequentially distant residues. These interactions are not only responsible for the formation of secondary structures, but also substantially winnow down the available conformational space (Fitzkee and Rose, 2005). Clearly, this has implications for the entropy. More specifically, polymer models can be used to compute upper bounds on the conformational entropy in polypeptides. And these bounds can be made tighter by incorporating the effects of steric clash into modified versions of the conformational probability distributions. To begin, let us compute the density of body i. This can either be done directly by, for example, averaging body i over all possible side-chain conformations. Or, it can be done by first computing each marginal of the density function Di as
117
Modeling Loop Entropy
dij xij ¼
ð
ð xi1 2R3
ð
xij1 2R3 xijþ1 2R3
ð
xik 2R3
Di ðxi1 ; . . . ; xik Þdxi1 xij1 xijþ1 dxik : Then, the average density of body i (normalized to be a probability density) is di0 ðxÞ ¼
ik X 1 di ðxÞ: ik i1 þ 1 ij ¼i1 j
The overlap of bodies in the chain is illustrated in Fig. 4.6. Therefore, if body i is moved by rigid-body motion gi, and likewise for body j, we can compute an estimate of their overlap (averaged overall deformations of the bodies) as ð
wij gi ; gj ¼ di0 gi1 x dj0 gj1 x dx: R3
A general property of integration over all of three-dimensional space is that it is invariant under rigid-body motions. Therefore, if we make the change of variables y ¼ gi1 ∘x, then we find that
wij gi ; gj ¼ wij e; gi1 ∘gj ¼ wij gj1 ∘gi ; e : Clearly, when the two bodies do not overlap, wij ¼ 0. Otherwise, they will have some positive value. One can imagine evaluating wij as the argument of a “sigmoid function,” which sharply ramps up from zero to one, where it
Figure 4.6 Conformations to be removed from the phantom chain ensemble: (left) local overlaps; (right) nonlocal overlaps.
118
Gregory S. Chirikjian
then plateaus at higher values. The resulting Wij(gi 1 ∘ gj) ¼ 1 exp ( (wij(gi, gj))2 /2s2) (for some small value of s) would effectively window-out all values of the rigid-body motions gi and gj that contribute to nonphysical overlaps. Then, the original f(g1, g2, . . ., gn) in Eq. (4.15) could be replaced with one of the form fex ðg1 ; g2 ; . . . ; gn Þ ¼ Cf ðg1 ; g2 ; . . . ; gn Þ
n Y 1 Wij gi1 ∘gj ;
ð4:19Þ
i<j
where C is the normalization required to make fex a pdf. Note that the product in this expression is not only over sequentially local pairs of bodies, but rather all bodies, where the i < j simply avoids double counting. In this way, a phantom polymer model that generates f(g1, g2, . . ., gn) can be viewed as the starting point for a more realistic model that includes steric constraints.
2.3. Bounding Cartesian conformational entropy Practically speaking, computing such high-dimensional integrals as in Eq. (4.17) or (4.19) can impose a computational problem, except when simple closed-form expressions such as Gaussians are used. If we seek an upper bound on Cartesian conformational entropy, marginals can be computed and information-theoretic bounds can be employed. Performing such marginalization, one finds ð ri ðxi Þ ¼ f0;i ðgi ÞDi gi1 xi1 ; . . . ; gi1 xik dgi : ð4:20Þ G
In the case, when one representative point is chosen per residue (e.g., the Ca atom, which is where the reference frame for the residue is usually attached), we have k ¼ 1. Then, i ¼ i1, and since x0i ¼ 0 due to the way the reference frame is attached, we can write ð ri ðxi Þ ¼ f0;i ðgÞd g1 xi dg: G
If g ¼ (r, R), then dðg1 xi Þ ¼ dðRT ðxi rÞÞ ¼ dðxi rÞ and so we can get the positional distribution of the ith Ca atom by marginalizing the full pose distribution over orientations as ð ð ð f0;i ðr; RÞdðxi rÞdrdR ¼ f0;i ðR; xi ÞdR: ð4:21Þ ri ðxi Þ ¼ SOð3Þ R3
SOð3Þ
119
Modeling Loop Entropy
The conformational entropy of the backbone represented by Ca atoms is then bounded from below by the entropy of individual marginals (with the tightest lower bound resulting from the maximum of these). The loop entropy will be bounded from above by the sum of entropies from all of the marginals. Therefore, max Si Sx i
n X
ð Si ; where
Si ¼
i¼1
R3
ri ðxi Þ logri ðxi Þdxi
ð4:22Þ
T and x ¼ xT1 ; xT2 ; . . . ; xTn .
3. Approximating Entropy of the Loops in the Folded Ensemble The native ensemble of a protein is characterized by a relatively high degree of order. However, the native form is not completely rigid. In particular, loop/coil regions connecting secondary structures can exhibit large motions. Here, we model the ends of these loops as being fixed at specific positions and orientations, as illustrated in Fig. 4.7. Bounds on the contribution of loop motions to overall entropy are discussed here. If f(g1,. . .,gn) is the conformational distribution function describing the positions and orientations of all bodies in the system with respect to the proximal end of the chain, then if we fix the distal end at a specific pose, gend, the resulting distribution will be the conditional density
gend
fix
f0,i (gi ; gend)
Figure 4.7 Using density information to determine probabilities of conformations that obey end constraints.
120
Gregory S. Chirikjian
f fix ðg1 ; g2 ; . . . ; gn1 ; gend Þ ¼ f ðg1 ; . . . ; gn jgn ¼ gend Þ ¼ f ðg1 ; . . . ; gn1 ; gend Þ=f0;n ðgend Þ:
ð4:23Þ The entropy of this distribution, in some cases, can either be computed directly, or each of the marginals can be computed as fix f0;i ðgi ; gend Þ ¼ f0;i ðgi Þfi;n gi1 ∘gend =f0;n ðgend Þ:
ð4:24Þ
The reason for this is that in the definition of f(g1,. . .,gn), the variable gi appears in only the two multiplied terms: fi1, i(gi11 ∘ gi) fi,iþ1(gi1 ∘ giþ1). Marginalizing over g1 through gi 1 results in f0,i(gi). If gi were the identity element, then marginalizing over giþ1 through gn1 would yield fi,n(gend). However, since in general gi ¼ 6 e, this result is shifted by gi to yield fi, n(gi1 ∘ gend). Division by f0,n(gend) is the normalization required to make the result a pdf (since integration of the numerator in Eq. (4.24) over gi is a convolution). This denominator is carried along from Eq. (4.23), which is a statement of Bayes’ rule. Intuitively, the entropy of a chain with fixed ends must be smaller than that of a chain with freely moving ends. This can be quantified when using Eqs. (4.23) and (4.24), as will be shown in the examples in the next section.
4. Examples In this section, examples are used to illustrate the formulation presented earlier in this work. In both of these examples, a piece of flexible loop/coil connects relatively rigid structures. In the first example, the loop is considered to be a long phantom chain, whereas in the second, it is considered to be a semiflexible polymer. The reduction in conformational entropy associated with constraining the ends in both cases is examined.
4.1. Model 1: Long loops modeled as Gaussian chains Perhaps, the most common model for the distribution of end-vector distribution in polymer theory is the Gaussian distribution:
3r 2 f ðgÞ ¼ W ðrÞ ¼ exp 2 2hr i 1 T 1 exp r S r ; ¼ 2 ð2pÞ3=2 jSj1=2 3 2phr 2 i 1
3=2
ð4:25Þ
where g ¼ (r, R) 2 SE(3) and the chain is so flexible that the orientational part of the distribution is constant, and S ¼ (hr2i /3)I, where I is the 3 3
121
Modeling Loop Entropy
identity matrix. This distribution is spherically symmetric (and hence depends only on r ¼ jrj). It is normalized so that it is a pdf, ð ð1 W ðrÞdV ¼ 4p W ðrÞr 2 dr ¼ 1; R3
0
satisfying ð R3
W ðrÞjrj2 dV ¼ 4p
ð1
W ðrÞr 4 dr ¼ r 2 :
0
The pdf for a freely jointed chain with n links, each of length l can be approximated as a Gaussian random walk with 2 r ¼ nl 2 : If we denote Wn(r) to be the function for n links, then it is clear that in this simple model, Wn1 Wn2 ¼ Wn1 þ n2. Therefore, in this simple model, the conformational entropy is bounded as n 3 3X 2 log 2pekl 2 =3 log 2penl =3 Sr 2 2 k¼1
using Eq. (4.22) where r takes the place of x. The conformational entropy of the phantom chain that gives rise to this distribution is bounded from below by the entropy of the pdf of the location of the terminal end, and from above by the sum of probabilities for each link from base to end, since these are marginals of the total conformational distribution f(g1,. . .,gn), which for this case is a function Q W ðr1 ; . . . ; rn Þ ¼ n1 i¼0 W1 ðriþ1 ri Þ with r0 ¼ 0 where W1 is the effective one-bond Gaussian distribution with covariance matrix S1 ¼ (l2/3)I. Therefore, it is possible to compute the Cartesian conformational entropy in this model exactly in closed form as ð ð Sr ¼ W ðr1 ; . . . ; rn Þ logW ðr1 ; . . . ; rn Þdr1 drn : r1
rn
Since the chain is assumed to be uniform and the product of Gaussians is a Gaussian, we can use the fact that there is a closed-form formula for the entropy of a Gaussian in terms of its covariance matrix, together with the fact that in this example
122
Gregory S. Chirikjian
0
S1
2I
B I B B 3B 0 ¼ 2B B l B0 B. @ .. 0
I
0
0
2I
I .. . .. . .. . 0
0 .. . .. .
.. . .. .
I 0
2I I
I 0 .. .
1 0 .. C . C C 0 C C: C 0 C C I A
ð4:26Þ
I
In principle, the entropy of a Gaussian chain with end position constraints can be computed using Eqs. (4.23) and (4.24). In practice, there are some details that need to be addressed, which are addressed in Section 4.3. Whereas here a Gaussian chain in which orientations diffuse rapidly was considered, the opposite extreme of a stiff chain is considered in the following section.
4.2. Model 2: Short loops modeled as semiflexible polymers Suppose that we have a semiflexible loop, that is, one that has local resistance to bending and twisting that reflects sequentially local steric constraints. Then, each gi will deviate only a relatively small amount from a constant reference pose, hi, and so we write gi ¼ hi expxi where jjxi jj < 1. This sort of assumption is consistent with findings in the literature. For example, (Zhou, 2001) validated the use of semi-flexible polymer models to describe protein loop motions. The relative motion between adjacent reference frames will be gi1 ∘giþ1 ¼ exp xi ∘h1 i ∘hiþ1 ∘ expxiþ1 : If the probability density fi,iþ 1 is a Gaussian with mean at hi 1 ∘ hi þ 1, then it will be of the form
1 fi;iþ1 ðgÞ ¼ FSi;iþ1 h1 ∘g ; i ∘hiþ1 where FSi;iþ1 ð expxÞ ¼
1 ð2pÞ3 jSi;iþ1 j1=2
1 exp xT S1 x : i;iþ1 2
Therefore,
1 1 ∘h ∘ exp x ∘h ∘h ∘ expx fi;iþ1 gi1 ∘giþ1 ¼ FSi;iþ1 h1 iþ1 iþ1 i iþ1 : i i For small motions between adjacent bodies, the approximation
123
Modeling Loop Entropy
h
i_ 1 1 log h1 ∘h ∘ exp x ∘h ∘h ∘ expx iþ1 iþ1 i iþ1 i i xiþ1 Ad1 h1 ∘hiþ1 xi i
has been proved to be accurate (Wang and Chirikjian, 2008). If as shorthand, we define Ai,iþ1 ¼ Adhi 1∘hiþ1, then
1 1 exp xTi ; xTiþ1 fi;iþ1 gi1 ∘giþ1 ¼ 3 1=2 2 ð2pÞ jSi;iþ1 j ! ð4:27Þ 1 1 T 1 Ai;iþ1 Si;iþ1 Ai;iþ1 AT xi i;iþ1 Si:iþ1 : 1 xiþ1 S1 S1 i;iþ1 Ai;iþ1 i;iþ1 This, together with the product in Eq. (4.15), leads f(g1, Tg2, . . ., gn) to be a Gaussian distribution in the variable x ¼ xT1 ; . . . ; xTn with an inverse covariance of the form 2
0
1 S1 0;1 þ S1;2 6 6 6 S1 A1 1;2 1;2 6 6 6 60 6 6 . S1 ¼ 6 6 .. 6 6 60 6 6 6 60 4
0
1 AT 1;2 S1;2
0
0 ..
.
..
.
..
.
0
.
..
.
0
1 T 1 S1 1;2 þ S2;3 A2;3 S2;3 0 1 S1 2;3 A2;3
.. ..
.
..
.
..
0
0
.
0
..
..
0
0
1 S1 n2;n1 An2;n1
1 S1 n2;n1 þ Sn1;n
0
0
0
0
.
7 7 7 7 7 7 7 0 7 7 .. 7; 7 . 7 7 7 0 7 7 7 1 7 T An1;n Sn1;n 5 S1 n1;n 0
0 .. .
.
3
0
1 1 1 1 T S1 n3;n2 An3;n2 Sn2;n1 þ Sn3;n3 An2;n1 Sn2;n1 0
ð4:28Þ 0
where Si,iþ1 ¼ Ai,iþ1Si,iþ1ATi, iþ1. The entropy Sg for the case with free ends is then given by the formula (4.18), which is relatively efficient to compute due to the block-tridiagonal form of S 1. This will be discussed further in Section 4.3. Obtaining the entropy for the case when both ends are fixed in position and orientation is also possible within this model. Using the fact that the adjoint is a homomorphism, that is, Αd(g1∘g2) ¼ Ad(g1)Ad(g2) and Ad(g 1) ¼ Ad 1(g), this generalizes to the concatenation of n reference frames that vary around values hi as n X T Ad1 ð4:29Þ S0 1 n ¼ hk;n Sk Adhk;n : k¼0
To compute Sg for the case when the distal end of the chain is fixed at gn ¼ gend, we would use Eq. (4.23) with the covariance of f0,n(g) being given by Eq. (4.29). Conditioning of Gaussians by Gaussians yields Gaussians, the
124
Gregory S. Chirikjian
entropy of which can be computed in closed form in principle. However, there are some subtle issues that need to be addressed, as discussed below.
4.3. From covariance matrices to entropy It is one thing to have bounds such as Eq. (4.22). It is another to have a closed-form expression for the actual quantity of interest. Here, Eqs. (4.26) and (4.28) are used to compute entropy. This follows from the fact that for a d(n)-dimensional Gaussian distribution with covariance S, the entropy is given as (Chirikjian, 2009; Shannon, 1948) S ¼ logfð2peÞdðnÞ=2 jSj1=2 g:
ð4:30Þ
Here, we use the notation d(n) to denote the dimension of the covariance matrix, which is d(n) ¼ 3n for positional Gaussian, and d(n) ¼ 6n for the semiflexible case. In other words, we can write d(n) ¼ d0 n. We will consider the case when entropy change is due to fixing the ends. Consider a chain (either Gaussian or semiflexible), and let x1,. . .,xn denote the variables describing the kinematic state of the n segments (i.e., xi ¼ ri 2 R3 for the Gaussian chain and xi ¼ xi 2 R6 for the semiflexible
T chain). Let us denote x ¼ xT1 ; . . . ; xTn1 , y ¼ xn 2 Rd0 , and T z ¼ ½xT ; yT . Then, f(x1,. . .,xn) can be written as 1 1 T 1 exp z S z : f ðzÞ ¼ 2 ð2pÞd0 n=2 jSj1=2 The entropy of this distribution is computed simply as Eq. (4.30). However, the end-constrained case is somewhat more involved. The conditional probability density describing the ensemble of endconstrained conformations is of the form f ðxjyÞ ¼ f ðx; yÞ=f ðyÞ; where f(x, y) ¼ f(z) (with y held fixed rather than being a variable) and the marginal distribution f(y) is given by 1 1 T 1 f ðyÞ ¼ exp y Syy y ; 2 ð2pÞd0 =2 jSyy j1=2 where
Sxx S¼ Syx
Sxy : Syy
125
Modeling Loop Entropy
The conditional distribution will then be 1 1 T 1 f ðxjyÞ ¼ exp ½x x0 L ½x x0 ; 2 ð2pÞd0 ðn1Þ=2 jLj1=2
ð4:31Þ
where L ¼ Sxx Sxy S1 yy Syx ¼
h i1 S1 xx
and x0 ¼ Sxy S1 yy y: In principle, we now have everything we need to compute entropy differences. However, in practice, there is an implicit assumption about polymer distribution functions that must be addressed. Namely, even though the chain length is L, and hence the distal end cannot reach outside a ball or radius L centered at the proximal end, for the sake of convenience, we will accept distributions with infinitely long tails. Another way to say this is that as long as the pdf decays rapidly enough and all integrals over a ball of radius L centered at the origin can be replaced by integrals over an infinitely large ball, then things will work out fine. Such calculations include computing probabilities and entropies from probability densities. In other words, we have simplifications such as ðL
x2
L
e
dx
ð1
ex dx: 2
1
While this is perfectly reasonable when the Gaussians are centered at the origin, it will no longer be the case when we shift them by significant amounts. In other words, even though the value of an integral over an infinite range is invariant under shifts, this is not the case for integration over finite intervals: ðL L
ðxL=2Þ2
e
dx 6¼
ðL L
ex dx: 2
This is important in the context of the current discussion because the conditional pdf in Eq. (4.31) is shifted from the origin by a vector x0. In other words, if we fix the distal end of the chain at an arbitrary y, then this distribution of interest in d0 (n 1)-dimensional space will not be
126
Gregory S. Chirikjian
centered at the origin, and the infinite integral used to approximate integration over a ball of radius L ¼ nl centered at the origin that resulted in the normalization constant [(2p)d0(n 1) /2jLj1/2] 1 will no longer be a valid approximation. The computation of this constant and the computation of entropy then become a problem when jjyjj (and hence, jjx0 jj) is not very small relative to total chain length, L ¼ nl. However, when it is very small, the integral can still be approximated as being over infinite-dimensional space because the overwhelming majority of the mass under the pdf will still be contained in the finite ball of radius L. 4.3.1. Entropy for the Gaussian chain For the Gaussian chain, effectively the end constraint rn ¼ 0 means that the chain forms a closed loop because the vectors {ri} in this case are absolute positions of the ith residue with respect to the proximal end. The entropy difference between two ensembles described by Gaussians with dimensions d(n) and d(n 1) in the unconstrained and end-constrained states, respectively, will be DS ¼ S2 S1 ¼ logfð2peÞdðnÞ=2 jS2 j1=2 g logfð2peÞdðn1Þ=2 jS1 j1=2 g 2 # " # 1=2 1 jS j 1 jS j 2 d =2 d 1 0 0 : ¼ log ð2peÞ ¼ log4ð2peÞ 2 jS1 jS1 j1=2 2 j
ð4:32Þ
The last equality means that there is no need to invert the matrices in Eqs. (4.26) and (4.28) when computing entropy differences between the ensembles with free and fixed ends. This is useful, because in practice, one usually is interested only in entropy differences, and the determinants of block-tridiagonal matrices can be computed very efficiently (in O(n) computations for a chain of length n), whereas computing their inverses followed by taking the determinant can be an O(n3) operation. 4.3.2. Entropy of a semiflexible chain The entropy being considered is that defined in Eq. (4.18). For the semiflexible chain, the set {xi} describes the relative small rigid-body displacements of the ith residue with respect to a referential configuration. Therefore, in this case, the same tools developed in this work can be used to describe the entropy differences between the free and end-constrained cases for a somewhat different scenario than the Gaussian chain model. Namely, we can compute multiple reference conformations and consider small deviations around each. The reduction in entropy due to constraining both ends of the chain is then due to eliminating motions around multiple reference conformations (each with free ends) and only allowing motions
127
Modeling Loop Entropy
around the one reference conformation that satisfies the required end conditions. This discussion is quantified below. Imagine sampling the relative poses between adjacent amino acids in a loop at their K most populated isolated peaks. For example, K might be equal to 3 if we sample at the centers of the a, b, and O regions of the f–c plane. If each of these peaks are isolated, and the distributions around them are modeled as SE(3) Gaussian distributions with small covariances and essentially nonoverlapping tails, then the entropy associated with each of these conformational ensembles will be given by Eq. (4.30) where S is defined by Eq. (4.28). If the loop has n residues, then each reference conformation for i ¼ 1,. . .,Kn will have its own entropy, Si defined by these equations. If the relative weights of each of these reference conformations are given by w1, . . ., wKn, and if each conformation is disjoint, and there is minimal overlap between the associated conformational distributions around each, then the total entropy in the case of free ends can be approximated as Sfree
Kn X i¼1
wi logwi þ
Kn X
wi Si :
i¼1
The entropy for the case when the distal end is fixed can be approximated by adding contributions from the subset of baseline conformations that approximately satisfy the end constraints.
5. Conclusions This work reviewed and built on techniques from the fields of robotics, information theory, and theoretical polymer science, and applied these to model conformational entropy in protein loops. At the core of this presentation was the mathematics of rigid-body motion and associated statistical computations, as well as the use of inequalities from information theory for developing a rigorous mathematical treatment of the entropy of unfolded, partially folded, and fully folded proteins. Models of conformational statistics in these three kinds of ensembles were reviewed and developed. These models were then applied to compute entropy differences. The various concepts of entropy in statistical mechanics, computational polymer science, and information theory were reviewed. The distinction between conformational entropy computed in internal and Cartesian coordinates was made. Inequalities to bound the value of entropy from below and above were presented in cases when exact computations were judged to be intractable.
128
Gregory S. Chirikjian
ACKNOWLEDGMENTS This work was performed with support from the National Institutes for Health under grant R01GM075310 and R01GM075310-04S1. The author thanks Profs G. Rose and E. Lattman for their valuable comments and Dr W. Park for proofreading and Dr. S. Lee and Ms. Y. Wang for creating some of the figures.
REFERENCES Amato, N. M., and Song, G. (2002). Using motion planning to study protein folding pathways. J. Comput. Biol. 9(2), 149–168. Amato, N. M., Dill, K. A., and Song, G. (2003). Using motion planning to map protein folding landscapes and analyze folding kinetics of known native structures. J. Comput. Biol. 10(3–4), 239–255. Anfinsen, C. B. (1973). Principles that govern folding of protein chains. Science 181(4096), 223–230. Baldwin, R. L., and Rose, G. D. (1999). Is protein folding hierarchic? I. Local structure and peptide folding. Trends Biochem. Sci. 24(1), 26–33. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E. (2000). The protein data bank. Nucleic Acids Res. 28(1), 235–242. Birshtein, T. M., and Ptitsyn, O. B. (1966). Conformations of Macromolecules. Interscience, New York. Boehr, D. D., Nussinov, R., and Wright, P. E. (2009). The role of dynamic conformational ensembles in biomolecular recognition. Nat. Chem. Biol. 5, 789–796. Boyd, R. H., and Phillips, P. J. (1993). The Science of Polymer Molecules, Cambridge Solid State Science Series Cambridge University Press, Cambridge. Bracken, C., Iakoucheva, L. M., Rorner, P. R., and Dunker, A. K. (2004). Combining prediction, computation and experiment for the characterization of protein disorder. Curr. Opin. Struct. Biol. 14(5), 570–576. Bryngelson, J. D., Onuchic, J. H., Socci, N. D., and Wolynes, P. G. (1995). Funnels, pathways, and the energy landscape of protein-folding—A synthesis. Proteins: Struct. Funct. Genet. 21(3), 167–195. Canutescu, A. A., and Dunbrack, R. L. (2003). Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci. 12(5), 963–972. Chirikjian, G. S. (2001). Conformational statistics of macromolecules using generalized convolution. Comput. Theor. Polym. Sci. 11, 143–153. Chirikjian, G. S. (2009). Stochastic Models, Information Theory, and Lie Groups. Birkha¨user, Boston. Chirikjian, G. S. (2010). Group theory and biomolecular conformation, I.: Mathematical and computational models. J. Phys. Condens. Matter 22, 323103. Chirikjian, G. S., and Burdick, J. W. (1992). A geometric approach to hyper-redundant manipulator obstacle avoidance. ASME J. Mech. Des. 114, 580–585. Chirikjian, G. S., and Burdick, J. W. (1994). A modal approach to hyper-redundant manipulator kinematics. IEEE Trans. Robot. Autom. 10, 343–354. Chirikjian, G. S., and Kyatkin, A. B. (2000). An operational calculus for the Euclidean motion group with applications in robotics and polymer science. J. Fourier Anal. Appl. 6 (6), 583–606. Chirikjian, G. S., and Kyatkin, A. B. (2001). Engineering Applications of Noncommutative Harmonic Analysis. CRC Press, Boca Raton.
Modeling Loop Entropy
129
Chirikjian, G. S., and Wang, Y. (2000). Conformational statistics of stiff macromolecules as solutions to PDEs on the rotation and motion groups. Phys. Rev. E 62(1), 880–892. Crippen, G. M. (2001). A Gaussian statistical mechanical model for the equilibrium thermodynamics of barnase folding. J. Mol. Biol. 306(3), 565–573. Crippen, G. M. (2004). Statistical mechanics of protein folding by cluster distance geometry. Biopolymers 75(3), 278–289. D’Aquino, J. A., Gomez, J., Hilser, V. J., Lee, K. H., Amzel, L. M., and Fieire, E. (1996). The magnitude of the backbone conformational entropy change in protein folding. Proteins: Struct. Funct. Genet. 25, 143–156. Das, P., Moll, M., Stamati, H., Kavraki, L. E., and Clementi, C. (2006). Low-dimensional free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction. PNAS 103(26), 9885–9890. de Gennes, P. G. (1979). Scaling Concepts in Polymer Physics. Cornell University Press, Ithaca. des Cloizeaux, J., and Jannink, G. (1990). Polymers in Solution: Their Modelling and Structure. Clarendon Press, Oxford. Dill, K. A., Fiebig, K. M., and Chan, H. S. (1993). Cooperativity in Protein-Folding Kinetics. PNAS 90(5), 1942–1946. Doi, M., and Edwards, S. F. (1986). The Theory of Polymer Dynamics. Clarendon Press, Oxford. Dunker, A. K., Lawson, J. D., Brown, C. J., Williams, R. M., Romero, P., Oh, J. S., Oldfield, C. J., Campen, A. M., Ratliff, C. R., Hipps, K. W., Ausio, J., Nissen, M. S., et al. (2001). Intrinsically disordered protein. J. Mol. Graph. Model. 19(1), 26–59. Dunker, A. K., Cortese, M. S., Romero, P., Iakoucheva, L. M., and Uversky, V. N. (2005). Flexible nets—The roles of intrinsic disorder in protein interaction networks. FEBS J. 272(20), 5129–5148. Fang, Q. J., and Shortle, D. (2005). A consistent set of statistical potentials for quantifying local side-chain and backbone interactions. Proteins: Struct. Funct. Bioinform. 60(1), 90–96. Fitzkee, N. C., and Rose, G. D. (2004). Reassessing random-coil statistics in unfolded proteins. PNAS 101(34), 12497–12502. Fitzkee, N. C., and Rose, G. D. (2005). Sterics and solvation winnow accessible conformational space for unfolded proteins. J. Mol. Biol. 353(4), 873–887. Flory, P. J. (1969). Statistical Mechanics of Chain Molecules. John Wiley & Sons, New York,(reprinted Hanser Publishers, Munich, 1989). Frederick, K. K., Marlow, M. S., Valentine, K. G., and Wand, A. J. (2007). Conformational entropy in molecular recognition by proteins. Nature 448, 325–330. Gel’fand, I. M., Minlos, R. A., and Shapiro, Z. Ya. (1963). Representations of the Rotation and Lorentz Groups and Their Applications. Pergamon Press, New York. Gong, H. P., and Rose, G. D. (2005). Does secondary structure determine tertiary structure in proteins? Proteins: Struct. Funct. Bioinform. 61(2), 338–343. Grosberg, A. Yu., and Khokhlov, A. R. (1994). Statistical Physics of Macromolecules. American Institute of Physics, New York. Hsu, D., Latombe, J. C., and Motwani, R. (1999). Path planning in expansive configuration spaces. Int. J. Comput. Geom. Appl. 9(4–5), 495–512. Jernigan, R. L., and Bahar, I. (1996). Structure-derived potentials and protein simulations. Curr. Opin. Struct. Biol. 6(2), 195–209. Karplus, M., and Weaver, D. L. (1976). Protein-folding dynamics. Nature 260(5550), 404–406. Kavraki, L. E., Svestka, P., Latombe, J. C., and Overmars, M. H. (1996). Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 12(4), 566–580.
130
Gregory S. Chirikjian
Kazerounian, K. (2004). From mechanisms and robotics to protein conformation and drug design. J. Mech. Des. 126(1), 40–45. Kazerounian, K., Latif, K., Rodriguez, K., and Alvarado, C. (2005). Nano-kinematics for analysis of protein molecules. J. Mech. Des. 127(4), 699–711. Kim, J. S., and Chirikjian, G. S. (2005). A unified approach to conformational statistics of classical polymer and polypeptide models. Polymer 46(25), 11904–11917. Kim, M. K., Jernigan, R. L., and Chirikjian, G. S. (2005). Rigid-cluster models of conformational transitions in macromolecular machines and assemblies. Biophys. J. 89(1), 43–55. Kortemme, T., Morozov, A. V., and Baker, D. (2003). An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein–protein complexes. J. Mol. Biol. 326(4), 1239–1259. Lavalle, S. M., Finn, P. W., Kavraki, L. E., and Latombe, J. C. (2000). A randomized kinematics-based approach to pharmacophore-constrained conformational search and database screening. J. Comput. Chem. 21(9), 731–747. Lazaridis, T., and Karplus, M. (1999). Effective energy function for proteins in solution. Proteins: Struct. Funct. Genet. 35(2), 133–152. Lee, S., and Chirikjian, G. S. (2004). Inter-helical angle and distance preferences in globular proteins. Biophys. J. 86, 1105–1117. Lee, S., and Chirikjian, G. S. (2005). Pose analysis of alpha-carbons in proteins. Int. J. Rob. Res. 24(2–3), 183–210. Levitt, M. (1983). Protein folding by restrained energy minimization and molecular-dynamics. J. Mol. Biol. 170(3), 723–764. Li, Z., Raychaudhuri, S., and Wand, J. (1996). Insights into the local residual entropy of proteins provided by NMR relaxation. Protein Science 5, 2647–2650. Liu, L., and Chen, S.-J. (2010). Computing the conformational entropy for RNA folds. J. Chem. Phys. 132, 235104. Lotan, I., Schwarzer, F., Halperin, D., and Latombe, J. C. (2004). Algorithm and data structures for efficient energy maintenance during Monte Carlo simulation of proteins. J. Comput. Biol. 11(5), 902–932. Manocha, D., Zhu, Y. S., and Wright, W. (1995). Conformational-analysis of molecular chains using nano-kinematics. Comput. Appl. Biosci. 11(1), 71–86. Mattice, W. L., and Suter, U. W. (1994). Conformational Theory of Large Molecules, the Rotational Isomeric State Model in Macromolecular Systems. John Wiley & Sons, New York. Mavroidis, C., Dubey, A., and Yarmush, M. L. (2004). Molecular machines. Annu. Rev. Biomed. Eng. 6, 363–395. Miller, W., Jr. (1968). Lie Theory and Special Functions. Academic Press, New York, also see Miller, W. Jr. (1964). Some applications of the representation theory of the Euclidean group in three-space. Commun. Pure Appl. Math. 17, 527–540. Moult, J. (1997). Comparison of database potentials and molecular mechanics force fields. Curr. Opin. Struct. Biol. 7(2), 194–199. Palmer, A. G., III. (1997). Probing molecular motions by NMR. Curr. Opin. Struct. Biol. 7, 732–737. Pappu, R. V., Srinivasan, R., and Rose, G. D. (2000). The Flory isolated-pair hypothesis is not valid for polypeptide chains: Implications for protein folding. PNAS 97(23), 12565–12570. Patriciu, A., Chirikjian, G. S., and Pappu, R. V. (2004). Analysis of the conformational dependence of mass-metric tensor determinants in serial polymers with constraints. J. Chem. Phys. 121(24), 12708–12720.
Modeling Loop Entropy
131
Radivojac, P., Obradovic, Z., Smith, D. K., Zhu, G., Vucetic, S., Brown, C. J., Lawson, J. D., and Dunker, A. K. (2004). Protein flexibility and intrinsic disorder. Protein Sci. 13(1), 71–80. Ramachandran, G. N., Ramakrishnan, C., and Sasisekharan, V. (1963). Stereochemistry of polypeptide chain configurations. J. Mol. Biol. 7(1), 95–99. Rhee, Y. M., and Pande, V. S. (2006). On the role of chemical detail in simulating protein folding kinetics. Chem. Phys. 323, 66–77. Rienstra, C. M., Tucker-Kellogg, L., Jaroniec, C. P., Hohwy, M., Reif, B., McMahon, M. T., Tidor, B., Lozano-Perez, T., and Griffin, R. G. (2002). De novo determination of peptide structure with solid-state magic-angle spinning NMR spectroscopy. PNAS 99(16), 10260–10265. Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, pp. 623–656. Shehu, A., Clementi, C., and Kavraki, L. E. (2006). Modeling protein conformational ensembles: From missing loops to equilibrium fluctuations. Proteins: Struct. Funct. Bioinform. 65(1), 164–179. Shortle, D., and Ackerman, M. S. (2001). Persistence of native-like topology in a denatured protein in 8 M urea. Science 293(5529), 487–489. Skliros, A., and Chirikjian, G. S. (2008). Positional and orientational distributions for locally self-avoiding random walks with obstacles. Polymer 49(6), 1701–1715. Sugiura, M. (1990). Unitary representations and harmonic analysis. 2nd edn. Elsevier Science Publisher, The Netherlands. Talman, J. (1968). Special Functions. W. A. Benjamin, Inc., Amsterdam. Tang, X. Y., Kirkpatrick, B., Thomas, S., Song, G., and Amato, N. M. (2005). Using motion planning to study RNA folding kinetics. J. Comput. Biol. 12(6), 862–881. Teodoro, M., Phillips, G. N., Jr., and Kavraki, L. E. (2001). Molecular docking: A problem with thousands of degrees of freedom. Proceedings of the 2001 IEEE International Conference on Robotics and Automation (ICRA 2001), pp. 960–966. IEEE Press, Seoul. Thomas, S., Song, G., and Amato, N. M. (2005). Protein folding by motion planning. Phys. Biol. 2(4), S148–S155. Vajda, S., Sippl, M., and Novotny, J. (1997). Empirical potentials and functions for protein folding and binding. Curr. Opin. Struct. Biol. 7(2), 222–228. Vilenkin, N. J., and Klimyk, A. U. (1991). Representation of Lie Group and Special Functions, Vols. 1–3. Kluwer Academic Publishers, The Netherlands. Vucetic, S., Obradovic, Z., Vacic, V., Radivojac, P., Peng, K., Iakoucheva, L. M., Cortese, M. S., Lawson, J. D., Brown, C. J., Sikes, J. G., Newton, C. D., and Dunker, A. K. (2005). DisProt: A database of protein disorder. Bioinformatics 21(1), 137–140. Wang, Y., and Chirikjian, G. S. (2004). Workspace generation of hyper-redundant manipulators as a diffusion process on SE(N). IEEE Trans. Robot. Autom. 20(3), 399–408. Wang, Y., and Chirikjian, G. S. (2008). Nonparametric second-order theory of error propagation on the Euclidean group. Int. J. Rob. Res. 27(1112), 1258–1273. Wang, J. Y., and Crippen, G. M. (2004). Statistical mechanics of protein folding with separable energy functions. Biopolymers 74(3), 214–220. Wang, C. S. E., Lozano-Perez, T., and Tidor, B. (1998). AmbiPack: A systematic algorithm for packing of macromolecular structures with ambiguous distance constraints. Proteins: Struct. Funct. Genet. 32(1), 26–42. Yang, D., and Kay, L. E. (1996). Contributions to conformational entropy arising from bond vector fluctuations measured from NMR-derived order parameters: Application to protein folding. J. Mol. Biol. 263, 369–382.
132
Gregory S. Chirikjian
Zhang, M., White, R. A., Wang, L., Goldman, R., Kavraki, L. E., and Hassett, B. (2005). Improving conformational searches by geometric screening. Bioinformatics 21(5), 624–630. Zhang, J., Lin, M., Chen, R., Wang, W., and Liang, J. (2008). Discrete state model and accurate estimation of loop entropy of RNA secondary structures. J. Chem. Phys. 128, 125107. Zhou, H.-X. (2001). Loops in proteins can be modeled as worm-like chains. J. Phys. Chem. B 105, 6763–6766. Zhou, Y., and Chirikjian, G. S. (2003). Conformational statistics of bent semiflexible polymers. J. Chem. Phys. 119(9), 4962–4970. Zhou, Y., and Chirikjian, G. S. (2006). Conformational statistics of semi-flexible macromolecular chains with internal joints. Macromolecules 39(5), 1950–1960.
C H A P T E R
F I V E
Inferring Functional Relationships and Causal Network Structure from Gene Expression Profiles Radhakrishnan Nagarajan* and Meenakshi Upreti† Contents 134 136
1. Introduction 2. Methods 2.1. Synthetic two-gene network modeled as a first-order bivariate VAR 2.2. Diagnostic tests 3. Results 3.1. Impact of noise variance on GC analysis of a synthetic two-gene network 3.2. GC analysis of cell-cycle gene expression profiles 4. Conclusions References
136 140 141 141 143 144 145
Abstract Inferring functional relationships and network structure from the observed gene expression profiles can provide a novel insight into the working of the genes as a system or network as opposed to independent entities. Such networks may also represent possible causal relationships between a given set of genes, hence can prove to be a convenient abstraction of the underlying signaling mechanism. The discovery of functional relationships from the observed gene expression profiles does not rely on prior literature, hence useful in identifying undocumented relationships between a given set of genes. Several techniques have been proposed in the literature. The present study investigates the choice Granger causality (GC) and its extensions in modeling the network structure between a given pair of genes from their expression profiles. The impact of noise variance on GC relationships is investigated. VAR parameter estimation is proposed to obtain a finer insight into the functional relationships inferred * Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA Division of Radiation Oncology, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
{
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87005-9
#
2011 Elsevier Inc. All rights reserved.
133
134
Radhakrishnan Nagarajan and Meenakshi Upreti
using GC tests. The results are presented on synthetic networks generated from known vector-autoregressive (VAR) models and those from cell-cycle gene expression profiles that can be modeled as a first-order bivariate VAR.
1. Introduction Classical biological studies had focused on understanding the changes in the expression of specific genes (i.e., differential gene expression) across distinct biological states (e.g., control tissue vs. cancer tissue). Such studies, while useful, provide limited insights into the interaction between the genes. Capturing the interaction or functional relationships (Gardner et al., 2003) between genes is critical since they work in concert as a system (Kitano, 2002) and mediate specific phenotypic outcomes. Such a system-level understanding can also prove to be a convenient abstraction of the underlying signaling mechanism. Modeling the interaction by integrating the results across mutually exclusive gene expression studies can be challenging. Recent studies have encouraged differential gene expression studies in conjunction with text mining of published literature from large repositories such as PUBMED/ MEDLINE in order to arrive at empirical pathways and identify enriched pathways (Ashburner et al., 2000; Subramanian et al., 2005). These approaches integrate knowledge from heterogeneous data from disparate sources. However, discrepancy in findings across laboratories, protocols, tissues, cell lines, and species may impose constraints on constructing empirical pathways from a specific experimental study. Cross talk (Sachs et al., 2005), involvement of multiple pathways (Nagarajan et al., 2010), and noncanonical signaling mechanisms (Korswagen, 2002) also discourage a straightforward construction of pathways from mutually exclusive studies. The above issues are compelling enough to explore alternate approaches to construct gene networks and pathways. The recent explosion of high-throughput assays enable simultaneous measurement of transcriptional (e.g., microarrays), translational (e.g., protein arrays), and posttranslational (e.g., kinase arrays) activities from biological systems. Simultaneous measurement of genes is necessary, since the superposition of results of knowledge across single genes may not provide sufficient insight into their working as a system. Recent studies have used highthroughput assays in conjunction with suitable computational and statistical techniques (Butte et al., 2000; Eisen et al., 1998; Friedman, 2004; Gardner et al., 2003; Pe’er, 2005; Sachs et al., 2005) to infer the functional relationships from the observed gene expression profiles. It is important to note that computational approaches can prove to be useful initial screening tools, recommend new relationships, and confirm established relationships. However, they cannot substitute rigorous biological validation.
Inferring Functional Relationships
135
Gene expression data generated from a high-throughput assay such as microarray fall under observational data. Such data sets are passive and do not necessarily lend themselves to active perturbation. This, in turn, renders the inference problem challenging. In spite of the above limitations, several studies have successfully used computational and statistical approaches to infer functional relationships especially from gene expression data under implicit assumptions. Gene expression data obtained from high-throughput assays can be with or without temporal information. Lack of temporal information provides only a snapshot of the biological process (i.e., static data). In contrast, temporal data or time series captures the evolution of the gene expression dynamics as a function of time (i.e., dynamic data). Techniques such as clustering are widely used to group static as well as dynamic gene expression profiles based on some similarity or correlation metric. The underlying hypothesis being genes that share the same expression profile may be functionally related (Eisen et al., 1998). Such an approach may also be useful in attributing possible function to expressed sequence tags (ESTs) whose expression profile is similar to a gene with known function. However, correlation metrics are symmetric and do not provide information about the directionality or causation between the genes. Correlation between a given pair of genes may also manifest after a finite delay as their regulation need not necessarily be synchronized. Popular clustering techniques are oblivious to such delays, hence have clear limitations. In the case of static data sets and replicate measurements, probabilistic approaches such as Bayesian structure learning techniques prove to be ideal candidates for modeling the network structure. Justification behind Bayesian approaches can be attributed to inherent uncertainty in the gene expression measurements across the replicates. Several Bayesian structure learning techniques have been used widely for gene network inference (Pe’er, 2005; Sachs et al., 2005). However, inferring the network structure from joint probability distributions may result in a family of networks or equivalent structures (Pe’er, 2005) as opposed to a unique network. Network inference from static data sets may also fail to accommodate possible feedback mechanisms between the genes, rendering the resulting network acyclic. On a related note, feedback mechanisms are ubiquitous in gene networks/signaling cascades and play a critical role in its control and stability. Network inference from temporal gene expression profiles or dynamic data sets alleviate some of these concerns faced in static data sets. More importantly, inference with temporal data is useful in modeling feedback relationships. Recent studies have used Granger causality (GC, Granger, 1969) and its extensions to model functional relationship from multivariate time series across distinct biological systems (Brovelli et al., 2004; Gregoriou et al., 2009; Seth et al., 2006; Sridharan et al., 2008) including gene expression profiles (Fujita et al., 2007; Guo et al., 2008; Mukhopadhyay and Chatterjee, 2007; Nagarajan, 2009; Nagarajan and Upreti, 2010). It is
136
Radhakrishnan Nagarajan and Meenakshi Upreti
important to note that causality as defined by GC essentially represents the forecasting ability between a given pair of processes. The ability of GC to accommodate inherent delays between the processes and ease of implementation may be possible reasons for its widespread use across several scientific domains, including gene expression analysis. GC is ideally suited to investigate relationships in bivariate vector-autoregressive process (VAR). A zero magnitude of an off-diagonal element of a bivariate VAR indicates absence of relationship in the corresponding direction. However, in practical settings, statistical tests such as the F-test (Hamilton, 1994; Lutkepohl, 2006), which depends on the ratio of the mean-squared forecast errors, is used for identifying significant GC relationships from observed time series data.
2. Methods GC represents the forecasting ability between a given pair of processes. In order to reject the null hypothesis that the given process “xt does not granger cause yt” statistical tests such as F-test (Hamilton, 1994; Lutkepohl, 2006) that rely on the ratio of the mean-squared forecast errors gx ! z ¼ s0 / s1 are used, where s0 is the mean-squared forecast error in predicting yt from its own past and s1 is the mean-squared forecast error in predicting yt from its own past and the past of xt. If xt contributes to the forecasting ability, then we expect to observe a decrease in the mean-squared forecast error such that s1 < s0 and inflation of gx ! y. Since the expression of F-test statistic is a function of gx ! y, we restrict the subsequent discussion to the ratio of the mean-squared forecast errors. In the following discussion, we investigate the contribution of the VAR process parameters to the forecast errors.
2.1. Synthetic two-gene network modeled as a first-order bivariate VAR Consider the first-order bivariate VAR [15, 16], Fig. 5.1, given by xt a1 b1 xt1 e ¼ þ t yt b2 a2 yt1 t
ð5:1Þ
In Eq. (5.1), (xt, yt) represent the mRNA expression of the genes (x, y), respectively, at time t. The diagonal elements represent autoregulatory feedback strengths a1, a2 2 R, whereas the off-diagonal elements b1, b2 2 R represent the transcriptional coupling strengths between the genes. The terms et, t represent normally distributed zero-mean uncorrelated noise with finite variance attributed to inherent uncertainties in the transcriptional mechanisms (Elowitz et al., 2002; Kaern et al., 2005) and artifacts
137
Inferring Functional Relationships
b1 Y
X a1
a2
b2 ht
et
Figure 5.1 A synthetic two-gene network (Eq. 5.1) where (b1, b2) represent the coupling strengths between the genes, (a1, a2) represent the autoregulatory feedback strengths, and (et, t) represent their corresponding noise variances.
(Okoniewski and Miller, 2006; Tu et al., 2002) ubiquitous in microarray expression studies. In order to estimate whether xt Granger causes yt (i.e., x GC y), we follow the following steps: From Eq. (5.1), we have xt ¼ a1 xt1 þ b1 yt1 þ et xt ¼
1 ½b yt1 þ et ð1 a1 BÞ 1
ð5:2Þ
where B is the backshift operator such that Bxt ¼ xt 1. Power series expansion (Eq. 5.2) converges provided ja1j < 1. From Eq. (5.1), we also have yt ¼ b2 xt1 þ a2 yt1 þ t Substituting for xt 1 from Eq. (5.2) in Eq. (5.3), we get 1 y t ¼ b2 ½b yt2 þ et1 þ a2 yt1 þ t ð1 a1 BÞ 1 b2 b1 b2 yt2 þ a2 yt1 þ et1 þ t ¼ ð1 a1 BÞ 1 a1 B
ð5:3Þ
ð5:4Þ
From Eq. (5.3), forecasting yt from the past of yt as well as xt results in yt f 0 ¼ b2 xt1 þ a2 yt1
ð5:5Þ
Therefore, the mean-squared forecast error on forecasting yt from the past of yt as well as xt from Eqs. (5.3) and (5.5) is s0 ¼ s2
ð5:6Þ
138
Radhakrishnan Nagarajan and Meenakshi Upreti
From Eq. (5.4), forecasting yt from the past of yt alone results in yt f 1 ¼
b2 b1 yt2 þ a2 yt1 ð1 a1 BÞ
ð5:7Þ
Therefore, the error in forecasting yt from its past from Eqs. (5.4) and (5.7) is b2 et1 þ t ¼ 1 þ a1 B þ a21 B2 þ b2 et1 þ t ð1 a1 BÞ ¼ b2 1 þ a1 et2 þ a21 et3 þ þ t
ð5:8Þ
From Eq. (5.8), the mean-squared forecast error on forecasting yt from its past is s1 ¼
b22 s2 þ s2 ð1 a21 Þ e
ð5:9Þ
From Eqs. (5.6) and (5.9), the ratio of the mean-squared forecast errors corresponding to xt Granger cause yt is given by gx!y ¼
s1 b22 s2e ¼ þ1 s0 ð1 a21 Þ s2
ð5:10Þ
The mean-squared forecast error corresponding to yt Granger causes xt; a similar approach is given by gy!x ¼
s1 b21 s2 ¼ þ 1; s0 ð1 a22 Þ s2e
where ja2 j < 1
ð5:11Þ
Several important observations can be made based on the functional forms of the mean-squared forecast error ratios (Eqs. 5.10 and 5.11).
These ratios (Eqs. 5.10 and 5.11) have significant contributions from the autoregulatory feedback strengths (a1, a2) and the ratio of the noise variances (et, t) in addition to the coupling strengths (b1, b2). The ratios (Eqs. 5.10 and 5.11) are also immune to the sign of the bivariate VAR process parameters. As expected, the larger the coupling strength, the larger the corresponding ratio of the mean-squared forecast error (e.g., larger b2 implies larger gx ! y). Therefore, the gene with a larger coupling
139
Inferring Functional Relationships
strength has a greater tendency to act as a cause compared to the gene with the smaller coupling strength in the bivariate VAR representation. From Eqs. (5.10) and (5.11), it should also be noted that a larger value of jaij < 1, i ¼ 1, 2 implies a larger value of the corresponding ratio. Consider the case where the coupling strength and noise variance held constant across the genes, that is, b1 ¼ b2 and s2 ¼ se2. However, only one of the genes has an autoregulatory feedback, that is, a1 6¼ 0, a2 ¼ 0. For these choices of the parameters, we have gx!y ¼
gy!x 1 a21
Since ja1j < 1, gx ! y is always greater than gy ! x for identical coupling strengths and noise variances between the genes. In essence, the gene with an autoregulatory feedback has a greater tendency to act as the cause as opposed to the gene with no autoregulatory feedback for the same choice of parameters.
Unlike autoregulatory feedback and coupling strength, the magnitude of the noise variance affects the ratio of the mean-squared forecast errors along either direction; that is, changes in the magnitude of the noise variance on a given gene affect not only gx ! y but also gy ! x and genes with larger noise variance have a greater tendency to act as the cause. Consider the expressions (Eqs. 5.10 and 5.11) where the coupling strength and autoregulatory feedback are held constant across the genes, that is, b1 ¼ b2 ¼ b and a1 ¼ a2 ¼ a. Let us assume that the noise variances are markedly different such that se ¼ ks, k > 1. For these choices of the parameters, we have gx!y ¼ gy!x ¼
b2 k2 þ1 ð 1 a2 Þ
b2 þ1 ð1 a2 Þk2
Therefore, gx!y s4 / k4 ¼ 4e gy!x s
ð5:12Þ
It is important to recall that noise can be an outcome of stochastic mechanisms underlying transcription or artifacts prevalent in high-throughput
140
Radhakrishnan Nagarajan and Meenakshi Upreti
assays such as microarrays. These artifacts can be an outcome of the sensitivity of the probes, nonspecific binding, cross-hybridization, and measurement noise inherent in biological assays (Okoniewski and Miller, 2006; Tu et al., 2002).
2.2. Diagnostic tests VAR parameter estimation and statistical tests for identifying significant GC relationships are valid only under certain implicit assumptions (Hamilton, 1994; Lutkepohl, 2006) and can lead to spurious conclusions when these assumptions are violated. For instance, the F-test used widely for GC inference implicitly assumes the gene expression profiles of interest to be normally distributed. In the present study, we use a series of diagnostic tests as sanity checks prior to GC analysis of the expression profiles. Since we focus on gene expression profiles that can be modeled as first-order bivariate VAR, the optimal lag (topt) between any given pair of gene expression profiles was confirmed to be the one (topt ¼ 1) using Schwarz criterion [16]. Subsequently, the stability of the bivariate VAR was investigated and diagnostic tests were used to check the validity of normality assumptions and absence of serial correlation in the residuals (Lutkepohl, 2006). VAR process whose reverse characteristic polynomial had roots outside the unit circle were deemed as stable. Multivariate normality assumptions were investigated using Jarque-Bera test. Serial correlation in the residuals up to lag 10 was investigated using adjusted Portmanteau test (Lutkepohl, 2006). Pairs that passed the above diagnostics tests were subjected to tests for GC (F-test) as well as instantaneous GC (IGC, w2 test) (Lutkepohl, 2006). Pairs that exhibited significant IGC were not considered for further investigation. The significance level for all the GC tests (F-tests) was fixed at (a ¼ 0.01). A Monte-Carlo approach (Nagarajan, 2009; Nagarajan and Upreti, 2010) was subsequently used to estimate the statistical power of GC relationships generated from synthetic networks where the functional form is known. The statistical power is estimated as the number of successful rejections across (N ¼ 100) independent realizations of the bivariate time series generated from the synthetic network modeled as a bivariate VAR with nonzero coupling strengths. In the case of the cell-cycle gene expression profiles, a similar approach was used to determine what is termed as the proportion (C). Since the functional form of the bivariate VAR is unknown in the case of cell-cycle gene expression profiles, we first estimate the firstorder bivariate VAR process parameters from the given gene expression profiles. We first estimate the first-order bivariate VAR process parameters from the given gene expression profiles. Subsequently, the number of rejections (i.e., proportion) across (N ¼ 100) independent realizations of the bivariate time series generated from the estimated VAR parameters represents the proportion (C). Similar to statistical power, the proportion
141
Inferring Functional Relationships
of rejections is bounded in [0, 1], with a large value indicating the possible presence of a significant GC relationship between the given pair of processes in that specific direction, whereas a small value implies the possible absence of relationship.
3. Results 3.1. Impact of noise variance on GC analysis of a synthetic two-gene network Consider the first-order bivariate VAR inspired by the example from Granger’s seminal article (Granger, 1969, Section 5) given by xt 0 b1 xt1 e ¼ þ t ð5:13Þ yt b2 0 yt1 t Following the frequency-domain approach outlined under (Section 5 in Granger, 1969), the causality-coherence along either direction is given by Cx!y ðoÞ ¼
b22 s4e s2e þ b21 s2 s2 þ b22 s2e
b21 s4 Cy!x ðoÞ ¼ s2e þ b21 s2 s2 þ b22 s2e For equal coupling strength (b1 ¼ b2 ¼ b), the ratio of the causalitycoherence takes the form Cx!y ðoÞ s4e / Cy!x ðoÞ s4
ð5:14Þ
Of interest is to note the similarity between the expressions (Eq. 5.14) in the frequency-domain to those in the time-domain (Eq. 5.12). To illustrate the impact of noise on the statistical conclusions, we consider the first-order bivariate VAR (Eq. 5.13) with identical coupling strengths (b1 ¼ b2 ¼ b) where b is a real number, zero-mean Gaussian noise (et, t), and absence of autoregulatory feedback (a1, a2 ¼ 0). 0 b xt1 e xt ¼ þ t ð5:15Þ b 0 zt1 zt t
142
Radhakrishnan Nagarajan and Meenakshi Upreti
Stability of the bivariate VAR (Eq. 5.15) is guaranteed provided the roots of the reverse characteristic polynomial 1 0 0 b l ¼0 0 1 b 0 is such that (i.e., l1, 2 ¼ 1/ b, jl1, 2j > 1);that is, the coupling strength should be such that jbj < 1. The number of samples were fixed to be the same as that of the cell-cycle gene expression profiles (i.e., n ¼ 48). The noise variance was fixed as s ¼ 1 across one of the processes yt, whereas it was scaled across the other xt by a noise factor k, such that se ¼ ks. The statistical power (a ¼ 0.01, F-test) along either direction estimated from 100 independent realizations of Eq. (5.14) as a function of the noise factor (k ¼ 1.0, 2.0, 3.0, 4.0) and coupling strength (b ¼ 0.8, 0.8, 0.4, 0.2) is shown in Fig. 5.2. For equal noise variance (i.e., k ¼ 1), the statistical power corresponding to x GC y and y GC x is similar irrespective of the coupling strengths (b ¼ 0.8, 0.8, 0.4, 0.2) (Fig. 5.1A–D). This is to be expected, as for these choice of the VAR process parameters the ratio of the mean-squared A
b = – 0.8
1
B
0.8
0.8
0.6
0.6
0.4
0.4
y GC x x GC y
0.2
0.2 0
0
Power
b = 0.8
1
1
2
3 b = 0.4
C
4
1 1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 1
2
3
3
4
b = 0.2
D
1
0
2
0 1 4 Noise factor (k)
2
3
4
Figure 5.2 Variation in statistical power as function of noise factor (k) for the synthetic two-gene network (Eq. 5.14) as function of coupling strength (b ¼ 0.8, 0.8, 0.4, 0.2). The solid and dotted lines correspond to statistical power for y GC x and x GC y, respectively.
Inferring Functional Relationships
143
forecast errors (gx ! y ¼ gy ! x), hence the statistical power is unchanged across x GC y and y GC x, from Eqs (5.10) and (5.11). However, as k increases, considerable discrepancy in the statistical power is observed along either direction. For larger coupling strengths, (i.e., b ¼ 0.8, 0.8), irrespective of the sign of the coupling strength, the discrepancy in power is not pronounced for the various choices of k, Fig. 5.1A and B. However, a marked discrepancy in the statistical power as a function of increasing noise factor (k ¼ 3, 4) is observed at smaller coupling strengths (i.e., b ¼ 0.2, 0.4, Fig. 5.1C and D), respectively. This is to be expected since an increase in k increases the mean-squared forecast error gx ! y while decreasing gy ! x. Thus, GC relationship, where the noisier gene tends to be the cause, has a higher statistical power. From the above case study, it is clear that a bidirectional network with equal coupling strengths (b) can be rendered unidirectional provided there is sufficient discrepancy in the noise variance between the genes. Thus, understanding the contribution of the noise may be critical in avoiding spurious conclusion of functional relationship between a given pair of genes using GC analysis.
3.2. GC analysis of cell-cycle gene expression profiles In this section, we identify functional relationships from cell-cycle gene expression profiles (Whitfield et al., 2002) using GC analysis. We also investigate the impact of the noise variance on the GC conclusions. Only those gene expression profiles that pass the diagnostic tests (Section 2.2) and can be modeled as a first-order bivariate VAR are considered. The proportions (C) (Section 2.2) are estimated from the given gene expression profiles as well as those generated after constraining the noise variance to be equal (se ¼ s ¼ 1, k ¼ 1) sampled from a zero-mean Gaussian noise. UBE2C, TLOC1: UBE2C represents the ubiquitin-conjugating enzyme E2C and involved in the G2 phase of the cell-cycle. TLOC1 represents the translocation protein 1 and involved in the G1/S phase of the cell-cycle (Whitfield et al., 2002). This gene pair satisfied the diagnostic tests and could be modeled as a first-order bivariate VAR (Section 2.2). F-test (a ¼ 0.01) revealed statistically significant GC relationship along either directions UBE2C ! TLOC1 as well as TLOC1 ! UBE2C. The proportion of rejections across 100 independent realizations (Section 2.2) along either direction was cUBE2C ! TLOC1 0.95 and cTLOC1 ! UBE2C 0.89, indicative of a significant GC relationship along either direction, that is, bidirectional network. However, VAR parameter estimation revealed considerable discrepancy in the noise variance between these genes with sUBE2C sTLOC1 and k 4.3 >> 1. On constraining the noise variance to be equal (k ¼ 1, sUBE2C ¼ sTLOC1 ¼ 1), the proportion of rejections along either directions changed to cUBE2C ! TLOC1 0.54 and cTLOC1 ! UBE2C 1.0. These results raise concern as to whether the high proportion estimate of UBE2C ! TLOC1 obtained from the given
144
Radhakrishnan Nagarajan and Meenakshi Upreti
expression profile was solely due to discrepancy in noise variance sUBE2C sTLOC1, with the noisier gene having a tendency to act as the cause. This case study elucidates the nontrivial impact of noise variance on functional relationships inferred using GC analysis. From the above analysis, the relationship TLOC1 ! UBE2C as opposed to UBE2C ! TLOC1 may be of interest. RGS3-RNPS1: RGS3 represents the regulator of G protein signaling 3 and involved in the G2 phase of the cell-cycle. RNPS1 represents the RNAbinding protein S1 and involved in the G2/M phase of the cell-cycle (Whitfield et al., 2002). This pair satisfied the diagnostic tests and had an optimal delay (topt ¼ 1), and hence can be modeled as a first-order bivariate VAR. GC analysis revealed significant GC relationship only in the direction RGS3 ! RNPS1 (a ¼ 0.01, F-test), rendering the relationship acyclic. The proportion of statistically significant GC relationships across 100 independent realizations estimated using the Monte-Carlo approach were cRGS3 ! RNPS1 ¼ 0.88 and cRNPS1 ! RGS3 ¼ 0.25. However, the noise factor from VAR parameter estimation was k 4.3, with one of the genes exhibiting higher noise variance than the other, that is, sRGS3 sRNPS1. Also, the gene with higher noise variance (i.e., RGS3) acted as the cause in the resulting acyclic approximation, RGS3 ! RNPS1. Constraining the noise variance to be equal (i.e., se ¼ s ¼ 1, k ¼ 1) completely reversed the conclusions with proportions sRGS3 sRNPS1 ¼ 0.19 and cRNPS1 ! RGS3 ¼ 0.94, respectively. Therefore, the functional relationship RGS3 ! RNPS1 on GC analysis of the given gene expression profiles should be regarded with caution. IKKa, NFkΒ: A recent study (Guo et al., 2008) reported functional relationships between these genes involved in NF-kB signaling (Fig. 3 in Guo et al., 2008) using pair-wise GC analysis. We reinvestigated this pair (IKKa, NFkB) that had nonzero expression values across the 48 time points, passed the VAR diagnostic tests, and could be modeled as a first-order bivariate VAR. However, our analysis failed to reveal significant GC in any direction (a ¼ 0.01) between these genes in contrast to the findings by Guo et al. (2008). The proportions estimated by the Monte Carlo approach across 100 independent realizations (Section 2.2) were low as expected cIKKa ! NFkB ¼ 0.21 and cNFkB ! IKKa ¼ 0.23. VAR estimation revealed discrepancy in the noise variance, with noise factor k 1.5. Constraining the noise variance to be equal (i.e., s ¼ s ¼ 1, k ¼ 1) did not sufficiently change the conclusions.
4. Conclusions Understanding the working of genes as a system is critical, as phenotypic outcomes are mediated by the concerted working of genes. Several techniques have been proposed in the literature that infers functional
Inferring Functional Relationships
145
relationships between genes from their observed temporal gene expression profiles obtained from high-throughput assays. In this chapter, we briefly investigated the choice of GC analysis in inferring functional relationships between the genes that can be modeled as first-order bivariate VAR. Subsequently, the impact of the noise variance on GC results was elucidated. VAR parameter estimation was proposed to obtain a finer interpretation. The results were demonstrated on a synthetic gene network with known functional form and cell-cycle temporal gene expression profiles.
REFERENCES Ashburner, M., et al. (2000). Gene ontology: Tool for the unification of biology. Nat. Genet. 25(1), 25–29. Brovelli, A., et al. (2004). Beta oscillations in a large-scale sensorimotor cortical network: Directional influences revealed by Granger causality. Proc. Natl. Acad. Sci. USA 101, 9849–9854. Butte, A. J., et al. (2000). Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl. Acad. Sci. USA 97 (22), 12182–12186. Eisen, M. B., et al. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95(25), 14863–14868. Elowitz, M. B., et al. (2002). Stochastic gene expression in a single cell. Science 297(5584), 1183–1186. Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science 303, 799–805. Fujita, A., et al. (2007). Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method. Bioinformatics 23(13), 1623–1630. Gardner, T. S., et al. (2003). Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301, 102–105. Granger, C. W. J. (1969). Investigating causal relations by econometric models and crossspectral methods. Econometrica 37(3), 424–438. Gregoriou, G. G., et al. (2009). High-frequency, long-range coupling between prefrontal and visual cortex during attention. Science 324, 1207–1210. Guo, S., et al. (2008). Uncovering interactions in the frequency domain. PLoS Comput. Biol. 4(5), e1000087. Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press, Princeton, NJ. Kaern, M., et al. (2005). Stochasticity in gene expression: From theories to phenotypes. Nat. Rev. Genet. 6, 451–464. Kitano, H. (2002). Systems biology: A brief overview. Science 295, 1662–1664. Korswagen, H. C. (2002). Canonical and non-canonical Wnt signaling pathways in Caenorhabditis elegans: Variations on a common signaling theme. Bioessays 24(9), 801–810. Lutkepohl, H. (2006). New Introduction to Multiple Time Series Analysis. Springer-Verlag, Heidelberg. Mukhopadhyay, N. D., and Chatterjee, S. (2007). Causality and pathway search in microarray time series experiment. Bioinformatics 23(4), 442–449. Nagarajan, R. (2009). A note on inferring acyclic network structures using Granger causality tests. Int. J. Biostat. 5, 1.
146
Radhakrishnan Nagarajan and Meenakshi Upreti
Nagarajan, R., and Upreti, M. (2010). Granger causality analysis of human cell-cycle gene expression profiles. Stat. Appl. Genet. Mol. Biol. 9(1), 31. Nagarajan, R., et al. (2010). Functional relationships between genes associated with differentiation potential of aged myogenic progenitors. Front. Physiol. 1, 21. Okoniewski, M. J., and Miller, C. J. (2006). Hybridization interactions between probesets in short oligomicroarrays lead to spurious correlations. BMC Bioinformatics 7, 276. Pe’er, D. (2005). Bayesian network analysis of signaling networks: A primer. Science STKE 281, p14. Sachs, K., et al. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721), 523–529. Seth, A. K., et al. (2006). Theories and measures of consciousness: An extended framework. Proc. Natl. Acad. Sci. USA 103, 10799–10804. Sridharan, D., et al. (2008). A critical role for the right fronto-insular cortex in switching between central-executive and default-mode networks. Proc. Natl. Acad. Sci. USA 105 (34), 12569–12574. Subramanian, A., et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102(43), 15545–15550. Tu, Y., et al. (2002). Quantitative noise analysis for gene expression microarray experiments. Proc Natl. Acad. Sci. USA 99(22), 14031–14036. Whitfield, M. L., et al. (2002). Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell 13(6), 1977–2000.
C H A P T E R
S I X
Numerical Solution of the Chemical Master Equation: Uniqueness and Stability of the Stationary Distribution for Chemical Networks, and mRNA Bursting in a Gene Network with Negative Feedback Regulation E. S. Zeron* and M. Santilla´n†,‡ Contents 1. 2. 3. 4.
Introduction The Chemical Master Equation Irreducible Chemical Reaction Systems Stability of the Chemical Master Equation Stationary Probability Distribution 5. Two Different Algorithms to Calculate Stationary Probability Distributions for the Chemical Master Equation 6. Gene Expression with Negative Feedback Regulation 7. Concluding Remarks References
148 150 152 153 157 160 167 167
Abstract In this work, we introduce a couple of algorithms to compute the stationary probability distribution for the chemical master equation (CME) of arbitrary chemical networks. We further find the conditions that guaranty the algorithms’ convergence and the unicity and stability of the stationary distribution. Next, we employ these algorithms to study the mRNA and protein probability distributions in a gene regulatory network subject to negative feedback regulation. In particular, we analyze the influence of the promoter activation/deactivation * Centro de Investigacio´n y de Estudios Avanzados del IPN, Departamento de Matema´ticas, Av. Instituto Polite´cnico Nacional 2508, Me´xico DF, Me´xico Centro de Investigacio´n y Estudios Avanzados del IPN, Unidad Monterrey, Parque de Investigacio´n e Innovacio´n Tecnolo´gica, Apodaca, NL, Me´xico { Centre for Applied Mathematics in Bioscience and Medicine, 3655 Promenade Sir William Osler, McIntyre Medical Building, Montreal, Canada {
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87006-0
#
2011 Elsevier Inc. All rights reserved.
147
148
E. S. Zeron and M. Santilla´n
speed on the shape of such distributions. We find that a reduction of the promoter activation/deactivation speed modifies the shape of those distributions in a way consistent with the phenomenon known as mRNA (or transcription) bursting.
1. Introduction The emergence of disciplines like systems and synthetic biology has ushered a renewed interest in the development of mathematical models to study the processes taking place at the intracellular level (Breitling, 2010; Robeva, 2010). Numerous modeling approaches have been proposed to study the dynamic behavior of such systems. Nonetheless, all of them rely on the conceptualization of the cell as a complex reactor, where a huge number of interlinked chemical reactions take place. Another common feature to all modeling approaches is the assumption that the global chemical-reaction network in a cell can be theoretically decomposed into a set of relatively independently functioning subsystems, all of which have network structures of their own. Henceforth, one of the most frequently employed strategies consists in picking up a given subsystem and constructing the corresponding mathematical model from the underlying chemical-reaction network (Robeva, 2010). Most of the models developed to analyze cellular systems can be classified into one of two categories: (A) those that employ ordinary differential equations (ODE) to deterministically model the dynamics of the system’s chemical-species concentrations, and (B) those that utilize the chemical master or the chemical Langevin equation to model the time evolution of the system probability distribution function (Simpson et al., 2009). In general, deterministic models can be described as follows: Let X(t) be a vector denoting the system state; that is, the entries of X correspond to the concentrations of all the chemical species in the network to be modeled. The dynamics of X are then governed by the following ODE system: dX ¼ F ðX Þ; dt where the vectorial function F(X ) is determined by the rates of all the chemical reactions taking place within the system, as well as by the corresponding stoichiometric coefficients. As a matter of fact, the specific form of F(X ) for a given network can be derived by means of the law of mass action of chemical kinetics. The resulting ODE systems are often nonlinear and impossible to solve analytically, except for a few exceptional cases. However, the available numerical algorithms provide accurate and
Numerical Solution of the Chemical Master Equation
149
computationally cheap solutions in most situations. Moreover, nonlinear dynamics and bifurcation analysis are powerful analytical tools to study the dependence of the system behavior on the parameter values. In other words, there are several numerical and analytical techniques to analyze deterministic ODE models, and this makes them an effective technique to study the dynamics of chemical networks. On the other hand, deterministic models have one serious flaw. As we have mentioned, this modeling approach relies on the law of mass action, which is only strictly valid when the count of molecules involved in all chemical reactions is of the order of Avogadro’s number. Unfortunately, this is not the case in many intracellular processes. In particular, the number of protein molecules involved in gene regulation can be as low as a few dozens. In consequence, the predictions from deterministic models fail to reproduce the fluctuations in molecule numbers (also known as biochemical noise) originated by the stochastic nature of individual chemical reactions (Arkin et al., 1998; Elowitz et al., 2002; Paulsson, 2004; Pedraza and van Oudenaarden, 2005; Shahrezaei and Swain, 2008; Swain et al., 2002; Thattai and van Oudenaarden, 2001). In fact, the solution X(t) of a given ODE model corresponds in general to the average over an ensemble of identical chemical reactors (Simpson et al., 2009). Biochemical noise has become the object of numerous theoretical and experimental studies aimed at answering questions like how does noise affect cell functioning? How has the cell chemical circuitry evolved to minimize noise when it is detrimental? How is noise important for the functioning of some cellular networks? Most of these questions remain unanswered to this day, and their solution will necessarily involve an accurate modeling framework (Arkin et al., 1998; Elowitz et al., 2002; Paulsson, 2004; Pedraza and van Oudenaarden, 2005; Shahrezaei and Swain, 2008; Simpson et al., 2009; Swain et al., 2002; Thattai and van Oudenaarden, 2001). The most fundamental theoretical treatment for studying biochemical noise is the chemical master equation (CME). In it, the state of the system is defined by the vector x(t), whose entries correspond to the number of molecules of each chemical species at time t. A detailed description of the CME shall be presented in the next section, but let us say for now that it is a linear differential equation governing the time evolution of the probability P(x, t) that the system is in state x at time t. Despite its linearity, the CME is impossible to solve analytically, except for a few special cases. Therefore, one has to recur to numerical solutions and/or make use of simplifying assumptions (Simpson et al., 2009). For instance, while x(t) is a discrete variable, it can be treated as a continuous one when the mean values of the molecular counts in the system are large, and so the CME can be approximated by the so-called chemical Langevin equation (Gillespie, 2000, 2002). Another common approach
150
E. S. Zeron and M. Santilla´n
consists in developing stochastic simulation algorithms (SSA) that are equivalent to Monte Carlo simulations of the CME. Most of the available algorithms are derivations of the celebrated algorithm proposed by Gillespie (1977). While the Gillespie algorithm is exact in the sense that it generates statistically correct trajectories, it happens to be too computationally expensive in most situations. Much work has been aimed at improving stochastic algorithms by means of sound approximations (Burrage et al., 2004; Cao et al., 2005a,b; Liu and Vanden-Eijnden, 2005; Zhou et al., 2008). Although approximation techniques show exceptional promise to accelerate stochastic simulations, understanding when to use these algorithms as well as the magnitude of the error they introduce is an area of active research. Still another strategy consists in numerically solving the CME. The advantage of this approach is that it directly returns the system probability distribution function that, in many cases, is more adequate than stochastic simulations to analyze the system dynamic behavior. Munsky and Khammash (2006, 2007) have proposed the so-called finite state projection algorithm to solve the CME, while Macnamara et al. (2008) have worked out improvements for this algorithm. In a previous work (Zeron and Santilla´n, 2010), we introduced a different algorithm, based on the Jacobi method, to numerically solve the CME. We tested such algorithm by studying the influence of negative feedback regulation on intrinsic noise in a simple gene regulatory network. Comparison with equivalent results from stochastic simulations validated the proposed algorithm. However, we were unable to provide at that time a formal demonstration for the algorithm convergence. In the present work, we expand the results in Zeron and Santilla´n (2010) by introducing a couple of algorithms to calculate the steady-state solution of the CME and analyzing their convergence. The chapter is organized as follows: In Section 2, we introduce the CME. In Section 3, we define what irreducible chemical systems are and analyze them. In Section 4, we study the existence, uniqueness, and (asymptotic) stability of the stationary distribution for irreducible chemical systems. In Section 5, we introduce two different algorithms to find the stationary probability distribution function of arbitrary chemical systems, and prove their convergence.
2. The Chemical Master Equation Consider a chemical system (or reactor) in which n different chemical species are involved in m different chemical reactions. As previously mentioned, the state of this system can be represented by a vector of natural numbers x(t) 2 Nn, such that the molecule count at time t of the ‘th chemical species is represented by the ‘th entry of the vector x(t) (for
151
Numerical Solution of the Chemical Master Equation
which we use the notation (x‘(t))). Furthermore, each one of the chemical reactions taking place in this reactor can be represented as usual: aðk;xÞ
x ! x þ v½k;
k ¼ 1; 2 . . . m;
ð6:1Þ
where a(k,x) 0 and v[k] 2 Zn denote the propensity and the stoichiometric vector of the kth reaction, respectively. The product a(k,x)dt is the probability that the kth reaction takes place during the time interval [t, t þ dt] (dt 0), while each of the entries of v[k] 2 Zn determines the change produced in the molecule count of the corresponding species by the occurrence of the kth chemical reaction. Here and thereafter, we assume that every propensity a(k,x) does not depend on time explicitly. In accordance with the fact that negative molecule counts are impossible, a(k,x) vanishes whenever any entry of x or x þ v[k] is negative. Similarly, we assume the existence of a natural number s 1 such that a(k,x) vanishes whenever any entry of x or x þ v[k] is larger than s. The above constrains can be restated as follows: a(k,x) vanishes whenever x and/or x þ v[k] are not in Sn, with S :¼ f0; 1; 2; . . . ; sg:
ð6:2Þ
Let P(x,t) 0 be the probability that the system state at time t is x. In agreement with the constrains discussed in the previous paragraph, we assume that P(x,t) vanishes whenever x 2 = Sn. From the previous definitions, the time evolution of P(x,t) 0 is governed by the so-called CME (or Forward Kolmogorov equation): m @P ðx; tÞ X ½aðk; x v½kÞP ðx v½k; t Þ aðk; xÞP ðx; tÞ: ¼ @t k¼1
ð6:3Þ
For a deeper discussion, see Gillespie (1976, 1977, 1992) or Higham (2008). The normalization condition for the probability distribution function P(x,t) implies that X P ðx; tÞ ¼ 1: x2Sn
It further follows from this that X @P ðx; tÞ x2Sn
@t
0:
The last result is in agreement with the fact that
152
E. S. Zeron and M. Santilla´n
X
aðk; x v½kP ðx v½k; tÞ ¼
x2Sn
X
aðk; xÞP ðx; tÞ:
x2Sn
Finally, by definition, any stationary distribution P*(x) must satisfy @P*(x)/ @t ¼ 0, and so P ðxÞ
m X j¼1
aðx; jÞ ¼
m X
aðk; x v½kÞP ðx v½kÞ:
ð6:4Þ
k¼1
3. Irreducible Chemical Reaction Systems A chemical reaction system is said to be irreducible if and only if, given any arbitrary pair of states x and y in Sn, there exists a finite set of reactions of the form aðkj ;xj Þ xj ! xjþ1 :¼ xj þ v kj ;
j ¼ 1; 2 . . . q;
such that a(kj,xj) > 0 for all j. Moreover x1 ¼ x, and xqþ 1 ¼ y. In other words, aðkq ;xq Þ aðk1 ;x1 Þ aðk2 ;x2 Þ aðk3 ;x3 Þ x ¼ x1 ! x2! x3! ! xqþ1 ¼ y:
ð6:5Þ
The adjective irreducible originates from the fact that the matrix representing the set of chemical reactions in Eq. (6.1) is irreducible for these systems. To see this, let us sort the elements of the finite set Sn using the lexicographic order. Then, the stationary distribution P*(x) can be seen as a vector, where each element x 2 Sn is an index of P*. The chemical reactions represented in Eq. (6.1) can be similarly codified into a square matrix whose entries A[x, y] are given by aðk; xÞ; if y ¼ x þ v½k for some k; A½x; y :¼ ð6:6Þ 0; otherwise: Notice that a given entry A[x, y] is strictly positive if and only if there is a chemical reaction in Eq. (6.1) that takes the system from state x into state y and has a strictly positive propensity a(k,x) ¼ A[x, y]. It is straightforward to see that the system of chemical reactions in Eq. (6.1) is irreducible if and only if the matrix A in Eq. (6.6) is irreducible according to the characterization given in Bapat and Raghavan (1997, p. 4) or Berman and Plemmons (1979,
153
Numerical Solution of the Chemical Master Equation
p. 30); the reason being that the associated graphic G(A) is strongly connected via the paths (Eq. (6.5)) transforming the system from any arbitrary state x into any other arbitrary state y. We shall use later the fact that the transpose matrix A0 is irreducible if and only if A is irreducible—see for example Bapat and Raghavan (1997, p. 4) or Berman and Plemmons (1979, p. 27). The irreducibility of a matrix only depends on which entries are zero or strictly positive, and so it is invariant under the multiplication of every entry by strictly positive numbers. Consider, for instance, the invertible diagonal matrix dðxÞ; if y ¼ x; D½x; y :¼ 0; otherwise; where d(x) > 0 is a collection of positive numbers indexed by x 2 Sn. Then, the entries of the product AD are given by ðDAÞ½x; y ¼ dðxÞA½x; y: Since any entry (DA)[x, y] vanishes if and only if A[x, y] also vanishes, it is easy to prove that the matrix A is irreducible if and only if DA is irreducible. The importance of analyzing irreducible chemical reaction systems arises from the fact that, as we shall see later, these systems have a unique stationary distribution.
4. Stability of the Chemical Master Equation Stationary Probability Distribution We begin by regarding the CME as a differential equation where the dependent variable is a matrix. Let us endow the set Sn with the lexicographic order and define a pair of square matrices D and B (D being diagonal) as follows: aðk; y^Þ; if y^ ¼ x^ v½k for some k; B½x^; y^ :¼ ð6:7Þ 0; otherwise; and D½x^; y^ :¼
8X m < að j; x^Þ; :
if x^ ¼ y^;
j¼1
0;
otherwise:
ð6:8Þ
154
E. S. Zeron and M. Santilla´n
Notice that the entries D[^ x, y^] and B[^ x, y^] are indexed by the vectors x^ and y^ in Sn. The diagonal matrix D is invertible because for every state x^ 2 Sn there exists at least one strictly positive propensity function a(^ x, j) > 0 for some index j. Furthermore, observe that the matrix A in Eq. (6.6) is nothing else but the transpose of B in Eq. (6.7), since aðk; xÞ; if y ¼ x þ v½k; 0 B ½x; y ¼ B½y; x ¼ A½x; y ¼ ð6:9Þ 0; otherwise: Finally, it follows from the definitions of D and B that the CME can be rewritten as m @P ðx; tÞ X ½aðk; x v½kÞP ðx v½k; tÞ aðk; xÞP ðx; t Þ ¼ @t k¼1 P ¼ y2Sn ½Bðx; yÞ Dðx; yÞP ðy; tÞ:
ð6:10Þ
The analysis below is strongly based on the fact that D 1A is a stochastic matrix according to the definitions in Meyer (2000, p. 687) and Seber (2008, p. 212). That is, all the entries of this matrix are nonnegative and the sum of all elements in a given row equals one: X y2Sn
" #1 " # m m X X D A ½x; y ¼ að j; xÞ aðk; xÞ ¼ 1: 1
j¼1
ð6:11Þ
k¼1
As a consequence of the above fact, each eigenvalue w of D 1A has an absolute value smaller than or equal to one—see Seber (2008, p. 212) and notice that jwj jwpj þ p 1, with p the minimum value in the main diagonal of D 1A. On the other hand, w0 ¼ 1 is a semisimple eigenvalue of D 1A—see Meyer (2000, p. 696). That is, the Jordan blocks associated with w0 ¼ 1 are all trivial, that is, there is an invertible matrix M satisfying the following equation: 1 1 Id 0 D A¼M M; ð6:12Þ 0 S where Id is the identity matrix and S is an upper-diagonal matrix such that no element in the main diagonal is equal to one—see Elaydi (1999, p. 143), Meyer (2000, p. 510), or Seber (2008, p. 92). Finally, given that P*(x) is a solution of Eq. (6.4) and so it satisfies BP* ¼ DP*, every stationary distribution P*(x) 0 lies in the kernel of B D. In the same way, each nonzero
155
Numerical Solution of the Chemical Master Equation
eigenvector V(x) 0 in the kernel of B D yields a stationary distribution after renormalization: V ðxÞ P ðxÞ ¼ X V ðyÞ 0;
so that
X
P ðxÞ ¼ 1:
ð6:13Þ
x2Sn y2Sn
We have seen that a stationary probability function P*(x) can be constructed from every nonzero vector V(x) 0 in the kernel of B D. In the following theorem, we prove the existence of at least one such a vector. Theorem 6.1 The identity jwj 1 holds for every eigenvalue w of the matrix product BD 1 defined in Eqs. (6.7) and (6.8). Each nonzero eigenvalue l 6¼ 0 of the difference B D has a negative real part: ℛ (l) < 0. There exists at least one nonzero nonnegative vector V(x) 0 in the kernel of B D. Finally, l1 ¼ 0 is always a semisimple eigenvalue of B D, and it is a unique eigenvalue when the system of chemical reactions (Eq. (6.1)) associated with B D is irreducible according to the definition given in Section 5. Proof We begin by showing that every nonzero eigenvalue l 6¼ 0 of B D has a negative real part: ℛ(l) < 0. For the sake of simplicity, we analyze the transpose B0 D0 ¼ A D instead of B D, and use the fact that both of them have the same eigenvalues. Notice that D0 ¼ D since this matrix is diagonal, and recall that A0 ¼ B according to Eq. (6.9). Let l be any eigenvalue of A D and let U 6¼ 0 be its associated eigenvector. Hence, AU ¼ lU þ DU. Choose x^ 2 Sn such that jU(^ x)j jU(x)j holds for every index x 2 Sn (observe that U(^ x) 6¼ 0 because U 6¼ 0). Therefore, m X X aðk; x^Þ jU ðx^Þj ¼ A½x^; yU ðyÞ l þ y2Nn k¼1 " # m m X X aðk; x^ÞU ðx^ þ v½kÞ
aðk; x^Þ jU ðx^Þj: ¼ k¼1 k¼1 Since a(k, x^) 0 for all k and U(^ x) 6¼ 0, the inequality in the previous equation may only hold if l ¼ 0 or ℛ(l) < 0. We showed in Eq. (6.11) that D 1A is a stochastic matrix (i.e., it is nonnegative and the sum of all the elements in a given row equals one). We also have that BD 1 and its transpose D 1A have exactly the same eigenvalues and Jordan forms. That is, w0 ¼ 1 is a semisimple eigenvalue of BD 1
156
E. S. Zeron and M. Santilla´n
and jwj 1 for every eingenvalue w of BD 1—see Meyer (2000, p. 696) and Seber (2008, p. 212). Since BD 1 is also nonnegative, the Perron– Frobenius theorem grants the existence of a nonzero eigenvector U(x) 0 such that (see Bapat and Raghavan, 1997, p. 35 or Serre, 2002, p. 81) 1 BD U U and ðB DÞD1 U 0: ð6:14Þ The last equation further implies that the vector V ¼ D 1U lies in the kernel of B D. Furthermore, V is nonzero and nonnegative given that, according to Eq. (6.8), D is a diagonal matrix with strictly positive entries in the main diagonal. Since w0 ¼ 1 is a semisimple eigenvalue of BD 1, the Jordan blocks associated with w0 ¼ 1 are all trivial according to Eq. (6.12)— see also Elaydi (1999, p. 143), and so there exists a square invertible matrix M satisfying 0 1 1 Id BD ¼ M M; 0 T where Id is the identity matrix and T is a lower diagonal matrix such that no element in its main diagonal equals one. This result further implies that 0 1 0 BD¼M MD: 0 T Id Notice that MD is invertible and T Id is lower diagonal with no zero element in its main diagonal. Therefore, the number of zero eigenvalues l1 ¼ 0 of the matrix B D is equal to the dimension of its kernel, and so l1 ¼ 0 is a semisimple eigenvalue of B D—see Elaydi (1999, p. 143), Meyer (2000, p. 510), or Seber (2008, p. 92). Finally, the matrix A in Eq. (6.6), the product D 1A, and its transpose 1 BD , are all irreducible whenever the system of chemical reactions (Eq. 6.1) is irreducible according to the definition given in Section 3. It then follows from the Perron–Frobenius theorem—see Bapat and Raghavan (1997, p. 17), Berman and Plemmons (1979, p. 27), or Serre (2002, p. 82)—that the eigenvalue w0 ¼ 1 of BD 1 and the eigenvalue l1 ¼ 0 of B D are both unique. Recall that the number of eigenvalues l1 ¼ 0 of the matrix B D, the dimension of its kernel, and the number of eigenvalues w0 ¼ 1 of BD 1 are all equal. □ To end this section, the foregoing theorem gives the sufficient conditions for the existence, uniqueness, and stability of the stationary distributions P*(x) of the system in Eqs. (6.3)–(6.10). We wish to emphasize that this is the main result of the present section.
157
Numerical Solution of the Chemical Master Equation
Theorem 6.2 The system (Eqs. (6.3)–(6.10)) is stable and has at least one nonzero stationary distribution P*(x) 0 satisfying Eq. (6.4). Furthermore, the stationary distribution P*(x) is unique and asymptotically stable whenever the system of chemical reactions (Eq. 6.1) is irreducible in the sense of Section 3. Proof Theorem 6.1 implies that the real part ℛ(l) of any nonzero eigenvalue l of the square matrix B D defined in Eqs. (6.7) and (6.8) is strictly negative. Moreover l1 ¼ 0 is a semisimple eigenvalue of B D. Then, the results in Amann (1990, p. 201) or Brugnano and Trigiante (1998, p. 4) imply that any stationary distribution of the system (Eqs. (6.3)–(6.10)) is stable. Theorem 6.1 also grants the existence of at least one nonzero nonnegative vector V(x) 0 that lies in the kernel of B D. A stationary distribution P*(x) can then be constructed by normalizing each V(x)—see Eq. (6.13). On the other hand, the eigenvalue l1 ¼ 0 of B D is also unique when the system of chemical reactions (Eq. (6.1)) is irreducible according to the definition given in Section 3. This finally implies that the distribution P*(x) is unique and asymptotically stable. The unicity follows from the fact that the kernel of B D has unitary real dimension. □
5. Two Different Algorithms to Calculate Stationary Probability Distributions for the Chemical Master Equation We introduce now an algorithm, based on the Jacobi method, to calculate the steady-state probability distribution P*(x) of an irreducible system. Let p^0 ðxÞ 0 be an arbitrary initial distribution and define the sequence of nonnegative functions p^q ðxÞ 0 via the following iterative process: p^qþ1 ðxÞ :¼
X B½x; y^ pq ðyÞ y2Sn
D½x; x
¼
m X aðk; x v½kÞ k¼1
D½x; x
p^q ðx v½kÞ;
ð6:15Þ
where B and D are given by Eqs. (6.7) and (6.8), respectively. Then, p^qþ1
¼ D1 B p^q
with D½x; x ¼
m X j¼1
að j; xÞ > 0:
ð6:16Þ
158
E. S. Zeron and M. Santilla´n
It follows from Eq. (6.16) that p^q ¼ ðD1 BÞ p^0 for every integer q 0. It is important to emphasize that the distributions p^q ðxÞ 0 do not need to be normalized because the 1-norms k D^ pqþ1 k and k D^ pq k coincide. That is, q
X x2Sn
D½x; x^ pqþ1 ðxÞ ¼
m X X
aðk; yÞ^ pq ðyÞ ¼
k¼1 y2Sn
X
D½y; y^ pq ðyÞ:
y2Sn
In the following theorem, we study the sufficient conditions for the convergence of p^q ðxÞ, as q ! 1, to a positive multiple of a stationary distribution P*(x) satisfying Eq. (6.4). Theorem 6.3 Let p^0 ðxÞ 0 be an arbitrary initial distribution. The sequence of nonnegative functions p^q ðxÞ 0 defined via the iterative process (Eqs. (6.15)–(6.16)) converges to a positive multiple of a stationary solution p^ ðxÞ 0 in Eq. (6.4) as p ! 1 and the following two conditions hold A. The vector D^ p0 0 has nontrivial projection onto the eigenspace associated with the eigenvalue w0 ¼ 1 of BD 1. B. w0 ¼ 1 is the only eigenvalue w of BD 1 (counting multiplicity) with absolute value jwj ¼ 1. The fulfillment of condition A is immediate because, given any initial distribution p^0 ðxÞ, the projection of D^ p0 onto the eigenspace associated with the eigenvalue w1 ¼ 1 is nontrivial with probability one. Moreover, Theorem 6.1 and the existence of a nonzero eigenvector U 0 satisfying Eq. (6.14) together imply that w1 ¼ 1 is an eigenvalue of BD 1. Proof Notice that D^ pq ¼ ðBD1 Þ D^ p0 for every integer q 1. Theorem 6.1 and condition B automatically imply that the eigenvalues wj of BD 1 can be organized in a decreasing order as follows: q
w0 ¼ 1 > jw1 j jw2 j jw3 j The power method (see Groetsch and King, 1988, p. 274 or Golub and Van Loan, 1996, pp. 330–332) implies that p^q ðxÞ 0 converges to a nonzero vector V(x) 0 as q ! 1. In consequence, DV is an eigenvector associated with the maximal eigenvalue w0 ¼ 1 of BD 1. To obtain the stationary solution P*(x), we only need to normalize V(x) 0 using Eq. (6.13), the reason being that 1 BD DV DV ; if and only if ðB DÞV 0: h
159
Numerical Solution of the Chemical Master Equation
In order to assure that the sequence p^q converges as q ! 1, it is essential that the matrix BD 1 has a unique eigenvalue l with absolute value jlj ¼ 1. To see this necessity, consider the following example. Let r be an arbitrary real number in the interval [0,1], take the following matrices B and D, and use the following initial distribution p^0 :
0 1 1 0 1r B :¼ ; D :¼ ; and p^0 :¼ : 1 0 0 1 r It is straightforward to prove that the eigenvalues of BD 1 are 1, and that the iteratively calculated functions p^q oscillate between
1r r and : r 1r To finish this section, we introduce a second algorithm (also based on the Jacobi method) to calculate the stationary distribution P*(x). The advantage of this second algorithm is that it does not need condition B in Theorem 6.3. Let p^0 ðxÞ 0 be an arbitrary initial distribution and let p0(x) 0. Define the function sequences p^q ðxÞ 0 and pq(x) 0 via the iterative processes pqþ1 ðxÞ :¼ pq ðxÞ þ p^q ðxÞ;
p^qþ1 ðxÞ :¼
m X aðk; x v½kÞ k¼1
D½x; x
p^q ðx v½kÞ; ð6:17Þ
where D[x, x] ¼ Sjm¼ 1 a( j, x) > 0 is as given by Eq. (6.8) or Eq. (6.16). In the next theorem, we prove that pq(x)/q always converges, as q ! 1, to a positive multiple of a stationary distribution P*(x) satisfying Eq. (6.4). Theorem 6.4 Let p^0 ðxÞ 0 be an arbitrary distribution and let p0(x) 0. The sequence of nonnegative functions pq(x)/q 0 defined via the iterative process (Eq. 6.17) converges to a positive multiple of a stationary solution P*(x) 0 satisfying Eq. (6.4) as q ! 1, provided the following condition holds: A. The vector D^ p0 0 has nontrivial projection onto the eigenspace associated with the eigenvalue w0 ¼ 1 of BD 1. The fulfillment of condition A is immediate because, given any initial distribution p^0 ðxÞ 0, the projection of D^ p0 onto the eigenspace associated with the eigenvalue w0 ¼ 1 is nontrivial with probability one.
160
E. S. Zeron and M. Santilla´n
Proof Consider the matrices A, B, and D defined in Eqs. (6.6)–(6.8). We showed in Eq. (6.11) that the product D 1A is a stochastic matrix because every entry of D 1A is nonnegative and the sum of all elements in a given row is equal to one (see Bapat and Raghavan, 1997, p. 45 or Meyer, 2000, p. 687). Furthermore, the theorem 1.9.7 in Bapat and Raghavan (1997, p. 50) (see also Meyer, 2000, p. 697) implies the existence of a square matrix Q such that Q ¼ lim
q!1
q1 k X ðD1 AÞ
q
k¼0
and
Q D1 A ¼ Q:
ð6:18Þ
Since D is diagonal and A0 ¼ B—according to Eq. (6.9)—the transpose Q0 is equal to BD 1Q0 . Let us define now the square matrix Y :¼ D 1Q0 D. Then, Y ¼ lim D1 q!1
q1 k X ðBD1 Þ
q
k¼0
D ¼ lim
q!1
q1 k X ðD1 BÞ k¼0
q
:
On the other hand, it is straightforward to prove that pq ¼
q1 X
k D1 B p^0 0:
k¼1
Finally, the two previous equations imply that pq/q converges to a nonnegative vector V :¼ Y^ p0 as q ! 1. To obtain the stationary distribution P*(x), we only need to normalize V(x) 0 using Eq. (6.13). The rationale behind this assertion is that DV ¼ DY^ p0 ¼ Q0 D^ p0 ¼ BD1 Q0 D^ p0 ¼ BY^ p0 ¼ BV : Recall that Y ¼ D 1Q0 D and Q0 ¼ BD 1Q0 .
□
6. Gene Expression with Negative Feedback Regulation In this section, we introduce a simple gene regulatory network with negative feedback regulation at the transcriptional level, and analyze its stochastic dynamic behavior by means of the previously introduced
161
Numerical Solution of the Chemical Master Equation
techniques. A schematic representation of this system is given in Fig. 6.1. We assume that the promoter can be either active (Da) or inactive (Di), that transcription can only be started in—and so that mRNA molecules (M) can only be synthesized from—an active promoter, that proteins P—translated from mRNA molecules—cause an increase in the number of metabolite molecules (R) either by catalyzing their production or by uptaking them from the environment, and that this metabolite interacts with the transcription initiation complex turning the promoter inactive. Riboswitch B12 is a good example for the type of regulation illustrated in Fig. 6.1. In this case, one of the genes under regulation of the corresponding promoter codes for a protein is involved in the uptake of vitamin B12. This vitamin further interacts with the transcription initiation complex, prematurely terminating transcription and avoiding the production of mature mRNA. (Santilla´n and Mackey, 2005).
gr
kr
Inactive promoter (Di) Metabolites (R) kd–
kd+
Proteins (P)
mRNA (M) kp
km Active promoter (Da) gm
gp
Figure 6.1 Schematic representation of a simple gene network with negative feedback regulation. See the main text for more details.
162
E. S. Zeron and M. Santilla´n
As for the parameters in Fig. 6.1, km is the probability per unit time of transcription, kp the probability per unit time of translation, gm the probability per unit time of mRNA degradation, gp is the probability per unit time of protein degradation, kr denotes the probability per unit time that a protein P catalyzes the production of a molecule R or uptakes it from the environment, gr is the probability per unit time of metabolite degradation, kdþ the probability per unit time that a molecule R binds the active promoter, and kd is the probability per unit time of the R:Da complex dissociation. Of all the reactions depicted in Fig. 6.1, the synthesis and degradation of metabolites are much faster processes than transcription, translation, mRNA degradation, protein degradation, and the promoter flipping between states Da and Di. If we take into account that the metabolite molecule count is of the order of tens of thousands, then its number does not change noticeably when one of them binds Da. From all these considerations, we can perform a separation of time scales to make an adiabatic approximation—emulating (Zeron and Santilla´n, 2010)—as follows. Notice that only the metabolite count nR is affected by the fast reactions. For the sake of convenience, we introduce the variable y ¼ nR. Let us denote the number of active promoters as nD, the mRNA count as nM, and the protein count as nP. Then, the vector x ¼ (nD, nM, nP) represents the state of all the chemical species affected by the slow reactions. According to Zeron and Santilla´n (2010), the dynamics of this system are governed by the following reduced CME m @P ðx; tÞ X ½aðk; x v½kÞP ðx v½k; tÞ aðk; xÞP ðx; t Þ; ¼ @t k¼1 0
ð6:19Þ
where m0 ¼ 6 is the number of slow reactions, while a(k, x) is the effective propensity of the kth reaction, given the state x of the slow subsystem. The effective propensities are given on their own by X aðk; xÞ ¼ bðk; x; yÞP ðyjx; 1Þ; ð6:20Þ y with b(k, x, y) the propensity of the kth reaction and P(yjx, 1) the steadystate probability distribution for the fast variable y, given the state x of the slow subsystem. The propensity for the metabolite-synthesis reaction is krnP, while that for the metabolite degradation reaction is grnR. Therefore, these processes are equivalent to those of mRNA synthesis and degradation in a constitutive gene, and so the stationary distribution for nR (given the protein count nP) is given by the following Poisson distribution (Shahrezaei and Swain, 2008; Thattai and van Oudenaarden, 2001):
163
Numerical Solution of the Chemical Master Equation
P ðnR jnP ; 1Þ ¼
lnR el ; nR !
where l ¼ krnP/gr. With this, the effective-propensity vector for the slow reaction subsystem comes out to be 0
1 k d ð1 nD Þ B þ kr nP C B kd nD C B gr C B C C aðxÞ ¼ B B km nD C: B g nM C m B C @ kp nM A gp nP
ð6:21Þ
The rows in this propensity vector respectively correspond to the following reactions: promoter activation, promoter inactivation, mRNA synthesis, mRNA degradation, protein synthesis, and protein degradation. In the derivation of the above propensities, we have used the fact that the mean of the variable nR is given by 1 X nR ¼0
nR
lnR el kr nP ¼ l¼ : nR ! gr
On the other hand, the stoichiometric matrix corresponding to the slow reaction subsystem is 0 1 1 0 0 B 1 0 0 C B C B 0 1 0 C B C: v¼B ð6:22Þ C 0 1 0 B C @ 0 0 1 A 0 0 1 The rows in this matrix correspond to the different chemical reactions in the system, as in the propensity vector (Eq. (6.21)), while each column determines the change produced by the system chemical reactions on the molecule count of the following chemical species: active promoter, mRNA, and protein. To perform the forthcoming analysis, we consider the parameter values estimated in Zeron and Santilla´n (2010) (where a very similar system is studied) and tabulated in Table 6.1.
164
E. S. Zeron and M. Santilla´n
Table 6.1 Parameter values for the gene regulatory network here analyzed
km 1.0 min 1 kp 10.0 min 1 kr/gr 100 molecules
gm 1.0 min 1 gp 0.1 min 1
Table 6.2 Combinations of kd and kdþ values employed in this work
kd(min 1) 500 50 5 0.5 0.05 0.005
kdþ(min 1molecules 1) 1 0.1 0.01 0.001 0.0001 0.00001
As for parameters, kdþ and kd, Zeron and Santilla´n (2010) estimated that KD ¼
k 6 d þ 2 1; 10 molecules; kd
where KD ¼ 1 molecule corresponds to a very strong feedback loop, while KD ¼ 106 molecules corresponds to a very week feedback loop. Here, we consider KD ¼ 5000 molecules, which represents an intermediately strong negative feedback regulation. Zeron and Santilla´n (2010) implicitly assumed that kd, kdþ 1, and so regarded the promoter activation and deactivation reactions as fast ones while reducing the system by means of an adiabatic approximation. In the present work, we dismiss such assumption and consider the combinations of kd and kdþ tabulated in Table 6.2, all of which satisfy that kd / kdþ ¼ 5000 molecules. Since KD remains constant, the probability that the promoter is active at a given time is the same in all cases. However, the activation and deactivation processes become slower as kd decreases. Therefore, by considering all the kd values in Table 6.2, we can investigate the influence of slowing down the promoter flipping between Da and Di on the mRNA and protein probability distributions. We implemented both of the algorithms introduced in Section 5 in Python to find the stationary probability distribution functions for the system given by Eqs. (6.19), (6.21), and (6.22), with the parameter values tabulated in Tables 6.1 and 6.2. Since both algorithms rendered exactly the same results, we only plot one curve per each set of parameters in Fig. 6.2.
165
Numerical Solution of the Chemical Master Equation
A
0.8
Probability
0.6
0.4
0.2
0 0 B
1
2 3 mRNA count
4
5
0.025
Probability
0.02 0.015 0.01 0.05 0
0
50
100 Protein count
150
200
Figure 6.2 Stationary probability distribution functions for the mRNA (A) and protein (B) counts, corresponding to the system represented in Fig. 6.1, with the parameter values given in Tables 6.1 and 6.2. The plot colors correspond to the different kd values as follows: kd ¼ 500 min 1, black; kd ¼ 50 min 1, blue; kd ¼ 5 min 1, red; kd ¼ 0.5 min 1, green; kd ¼ 0.05 min 1, magenta; and kd ¼ 0.005 min 1, brown. Finally, the distributions plotted in yellow were calculated with the model proposed by Zeron and Santilla´n (2010), in which the promoter activation and deactivation processes are regarded as fast ones while making an adiabatic approximation to simplify the model—see the main text for details.
Note that the distributions corresponding to kd ¼ 500 min 1 and ¼ 50 min 1 overlap, and both of them agree with the simulation from the model analyzed in Zeron and Santilla´n (2010), in which the promoter activation and deactivation are regarded as fast processes. Nonetheless, as kd and kdþ decrease (keeping constant the ratio KD ¼ kd / kdþ), the promoter activation and deactivation processes turn slower, and both
kd
166
E. S. Zeron and M. Santilla´n
the mRNA and the protein distribution functions change. The probability of finding three or more mRNA molecules increases, as well as that of finding no molecules at all. In contrast, the probabilities for one or two mRNA molecules decrease as the promoter activation and deactivation processes slow down. Regarding the protein distribution functions, they widen as kd and kdþ decrease. This happens because the probabilities for very low and very high protein counts increase while kd and kdþ decrease. In particular, we see that the protein probability distribution function becomes bimodal for very low kd and kdþ values. This phenomenon can be explained by considering that the promoter flips between the active and inactive states, but stays for long periods of time in each one of them. Thus, when the promoter is active, many transcriptions are initiated, and so the mRNA and protein counts highly increase. Nonetheless, when the promoter is inactive, no transcription is initiated for a long period of time, and so the mRNA and protein counts decrease because of degradation. The above-described behavior is congruent with the phenomenon known as mRNA (or transcription) bursting (Golding et al., 2005). New experimental techniques allowing the investigation of gene expression at the single molecule level have demonstrated that mRNA synthesis takes place in short periods of high activity, followed by long inactive intervals. Statistically, mRNA bursting induces a widening of the expected mRNA probability distribution function. That is, the observed standard deviation is larger than expected, given the average mRNA count. mRNA bursting has been observed both in eukaryotic and prokaryotic cells. In eukaryotic cells, chromatin remodeling has been invoked as the cause of transcription bursting. However, no satisfactory explanation exists for this phenomenon in prokaryotic cells. In agreement with the possible causes speculated by Golding et al. (2005), our results suggest that mRNA bursting can be explained by the promoter activation and deactivation processes being very slow. The possible causes for the slowing down of the promoter flipping between the active and inactive states is an issue that deserves further investigation. For instance, Choi et al. (2008) argue that, in the lac operon, it can be due to the simultaneous interaction of a repressor molecule with two separate operator sites along the DNA chain. From this consideration, it could be possible that other instances of cooperativity, like that observed between two different repressors in the phage l switch (Dodd et al., 2005) or in the tryptophan operon (Grillo et al., 1999), can also lead to mRNA bursting. Finally, one interesting prediction from our results is the fact that, when promoter activation and deactivation become too slow, the stationary probability distribution function for proteins becomes bimodal. Would it be possible to find in nature or to engineer promoters showing this bimodal behavior, their biological implications would be worth studying.
Numerical Solution of the Chemical Master Equation
167
7. Concluding Remarks Summarizing, we have introduced in this work a couple of algorithms to find the stationary probability distribution for the CME of arbitrary chemical networks. Furthermore, we found the conditions that guaranty the algorithms’ convergence and the unicity and (asymptotic) stability of the stationary distribution. The most popular approach employed in the study of the so-called biochemical noise in chemical-reaction networks involves the obtention of individual trajectories by means of the Montecarlo method. However, this procedure becomes impractical to calculate stationary probability distributions, since plenty individual simulations are required for that purpose. In that sense, the approach here introduced provides an efficient and more economical way to directly find the system stationary probability distribution. We further employed the proposed algorithms to study the mRNA and protein probability distributions in a gene regulatory network subject to negative feedback regulation. In particular, we analyzed the influence of the promoter activation/deactivation speed on the shape of such distributions. We found that a reduction of the promoter activation/deactivation speed modifies the shape of those distributions in a way consistent with the phenomenon known as mRNA (or transcription) bursting. This phenomenon has only been experimentally observed in prokaryotic genes expressing constitutively, but our results predict that it should also be observed in gene networks with negative feedback regulation at the transcriptional level. One final prediction from our results is the fact that, if the promoter switching between the active and the inactive states becomes too slow, the protein probability distribution may become bimodal.
REFERENCES Amann, H. (1990). Ordinary differential equations: An introduction to nonlinear analysis Vol. 13. de Gruyter, Berlin. Arkin, A., Ross, J., and McAdams, H. H. (1998). Stochastic kinetic analysis of developmental pathway bifurcation in phage lambda-infected escherichia coli cells. Genetics 149(4), 1633–1648. Bapat, R. B., and Raghavan, T. E. S. (1997). Nonnegative Matrices and Applications Vol. 64. Cambridge University Press, Cambridge, UK. Berman, A., and Plemmons, R. J. (1979). Nonnegative Matrices in the Mathematical Sciences, Computer Science and Applied Mathematics. Academic Press, New York. Breitling, R. (2010). What is systems biology? Front. Syst. Biol. 1(9). 10.3389/fphys.2010.00009. Brugnano, L., and Trigiante, D. (1998). Solving Differential Problems by Multistep Initial and Boundary Value Methods, of Stablity and Control: Theory Methods and Applications Vol. 6. Gordon and Breach Science Publishers, Amsterdam.
168
E. S. Zeron and M. Santilla´n
Burrage, K., Tian, T., and Burrage, P. (2004). A multi-scaled approach for simulating chemical reaction systems. Prog. Biophys. Mol. Biol. 85, 217–234. 10.1016/j. pbiomolbio.2004.01.014. Cao, Y., Gillespie, D. T., and Petzold, L. R. (2005a). The slow-scale stochastic simulation algorithm. J. Chem. Phys. 122, 014116. 10.1063/1.1824902. Cao, Y., Gillespie, D. T., and Petzold, L. R. (2005b). Multiscale stochastic simulation algorithm with stochastic partial equilibrium assumption for chemically reacting systems. J. Comput. Phys. 206, 395–411. 10.1016/j.jcp. 2004.12.014. Choi, P. J., Cai, L., Frieda, K., and Xie, X. S. (2008). A stochastic single-molecule event triggers phenotype switching of a bacterial cell. Science 322(5900), 442–446. 10.1126/ science.1161427. Dodd, I. B., Shearwin, K. E., and Egan, J. B. (2005). Revisited gene regulation in bacteriophage lambda. Curr. Opin. Genet. Dev. 15(2), 145–152. 10.1016/j.gde.2005.02.001. Elaydi, S. (1999). An Introduction to Difference Equations. 2nd ed. Springer, New York. Elowitz, M. B., Levine, A. J., Siggia, E. D., and Swain, P. S. (2002). Stochastic gene expression in a single cell. Science 297(5584), 1183–1186. 10.1126/science.1070919. Gillespie, D. T. (1976). A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Comput. Phys. 22(4), 403–434. 10.1016/ 0021-9991(76)90041-3. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. 10.1021/j100540a008. Gillespie, D. (1992). A rigorous derivation of the chemical master equation. Physica A 188 (1–3), 404–425. Gillespie, D. (2000). The chemical langevin equation. J. Chem. Phys. 113(1), 297–306. Gillespie, D. (2002). The chemical langevin and fokker-planck equations for the reversible isomerization reaction. J. Phys. Chem. 106(20), 5063–5071. 10.1021/jp0128832. Golding, I., Paulsson, J., Zawilski, S. M., and Cox, E. C. (2005). Real-time kinetics of gene activity in individual bacteria. Cell 123(6), 1025–1036. 10.1016/j.cell.2005.09.031. Golub, G. H., and Van Loan, C. F. (1996). Matrix Computations. 3rd ed. Johns Hopkins University Press, Baltimore. Grillo, A. O., Brown, M. P., and Royer, C. A. (1999). Probing the physical basis for trp repressor-operator recognition. J. Mol. Biol. 287(3), 539–554. 10.1006/jmbi.1999.2625. Groetsch, C. W., and King, J. T. (1988). Matrix Methods and Applications. Prentice Hall, Englewood Cliffs, N.J. Higham, D. J. (2008). Modeling and simulating chemical reactions. SIAM Rev. 50(2), 347–368. 10.1137/060666457. Liu, W. E. D., and Vanden-Eijnden, E. (2005). Nested stochastic simulation algorithm for chemical kinetic systems with disparate rates. J. Chem. Phys. 123(19), 194107. Macnamara, S., Bersani, A. M., Burrage, K., and Sidje, R. B. (2008). Stochastic chemical kinetics and the total quasi-steady-state assumption: Application to the stochastic simulation algorithm and chemical master equation. J. Chem. Phys. 129(9): 095105. 10.1063/ 1.2971036. Meyer, C. D. (2000). Matrix Analysis and Applied Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia. Munsky, B., and Khammash, M. (2006). The finite state projection algorithm for the solution of the chemical master equation. J. Chem. Phys. 124(4): 044104. 0.1063/1.2145882. Munsky, B., and Khammash, M. (2007). A multiple time interval finite state projection algorithm for the solution to the chemical master equation. J. Comput. Phys. 226(1), 818–835. 10.1016/j.jcp. 2007.05.016. Paulsson, J. (2004). Summing up the noise in gene networks. Nature 427(6973), 415–418. 10.1038/nature02257.
Numerical Solution of the Chemical Master Equation
169
Pedraza, J. M., and van Oudenaarden, A. (2005). Noise propagation in gene networks. Science 307(5717), 1965–1969. 10.1126/science.1109090. Robeva, R. (2010). Systems biology: Old concepts, new science, new challenges. Front. Syst. Biol. 1. 10.3389/fpsyt.2010.00001. Santilla´n, M., and Mackey, M. C. (2005). Dynamic behaviour of the B12 riboswitch. Phys. Biol. 231, 287–298: 1. 10.1088/1478-3967/2/1/004. Seber, G. A. F. (2008). A Matrix Handbook for Statisticians. Wiley-Interscience, Hoboken, NJ. Serre, D. (2002). Matrices: Theory and Applications Vol. 216. Springer, New York. Shahrezaei, V., and Swain, P. S. (2008). The stochastic nature of biochemical networks. Curr. Opin. Biotechnol. 19(4), 369–374. 10.1016/j.copbio.2008.06.011. Simpson, M. L., Cox, C. D., Allen, M. S., McCollum, J. M., Dar, R. D., Karig, D. K., and Cooke, J. F. (2009). Noise in biological circuits. Wiley Interdiscip. Rev. Nanomed. Nanobiotechnol. 1(2), 214–225. 10.1002/wnan.22. Swain, P. S., Elowitz, M. B., and Siggia, E. D. (2002). Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc. Natl. Acad. Sci. USA 99(20), 12795–12800. 10.1073/pnas.162041399. Thattai, M., and van Oudenaarden, A. (2001). Intrinsic noise in gene regulatory networks. Proc. Natl. Acad. Sci. USA 98(15), 8614–8619. 10.1073/pnas.151588598. Zeron, E. S., and Santilla´n, M. (2010). Distributions for negative-feedback-regulated stochastic gene expression: Dimension reduction and numerical solution of the chemical master equation. J. Theor. Biol. 264(2), 377–385. 10.1016/j.jtbi.2010.02.004. Zhou, W., Peng, X., Yan, Z., and Wang, Y. (2008). Accelerated stochastic simulation algorithm for coupled chemical reactions with delays. Comput. Biol. Chem. 32, 240–242. 10.1016/j.compbiolchem.2008.03.007.
C H A P T E R
S E V E N
How Molecular Should Your Molecular Model Be? On the Level of Molecular Detail Required to Simulate Biological Networks in Systems and Synthetic Biology Didier Gonze,* Wassim Abou-Jaoude´,† Djomangan Adama Ouattara,‡,§ and Jose´ Halloyk Contents 1. Introduction 2. Michaelis–Menten Kinetics Revisited 2.1. Michaelis–Menten equation for an isolated reaction 2.2. Embedded Michaelis–Menten kinetics 2.3. Stochastic simulation of the Michaelis–Menten system 3. Use of the Hill Kinetics for Transcription Rate 4. Repressilator 4.1. Original version: Hill-based model 4.2. Developed version: Cooperative binding sites 4.3. Deterministic simulation of the Repressilator 4.4. Stochastic simulation of the Repressilator 5. Toggle Switch 5.1. Original version: Hill-based model 5.2. Developed version: Cooperative binding sites 5.3. Deterministic simulation of the Toggle Switch model 5.4. Stochastic simulation of the Toggle Switch model 6. Discussion 6.1. Qualitative models are useful to understand dynamical properties of gene regulatory networks 6.2. Michaelis–Menten and Hill assumptions are often reasonable
172 175 176 178 182 185 190 190 191 193 196 198 198 199 200 201 204 204 205
* Laboratoire de Bioinformatique des Ge´nomes et des Re´seaux, Universite´ Libre de Bruxelles, Bruxelles, Belgium INRIA Sophia-Antipolis, Sophia-Antipolis, France { INERIS, Parc Technologique Alata, Verneuil-en-Halatte, France } UMR-CNRS 6600, Universite´ de Technologie de Compie`gne, France k Service d’Ecologie Sociale, Universite´ Libre de Bruxelles, Bruxelles, Belgium {
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87007-2
#
2011 Elsevier Inc. All rights reserved.
171
172
Didier Gonze et al.
6.3. The level of details depends on the question asked 6.4. The notion of elementary reaction steps in biology is elusive 6.5. Stochastic simulations can be performed using compact kinetics 6.6. Spatial effects are important and may imply complex kinetics 7. Conclusion Acknowledgments References
206 207 208 209 210 211 211
Abstract The recent advance of genetic studies and the rapid accumulation of molecular data, together with the increasing performance of computers, led researchers to design more and more detailed mathematical models of biological systems. Many modeling approaches rely on ordinary differential equations (ODE) which are based on standard enzyme kinetics. Michaelis–Menten and Hill functions are indeed commonly used in dynamical models in systems and synthetic biology because they provide the necessary nonlinearity to make the dynamics nontrivial (i.e., limit-cycle oscillations or multistability). For most of the systems modeled, the actual molecular mechanism is unknown, and the enzyme equations should be regarded as phenomenological. In this chapter, we discuss the validity and accuracy of these approximations. In particular, we focus on the validity of the Michaelis–Menten function for open systems and on the use of Hill kinetics to describe transcription rates of regulated genes. Our discussion is illustrated by numerical simulations of prototype systems, including the Repressilator (a genetic oscillator) and the Toggle Switch model (a bistable system). We systematically compare the results obtained with the compact version (based on Michaelis–Menten and Hill functions) with its corresponding developed versions (based on “elementary” reaction steps and mass action laws). We also discuss the use of compact approaches to perform stochastic simulations (Gillespie algorithm). On the basis of these results, we argue that using compact models is suitable to model qualitatively biological systems.
1. Introduction The recent advance of genetic studies and the rapid accumulation of molecular data led the biologists to draw detailed pictures of complex regulatory networks, including the cell cycle, the circadian clock, signaling pathways, and developmental processes. These biological networks often involve many components (genes, proteins, and small molecules) which are coupled through multiple regulatory interactions. A recurrent motif displayed by such gene regulatory networks are positive and negative feedback loops, which are necessary to provide systems with rich dynamical properties, like self-sustained oscillations and multistability (Thomas and D’Ari, 1990).
How Molecular Should Your Molecular Model Be?
173
Understanding the dynamics of such large networks by sheer intuition is often difficult. To fully grasp the role of the multiple feedback loops in such complex networks, it is necessary to resort to mathematical modeling. These models tend to be molecular: they describe the time evolution of the concentrations of identified species and they capture as many molecular details as possible. Dynamical models for metabolic pathways and genetic regulatory networks proposed in systems and in synthetic biology are generally based on classical enzyme kinetics, including Michaelis–Menten and Hill rate functions (Alon, 2007; Cornish-Bowden, 1995; Kaern et al., 2003; Segel, 1976). These equations have been shown to be good approximations of more detailed reaction schemes under the assumption that some reactions are fast compared to others. A further motivation to use these functions in mathematical models is that they provide the necessary nonlinearity to make the dynamics nontrivial. Nonlinearity is indeed required to observe behaviors like limit-cycle oscillations or multistability (Goldbeter, 1996; Murray, 2003). The use of Michaelis–Menten and Hill rate equations raises several related questions. First, they have been derived for isolated enzyme reactions and describe accurately the asymptotic behavior of the system (Cornish-Bowden, 1995). We may wonder whether these kinetic equations are still valid if the enzymatic reaction is embedded in a larger reaction network in which the substrate and the product of the reaction are themselves involved in other biochemical reactions. Furthermore, in natural conditions, the systems do not remain constantly at steady state. They must respond adequately to transient perturbations or to oscillatory inputs. It is thus necessary to check whether the Michaelis–Menten and Hill descriptions, established for an isolated system at steady state, still hold for dynamical systems undergoing temporary or periodic changes in time. Second, Michaelis–Menten and Hill rate equations have been extensively studied in the context of enzymatic reactions, but they are also often used to describe processes like mRNA synthesis (gene transcription). Simple models of gene regulation show that Michaelis–Menten-like kinetics can indeed describe transcription rates (Keller, 1995). In reality, however, gene transcription involves elaborate molecular machineries, allowing the proper processing of transcription and translation (Lewin, 2010; Ptashne, 2004). Since the kinetic details of the multiple steps involved in these processes are most of the time not known, the question of the validity of Michaelis– Menten and Hill approximations remains open. Classical enzyme kinetic functions can only be considered as phenomenological equations, hopefully capturing the essential dynamical properties of such processes, at least in a qualitative manner. The question of how to use this phenomenological approach is of interest for the modeling of any biomolecular network. Can the enzyme kinetic functions be used as such or should they be developed
174
Didier Gonze et al.
into a detailed reaction scheme in which each reaction is described by the mass action law? Clearly, the developed version of these functions would more explicitly show the assumptions underlying the compact version (i.e., constraints on the parameter values). Moreover, developed models would also allow working in a region of parameter space where the underlying assumptions are lifted. Nevertheless, given that molecular details are often not available, we may question the need for such a decomposition. Third, due to the low number of macromolecules in cellular processes, the dynamics of biological systems is likely to be affected by molecular noise (Barkai and Leibler, 2000; Elowitz et al., 2002; McAdams and Arkin, 1997, 1999; Raj and van Oudenaarden, 2008; Raser and O’Shea, 2005). Stochastic equations are then necessary to assess the potentially important effect of such molecular noise. A common approach to simulate stochastic systems is the Gillespie algorithm (Gillespie, 1977). In this approach, a propensity is associated with each reaction step, and the reactions occur probabilistically according to their propensities (see Appendix). This approach is considered to be valid for elementary reaction steps (Gardiner, 2004; van Kampen, 2007). However, several stochastic models compute the reaction propensities with the Michaelis–Menten or Hill functions (Gonze et al., 2003; Kraus et al., 1992; Ouattara et al., 2010; Song et al., 2007). The question remains whether the compact form of kinetic equations or the developed version is the more appropriate choice for the system at hand. Can we use compact, nonlinear functions as propensities of the reactions? Do such compact stochastic models give the same results as their developed version? In the present chapter, we address these questions by systematically comparing the compact and developed versions of various models, and analyzing their deterministic and stochastic behaviors. We first recall the derivation of the Michaelis–Menten equation for an isolated system, highlighting the underlying hypotheses. We then slightly modify this system to examine the conditions of validity of the Michaelis–Menten equation when the system is open (i.e., when the substrate and product are themselves involved in other reactions). We will also examine the cases where the synthesis rate of the substrate undergoes temporary or periodic changes in time. In the second part, we briefly discuss one possible derivation of the Hill function for the transcription rate. We then compare the compact and developed versions of the Repressilator (Elowitz and Leibler, 2000) and the Toggle Switch (Gardner et al., 2000), two prototypical genetic models, displaying oscillations and bistability, respectively. For all these models, we compare the compact and developed versions using both deterministic and stochastic simulations. Note that these models should not be regarded as models for precise molecular systems, but rather as generic and representative models.
How Molecular Should Your Molecular Model Be?
175
Many papers already discuss in details some aspects mentioned above. In the present paper, our goal is not to present a detailed theoretical study on enzyme and gene kinetics, but rather to recall and highlight some hypotheses implicitly done when using these standard kinetics in modeling biological systems. In the discussion, we also give some reasons why we believe that models based on Michaelis–Menten and Hill functions are a good approach to analyze deterministic and stochastic dynamics of biological systems.
2. Michaelis–Menten Kinetics Revisited We start with the analysis of the Michaelis–Menten function commonly used to describe the kinetics of enzymatic reactions (CornishBowden, 1995; Segel, 1976, 1988). Enzymatic reactions characterize not only most metabolic reactions but also posttranslational modifications such as phosphorylation–dephosphorylation of proteins by kinases and phosphatases. By regulating the activity of the proteins, these transformations can play crucial roles in the dynamics of genetic networks. In the mammalian circadian clock, for example, the Clock protein needs to be phosphorylated to enter the nucleus, where it acts as a transcriptional inhibitor (Leloup, 2009; Tamanini et al., 2005). Phosphorylation (among other posttranslational processes) affects both activity and stability of the p53 protein, a key regulator of the cell cycle and apoptosis (Brooks and Gu, 2003). When modeling metabolic pathways (Brooks and Storey, 1992; Pettersson, 1993) or genetic networks such as circadian clocks (Goldbeter, 1995; Leloup and Goldbeter, 2003) or the p52–Mdm2 network (Abou-Jaoude´ et al., 2009), Michaelis–Menten functions are often taken by default as standard kinetic equations. It should be stressed, however, that in these systems, the precise molecular mechanism of the reactions is often not known, and these equations should be regarded as a phenomenological description of saturation kinetics. The use of such equations implies hypotheses on the molecular mechanism. These hypotheses are rarely discussed, neither shown to be satisfied in the systems under study. We briefly recall here the derivation of the Michaelis–Menten equation for an isolated enzymatic reaction. We then discuss its validity in open systems, that is, systems in which the level of substrate and product are dynamically regulated by “external” reactions. To this end, for each model, we compare the compact model (based on Michaelis–Menten equations) and its corresponding developed model (i.e., detailed molecular mechanism). We also compare the stochastic time series generated by the stochastic simulation of the two versions of the model.
176
Didier Gonze et al.
2.1. Michaelis–Menten equation for an isolated reaction We assume that an enzyme E catalyzes the transformation of a substrate S into a product P (where S and P represent the dephosphorylated and phosphorylated forms of a protein and E a kinase): E
S!P
ð7:1Þ
For an isolated system, the rate of production of P is equal to the rate of consumption of S, and the time evolutions of the concentration S and P are thus often written as v¼
dP dS S ; ¼ ¼ vm dt dt KM þ S
ð7:2Þ
where vm is the maximum reaction rate (reached when S is very large) and KM is the Michaelian constant. Note that in this compact formulation, the concentration of the enzyme, E, does not appear explicitly. The classical reaction scheme proposed to describe the molecular mechanism by which the enzyme E catalyzes the conversion of S into P is the following: k1
k3
E þ SÐC!E þ P k2
ð7:3Þ
The evolution equations for the concentration of the different species follow the mass action law: dS ¼ k1 ES þ k2 C; dt dE dC ¼ ¼ k1 ES þ k2 C þ k3 C; dt dt dP ¼ v ¼ k3 C: dt
ð7:4Þ
The quasi-steady-state assumption (QSSA) stipulates that dC/dt ¼ dE/ dt ¼ 0 (Briggs and Haldane, 1925; Cornish-Bowden, 1995). This assumption derives from the singular perturbation theory (Bowen et al., 1962; Segel and Slemrod, 1989), which shows how the different time scales that characterize different processes can be exploited to reduce the dimensionality (i.e., the number of variables) in a model. Under the QSSA, assuming that the total enzyme concentration is constant, Etot ¼ E þ C, and much smaller than the substrate concentration S (allowing to consider S ¼ S0 þ
How Molecular Should Your Molecular Model Be?
177
C S0 where S0 is the concentration of the free substrate), it can be shown that the rate of synthesis of the product is well described by Eq. (7.2) provided that vm ¼ k3 Etot ; k2 þ k3 ; KM ¼ k1 Etot S:
ð7:5Þ
In fact, the derivation of the Michaelian function from the QSSA was due to Briggs and Haldane (Briggs and Haldane, 1925; Cornish-Bowden, 1995) and slightly differs from the original derivation by Michaelis and Menten, which was based on the rapid equilibrium between E þ S and C (i.e., k1, k2 k3). The latter leads to the same general equation, where KM is replaced by KS ¼ k2 / k1 KM (Michaelis and Menten, 1913). Since the precise molecular mechanism of the biochemical reactions modeled in systems biology is rarely known, this distinction is almost never done. Another point to notice is that when Eq. (7.2) is used in a large model, S implicitly represents the total concentration of the substrate. However, when Eq. (7.2) is derived from Eq. (7.4), the quantity S appearing in the denominator is the free form of the enzyme, the total substrate concentration being S þ C. However, if the condition S Etot holds, then the fraction of substrate bound to the enzyme is small and the approximation is valid. Different extensions of the Michaelis–Menten equation to the case of high enzyme concentration have been proposed by other authors (Cha, 1970; Schnell and Maini, 2000). Variants of the QSSA, namely, based on the so-called total quasi-steady-state assumption (tQSSA), were also shown to more accurately describe systems in which the levels of substrate and enzyme are within the same range (Borghans et al., 1996; Ciliberto et al., 2007; Macnamara et al., 2008). In Fig. 7.1, we compare the time evolution of S and P for the compact model (Eq. (7.2), dashed curves) and the developed model (Eq. (7.4), solid curves) for the case where the conditions (Eq. (7.5)) are satisfied (Etot low, panels A and B) or not (E tot high, panels C and D). In both cases, parameter values have been chosen such that vm ¼ k3Etot and KM ¼ (k2 þ k3)/k1. Clearly, in the latter case, the compact model is not a good approximation of the developed reactional scheme. In the developed model, when Etot is high, a large quantity of the substrate rapidly binds the enzyme to form the complex C, and the catalytic constant k3 is small so that the product is released in a slower fashion. The initial reaction rate is thus small because the complex C rather tends to dissociate into S and E than to produce the product P (k2 high, k3 small) (panel D, solid curve) Conversely, in the compact model, as soon as the substrate S is present, it is converted into the product P at a high rate (panel D, dashed curve).
178
Didier Gonze et al.
B
A 1
0.5
P Reaction rate, v
Concentration
0.8 0.6 0.4
0.4 0.3 0.2 0.1
0.2
v (compact) = v (dev)
S 0
0
5
10
0
15
0
5
D
C 1
Concentration
Reaction rate, v
P
0.8 0.6 0.4 0.2 0
5
15
10
0.5 0.4 0.3 v (compact)
0.2 0.1
S 0
10 Time
Time
15
0
v (dev) 0
Time
5
10
15
Time
Figure 7.1 Michaelis–Menten kinetics for an isolated reaction: comparison of the kinetics obtained by the compact model (Eq. (7.2), dashed curves) and by the corresponding developed model (Eq. (7.4), solid curves). Parameter values: (A, B) vm ¼ 1 nM min 1, KM ¼ 2 nM, k1 ¼ 55 nM 1 min 1, k2 ¼ 100 min 1, k3 ¼ 10 min 1, and Etot ¼ 0.1 nM. (C, D) vm ¼ 1 nM min 1, KM ¼ 2 nM, k1 ¼ 0.55 nM 1 min 1, k2 ¼ 1 min 1, k3 ¼ 0.1 min 1, and Etot ¼ 10 nM. In (A) and (C) are shown the time evolution of the concentration of the substrate, S, and of the product, P. In (B) and (D) is plotted the product production rate, v ¼ dP/dt. Note that in (A) and (B), a perfect agreement between the two descriptions is obtained: the dashed and the solid curves are superimposed. Note that the units of the parameters here and in the subsequent figures are arbitrary.
2.2. Embedded Michaelis–Menten kinetics Unless studied in vitro, a biochemical reaction as Eq. (7.1) is not expected to take place in a cell without being connected to other reactions. In vivo enzymatic reaction (7.1) is likely embedded in larger biochemical networks (metabolic pathway, signaling cascade, circadian clock, etc.). In such systems, the substrate S is generally the product of an upstream reaction, and the product P serves as a substrate for downstream reactions. In this condition, since the substrate is continuously supplied, the system can evolve
179
How Molecular Should Your Molecular Model Be?
towards a non-trivial steady state. In the present section, we modified the previous model by incorporating an inflow of the substrate S and an outflow of the product P. We first check whether, in such an open system, the compact version (based on Michaelis–Menten function) is still in agreement with the corresponding developed reaction scheme. In particular, since the substrate S varies with time (and can reach low levels, possibly smaller than the enzyme), it is important to check the validity of the Michaelis–Menten equation when the concentration of the enzyme is not smaller than the concentration of the substrate. We also perform stochastic simulations using the compact and the developed model and compare both descriptions. We modified model (7.1) by considering that the substrate S is produced at some rate vs and the product P is consumed in another reaction or degraded, and thus disappears at a rate vd (which will here assumed to be constant): vs ! S S ! vd P!
½E; vm ; KM
P
ð7:6Þ
Assuming that the QSSA holds, the time evolution of S and P is governed by the following evolution equations: dS S ; ¼ vs vm dt KM þ S dP S vd P: ¼ vm dt KM þ S
ð7:7Þ
Decomposing the enzymatic reaction into elementary steps according to Eq. (7.3), we obtain the following reactional scheme: vs ! S k3 E þ SÐC!E þ P k1 k2
ð7:8Þ
P! vd
For this developed version of the model, the time evolution of the species follows the mass action law: dS ¼ vs k1 ES þ k2 C dt dE dC ¼ ¼ k1 ES þ k2 C þ k3 C dt dt dP ¼ k3 C vd P dt
ð7:9Þ
180
Didier Gonze et al.
In Fig. 7.2, we compare the time evolution of S obtained with the compact model (Eq. (7.7), dashed curves) and the developed model (Eq. (7.9), solid curves). Panels A and D display the time evolution of S and P. In panel A, the total enzyme concentration, Etot, is five times larger than the concentration of the substrate at steady state, while in panel D, the concentration of the enzyme is about 50 times larger than the concentration of the substrate. In these simulations, we also imposed the constraints vm ¼ k3Etot and KM ¼ (k2 þ k3)/k1 (cf. Eqs. 7.5). In both cases, the steady state reached in the developed version is the same as the steady state reached in the compact model, but the transients are a bit longer when the enzyme concentration is high (panel D). In the developed version of the model, the additional steps create a delay in the response of the system. This delay is smaller and the match between the two descriptions is better if k1, k2, and k3 are increased, provided that vm ¼ k3Etot and KM ¼ (k2 þ k3)/k1 (cf. Eqs. 7.5) (panel A). The condition that the enzyme concentration is smaller than the substrate concentration is thus not required. If we are interested in the asymptotical properties of the system (i.e., at the steady state), this delay effect has no impact. In natural conditions, however, most biological systems do not operate at the steady state. They must respond adequately to external stimuli (e.g., the biosynthesis of amino acids must be regulated depending on the requirement and the availability of amino acids) or to periodic forcing (biological rhythms regulate numerous output pathways and signaling cascades). In panels B and E, we study the effect of a transient synthesis of the substrate (i.e., a pulse of vs). As expected from the above observations, the detailed model does not respond as fast as the compact model. Consequently, the peaks reached by S and P in the developed version of the model are slightly delayed and of smaller amplitude than in the compact version. This delay is very small if the enzyme concentration is low (panel B), but becomes important when the enzyme concentration is high (panel E). In panels C and F, we study the effect of a periodic forcing of the synthesis rate of the substrate S: A 2pt vs ð t Þ ¼ sin þ1 ; ð7:10Þ 2 t where A and t are the amplitude and period of the oscillations, respectively. The amplitude of the oscillations observed in the concentration of the substrate, S, and of the product, P, are slightly reduced in the developed model compared to the compact model. This reduction is more significant when the enzyme concentration is high (panel F). We also noticed that increasing A leads to a better agreement between the two versions, while changing t has not much effect (not shown).
181
How Molecular Should Your Molecular Model Be?
A
D 1
0.8 0.6 0.4 0.2 0
Concentration
B
Concentration
P
0
20
40 Time
60
P
0.8 0.6 0.4 0.2
S
0
80 E
0.4 0.3
Concentration
Concentration
1
0.2 P 0.1
S 0
20
C
0
10
20 30 Time
0.3 0.2 P 0.1
40
0
50 F
Concentration
Concentration
5 P S
3 2 1 0
80
S
6
4
60
0.4
S 0
40 Time
0
10
20 30 Time
40
50
6 5 P
4
S
3 2 1
0
20
40 Time
60
80
0
0
20
40 Time
60
80
Figure 7.2 Deterministic simulation of the simple, embedded, one-reaction Michaelis–Menten-based model. (A, D) Evolution of S and P toward the steady state obtained with the compact model (Eq. (7.7), dashed curves) and with the developed model (Eq. (7.9), solid curves), starting with S(0) ¼ P(0) ¼ 0 nM. (B, E) Response of the system to a pulse of vs:vs increases from 0 to 0.1 nM min 1 during t ¼ 5 and t ¼ 10 min. (C, F) Entrainment of the system by a periodic forcing of vs:vs undergoes sine oscillations, as defined by Eq. (7.10) with A ¼ 1 and t ¼ 15. Parameter values: (A–C) vm ¼ 1 nM min 1, KM ¼ 2 nM, k1 ¼ 5.5 nM 1 min 1, k2 ¼ 10 min 1, k3 ¼ 1 min 1, Etot ¼ 1 nM, vs ¼ 0.1 nM min 1, and vd ¼ 0.1 min 1. (D–F) Idem except k1 ¼ 0.55 nM 1 min 1, k2 ¼ 1 min 1, k3 ¼ 0.1 min 1, and Etot ¼ 10 nM.
182
Didier Gonze et al.
In all the above cases, a better agreement between the two descriptions can easily be achieved by increasing the constant rates k1, k2, and k3 (and decreasing Etot accordingly to ensure that vm ¼ k3Etot). We also noted that the condition that the enzyme concentration is smaller than the substrate concentration is not absolutely required to obtain a qualitative agreement between the developed versions and the Michaelis-Menten-based models. A quantitative agreement is obtained when the kinetic constants k1, k2, and k3 are sufficiently large (which also implies that Etot is small). A more detailed and quantitative analysis of the validity of the Michaelis–Menten approximation for open systems was recently published by Stoleriu et al. (2004, 2005), and their conclusion is in agreement with the present observations. More elaborate Michaelis–Menten-based models, including linear chains of reactions and models incorporating feedbacks, are currently under investigation.
2.3. Stochastic simulation of the Michaelis–Menten system The second question was to determine whether compact models can be used to perform stochastic simulations. More precisely, it is still a matter of debate whether we can compute the propensities of the reactions in the Gillespie algorithm (see Appendix) with nonlinear rate equations (e.g., Michaelis–Menten function; Barik et al., 2008; Goutsias, 2005; Grima, 2009; Macnamara et al., 2008; Rao and Arkin, 2003). First, note that resorting to such a description would reduce the number of variables as well as the number of reactions (and thus the number of propensities to be computed). In addition, if the constant rates k1, k2, and k3 are high, those reactions will occur very often, and this will be CPU-consuming. For large biomolecular networks, this intensive computation may be prohibitive. It is therefore tempting to use compact models to perform stochastic simulations. Stochastic versions of the compact and of the developed models are given in Tables 7.1 and 7.2. In Fig. 7.3, we present the results of the simulations of the stochastic versions of the models. Parameter values and initial conditions correspond to the deterministic situation shown in Fig. 7.2D. The system size was set to O ¼ 100, which yields a number of molecules of enzyme Etot ¼ 1000, that is, 50 times larger than the number of molecules of the substrate at steady state. Panels A and C show plots of the time series of S and P generated by the stochastic simulation of the compact and developed models, respectively. Panels B and D display the corresponding histograms of S and P at the steady state. Mean and standard deviation of both variables for each version are similar. The amplitude of the fluctuations is also similar in both models. The frequency of the fluctuations at the level of S is higher in the developed model because of the fast S–E binding/unbinding.
183
How Molecular Should Your Molecular Model Be?
Table 7.1 Stochastic version of the compact version of the simple Michaelis–Mentenbased model No.
Reaction
Propensity
Parameter values
1 2 3
!S S!P P!
w1 ¼ vsO S w2 ¼ vm O KM OþS w3 ¼ vdP
vs ¼ 0.1 nM min 1 vm ¼ 1 nM min 1, KM ¼ 2 nM vd ¼ 0.1 min 1
Default parameter values are given in the right column (parameter values are the same as in Fig. 7.2).
Table 7.2 Stochastic version of the developed version of the simple Michaelis– Menten-based model No.
Reaction
Propensity
1 2 3 4 5
!S SþE!C C!SþE C!P P!
w1 w2 w3 w4 w5
¼ ¼ ¼ ¼ ¼
vsO k1SE/O k2C k3C v dP
Parameter values
vs ¼ 0.1 nM min 1 k1 ¼ 0.55 nM 1 min 1 k2 ¼ 1 min 1 k3 ¼ 0.1 min 1 vd ¼ 0.1 min 1
The total enzyme concentration, Etot ¼ E þ C ¼ 0.1O.
One goal of stochastic simulations is to assess the robustness of the behavior when the number of molecules is reduced. In the compact version of the model, the enzyme concentration is implicit and does not contribute to the stochasticity. In the developed version of the model, the number of enzyme molecules appears explicitly, and one may wonder whether a low number of enzyme molecules affect the results of the stochastic simulations. In order to check whether the above conclusion still holds when E is small, we performed stochastic simulations for small Etot and thus large k1, k2, and k3 to satisfy the constraints vm ¼ k3Etot and KM ¼ (k2 þ k3)/k1 (Fig. 7.4). For Etot ¼ 10 (panels A–C) and Etot ¼ 1 (panels D–F), for a system size O ¼ 100, the results are highly similar to the ones shown in Fig. 7.3. The very low level of enzyme does not lead to higher stochasticity because it is compensated by a very high turnover rate, leading to a barcode-like pattern of the enzyme level (the enzyme present in a single molecule frequently switches between the free and the complex form; panel E). A similar behavior was also reported in a stochastic study of a model for circadian rhythms (Gonze et al., 2004). If the system size O is further decreased, as in Fig. 7.4G–I, the number of enzyme molecules that theoretically corresponds to the deterministic value becomes smaller than 1. Since the number of enzyme molecules must be an integer, we have two choices. Either, the number of enzymes is set to 0 and there is no reaction, or the number of enzyme is set to 1, which leads to a product k3Etot which is larger than vm.
184
120
Number of molecules
Mean(P) = 101.12 Std(P) = 7.98
0.3 Frequency
80 60 40
S
0.25 P
0.2 0.15
S
0.1
20
0.05 0
20
40 60 Time
80
0
100
0
50 100 Number of molecules
150
D
140
Mean(M) = 21.6 Std(M) = 4.58
0.35
120
P
80 60 40
Mean(P) = 96.96 Std(P) = 8.25
0.3
100
S
0.25 P
0.2 0.15
S
0.1
20 0
Mean(M) = 23.7 Std(M) = 5.45
0.35
P
100
0
C
B
140
Frequency
Number of molecules
A
Didier Gonze et al.
0.05 0
20
40 60 Time
80
100
0
0
50 100 Number of molecules
150
Figure 7.3 Stochastic simulation of the “embedded” Michaelis–Menten-based model. (A) Evolution of S and P for the compact model (obtained with the Gillespie algorithm for the model given in Table 7.1). (C) Evolution of S and P for the developed model (Table 7.2). (B, D) Histograms of S and P at the steady state for the corresponding time evolution. For each model, the mean and standard deviation is indicated above the histograms. Parameter values and initial conditions correspond to the deterministic values given in Fig. 7.2D (cf. Tables 7.1 and 7.2, with system size O ¼ 100).
In the latter case (panels G–I), the discrepancy between the developed and the compact model is not due to the noise but to the fact that the hypotheses underlying the Michaelis–Menten equations (vm ¼ k3Etot) are no longer satisfied. These simulations show that the compact and developed version of the model give very similar results, provided that the conditions underlying the Michaelian kinetics are satisfied. A good agreement is observed even for very low enzyme numbers as long as this low level is compensated by high kinetic rates. These observations thus justify the use of Michaelian functions as propensities of enzymatic biochemical reactions in stochastic simulations.
185
How Molecular Should Your Molecular Model Be?
A
B
100 80 60 S
20
40
60
80
Number of molecules
Number of molecules
120
P
100 80 60 40
S
20 0
0
20
40
60
80
P
12 8 S 4 0
0
20
40
60
Time
80
100
20
30
40
P
S 0.1 0
50
0
50
100
150
Number of molecules
F 0.3
1
0
10
20
30
40
0.2
P S
0.1 0
50
Time
H Number of molecules
Number of molecules
16
10
Time
0
100
Time
G
0
E
140
0.2
2 0
100
Time
D
4
Frequency
0
8 6
0
50
100
150
Number of molecules
I 0.3
1
0
Frequency
40 20
0.3
10 Frequency
Number of molecules
Number of molecules
P
120
0
C 12
140
0
10
20
30
Time
40
50
0.2
P
S 0.1 0
0
5
10
15
20
Number of molecules
Figure 7.4 Stochastic simulation of the “embedded” Michaelis–Menten-based model. (A–C) O ¼ 100, k1 ¼ 55, k2 ¼ 100, k3 ¼ 10, Etot ¼ 0.1 O ¼ 10; (D–F) O ¼ 100, k1 ¼ 550, k2 ¼ 1000, k3 ¼ 100, Etot ¼ 0.01 O ¼ 1; and (G–I) O ¼ 10, k1 ¼ 550, k2 ¼ 1000, k3 ¼ 100, Etot ¼ 0.1 O ¼ 1. Other parameter values are as in Table 7.2. Panels A, D, and G show the time evolution of S and P. Panels B, E, and H show the time evolution of the free form of the enzyme, E. Panels C, F, and I show the histograms of S and P at steady state.
3. Use of the Hill Kinetics for Transcription Rate Hill functions are commonly used to describe the kinetics of enzymatic reactions in which an enzyme has several, cooperative binding sites (Cornish-Bowden, 1995; Segel, 1976). Sigmoidal functions have also been applied to describe transcriptional regulation (Alon, 2007; Keller, 1995), and to model mRNA synthesis in circadian genetic clocks (Goldbeter, 1995; Leloup and Goldbeter, 2003), in the p53–Mdm2 network (Abou-
186
Didier Gonze et al.
Jaoude´ et al., 2009), and in various synthetic genetic networks (Elowitz and Leibler, 2000; Gardner et al., 2000). Various molecular mechanisms have been invoked as possible explanations for the occurrence of such sigmoidal gene transcriptional activity. They include cooperativity of binding sites (Keller, 1995), multimer formation (Alon, 2007; Yang et al., 2007), competition between a repressor and an activator (Rossi et al., 2000), or DNA looping (Narang, 2007). The purpose of the present section is to present one possible molecular mechanism that leads to Hill kinetics for gene expression and to compare the compact and the corresponding developed version of this Hill kinetics. In the next sections, we will decompose the Repressilator (a minimal genetic oscillator; Elowitz and Leibler, 2000) and the Toggle Switch (a bistable genetic system; Gardner et al., 2000) according to this mechanism. We consider here the case of transcriptional inhibition. Let us assume that a gene G can be active (i.e., can be transcribed) or inactive. Transition from the active to the inactive form is induced by a repressor I. Only the gene in its active form can be transcribed and can produce mRNA (M ) at a rate vs. I
GðactiveÞ Ð GðinactiveÞ vs ½GðactiveÞ ! M
ð7:11Þ
The square brackets denote the fact that the gene G is not consumed in the reaction. In most models of biological systems, the gene is not considered explicitly. The rate of synthesis of mRNA is often modeled by the following sigmoidal function: v¼
dM Kn ¼ vs n I n ; KI þ I dt
ð7:12Þ
where I is the concentration of the inhibitor. Parameter vs is the maximum transcription rate (reached in the absence of inhibitor), KI is referred to as the “inhibitory constant” and n is the Hill coefficient which quantifies the steepness of the function. The term KIn / (KIn þ In) thus represents the fraction of active G. Here, we focus on one of the many possible mechanisms, the one based on cooperative binding of the repressor to several binding sites of the gene promoter. More specifically, we will assume that n molecules of inhibitor I cooperatively bind to the promoter of the gene and that only the free form of G is active. For the case of two cooperative binding sites, the molecular mechanism can be decomposed as follows:
How Molecular Should Your Molecular Model Be?
187
ka1
G þ I ÐGI kd1 ka2
GI þ I ÐGI2
ð7:13Þ
kd2 vs
½G !M The kinetic equations for this reaction scheme are dG ¼ ka1 GI þ kd1 GI; dt dGI ¼ ka1 GI kd1 GI ka2 GII þ kd2 GI2 ; dt ð7:14Þ dGI2 ¼ ka2 GII kd2 GI2 ; dt dI ¼ ka1 GI þ kd1 GI ka2 GII þ kd2 GI2 : dt In these equations, I represents the concentration of the free inhibitor. G represents the concentration of free gene, while GI and GI2 are the concentrations of gene bound to one repressor and gene bound to two repressors, respectively (averaged over a large cell population). Variable G (expressed in concentration units) can be interpreted as the gene activity, or the probability of the gene to be transcribed (Alon, 2007). We assume that the total gene “concentration” is constant: G þ GI þ GI2 ¼ Gtot ¼ constant:
ð7:15Þ
Here, we consider the “extreme” case where only the free form, G, of the gene is transcribed (at a rate vs), whereas GI and/or GI2 are inactive (not transcribed). Of course, we could consider the more general case where GI has some transcriptional activities, but as we will see later, in the case of high cooperativity, this form is present in small amounts, so that its contribution to the transcription rate can be neglected. Thus, the transcription rate is given by v¼
dM ¼ vs G: dt
ð7:16Þ
Applying the quasi-steady-state approximation for G and I (implying that dG/dt ¼ dGI/dt ¼ dGI2/dt ¼ dI/dt ¼ 0), we can express G as a function of Gtot and find v¼
dM K1 K2 ; ¼ vs Gtot 2 I þ 2K2 I þ K1 K2 dt
ð7:17Þ
188
Didier Gonze et al.
where K2 ¼ kd2/ka2 and K1 ¼ kd1/ka1 are the dissociation constants. Assuming that Gtot ¼ 1, the correspondence between the Eqs. (7.17) and (7.12) is obtained if the term K2I can be neglected, that is, if K2 K1 (a rigorous development will pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi be presented in a subsequent paper), by setting n ¼ 2 and KI ¼ ðK1 K2 Þ. The relation K2 K1 defines cooperativity: the binding of j molecules to the promoter of the gene favors the binding of the ( j þ 1)th molecule. It can be shown that if there is no cooperativity (K2 K1), the term K2I is equal to or is greater than I2 and this leads to a kinetic of the type described by Eqs. (7.12) with n ¼ 1. In practice, the cooperativity is never maximal, so that 1 n number of binding sites. Saying that the Hill coefficient n is equal to the number of binding sites thus implicitly assumes a very high cooperativity. It should be stressed here that variable I in the developed model (7.14) represents the concentration of the free inhibitor, whereas variable I in the compact model (7.12) is often implicitly considered as the total concentration of inhibitor. Thus, a good agreement between the two descriptions is obtained when the concentration of inhibitor is large enough (compared to Gtot) so that the fraction of inhibitor bound to the gene promoter is negligible. In Fig. 7.5, we compare the Hill function defined by Eq. (7.12) for the case n ¼ 2, with the developed model described by Eq. (7.14). In panel A, we plotted the Hill kinetics as a function of the inhibitor I for the compact model (7.12) (thin dashed line) as well as the steady regime of v ¼ vsG obtained by numerical integration of Eq. (7.14) (black dots). Parameters have been chosen to account for a high cooperativity (K1 ¼ 10 and K2 ¼ 0.1), so that a good agreement between the two versions is obtained. The main difference between the two situations pertains to the transients which appears in the developed model (Fig. 7.5B, solid lines). A consequence of these transients is that the system will not adapt instantaneously to a pulse of inhibitor I (Fig. 7.5C). This delay effect also affects the response of the system to an external periodic forcing. Let us assume now that the synthesis of I is periodic. The time evolution of I is described by: dI ¼ ks kr I dt
ð7:18Þ
with A ks ¼ 2
2pt sin þ1 : t
ð7:19Þ
In the developed model, equation for dI /dt in Eq. (7.14) is then replaced by dI ¼ ks kr I ka1 GI þ kd1 GI ka2 GII þ kd2 GI2 : dt
ð7:20Þ
189
How Molecular Should Your Molecular Model Be?
A
B 1 Transcription rate, v
Transcription rate, v
1 0.8 0.6 0.4 0.2 0
0
1
2 3 Inhibitor
4
C
0.4 0.2 0
1
0
10
2
3 Time
I=3 4
5
6
40
50
D I=0
1
I=0
Transcription rate, v
1 Transcription rate, v
0.6
0
5
I = 0.5
0.8
0.8 0.6 0.4 I=3
0.2 0
0
5
10 Time
15
20
0.8 0.6 0.4 0.2 0
20 30 Time
Figure 7.5 Comparison of the Hill kinetics and the corresponding developed molecular mechanism based on cooperative binding sites. (A) Transcription rate v for the compact model (Eq. (7.12), thin dashed curve) and for the developed model (numerical integration of Eq. (7.14), dots) as a function of the concentration of the inhibitor I. (B) Transcription rate v (thin dashed lines, Eq. (7.12)) or time evolution of v (solid curves, Eq. (7.14)) for different concentrations of the inhibitor I (from top to bottom: I ¼ 0.5, 1, 1.5, 2, 2.5, and 3 nM). (C) Variation of rate v for the compact model (thin dashed curve) and for the developed model (solid curve) observed when the concentration of inhibitor I transiently increases from 0 to 3 nM during t ¼ 5 and t ¼ 12 min. (D) Oscillations of v when the synthesis of I undergoes oscillations. Parameter values: vs ¼ 1 nM min 1, KI ¼ 1 nM, n ¼ 2, ka1 ¼ 1 min 1 nM 1, kd1 ¼ 10 min 1, ka2 ¼ 10 min 1 nM 1, and kd2 ¼ 1 min 1. In panel D, the evolution equation for I is given by Eqs. (7.19) and (7.20) with A ¼ 3, t ¼ 10, and kr ¼ 1.
The results presented in Fig. 7.5D show that the oscillations of v obtained with the developed model have a slightly reduced amplitude compared to the oscillations obtained with the compact model. As for Michaelis-Menten kinetics (see previous section), increasing the binding and unbinding constant rates ka1, kd1, ka2, kd2 (while respecting the pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi constraints KI ¼ ðK1 K2 Þ) reduces the transients, hence leading to a better
190
Didier Gonze et al.
agreement between the compact and the developed version of the model (not shown). A good correspondence between the developed and the compact versions requires several assumptions and constraints on the parameter values. Namely, we need cooperativity, a sufficiently high level of inhibitor, and high binding/unbinding constants. The latter condition is necessary to reduce the transients (not shown). The molecular details being often unknown, it is usually impossible to check whether these requirements are satisfied. The Hill function should thus be regarded as a phenomenological equation, which is probably realistic in many situations. Our conclusions can be extended to systems with a higher Hill coefficient (n > 2), if we consider multiple, cooperative binding sites. Alternative molecular schemes, for example, based on inhibitor dimer (or multimer) formation, can also lead to the occurrence of Hill kinetics at the level of transcription (Alon, 2007).
4. Repressilator The Repressilator consists of three genes, cyclically regulated in such a way that each gene codes for a repressor protein which inhibits the transcription of the following gene in the cycle (Elowitz and Leibler, 2000) (Fig. 7.6). A theoretical model was used to design and to guide the experimental construction of a synthetic gene circuit exhibiting self-sustained oscillations in Escherichia coli (Elowitz and Leibler, 2000). In the present section, we develop the Repressilator model into “elementary” reaction steps, according to the molecular mechanism described in the previous section. We then compare the deterministic and stochastic dynamics of the original and developed versions of the model.
4.1. Original version: Hill-based model The original model of the Repressilator comprises six evolution equations governing the time evolution of each mRNA and protein level (Elowitz and Leibler, 2000). In the original formulation of the model, variables and
P2
P1 M1
G1
P3
M2
G2
Figure 7.6 Scheme of the Repressilator.
M3 G3
How Molecular Should Your Molecular Model Be?
191
time were rendered dimensionless. For our present purpose, we have rewritten the equations in terms of concentrations: dM1 Kn dM1 ; ¼a n dt K þ P3n dM2 Kn dM2 ; ¼a n dt K þ P1n dM3 Kn ¼a n dM3 ; dt K þ P2n dP1 ¼ bM1 gP1 ; dt dP2 ¼ bM2 gP2 ; dt dP3 ¼ bM3 gP3 : dt
ð7:21Þ
Variables Mi and Pi (with i ¼ 1, 2, 3) represent the concentration of mRNA and protein of the three components of the Repressilator. Parameters a, d, b, and g are the maximum transcription rate, mRNA degradation rate, protein synthesis rate, and protein degradation rate, respectively. The repression of the transcription is described by a Hill kinetics with an inhibition threshold K and a cooperativity degree n. For simplicity, each kinetic parameter was set to an identical value for each gene and protein. (Elowitz and Leibler, 2000) showed that such a symmetry favors the occurrence of oscillations. The stochastic version of this model (given in Table 7.3) consists of 12 reaction steps. Here, we computed the propensities of mRNA synthesis with the Hill kinetics.
4.2. Developed version: Cooperative binding sites As shown in the previous section, Hill functions used in Eq. (7.21) can be obtained, for example, if we assume that two binding sites are present in the promoter of each gene and that the binding of a repressor molecule is cooperative. We further assume that only the free gene can be transcribed at a maximum rate a, while the gene bound to one or two repressor proteins is not transcribed. A similar assumption was considered when developing a model for circadian rhythms (Gonze et al., 2002a). We thus obtain the following developed model:
192
Didier Gonze et al.
Table 7.3 Stochastic version of the original version of the Repressilator model No.
Reaction
1
[G1] ! M1
2 3
[G2] ! M2 [G3] ! M3
4 5 6 7 8 9 10 11 12
M1! M2! M3! [M1] ! P1 [M2] ! P2 [M3] ! P3 P1 ! P2 ! P3 !
Propensity w1
¼
Parameter values
ðKOÞn aO ðKO Þn þP3n n
ðKOÞ w2 ¼ aO ðKO Þn þP1n ðKOÞn w3 ¼ aO ðKOÞn þP n 2 w4 ¼ dM1 w5 ¼ dM2 w6 ¼ dM3 w7 ¼ bM1 w8 ¼ bM2 w9 ¼ bM3 w10 ¼ gP1 w11 ¼ gP2 w12 ¼ gP3
a ¼ 40 nM min 1, K ¼ 1 nM, n ¼ 2
d ¼ 5 min 1 b ¼ 5 min 1 g ¼ 5 min 1
Variables in brackets indicate species which are necessary for the reaction but which are not consumed in the reaction.
dG1 ¼ ka1 P3 G1 þ kd1 G1P ; dt dG1P ¼ ka1 P3 G1 kd1 G1P ka2 P3 G1P þ kd2 G1PP ; dt dG2 ¼ ka1 P1 G2 þ kd1 G2P ; dt dG2P ¼ ka1 P1 G2 kd1 G2P ka2 P1 G2P þ kd2 G2PP ; dt dG3 ¼ ka1 P2 G3 þ kd1 G3P ; dt dG3P ¼ ka1 P2 G3 kd1 G3P ka2 P2 G3P þ kd2 G3PP ; dt dM1 ¼ aG1 dM1 ; dt dM2 ¼ aG2 dM2 ; dt dM3 ¼ aG3 dM3 ; dt
193
How Molecular Should Your Molecular Model Be?
dP1 ¼ bM1 gP1 ka1 P1 G2 þ kd1 G2P ka2 P1 G2P þ kd2 G2PP ; dt dP2 ¼ bM2 gP2 ka1 P2 G3 þ kd1 G3P ka2 P2 G3P þ kd2 G3PP ; dt dP3 ¼ bM3 gP3 ka1 P3 G1 þ kd1 G1P ka2 P3 G1P þ kd2 G1PP : ð7:22Þ dt In these equations, Gi, GiP, and GiPP represent the concentration of the free gene, the gene bound to one repressor protein, and the gene bound to two repressor proteins (average over a large cell population). We assume that the total gene amount is constant: G1T ¼ G1 þ G1P þ G1PP ; G2T ¼ G2 þ G2P þ G2PP ; G3T ¼ G3 þ G3P þ G3PP :
ð7:23Þ
Parameters kai and kdi are the rates of binding and unbinding of the repressor to the gene promoter. It is reasonable to assume that the values of these kinetic rates are large compared to the other kinetic rates, because binding/ unbinding are fast compared to transcription/translation p and degradation pro-ffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi cesses. In addition, we also impose the constraint K ¼ ðkd1 =ka1 Þðkd2 =ka2 Þ as well as the high cooperativity, i.e., kd2/ka2 kd1/ka1 to make the developed model consistent with the original Hill-based model with n ¼ 2 (cf. Eq. (7.21)). The stochastic version of this developed model is given in Table 7.4. Note that the number of reaction steps is doubled compared to the compact model. We should stress that this model is taken here as a toy example. We do not aim at developing a precise model for the biological system itself. In the Repressilator constructed experimentally, there is no indication that the repressors cooperatively bind several sites in the promoter of their target genes. The sigmoidal shape of the transcription function may also arise from the dimerization of the protein prior to the binding. Other developments of the Repressilator into “elementary” reaction steps, for example, based on dimerization of the repressors prior to their binding, could also have been considered (as done for example by Chen et al., 2004 or by Loinger and Biham, 2007).
4.3. Deterministic simulation of the Repressilator The results of deterministic simulations of the original, compact model are given in Fig. 7.8A and B. Figure 7.8A shows the deterministic limit-cycle oscillations obtained by numerical integration of Eq. (7.21). Because of the symmetry in the model and in the parameters values, each mRNA
194
Didier Gonze et al.
Table 7.4 Stochastic version of the developed version of the Repressilator model No.
Reaction
Propensity
Parameter values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
G1 þ P3 ! G1P G1P ! G1 þ P3 G1P þ P3 ! G1PP G1PP ! G1P þ P3 G2 þ P1 ! G2P G2P ! G2 þ P1 G2P þ P1 ! G2PP G2PP ! G2P þ P1 G3 þ P2 ! G3P G3P ! G3 þ P2 G3P þ P2 ! G3PP G3PP ! G3P þ P2 [G1] ! M1 [G2] ! M2 [G3] ! M3 M1! M2! M3! [M1] ! P1 [M2] ! P2 [M3] ! P3 P1! P2! P3!
w1 ¼ ka1P3G1/O w2 ¼ kd1G1P w3 ¼ ka2P3G1P/O w4 ¼ kd2G1PP w5 ¼ ka1P1G2/O w6 ¼ kd1G2P w7 ¼ ka2P1G2P/O w8 ¼ kd2G2PP w9 ¼ ka1P2G3/O w10 ¼ kd1G3P w11 ¼ ka2P2G3P/O w12 ¼ kd2G3PP w13 ¼ aG1 w14 ¼ aG2 w15 ¼ aG3 w16 ¼ dM1 w17 ¼ dM2 w18 ¼ dM3 w19 ¼ bM1 w20 ¼ bM2 w21 ¼ bM3 w22 ¼ gP1 w23 ¼ gP2 w24 ¼ gP3
ka1 ¼ 10 nM 1 min 1 kd1 ¼ 100 min 1 ka2 ¼ 100 nM 1 min 1 kd2 ¼ 10 min 1
a ¼ 40 min 1 d ¼ 1 min 1 b ¼ 5 min 1 g ¼ 5 min 1
Variables in brackets indicate the species which are necessary for the reaction but which are not consumed in the reaction. The total gene concentration is G1T ¼ G2T ¼ G3T ¼ 1O.
concentration oscillates with the same amplitude, but the oscillations are 120 out-of-phase. For each gene, the protein level directly follows the mRNA level (not shown). For the parameter values chosen, the period of the oscillations is about 7.3 min. Figure 7.7B presents the bifurcation diagram as a function of the maximum transcription rate, a (taken identical for each gene). A Hopf bifurcation occurs at a ¼ 7.5 nM/min. Limit-cycle oscillations occur above this threshold value. Their amplitude increases somewhat linearly with a. The results of deterministic simulations of the developed model are given in Fig. 7.7C and D. Parameter values are the same as for the original model. To reduce delay effects, the value of the parameters for the binding and unbinding rates is 10 times higher than the ones used in Fig. 7.5. The oscillations produced by this developed model are very similar to the ones produced by the original model. The Hopf bifurcation is located at a ¼ 5.4 nM/min
195
How Molecular Should Your Molecular Model Be?
A
B 25
20 M1 = M2 = M3
20 Concentration
25
M1 M2 M3
15 10 5 0
15 HB a = 7.5
10 5
0
10
20
30
40
0
50
0
10
30
40
50
40
50
D 25
M1 M2 M3
20 15 10 5 0
25 20 M1 = M2 = M3
C
Concentration
20
Transcription rate, a
Time
15 HB a = 5.4
10 5
0
10
20
30 Time
40
50
0
0
10
20
30
Transcription rate, a
Figure 7.7 Deterministic simulation of the Repressilator. (A, B) Original model. (C, D) Developed model. (A) Deterministic oscillations obtained by numerical integration of Eq. (7.21). Parameter values: a ¼ 40 nM min 1, d ¼ 1 min 1, b ¼ 5 min 1, g ¼ 5 min 1, n ¼ 2, K ¼ 1 nM, and G1T ¼ G2T ¼ G3T ¼ 1 nM. The period is 7.3. (B) Bifurcation diagram as a function of the transcription rate, a (kept identical for all mRNA). HB denotes a Hopf bifurcation. On the left of this point, the solid curve gives the stable steady state; on the right on this point, the dashed curve corresponds to the unstable steady state, and the two solid curves are the maximum and the minimum of the mRNA oscillations. (C) Deterministic oscillations obtained by numerical integration of Eq. (7.21). Parameter values: ka1 ¼ 10 min 1 nM 1, kd1 ¼ 100 min 1, ka2 ¼ 100 min 1 nM 1, and kd2 ¼ 10 min 1; other parameters are as in panel A. The period is 7.97. (D) Bifurcation diagram as a function of the transcription rate, a.
(Fig. 7.7D), a value slightly smaller than for the compact model. Nevertheless, the shape of the envelope of the oscillations is very similar in both versions. If binding and unbinding rates are attributed smaller values (i.e., ka1 ¼ 1, kd1 ¼ 10, ka2 ¼ 10, kd2 ¼ 1, as in Fig. 7.5), then the oscillations present a slightly larger amplitude and period (period ¼ 10.4 min) than in the original model, but the structure of the bifurcation diagram is preserved. The comparison between Fig. 7.7A and B and Fig. 7.7C and D shows that the original and the developed model share the same qualitative features (limit-cycle oscillations, bifurcation structure). They only slightly differ in
196
Didier Gonze et al.
some quantitative aspects. Of course, the best agreement between the two versions of the model is obtained when the parameter values of the developed model are consistent with the compact model (fast binding/unbinding rates, high cooperativity). The developed model offers the possibility to explore the behavior of the system when these conditions are not fulfilled (slow binding/unbinding rates, absence of cooperativity). If, for example, we choose ka1 ¼ kd1 ¼ ka2 ¼ kd2 ¼ 1, the oscillations are preserved and even the structure of the bifurcation diagram is kept (HB at a ¼ 7.2 nM/min) (not shown). If we choose ka1 ¼ 10, kd1 ¼ 1, ka2 ¼ 1, kd2 ¼ 10, then, for the default parameter values, the oscillations are damped (no limit-cycle) (not shown). They can, however, be restored at very high values of a (a > 1700 nM/min). Thus, cooperativity and fast binding/unbinding rates are not required to generate limit-cycle oscillations.
4.4. Stochastic simulation of the Repressilator The results of stochastic simulations for the original model are given in Fig. 7.8A and B. Figure 7.8A shows the stochastic oscillations obtained with the Gillespie algorithm for the stochastic version of the model presented in Table 7.3. Note that in this version, the propensities of the transcriptional processes are computed using the Hill functions. In the case illustrated on the figure, the system size has been fixed to O ¼ 100, which yields a number of molecules of a few thousands mRNA and proteins. For this number of molecules, the oscillations are clearly observable, but their amplitudes and periods undergo some variability. The effect of noise may be quantified by computing the distribution of the periods and by measuring its standard deviation (Fig. 7.8B). For the compact version, the mean period is 7.33 and the standard deviation is 0.25 (the coefficient of variation CV ¼ standard deviation/mean ¼ 3.41%). The results of stochastic simulations for the developed model are given in Fig. 7.8C and D. As for the original model, the systems size was set to O ¼ 100. The number of mRNA and protein molecules is thus comparable to those of the original model and the number of gene copies for each species is equal to G1T ¼ G2T ¼ G3T ¼ O. Comparing the histogram of the period distribution (Fig. 7.8D) with the histogram obtained for the original model (Fig. 7.8B), we notice that the effect of noise is roughly similar in both cases (mean period ¼ 7.98, standard deviation ¼ 0.33, CV ¼ 4.13%). The fact that the period is slightly increased is consistent with the prediction from the deterministic model (see legend of Fig 7.7A and C). In panels C and D, the total gene concentration was scaled by the system size O. However, we may want to set the number of gene copies equal to 1. To rescale variables Mi and Pi while keeping G1T ¼ G2T ¼ G3T ¼ 1, we do the following change: we multiply all the binding and unbinding rates by
197
How Molecular Should Your Molecular Model Be?
B
2500
M1 M2 M3
2000 1500 1000 500 0
120 100
Occurence
Number of molecules
A
Mean = 7.33 Std = 0.26
80 60 40 20
0
10
20
30
40
0
50
4
5
6
Time
Number of molecules
Occurence
1500 1000 500
9
10
7 Period
8
9
10
7 Period
8
9
10
Mean = 7.99 Std = 0.33
80 60 40 20
0
10
20
30 Time
40
0
50
F
2500
4
1500 1000 500
5
6
120 100
2000
0
8
120 100
2000
0
E
D
2500
Occurence
Number of molecules
C
7 Period
Mean = 7.38 Std = 0.6
80 60 40 20
0
10
20
30 Time
40
50
0
4
5
6
Figure 7.8 Stochastic simulation of the Repressilator. (A, B) Stochastic oscillations and period distribution obtained for the original model (Table 7.3). (C, D) Stochastic oscillations and period distribution obtained for the developed model (Table 7.4) for the parameter values given in Table 7.4. (E, F) Stochastic oscillations and period distribution obtained for the developed model (Table 7.4) when Gtot ¼ 1 and the binding/unbinding rates (ka1, ka2, kd1, and kd2) as well as the transcription rate a are multiplied by O. For all these simulations, the system size has been set to O ¼ 100 and the period distribution is computed from a time series of 5000 min.
O and we multiply the maximum transcription rate, a, by O (see Table 7.4, parameter set 2). Stochastic simulations of this model yield the time series shown in Fig. 7.8E. The corresponding period distribution is given in Fig. 7.8F. This version of the model leads to a lower robustness to molecular noise but does not destroy the oscillations.
198
Didier Gonze et al.
In summary, developing the Repressilator in “elementary reaction steps” does not significantly affect the robustness of the oscillations to molecular noise. However, if we want to set the number of genes equal to 1, some tricks are needed to preserve the consistency of the model with its deterministic counterpart. These adaptations may have some influence on the effect of noise.
5. Toggle Switch The Toggle Switch model consists of two genes which inhibit each other (Gardner et al., 2000) (Fig. 7.9). For appropriate parameter values and sufficient nonlinearity, this model produces bistability (coexistence between two stable steady states). This model was used to design and to guide the experimental construction of a synthetic gene circuit exhibiting bistability in E. coli (Gardner et al., 2000).
5.1. Original version: Hill-based model In the original version of the Toggle Switch model, mRNA and protein were not distinguished. The model thus contained only two variables. We consider here a variant in which we distinguish the mRNA (M1 and M2) from the protein (P1 and P2): dM1 Kn dM1 ; ¼ a1 n dt K þ P2n dM2 Kn dM2 ; ¼ a2 n dt K þ P1n dP1 ¼ bM1 gP1 ; dt dP2 ¼ bM2 gP2 : dt
ð7:24Þ
P2
P1 M1
G1
M2
G2
Figure 7.9 Scheme of the Toggle Switch model.
How Molecular Should Your Molecular Model Be?
199
These equations are built in a similar way than the ones of the Repressilator. Variables Mi and Pi (with i ¼ 1, 2) represent the concentration of mRNA and protein coded by the two genes of the Toggle Switch system. The definition of the parameters is the same as for the Repressilator. In the experiments, the switch between the two steady states is triggered by an inducer (e.g., IPTG) which binds to one of the repressor proteins and thereby prevents the protein to bind the gene promoter. Equations can be modified to account explicitly for the inducer (Gardner et al., 2000). For the sake of simplicity, we consider here that the switch is induced by a transient increase (pulse) of the transcription rate of one of the gene (a1). Again, our goal is not to model the experimental Toggle Switch system, but rather to study a minimal model exhibiting bistability. The stochastic version of this model, which comprises eight reaction steps, is given in Table 7.5. Here again, we computed the propensities of mRNA synthesis with the Hill kinetics.
5.2. Developed version: Cooperative binding sites We developed the Toggle Switch model in a similar way as the Repressilator. We assume that two binding sites are present in the promoter of each gene and that the binding of the repressor proteins is cooperative. We thus obtain the following eight-variable ODE model: dG1 ¼ ka1 P2 G1 þ kd1 G1P ; dt dG1P ¼ ka1 P2 G1 kd1 G1P ka2 P2 G1P þ kd2 G1PP ; dt dG2 ¼ ka1 P1 G2 þ kd1 G2P ; dt dG2P ¼ ka1 P1 G2 kd1 G2P ka2 P1 G2P þ kd2 G2PP ; dt ð7:25Þ dM1 ¼ a1 G1 dM1 ; dt dM2 ¼ a2 G2 dM2 ; dt dP1 ¼ bM1 gP1 ka1 P1 G2 þ kd1 G2P ka2 P1 G2P þ kd2 G2PP ; dt dP2 ¼ bM2 gP2 ka1 P2 G1 þ kd1 G1P ka2 P2 G1P þ kd2 G1PP ; dt
200
Didier Gonze et al.
Table 7.5 Stochastic version of the original version of the Toggle Switch model No.
Reaction
Propensity
1
[G1] ! M1 w1 ¼
2 3 4 5 6 7 8
[G2] ! M2 M1! M2! [M1] ! P1 [M2] ! P2 P1! P2!
Parameter values
ðKOÞn a1 O ðKO Þn þP2n n
a1 ¼ 4 nM 1 min 1, K ¼ 1 nM, n¼2 a2 ¼ 4 nM 1 min 1 d ¼ 1 min 1
ðKOÞ w2 ¼ a2 O ðKO Þn þP1n w4 ¼ dM1 w5 ¼ dM2 w7 ¼ bM1 b ¼ 2 min 1 w8 ¼ bM2 w10 ¼ gP1 g ¼ 2 min 1 w11 ¼ gP2
Variables in brackets indicate species that are necessary for the reaction but not consumed in the reaction.
with the conservation relations: G1T ¼ G1 þ G1P þ G1PP ; G2T ¼ G2 þ G2P þ G2PP :
ð7:26Þ
The definition of the variables and parameters is the same as for the Repressilator. The stochastic version of this developed Toggle Switch model is given in Table 7.6. As for the Repressilator, the number of reaction steps is doubled compared to its compact version. As for the Repressilator, the parameter values have been chosen for consistency between the two versions of the model (high cooperativity, fast binding/unbinding constants ffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and K ¼ ðkd1 =ka1 Þðkd2 =ka2 Þ).
5.3. Deterministic simulation of the Toggle Switch model We first simulate the deterministic Toggle Switch model (Fig. 7.10). For appropriate parameter values, the compact as well as the developed version of the model show a bistable behavior. In panel A, we show how a pulse of a1 can induce a switch from one steady state to the other in the original model. Panel B shows the bifurcation diagram as a function of a1 for the original model. Two saddle node bifurcations (denoted by LP on the figure), located at a1 ¼ 3.2 nM/min and a1 ¼ 5.8 nM/min, delimit the region of bistability. Panels C and D give the corresponding time series and bifurcation diagram for the developed model. The bifurcation picture is very similar to the one of the compact model with a bistable region located between a1 ¼ 3.3 nM/min and a1 ¼ 5.5 nM/min. Thus, the bistability region is slightly reduced compared to the compact model.
201
How Molecular Should Your Molecular Model Be?
Table 7.6 Stochastic version of the developed version of the Toggle Switch model No.
Reaction
Propensity
Parameter values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
G1 þ P2 ! G1P G1P ! G1 þ P2 G1P þ P2 ! G1PP G1PP ! G1P þ P2 G2 þ P1 ! G2P G2P ! G2 þ P1 G2P þ P1 ! G2PP G2PP ! G2P þ P1 [G1] ! M1 [G2] ! M2 M1 ! M2 ! [M1] ! P1 [M2] ! P2 P1 ! P2 !
w1 ¼ ka1P2G1/O w2 ¼ kd1G1P w3 ¼ ka2P2G1P/O w4 ¼ kd2G1PP w5 ¼ ka1P1G2/O w6 ¼ kd1G2P w7 ¼ ka2P1G2P/O w8 ¼ kd2G2PP w9 ¼ a1G1 w10 ¼ a2G2 w11 ¼ dM1 w12 ¼ dM2 w13 ¼ bM1 w14 ¼ bM2 w15 ¼ gP1 w16 ¼ gP2
ka1 ¼ 10 nM 1 min 1 kd1 ¼ 100 min 1 ka2 ¼ 100 nM 1 min 1 kd2 ¼ 10 min 1
a1 ¼ 4 min 1 a2 ¼ 4 min 1 d ¼ 1 min 1 b ¼ 2 min 1 g ¼ 2 min 1
Variables in brackets indicate the species that are necessary for the reaction but not consumed in the reaction.
Figure 7.10E shows the stability diagram as a function of the transcription rates a1 and a2. Clearly, the domain of bistability defined by the developed model for n ¼ 2 is very close to the bistability domain of the compact model (for n ¼ 2). For n ¼ 3, this domain is significantly enlarged. Thus, replacing the developed model with its compact counterpart affects the results much less than some parametric changes. To induce the switch from one steady state to the other, the pulse of a1 must have a sufficient amplitude and duration. In Fig. 7.10F, we plot the minimum amplitude needed to induce the switch as a function of the pulse duration for both versions of the model. Here again, the compact and developed versions lead to similar results. In summary, we obtain a good qualitative agreement between the compact and the developed versions of the Toggle Switch model. The difference between these two versions mainly concerns some minor quantitative changes, which are smaller than the effect induced by a variation of some parameters such as n. However, when modeling biological systems, the values of these parameters are often unknown.
5.4. Stochastic simulation of the Toggle Switch model We now discuss the stochastic analysis of both versions of the Toggle Switch model. Figure 7.11A displays the stochastic time series for 10 independent runs for the original model. In all cases, a pulse of a1 induces a switch from
202
E
4 3.5 3 2.5 2 1.5 1 0.5 0
B M1 M1 20
40 60 Time
M1
20
40 60 Time
80
2
0
4 a1
6
8
10 LP1 a1 = 3.3
LP2 a1 = 5.5
6 4 2 0
100
0
2
4 a1
6
8
20 30 Pulse duration
40
F
10
12 Bistability n = 2, dev
6
n=2
4
n=3
2 0
4
8
M2 0
LP2 a1 = 5.8
6
0
100 D
8 a2
80
LP1 a1 = 3.2
2
M2 0
10 8
M1
Concentration
C
4 3.5 3 2.5 2 1.5 1 0.5 0
0
2
4
a1
6
8
Pulse amplitude
Concentration
A
Didier Gonze et al.
n=2
11
n = 2, dev
10
Switch
9 8 7 6 No switch
10
5
0
10
Figure 7.10 Deterministic simulation of the Toggle Switch model. (A, B) Original model. (C, D) Developed model. (A) Switches induced by a pulse of a1 when a1 is increased from 4 to 6.8 nM min 1 during t = 20 and t = 40 min. These time series have been obtained by numerical integration of Eq. (7.24). Parameter values: a2 = 4 nM min 1, d = 2 min 1, b = 2 min 1, g = 1 min 1, K = 1 nM, n = 2, and G1T = G2T = 1 nM. (B) Bifurcation diagram as a function of a1 (a2 = 4). LP1 and LP2 denote the saddle node bifurcations. Between these two points, the system displays bistability. (C) Same as in panel A for the developed model. Parameter values: ka1 = 10 min 1 nM 1, kd1 = 100 min 1, ka2 = 100 min 1 nM 1, and kd2 = 10 min 1; other parameter values are as in panel A. (D) Same as in panel B for the developed model. (E) Stability diagram as a function of parameters a1 and a2 for the compact model (for n = 2 and n = 3, solid curves) and for the developed model (n = 2, dashed curve). Bistability is observed in the upper right part of the diagram. (F) Conditions on the amplitude and duration of the a1 pulse to induce the switch, for the compact model (solid curve) and the developed model (dashed curve).
203
How Molecular Should Your Molecular Model Be?
A
B
700 M2
500 400 300 200
M1
100 0
0
20
40
C
60
80
D
600 Occurrence
Number of molecules
700 M2
400 300 200
M1
100 0
20
40
30 20
60
80
100
90 80 70 60 50 40 30 20 10 0 40
50
F
800 700
Occurrence
600 M2
500 400 300 200
M1
100 0
0
20
40
60 Time
80
100
90 80 70 60 50 40 30 20 10 0 40
60 Switch time
70
80
Mean = 51.02 St dev = 2.47
50
Time
E Number of molecules
40
0 40
100
800
500
Mean = 57.75 St dev = 5.25
10
Time
0
60 50
600 Occurrence
Number of molecules
800
60 Switch time
70
80
Mean = 48.74 St dev = 2.19
50
60 Switch time
70
80
Figure 7.11 Stochastic simulation of the Toggle Switch model. (A, B) Time series of 10 independent runs and distribution of the switch times (500 runs) obtained for the original model (Table 7.5). (C, D) Time series of 10 independent runs and distribution of the switch times (500 runs) obtained for the developed model (Table 7.6). (E, F) Time series of 10 independent runs and distribution of the switch times (500 runs) obtained for the developed model (Table 7.6) when Gtot ¼ 1 and the binding/unbinding rates (ka1, ka2, kd1, and kd2) as well as the transcription rate a are multiplied by O. For all these simulations, the system size has been set to O ¼ 100. The switch time is defined as the time at which M1 becomes larger than M2.
one steady state to the other one, but, due to the stochastic nature of the model, the system does not switch exactly at the same time in each run. As a measure of the effect of noise on the system, we computed the distribution
204
Didier Gonze et al.
of the switch times, defined as the time at which M1 becomes larger than M2 (Fig. 7.11B). In Fig. 7.11C and D, we show the corresponding results for the developed version of the Toggle Switch model. The main difference is that the distribution is shifted towards smaller switch times and its standard deviation is reduced compared to the one obtained for original model (Fig. 7.11B). This quantitative difference may be attributed to the fact that the Hill function and its development are not strictly equivalent (see above) and that the phase space properties (vector flows and steady states attractivities) are not identical. As described for the Repressilator, we also consider a variant of the developed model in which the number of gene copies is maintained at G1T ¼ G2T ¼ 1. The results obtained for this variant are given in Fig. 7.11E and F. The distribution of the switch times is very similar to the developed version in which the total gene concentration is scaled by O. These stochastic simulations show that the distribution of the switch times obtained for the compact model is slightly shifted toward larger values and somewhat broader than for the developed version of the model. Developing the toggle switch in detailed reaction steps does not significantly affect the robustness of the bistable behavior to molecular noise.
6. Discussion 6.1. Qualitative models are useful to understand dynamical properties of gene regulatory networks Models in biology are useful to understand the relation between the structure of a regulatory network and its dynamical properties. Qualitative models capturing the core regulatory architecture of the system are suitable to unravel these relations. Earlier models in biology allowed to associate negative circuits to oscillations and positive circuits to multistability (Thomas and D’Ari, 1990). Models for circadian clocks can be used to explore the advantage of the interlocked feedback loops (Becker-Weimann et al., 2004). Models for the p53–Mdm2 system can be used to understand the frequency tuning of the oscillations (Abou-Jaoude´ et al., 2009). Qualitative models are usually relatively simple, rely on phenomenological kinetic equations, and reveal generic properties of biological systems. In synthetic biology, models for the Repressilator and for the Toggle Switch are essential to design the genetic circuits, but they do not tell quantitatively what should be the synthesis or the degradation rate of the mRNA and proteins. They rather show how these different rates should be balanced to obtain the desired behavior and to predict the behavior of these systems when the balance is impaired. In most biological systems studied, the value of kinetic parameters is unknown. Many reactions are known to be enzymatic, but the binding/
How Molecular Should Your Molecular Model Be?
205
unbinding constants and turnover rates remains usually unknown. In most cases, even the molecular mechanism is not elucidated. For these systems, it is impossible to develop detailed quantitative models that reproduce precisely experimental time series or predict quantitatively the behavior of the system for various well-defined sets of parameters. Detailed models are not absolutely required to fit experimental data. Hill functions for transcription have been measured experimentally (Rosenfeld et al., 2005), without the molecular mechanism responsible for such threshold being elucidated. The mechanism based on cooperative binding sites discussed here is only one possible way to generate a sigmoidal function, but other mechanisms, based on the multimerization of the inhibitor, on competition of transcription factors for a given binding site (Rossi et al., 2000), or positive-feedback-loop-based models (Ferrell and Machleder, 1998), may also explain the occurrence of Hilllike kinetics. Phenomenological models based on measured kinetics are thus suitable for quantitative modeling. For instance, in a study of bistability in the lactose system in bacteria, Ozbudak et al. (2004) resort to a compact three-variable model (built upon Hill function) to fit experimental data to the nonlinear functions.
6.2. Michaelis–Menten and Hill assumptions are often reasonable When modeling a biological system, Michaelis–Menten and Hill functions are often taken as standard equations. Michaelis–Menten functions are used to describe not only rates of enzyme reactions in metabolism but also posttranslational processes (such as phosphorylation of proteins) that play crucial roles in the dynamics of genetic networks. Hill functions are employed to describe the kinetic of the multisite enzyme as well as gene transcription. The rationale behind this choice is that biochemical reactions are often enzymatic and that saturation kinetics and threshold responses are often observed experimentally. This choice is also motivated by the fact that these functions provide appropriate nonlinearity necessary for the system to generate nontrivial behaviors, like self-sustained oscillations and multistability. The assumptions that make the Michaelis–Menten equations valid are, however, rarely discussed. Many textbooks present the derivation of the Michaelis–Menten kinetics for isolated reactions (S ! P) and highlight the underlying QSSA hypothesis. Only a few works discuss the validity of these assumptions in open systems (Flach and Schnell, 2006; Stoleriu et al., 2004, 2005). We have shown here that Michaelis–Menten kinetics are still valid when the reaction of interest is embedded in a larger reaction scheme (! S ! P !), but have reported here a few observations to keep in mind. First, the enzyme does not need to be in small amount compared to the substrate. Second, the E–S binding rates must be fast compared to the
206
Didier Gonze et al.
reactions of production of the substrate and degradation of the product. If it is not the case, the relaxation time to the steady states induces quantitative changes in the system and this may affect the response of the system to both transient perturbations and periodic forcing. Upon appropriate tuning of some parameter values, it is usually possible to make the developed model consistent with its compact version. The same question arises for the Hill function. We have presented one possible molecular mechanism that can explain the occurrence of the Hill functions at the level of gene transcription. When the transcription module is embedded in a larger network, then a good agreement between the compact and the developed versions of the model is obtained, if the binding/unbinding rates are fast compared to the other processes (such as transcription or translation). This assumption is generally valid because processes like transcription/translation, which require the recruitment of many transcription factors and the polymerization of RNA and protein molecules, are much slower than the binding of one molecule of a transcription factor to the gene promoter. When these conditions are fulfilled, it is possible to reproduce the behavior predicted by the compact model with a developed model consisting of elementary reaction steps. This was illustrated here with two simple gene models, the Toggle Switch (Gardner et al., 2000) and Repressilator (Elowitz and Leibler, 2000). An agreement of two different levels of detail of a model was also obtained for the circadian clock (Gonze et al., 2002a,b). In all these systems, the deterministic behavior and the bifurcation properties are largely conserved. Compact models are thus qualitatively equivalent regarding the global dynamical properties and bifurcations structures to their corresponding developed models, provided they incorporate the same regulatory network in terms of feedbacks and delays and fulfill the necessary conditions.
6.3. The level of details depends on the question asked The main advantage of using compact models is to reduce the number of variables and parameters, making the core structure of the network more visible and facilitating the numerical analysis of the equations. Exploring the properties and robustness of the system in parameter space is much easier if the number of key parameters is reduced. Such models, although approximative, have been successfully applied to study dynamical properties of genetic networks both in systems and in synthetic biology. It is important to stress here that by “developing a model,” we mean to decompose the Michaelis–Menten and Hill function into a detailed reactional scheme. This consists of adding intermediary reaction steps and compounds. Apart from quantitative changes, these additional steps would not change the global dynamics of the system. Developing a model is justified when we want to access a particular parameter or process which
How Molecular Should Your Molecular Model Be?
207
is not explicitely taken into account in the compact model and to examine its effect on the dynamics. Of course, incorporating new regulations in a network may change the behaviour of the system. For the case of the cell cycle modeling, authors developed models mainly to add more regulations as they are discovered (Chen et al., 2004; Ge´rard and Goldbeter, 2009). Such detailed models allow to assess the relative importance of the regulations. The level of details thus depends on the question asked. Note that it might not be straightforward to explain nonstandard phenomenological functions by a detailed molecular scheme (Sabouri-Ghomi et al., 2008). In this chapter, the question of how to treat the “gene” in the developed version of the Repressilator and the Toggle Switch illustrates the difficulties that can arise when developing a model. Furthermore, biomolecular mechanisms are never “fully” developed. It is always possible to further decompose each molecular step. Indeed, a fully detailed scheme of the transcription would require taking into account the recruitment of each protein that make and activate the RNA polymerase, as well as each polymerization step. Such a detailed model was designed to study the lysis/lysogeny switch in the l phage (Arkin et al., 1998). Where to fix the limit? In theory, each reactional step can always be further detailed till we reach the level of “elementary reaction step.” But what does it mean?
6.4. The notion of elementary reaction steps in biology is elusive In physical chemistry, the Van’t Hoff law of mass action is a phenomenological model empirically valid also for nonelementary step chemical reactions (Berry et al., 2001; McQuarrie and Simon, 1999). Even in more simple inorganic reaction schemes, there is an ongoing discussion on the level of detail required to model these systems. Rigorous mathematical demonstration of the validity of such phenomenological kinetic laws is based on many assumptions such as ideal solution, homogeneity of the system, Maxwell velocity distribution, thermal equilibrium, and constant temperature in time and space. From a mathematical point of view, one may rapidly conclude that none of these assumptions are respected in real, experimental chemical systems, a fortiori, in biochemical and cellular systems. However, we do know that these phenomenological kinetic equations are good approximations in terms of the dynamics of experimental systems. One should not be confused by the mathematical proof of phenomenological laws based on stringent hypotheses with the empirical validity of such laws. These are two related but nevertheless separated issues. Elementary steps in experimental kinetics are very seldom known even in inorganic chemical reactions. Physical chemistry shows that phenomenological kinetic laws are very robust and valid in many cases that go beyond
208
Didier Gonze et al.
the assumptions made to perform rigorous mathematical demonstrations. From a pragmatic, experimental point of view, chemical kinetic laws are valid in a much larger range than their corresponding rigorous mathematical theorems based on, for example, ideal solution or perfect gases models (Berry et al., 2001; McQuarrie and Simon, 1999). This general consideration is also valid for biological models (Cornish-Bowden, 1995). We thus believe that the emphasis on the need for writing a biological model in terms of elementary steps is, for the majority of the systems studied, not justified. The need for using elementary steps is some kind of inaccessible idealization for the vast majority of chemical and a fortiori biochemical processes.
6.5. Stochastic simulations can be performed using compact kinetics To assess the impact of molecular noise resulting from the low number of molecules, it is necessary to resort to stochastic simulations. A commonly used method to simulate stochastic models is the Gillespie algorithm (Gillespie, 1977). To run the Gillespie algorithm (or any of its variants), a propensity of each reaction must be defined (see Appendix). These propensities are directly related to the rate constants. The Gillespie algorithm was shown to be mathematically correct for “elementary reaction steps.” However, the master equation describing stochastic processes may include propensities not based on mass action law. Because running this Monte Carlo-type simulation on large systems is CPU-consuming, especially for systems with fast-rate reactions, it is highly tempting to use Michaelis–Menten or Hill functions to compute these propensities, rather than developing models into multiple reaction steps. Several studies apply compact models for stochastic simulations (Gonze et al., 2003; Kraus et al., 1992; Longo et al., 2009; Ouattara et al., 2010; Song et al., 2007), but the accuracy of these models is sometimes questioned. For this reason, many authors prefer to run stochastic simulations on developed models (Cai and Yuan, 2009; Forger and Peskin, 2005; Gonze et al., 2002a; Kar et al., 2009). Here, stochastic simulations of the compact version and developed version of the Repressilator and of the Toggle Switch models show a good qualitative correspondence. In agreement with the observations in the present study, a developed model for circadian clock gives similar results as its compact version (Gonze et al., 2002b). Interestingly, the results of a detailed stochastic model for p53–Mdm2 (Cai and Yuan, 2009) were comparable to those of a compact model of the same system (Ouattara et al., 2010), although they have been established independently by two different groups. We believe that the accuracy of these approaches should be discussed in a broader context. First, as explained in the previous section, the need for “elementary reaction steps” is overemphasized. Second, the Gillespie
How Molecular Should Your Molecular Model Be?
209
algorithm (as well as their ODE counterpart) implicitly assumes that the molecules are homogeneously distributed and well stirred. It neglects spatial organization or specific diffusion processes. The validity of such simplifications should be much more questionable than the use of a compact instead of a developed model. Finally, the purpose of these simulations is often to highlight the link between the network structure and the impact of noise, to identify robustness factors, or to determine how the noise may affect qualitatively the dynamical behaviors. Being developed or compact, this kind of models should be fitted to make quantitative predictions regarding the exact level of noise. In addition, for most biological systems, the absolute number of molecules is unknown, and therefore, quantitative predictions should be based on available experimental data and may not require a lower-level description of the system (Elowitz et al., 2002; Ozbudak et al., 2002). Nevertheless, these compact models are good at estimating the effect of relative noise on the system, that is, although the quantitative level of noise may not be correct, its relative impact on different variables of the system can be analyzed. For example, the role of network regulation in damping or amplifying the relative amplitude of molecular noise can be analyzed by comparing different versions of compact models incorporating different regulatory feedbacks (Cagatay et al., 2009).
6.6. Spatial effects are important and may imply complex kinetics Biomolecular systems work with complex spatial organization. Diffusion and transport processes play an important role both in the regulation and in the timing of biological processes. Components do not move randomly in the cell (Agutter and Wheatley, 2000; Agutter et al., 1995, 2000; Florescu and Joyeux, 2010). The molecular mechanisms of such processes are complex (compartmentalization, facilitated diffusion, DNA sliding, DNA looping, actin-guided protein transport, etc.; Klenin et al., 2006; Lomholt et al., 2009; Minton, 2006; von Hippel and Berg, 1989). The mathematical foundation of chemical kinetics assumes perfect stirring, ideal solution, etc. Although some adaptations of enzyme kinetics have been suggested (through the concept of fractal kinetics; Kopelman, 1988; Savageau, 1995) and models based on partial differential equations (PDE), which include the spatial dimension, have been used, there is currently no satisfactory approach to model accurately spatial effects. Nevertheless, pragmatic kinetic approaches make use of purely phenomenological kinetic equations. Those phenomenological equations can be the result of complicated biochemical mechanisms involving large molecular assemblies and elaborate transport mechanisms. When a Hill kinetics is measured in vivo, it includes spatial aspects (Rosenfeld et al., 2005).
210
Didier Gonze et al.
We can thus assume that Michaelis–Menten or Hill-like functions in phenomenological models encapsulate this high molecular and spatial complexity, provided that hyperbolic or threshold kinetics are experimentally observed. Developing such compact high-level description into a detailed lower-level description can become a daunting task. This difficult task has to be tackled on experimental grounds, that is, measured data should be accessible.
7. Conclusion One of the major goals of modeling gene regulatory networks is to relate the network architectures (generally in terms of feedback and feedforward loops) to dynamical properties (e.g., multistability, oscillations). The question of how much molecular details must be taken into account is often debated. Our results and the considerations discussed above provide arguments that compact models constitute a suitable approach both for deterministic and stochastic modeling. Using developed models may be appropriate for specific questions pertaining to the effect of some particular variables or to explore specific parameter effects. Provided that a compact model captures all important regulations of a system, it is suitable for analyzing the global dynamical properties. Adding intermediary steps to such a model raises the issue of having experimental data about the new variables and parameters, although this development does not affect qualitatively its dynamical properties. Moreover, decomposing a compact model into lower-level processes is neither straightforward nor univocal, as many different mechanisms can correspond to the same higher-level description. To make appropriate choices of model development, it is thus important to keep in mind the biological question at stake as well as the available experimental information. Here, we have illustrated this point by showing that adding molecular details to standard compact models does not lead to important qualitative changes in the global dynamical properties such as multistability or oscillations. However, qualitative models can give quantitative insights in terms of relative values of parameters or variables, as shown by the Toggle Switch or the Repressilator, and this is why they are useful for designing synthetic systems (Elowitz and Leibler, 2000; Gardner et al., 2000). From a quantitative perspective, experimental data can be fitted by compact models (Ozbudak et al., 2004). The latter automatically encapsulate the effects of underlying molecular mechanisms and spatial aspects. In conclusion, from a phenomenological point of view, the main qualitative behaviors can be obtained by compact versions of deterministic or stochastic models. In modeling, there is high value in simplicity (May, 2004).
How Molecular Should Your Molecular Model Be?
211
ACKNOWLEDGMENTS We thank M. Kaufman for fruitful discussions as well as K. Faust and L. De Mot for helpful comments on the chapter. This work was supported by grant #3.4636.04 from the Fonds de la Recherche Scientifique Me´dicale (F.R.S.M., Belgium), by the European Union through the Network of Excellence BioSim (Contract No. LSHB-CT-2004-005137), and by the Belgian Program on Interuniversity Attraction Poles, initiated by the Belgian Federal Science Policy Office, project P6/25 (BioMaGNet).
Appendix: The Gillespie Algorithm The Gillespie method is a standard, rigorous algorithm to simulate a stochastic system (Gillespie, 1977). It associates a certain propensity with each reaction (from which a probability can be easily obtained); at each time step, the algorithm stochastically determines the reaction that takes place according to its probability as well as the time interval to the next reaction. At each time step, the numbers of molecules of the different reacting species as well as the reaction propensities are updated. In this approach, a parameter, referred to as the system size and denoted O, is used to convert the concentration into the number of molecules, and hence allows us to control the number of molecules in the system. This parameter, which has the unit of a volume, thus permits us to control the level of noise in the system. Although the original Gillespie algorithm is rigorous, running such stochastic simulations might be CPU-consuming (even for relatively small systems, millions of reactions may occur). This is why alternative, approximate, or hybrid approaches have been proposed (for a review of these methods, see Pahle, 2009). As discussed in the main text, another way to limit memory and CPU requirements is to use compact, nonlinear equations (i.e., Michaelis–Menten or Hill functions) to compute the reaction propensities.
REFERENCES Abou-Jaoude´, W., Ouattara, D. A., and Kaufman, M. (2009). From structure to dynamics: Frequency tuning in the p53–Mdm2 network I. Logical approach. J. Theor. Biol. 258, 561–577. Agutter, P. S., and Wheatley, D. N. (2000). Random walks and cell size. BioEssays 22, 1018–1023. Agutter, P. S., Malone, P. C., and Wheatley, D. N. (1995). Intracellular transport mechanisms: A critique of diffusion theory. J. Theor. Biol. 176, 261–272. Agutter, P. S., Malone, P. C., and Wheatley, D. N. (2000). Diffusion theory in biology: A relic of mechanistic materialism. J. Hist. Biol. 33, 71–111. Alon, U. (2007). An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC, Boca Raton.
212
Didier Gonze et al.
Arkin, A., Ross, J., and McAdams, H. H. (1998). Stochastic kinetic analysis of developmental pathway bifurcation in phage lambda-infected Escherichia coli cells. Genetics 149, 1633–1648. Barik, D., Paul, M. R., Baumann, W. T., Cao, Y., and Tyson, J. J. (2008). Stochastic simulation of enzyme-catalyzed reactions with disparate timescales. Biophys. J. 95, 3563–3574. Barkai, N., and Leibler, S. (2000). Circadian clocks limited by noise. Nature 403, 267–268. Becker-Weimann, S., Wolf, J., Herzel, H., and Kramer, A. (2004). Modeling feedback loops of the mammalian circadian oscillator. Biophys. J. 87, 3023–3034. Berry, R. S., Rice, S. A., and Ross, J. (2001). Physical Chemistry (Topics in Physical Chemistry). Oxford University Press, Oxford. Borghans, J. A. M., De Boer, R. J., and Segel, L. A. (1996). Extending the quasi-steady state approximation by changing variables. Bull. Math. Biol. 58, 43–63. Bowen, J. R., Acrivos, A., and Oppenheim, A. K. (1962). Singular perturbation refinement to quasi-steady state approximation in chemical kinetics. Chem. Eng. Sci. 18, 177–188. Briggs, G. E., and Haldane, J. B. (1925). A note on the kinetics of enzyme action. Biochem. J. 19, 338–339. Brooks, C. L., and Gu, W. (2003). Ubiquitination, phosphorylation and acetylation: The molecular basis for p53 regulation. Curr. Opin. Cell Biol. 15, 164–171. Brooks, S. P., and Storey, K. B. (1992). A kinetic description of sequential, reversible, Michaelis–Menten reactions: Practical application of theory to metabolic pathways. Mol. Cell. Biochem. 115, 43–48. Cagatay, T., Turcotte, M., Elowitz, M. B., Garcia-Ojalvo, J., and Suel, G. M. (2009). Architecture-dependent noise discriminates functionally analogous differentiation circuits. Cell 139, 512–522. Cai, X., and Yuan, Z. M. (2009). Stochastic modeling and simulation of the p53–MDM2/ MDMX loop. J. Comput. Biol. 16, 917–933. Cha, S. (1970). Kinetic behavior at high enzyme concentrations. J. Biol. Chem. 245, 4814–4818. Chen, L., Wang, R., Kobayashi, T. J., and Aihara, K. (2004). Dynamics of gene regulatory networks with cell division cycle. Phys. Rev. E 70, 011909. Ciliberto, A., Capuani, F., and Tyson, J. J. (2007). Modeling networks of coupled enzymatic reactions using the total quasi-steady state approximation. PLoS Comput. Biol. 3, e45. Cornish-Bowden, A. (1995). Fundamentals of Enzyme Kinetics. Portland Press, London. Elowitz, M. B., and Leibler, S. (2000). A synthetic oscillatory network of transcriptional regulators. Nature 403, 335–338. Elowitz, M. B., Levine, A. J., Siggia, E. D., and Swain, P. S. (2002). Stochastic gene expression in a single cell. Science 297, 1183–1186. Ferrell, J. E., and Machleder, E. M. (1998). The biochemical basis of an all-or-none cell fate switch in Xenopus oocytes. Science 280, 895–898. Flach, E. H., and Schnell, S. (2006). Use and abuse of the quasi-steady-state approximation. Syst. Biol. (Stevenage) 153, 187–191. Florescu, A. M., and Joyeux, M. (2010). Comparison of kinetic and dynamical models of DNA–protein interaction and facilitated diffusion (dagger). J. Phys. Chem. A 114, 9662–9672. Forger, D. B., and Peskin, C. S. (2005). Stochastic simulation of the mammalian circadian clock. Proc. Natl. Acad. Sci. USA 102, 321–324. Gardiner, C. W. (2004). Handbook of Stochastic Methods. Springer, Berlin. Gardner, T. S., Cantor, C. R., and Collins, J. J. (2000). Construction of a genetic toggle switch in Escherichia coli. Nature 403, 339–342. Ge´rard, C., and Goldbeter, A. (2009). Temporal self-organization of the cyclin/Cdk network driving the mammalian cell cycle. Proc. Natl. Acad. Sci. USA 106, 21643–21648.
How Molecular Should Your Molecular Model Be?
213
Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. Goldbeter, A. (1995). A model for circadian oscillations in the Drosophila period protein (PER). Proc. Biol. Sci. 261, 319–324. Goldbeter, A. (1996). Biochemical Oscillations and Cellular Rhythms. Cambridge University Press, Cambridge, UK. Gonze, D., Halloy, J., and Goldbeter, A. (2002a). Robustness of circadian rhythms with respect to molecular noise. Proc. Natl. Acad. Sci. USA 99, 673–678. Gonze, D., Halloy, J., and Goldbeter, A. (2002b). Deterministic versus stochastic models for circadian rhythms. J. Biol. Phys. 28, 637–653. Gonze, D., Halloy, J., Leloup, J.-C., and Goldbeter, A. (2003). Stochastic models for circadian rhythms: Effect of molecular noise on periodic and chaotic behavior. C. R. Biol. 326, 189–203. Gonze, D., Halloy, J., and Goldbeter, A. (2004). Emergence of coherent oscillations in stochastic models for circadian rhythms. Physica A 342, 221–233. Goutsias, J. (2005). Quasiequilibrium approximation of fast reaction kinetics in stochastic biochemical systems. J. Chem. Phys. 122, 184102. Grima, R. (2009). Noise-induced breakdown of the Michaelis–Menten equation in steadystate conditions. Phys. Rev. Lett. 102, 218103. Kaern, M., Blake, W. J., and Collins, J. J. (2003). The engineering of gene regulatory networks. Annu. Rev. Biomed. Eng. 5, 179–206. Kar, S., Baumann, W. T., Paul, M. R., and Tyson, J. J. (2009). Exploring the roles of noise in the eukaryotic cell cycle. Proc. Natl. Acad. Sci. USA 106, 6471–6476. Keller, A. D. (1995). Model genetic circuits encoding autoregulatory transcription factors. J. Theor. Biol. 172, 169–185. Klenin, K. V., Merlitz, H., Langowski, J., and Wu, C. X. (2006). Facilitated diffusion of DNA-binding proteins. Phys. Rev. Lett. 96, 018104. Kopelman, R. (1988). Fractal reaction kinetics. Science 241, 1620–1626. Kraus, M., Lais, P., and Wolf, B. (1992). Structured biological modelling: A method for the analysis and simulation of biological systems applied to oscillatory intracellular calcium waves. BioSystems 27, 145–169. Leloup, J. C. (2009). Circadian clocks and phosphorylation: Insights from computational modeling. Cent. Eur. J. Biol. 4, 290–303. Leloup, J. C., and Goldbeter, A. (2003). Toward a detailed computational model for the mammalian circadian clock. Proc. Natl. Acad. Sci. USA 100, 7051–7056. Lewin, B. (2010). Gene IX. 9th edn. Jones and Bartlett, Sudbury. Loinger, A., and Biham, O. (2007). Stochastic simulations of the repressilator circuit. Phys. Rev. E 76, 051917. Lomholt, M. A., van den Broek, B., Kalisch, S. M., Wuite, G. J., and Metzler, R. (2009). Facilitated diffusion with DNA coiling. Proc. Natl. Acad. Sci. USA 106, 8204–8208. Longo, D. M., Hoffmann, A., Tsimring, L. S., and Hasty, J. (2009). Coherent activation of a synthetic mammalian gene network. Syst. Synth. Biol. 4, 15–23. Macnamara, S., Bersani, A. M., Burrage, K., and Sidje, R. B. (2008). Stochastic chemical kinetics and the total quasi-steady-state assumption: Application to the stochastic simulation algorithm and chemical master equation. J. Chem. Phys. 129, 095105. May, R. (2004). Uses and abuses of mathematics in biology. Science 303, 790–793. McAdams, H. H., and Arkin, A. (1997). Stochastic mechanisms in gene expression. Proc. Natl. Acad. Sci. USA 94, 814–819. McAdams, H. H., and Arkin, A. (1999). It’s a noisy business! Genetic regulation at the nanomolar scale. Trends Genet. 15, 65–69. McQuarrie, D. A., and Simon, J. D. (1999). Physical Chemistry: A Molecular Approach. University Science Books, Sausalito.
214
Didier Gonze et al.
Michaelis, L., and Menten, M. L. (1913). Kinetik der Invertinwirkung. Biochem. Z. 49, 333–369. Minton, A. P. (2006). Macromolecular crowding. Curr. Biol. 16, R269–R271. Murray, Check for Author Initials (2003). Mathematical Biology. Springer, Berlin. Narang, A. (2007). Effect of DNA looping on the induction kinetics of the lac operon. J. Theor. Biol. 247, 695–712. Ouattara, D. A., Abou-Jaoude´, W., and Kaufman, M. (2010). From structure to dynamics: Frequency tuning in the p53–Mdm2 network. II: Differential and stochastic approaches. J. Theor. Biol. 264, 1177–1189. Ozbudak, E. M., Thattai, M., Kurtser, I., Grossman, A. D., and van Oudenaarden, A. (2002). Regulation of noise in the expression of a single gene. Nat. Genet. 31, 69–73. Ozbudak, E. M., Thattai, M., Lim, H. N., Shraiman, B. I., and Van Oudenaarden, A. (2004). Multistability in the lactose utilization network of Escherichia coli. Nature 427, 737–740. Pahle, J. (2009). Biochemical simulations: Stochastic, approximate stochastic and hybrid approaches. Brief. Bioinform. 10, 53–64. Pettersson, G. (1993). Optimal kinetic design of enzymes in a linear metabolic pathway. Biochim. Biophys. Acta 1164, 1–7. Ptashne, M. (2004). A Genetic Switch: Phage Lambda Revisited. Cold Spring Harbor Laboratory Press, Cold Spring Harbor. Raj, A., and van Oudenaarden, A. (2008). Nature, nurture, or chance: Stochastic gene expression and its consequences. Cell 135, 216–226. Rao, C. V., and Arkin, A. P. (2003). Stochastic chemical kinetics and the quasi-steady-state assumption: Application to the Gillespie algorithm. J. Chem. Phys. 118, 4999–5010. Raser, J. M., and O’Shea, E. K. (2005). Noise in gene expression: Origins, consequences, and control. Science 309, 2010–2013. Rosenfeld, N., Young, J. W., Alon, U., Swain, P. S., and Elowitz, M. B. (2005). Gene regulation at the single-cell level. Science 307, 1962–1965. Rossi, F. M., Kringstein, A. M., Spicher, A., Guicherit, O. M., and Blau, H. M. (2000). Transcriptional control: Rheostat converted to on/off switch. Mol. Cell 6, 723–728. Sabouri-Ghomi, M., Ciliberto, A., Kar, S., Novak, B., and Tyson, J. J. (2008). Antagonism and bistability in protein interaction networks. J. Theor. Biol. 250, 209–218. Savageau, M. A. (1995). Michaelis–Menten mechanism reconsidered: Implications of fractal kinetics. J. Theor. Biol. 176, 115–124. Schnell, S., and Maini, P. K. (2000). Enzyme kinetics at high enzyme concentration. Bull. Math. Biol. 62, 483–499. Segel, I. H. (1976). Biochemical Calculations: How to Solve Mathematical Problems in General Biochemistry. Wiley, New York. Segel, L. (1988). On the validity of the steady state assumption of enzyme kinetics. Bull. Math. Biol. 50, 579–593. Segel, L., and Slemrod, M. (1989). The quasi-steady state assumption: A case study in perturbation. SIAM Rev. 31, 446–477. Song, H., Smolen, P., Av-Ron, E., Baxter, D. A., and Byrne, J. H. (2007). Dynamics of a minimal model of interlocked positive and negative feedback loops of transcriptional regulation by cAMP-response element binding proteins. Biophys. J. 92, 3407–3424. Stoleriu, I., Davidson, F. A., and Liu, J. L. (2004). Quasi-steady state assumptions for nonisolated enzyme-catalysed reactions. J. Math. Biol. 48, 82–104. Stoleriu, I., Davidson, F. A., and Liu, J. L. (2005). Effects of periodic input on the quasisteady state assumptions for enzyme-catalysed reactions. J. Math. Biol. 50, 115–132. Tamanini, F., Yagita, K., Okamura, H., and van der Horst, G. T. (2005). Nucleocytoplasmic shuttling of clock proteins. Methods Enzymol. 393, 418–435. Thomas, R., and D’Ari, R. (1990). Biological Feedback. CRC Press, Boca Raton.
How Molecular Should Your Molecular Model Be?
215
van Kampen, N. G. (2007). Stochastic Processes in Physics and Chemistry. Elsevier, Amsterdam. von Hippel, P. H., and Berg, O. G. (1989). Facilitated target location in biological systems. J. Biol. Chem. 264, 675–678. Yang, H. T., Hsu, C. P., and Hwang, M. J. (2007). An analytical rate expression for the kinetics of gene transcription mediated by dimeric transcription factors. J. Biochem. 142, 135–144.
C H A P T E R
E I G H T
Computational Modeling of Biological Pathways by Executable Biology Maria Luisa Guerriero* and John K. Heath† Contents 1. Introduction 2. Executable Modeling Languages for Biology 2.1. Petri nets 2.2. Rewriting systems 2.3. Process algebra 3. Intuitive Representation of Formal Models 3.1. The Narrative Language: A high-level executable textual language for biology 4. Case Studies 4.1. The JAK/STAT pathway 4.2. Circadian clocks 5. Conclusions and Perspectives Acknowledgments References
218 220 221 222 223 226 228 233 233 247 248 248 248
Abstract “In silico” experiments (i.e., computer simulation) constitute an aid to traditional biological research, by allowing biologists to execute efficient simulations taking into consideration the data obtained in wet experiments and to generate new hypotheses, which can be later verified in additional wet experiments. In addition to being much cheaper and faster than wet experiments, computer simulation has other advantages: it allows us to run experiments in which several species can be monitored at the same time, to explore quickly various conditions by varying species and parameters in different runs, and in some cases to observe the behavior of the system at a greater level of detail than the one permitted by experimental techniques. * Centre for Systems Biology at Edinburgh, University of Edinburgh, Edinburgh, United Kingdom School of Biosciences and Centre for Systems Biology, University of Birmingham, Edgbaston, Birmingham, United Kingdom
{
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87008-4
#
2011 Elsevier Inc. All rights reserved.
217
218
Maria Luisa Guerriero and John K. Heath
In the past few years there has been a considerable effort in the computer science community to develop computational languages and software tools for modeling and analysing biochemical systems. Among the challenges which must be addressed in this context, there are: the definition of languages powerful enough to express all the relevant features of biochemical systems, the development of efficient algorithms to analyze models and interpret the results, and the implementation of modeling platforms which are usable by nonprogrammers. In this chapter, we focus on the use of computational modeling to the analysis of biochemical systems. Computational modeling, in conjunction with the use of formal intuitive modeling languages, enables biologists to define models using a notation very similar to the informal descriptions they commonly use, but formal and, hence, automatically executable. We describe the main features of the existing textual computational languages and the tool support available for model development and analysis.
1. Introduction There are multiple incentives for the generation and analysis of robust models of biological processes (Kwiatkowska and Heath, 2009). At the simplest level, models provide a convenient way of storing and sharing knowledge of a particular system. Models potentially become more powerful when they offer the means to analyze processes without having to conduct laboratory experiments. If the model is an accurate representation of the natural process, dynamic analysis of model behavior allows the investigator to reason about the process in question with the objective of identifying logical flaws or areas of ignorance, test-specific hypotheses or scenarios, and reveal behaviors that might not be predicted by biological intuition. Models can also be used to simulate experimental outcomes with the purpose of prioritizing experiments when resource is limited or identify optimal experimental conditions to study particular outcomes in the laboratory. If models are a sufficiently faithful emulation of the natural process, they can be used to make confident predictions of future behavior under specific constraints which would have great practical value in, for example, predicting optimal therapeutic interventions in a disease state. The most common approach to modeling biological processes is the use of mathematical formalisms in which the relationship between quantities of reactants which change overtime are articulated in the form of equations which, for example, relate the values of an input value and an output value. Mathematical models can be studied either by analytical methods or by numerical simulation depending on their complexity. Mathematical models can be constructed from a priori descriptions of the biological pathway (Aldridge et al., 2006) or by fitting models to observed biological data ( Janes and Yaffe, 2006).
Computational Modeling of Biological Pathways by Executable Biology
219
In recent times, stimulated by the suggestions of Regev and Shapiro (2002), biological processes have been modeled using the formalisms of computer science. In this approach, the behavior of the system is described by algorithms—rules which describe the behavior of the system in the form of transitions between states. This approach has been termed “Executable” or “Algorithmic” biology (Fisher and Henzinger, 2007; Priami, 2009) as, in effect, an algorithm represents an executable computer program which can be analyzed by the large body of analytical and simulation tools developed by computer scientists. The algorithmic approach is especially useful for analyzing biological processes which are not deterministic (i.e., execution of a process may lead to different outcomes), concurrent (i.e., multiple processes execute at the same time and may exhibit interdependencies), and/or stochastic (i.e., system evolution depends on the likelihood for different reactions to occur). The emphasis in this approach to modeling is on the logic, or the temporal and causal relationship, between events in the biological process and, as discussed below, is particularly suited to in silico interrogation of biological models by changing parameter values or by formally verifying the satisfaction of specific system properties. At present, executable methods are based on a priori models based upon prevailing biological knowledge although approaches to the automated construction of executable models from biological data have been suggested (Heath, 2009; Vyshemirsky and Girolami, 2007). As a consequence of these developments in the past few years, there has been a considerable effort in the computer science community to develop computational languages and software tools for modeling and analysis biochemical systems. Several challenges need to be addressed in this context, such as the definition of languages powerful enough to express all the relevant features of biochemical systems, the development of efficient algorithms to analyze models and interpret the results, and the implementation of modeling platforms which are usable by nonprogrammers. A compromise between language expressiveness and ease of use is needed. Tools traditionally used for biochemical modeling allow users to develop models as complex and detailed as desired, but they require users to write mathematical formulas, a process which can be complex and errorprone. A number of tools for modeling and simulation have been now developed to hide as much as possible of the mathematical details from users. Some of these tools provide easy-to-use graphical user interfaces which frees the end-user from knowing the details of the underlying modeling language and of the simulation methods: systems can be modeled using the standard notation of biochemical reactions reactant1 þ . . . þ reactantn ! product1 þ . . . þ productm and reaction rates can be selected from a set of predefined rate laws. Examples of these tools are the SymBiology toolbox for MATLAB and COPASI.
220
Maria Luisa Guerriero and John K. Heath
An alternative approach, on which we focus in this chapter, is the use of intuitive textual notations, that is “narrative-style” modeling languages which aim to be as close as possible to the informal descriptions biologists often use to describe models (such as Protein A binds to B to form a complex which can then phosphorylate. The phosphorylated complex becomes active.). Examples of these languages are the Narrative Language (NL; Guerriero et al., 2009) and the high-level textual interface of the SPiM tool (Kahramanogu˘lları et al., 2009). In this chapter, we focus on the existing high-level languages and in particular on the textual approach; with the help of two examples, we describe the main language features and the tool support available for model development and execution.
2. Executable Modeling Languages for Biology The issue of constructing models to represent complex biological systems involves the definition of techniques to develop and simulate models, and analyze the obtained results. In addition to simulation, formal models enable modelers to analyze systems: for example, formal properties of models can be verified to uncover causal relations between events, reachability of specific states, equivalences between different systems. Therefore, techniques for verifying such properties should also be developed to fully benefit from the formal modeling of biological systems. Several formal methods have been proposed to model, analyze, and simulate biological systems, such as Petri nets (Reisig, 1985), rewriting systems (e.g., membrane systems, Pa˘un, 2002; and Kappa, Danos and Laneve, 2004), and process algebra (e.g., the biochemical stochastic p-calculus, Priami et al., 2001; BioAmbients, Regev et al., 2004; Beta-binders, Priami and Quaglia, 2005; Bio-PEPA, Ciocchetta and Hillston, 2009). Since each of these approaches has different attributes, they can be more suitable for solving a particular class of problems or according to the desired level of abstraction. The model analysis techniques that are available differ for the various formalisms. Most of them are equipped with a discrete stochastic semantics, and some also allow for a continuous deterministic interpretation. For some of these languages, it is also possible to employ formal verification techniques such as model-checking to formally prove model correctness or identify existing errors: see Clarke et al. (1999) for an introduction on model-checking and Kwiatkowska et al. (2002) for a description of the well-known stochastic model-checker PRISM (available from PRISM). These kinds of computational models can either be analyzed statically via techniques which work at the level of model structure, or can be
Computational Modeling of Biological Pathways by Executable Biology
221
dynamically executed via stochastic simulation techniques such as, for instance, the Gillespie’s stochastic simulation algorithm (Gillespie, 1977), which produce time-course observations of amounts of the participating species. For languages which have a deterministic interpretation, numerical solution of the associated set of ordinary differential equations (ODEs) can be also performed, together with the various mathematical methods available for the analysis of ODE systems (e.g., bistability, bifurcation, and continuation analysis). Existing simulation tools, generally, also allow modelers to perform in a simple way model experimentation (parameter sensitivity analysis, components knockdown, dose response experiments, etc.). Stochasticity has been shown to be very relevant in modeling biological systems such as genetic networks, which involve a small number of molecules and consequently can exhibit significant variability due to noise. In addition, analysis via stochastic simulation can provide modelers with an interesting parallel between simulation results and experimental data: the results of a single run of stochastic simulation can be considered as the modeling counterpart of single cell data, while results obtained averaging over multiple runs is equivalent to population of cells data. These considerations are particularly interesting for systems exhibiting periodic/oscillatory behavior, where the robustness of the systems over a population of cells might be affected by phase asynchrony. In this section, we give a brief overview of the mentioned formalisms, presenting their main features, advantages, and disadvantages. In the rest of the chapter, we focus on process algebras, and in particular on Bio-PEPA and Beta-binders, but the concepts we describe can be similarly applied to the other formal languages mentioned here.
2.1. Petri nets A Petri net model of a biochemical system is an automaton whose states represent molecular species and whose transitions represent reactions transforming reactants into products. Graphically, the automaton is a graph with nodes representing states, which are connected, by arcs, to nodes representing transitions (Reisig, 1985). Petri nets are able to represent generic biochemical systems, and several variants of them (e.g., timed, stochastic, continuous, hybrid) have been defined in order to better address specific types of systems or perform specific types of analysis. Petri nets rely on well-assessed mathematical foundations which enable to obtain both transient and steady-state solutions of models. Moreover, static analysis of Petri nets models can also be used to uncover modeling errors: for instance, reachability analysis can be used to identify parts of the model which are not connected, boundedness analysis can be used to ensure uncontrolled growth of molecules is not possible, and invariant analysis can identify violations of the law of conservation of mass. See Peleg et al. (2002)
222
Maria Luisa Guerriero and John K. Heath
and Heiner et al. (2008) for more details and examples. Petri nets are among the best known formalisms for biological modeling, and owe their popularity to their graphical intuitive representation, which makes models easily understandable also for nonmodelers. This strength, however, comes to a limit for big or highly interconnected models, which become too messy to represent graphically. Several tools for development and analysis of Petri nets models are available from Petri Nets World.
2.2. Rewriting systems Rewriting systems consist of a set of objects and a set of relations over them which specify how to transform a set of objects into another. In the case of biochemical systems, the objects represent molecular species and the relations represent the biochemical reactions. The dynamic evolution of a system is obtained by applying the relations starting from a set of objects initially present. Rewriting relations can be either deterministic, not deterministic, or stochastic. Though there are a number of notably different formalisms which can be grouped into the category of rewriting systems, the syntax of most of them is similar, and it is also similar to the biochemical reaction equations, traditionally used to describe biochemical systems. We mention here two of the best known rewriting system formalisms used for biochemical modeling. 2.2.1. Membrane systems Membrane systems, or P systems (Pa˘un, 2002), are computational models inspired by the notion of cellular membranes and the observation that complex biological systems are composed of independent computing processes separated by and communicating through membranes. The evolution of a membrane system starting from an initial configuration of membranes and objects is obtained by repeated application of evolution rules, representing biochemical reactions and movement of objects across membranes. Membrane systems are a powerful method for describing and simulating multicompartment models involving mainly intracompartment reactions and limited intercompartment exchange of information. Several simulators for membrane systems are available from PSystems. 2.2.2. Kappa Kappa (Danos and Laneve, 2004) is a rule-based language for molecular biology, which focuses on the description of protein interaction and complexation. Agents representing biochemical species can have explicit interfaces (i.e., interaction and modification sites), and a set of rewriting rules describes the evolution of molecules in terms of activation and complexation of their interfaces. Each rule, representing a biochemical reaction,
Computational Modeling of Biological Pathways by Executable Biology
223
specifies how agents’ interfaces change following its occurrence, and the constraints that agents need to satisfy in order to be involved in it. An interesting feature of Kappa is that rules can include agents with partially defined interfaces (called patterns), which allows modelers to define generic rules by omitting the parts of agents’ interfaces which are not relevant to their occurrence. This partial form of rules makes Kappa particularly well suited to represent systems exhibiting a high degree of symmetry: in this case, the number of pattern rules is potentially much smaller than the number of reactions, leading to a huge gain in the compactness of the system description. A set of modeling and analysis tools for Kappa, available from RuleBase, can be used to perform stochastic simulation and additional analysis, such as the identification of stories in model evolution, which represent causality chains.
2.3. Process algebra Process algebras were originally developed to give semantics to concurrent programs, and were used to model computer networks and mobile communication systems, to specifying communication and security protocols and to verifying their correctness (Milner, 1989). Recently, following Regev and Shapiro’s (2002) landmark paper, process algebras have also proved themselves as powerful tools for quantitative dynamical modeling of biological systems. The basic components of a process algebra model are concurrent processes which can interact via communication and exchange of information. Processes have internal states and interaction capabilities. When a process receives an input, its behavior is based on its internal state and on the content of the input. A direct consequence of interaction can be the modification of the internal state and of the interaction capabilities of the process. Complex entities can be described hierarchically as the composition of smaller entities. The abstraction of biochemical systems proposed in Regev et al. (2001) consists in representing the molecules composing biological systems (e.g., proteins, genes, etc.) by computational processes which can interact by means of the standard primitives of process algebras. The occurrence of a reaction is represented by an interaction between processes representing its reactants and products, whose effect is a change in the state of the involved processes to represent the transformation of reactants into products (Table 8.1). Table 8.1 The “molecules as processes” abstraction of process algebra Molecule
Process
Cell
Interaction capability Interaction Modification
Channel Communication State and/or channel change
Interaction capability Signal State change
224
Maria Luisa Guerriero and John K. Heath
Process algebras are inherently discrete (i.e., the amounts of biochemical species are given in terms of molecule counts rather than in concentrations as in differential equations), and are generally equipped with a stochastic semantics (i.e., reactions are associated with a probability of occurrence rather than a rate). The behavior of models can be obtained by stochastic simulation or by other kinds of formal analysis: for instance, reachability and causality analysis enable modelers to identify interesting behaviors in a model. Moreover, process algebra come equipped with well-assessed equivalence relations which could be powerful tools for biology: for example, the equivalence of the same functional unit in different organisms could be used as a measure of behavioral and structural similarity. Several process algebras have been developed and applied to the modeling of biological systems. Though the underlying concepts are the same for all of them, there are some noteworthy differences in the considered level of abstraction and in the language operators. We mention here the process algebras which have been used more for biochemical modeling. 2.3.1. Biochemical p-calculus The biochemical stochastic p-calculus (Priami et al., 2001) is an extension of the p-calculus (Milner, 1999) specifically designed for biochemical modeling, in which interactions are represented as input/output communications between pairs of processes and the “molecules as processes” abstraction can be applied straightforwardly. The biochemical p-calculus can rely on the strong formal foundations of the p-calculus, for which causality, locality, and equivalence relations have been defined. The p-calculus style of modeling biochemical interactions and complex formations is elegant, but it can sometimes cause models to be too abstract and hard to understand. For instance, the use of binary communications does not allow modelers to represent in a straightforward way interactions involving three or more processes or reactions with arbitrary stoichiometric coefficients. Successive extensions of the biochemical p-calculus focused on the definition of new operators to make the description of specific features of systems easier (e.g., compartments and membranes, or protein binding, and allosteric sites). The two most common simulation tools for the biochemical p-calculus are SPiM and BioSPI. 2.3.2. BioAmbients BioAmbients (Regev et al., 2004) extends the biochemical p-calculus by allowing processes to nest in order to represent hierarchies of cellular and subcellular compartments and membranes. Interactions are divided into local (i.e., within one compartment) and nonlocal (i.e., between one compartment and the one enclosing it), hierarchies of compartments are dynamical, and processes can move in and out of compartments by synchronizing with membranes.
Computational Modeling of Biological Pathways by Executable Biology
225
Thanks to its notions of localization of processes and hierarchy of compartments, BioAmbients is ideal to describe systems whose focus is on movement of objects across membranes and on the modeling of dynamic rearrangement of compartments. BioAmbients models can be analyzed via stochastic simulation using the BAM tool (BAM). In addition, static analysis techniques such as control flow analysis (Nielson et al., 2007) and pathway analysis (Pilegaard et al., 2008) have been defined for BioAmbients and can be used to investigate causal properties of models. 2.3.3. Beta-binders Beta-binders (Priami and Quaglia, 2005) is another extension of p-calculus, whose focus is on the description of interaction and complexation of molecules via communication between compatible protein sites. Processes representing proteins are enclosed in boxes with interaction capabilities representing their allosteric/binding sites, and interactions between two processes are allowed only if their sites have compatible types (according to a user-defined notion of compatibility). A basic notion of localization and movement of processes within a (static) compartment hierarchy is also defined. The notion of compatibility of interaction sites and the definition of primitives for making interaction sites available or not and for describing their modification make the representation of binding reactions intuitive and the description of reactions definitely more informative than with only input/output communication as in the p-calculus. Interactions, however, are still binary, and hence generic reactions involving more than two molecular species cannot be defined explicitly, thus limiting the ability of modelers to define abstract reactions. The BetaWB is a collection of tools for modeling, simulating, and analyzing models described in the BlenX programming language, which is based on Beta-binders. 2.3.4. Bio-PEPA In the “molecules as processes” abstraction used for the process algebras mentioned so far, a process represents a molecule, and the occurrence of a reaction represents the change of state in the reactant and product molecules involved. The abstraction underlying Bio-PEPA (Ciocchetta and Hillston, 2009), called “species as processes,” is slightly different: a process represents a molecular species, and the occurrence of a reaction represents the change in available amount of the involved molecules. Processes interact by means of shared action names representing reactions and specifying their role in the reaction (reactant, product, inhibitor, etc.) and their stoichiometric coefficient for that reaction; the effect of a reaction occurrence is to decrease the amount of reactants and increase the amount of products. Generic reaction rate law and multiway synchronization can be used, thus allowing the modeling of generic reactions with any number of interacting processes. A basic notion of localization and movement similar to Beta-binders is defined.
226
Maria Luisa Guerriero and John K. Heath
The syntax of Bio-PEPA is very similar to that of a system of ODEs (indeed, the translation from ODEs to Bio-PEPA is mechanical) and, therefore, it is quite easy for modelers familiar with ODE models to understand the meaning of Bio-PEPA models. On the other hand, as for ODEs, models must satisfy a precise structure: this makes it hard to exploit symmetries and patterns for making models more compact, which can be an issue for large size models. Species amounts in Bio-PEPA can either be molecule counts (discrete semantics) or concentrations (continuous semantics), hence allowing both stochastic analysis methods and numerical methods for differential equations. An intermediate representation in terms of discrete levels of concentration is also possible, whose purpose is to make the system smaller and bounded (in terms of the range of values the species can assume) so that model-checking techniques can be applied to BioPEPA models. The Bio-PEPA Eclipse Plug-in, available from Bio-PEPA, is a framework for model development and analysis, which enables modelers to perform static analysis (e.g., invariant analysis and identification of sources and syncs), dynamic time-series analysis (stochastic simulation and solution of ODEs), and to export models for analysis via other simulation, modelchecking, and analysis tools).
3. Intuitive Representation of Formal Models Central to the process algebra approach to biological modeling is the absolute requirement to articulate the biological process in a form that conforms to the rules of formal logic (thus allowing computational execution) and has an unambiguous syntax (to allow translation into a executable programme) but, at the same time, faithfully and accurately represents the biological process under consideration. It would also be advantageous if the method of model formulation was readily understood by both biologists and computer scientists. A further advantage should be that the biological formulation should be “portable,” that is formatted in a fashion which can be employed by a variety of different computational tools. Formal computer languages are, unfortunately, hard for the nonspecialist to understand and use. Moreover, in general, the more formal the language is, the more complex it is for noncomputer scientists to write and understand models. Developing a model using process algebras, for example, requires modelers to learn the meaning of a specific set of language operators (usually not straightforward, being quite different to traditional biochemical reaction description). In addition, in order to be unambiguous, formal models must comply with a precise structure and generally cannot include any sort of uncertainty in either model structure or quantitative aspects. However, formal foundations of descriptions are mandatory requirements
Computational Modeling of Biological Pathways by Executable Biology
227
in order to enhance the understanding of complex biological systems and to perform automatic simulation and analysis of models. Computer science modeling is specifically designed to meet the above requirements, but it should hide as many technical details as possible from users in order to be usable by nonexpert users. The definition of intuitive notations for modeling biological systems is an area of active investigation and several—both graphical and textual— informal notations are been proposed. The drawback of many notations classically employed in the biological literature is that they can be ambiguous in construction or syntax and conceal important features of the pathway—such as spatial or temporal features—from the naı¨ve viewer. Hence it is not possible to have a direct translation of a model developed using these notations into a computational executable model that can be simulated on a computer. Two broad classes of biological notation have been developed to address this challenge: graphical and textual notations. In graphical notations, systems are described by arrows (representing reactions) connecting boxes (representing the involved molecules), and the localization of molecules can also be visually represented. Several graphical notations have been developed, among which Kohn Molecular Interaction Maps (Kohn, 1999) and SBGN (SBGN Home Page), and several tools (e.g., CellDesigner, Funahashi et al., 2003, available from CellDesigner, VCell, KEGG, and EPE; Sorokin et al., 2006, available from EPE) allow some sort of graphical representation. The clear advantage of graphical notations is that they are (Harel, 1987) intuitive for the biologist and hence they make it easy to build and work with models. However, graphical notations tend to be ambiguous and the interpretation of the meaning of their components is often left to the simulator rather then being standardized and temporal and/or spatial aspects are not frequently portrayed. In addition, viewing large or highly interconnected models and searching through them is often cumbersome. A second graphical approach draws directly on computer science formalisms for representing processes as state machines. Statecharts (Harel, 1987) is a formal graphical method which extends state diagrams to express hierarchical, concurrent, and communicating processes and is thus particularly suited to the description of biological processes. The Statechart approach has recently been modified to explicitly encompass biological processes in the form of Biocharts (Kugler et al., 2010). They can be nested which allows representation of a process at different hierarchical levels, for example, in the biological context, the behavior of pathways within and between cells. They have proved an effective tool for representing biological processes for computational execution for both pathways (Fisher et al., 2005) and multiscale modeling of tissues (Setty et al., 2008). The alternative approach is to employ textual representations. These are intuitively closer to a set of program instructions, can be easily modularized
228
Maria Luisa Guerriero and John K. Heath
and equipped with a formal syntax and grammatical structure which can be presented in a way which is biologically intuitive but retains the ability to describe concurrent, hierarchical, or temporal features. Despite not being as compact and intuitive as graphical representations, text-based descriptions of biological processes generally become more readable than graphical descriptions when employed to represent big highly interconnected systems. System Biology Markup Language (SBML Home Page; Hucka et al., 2003) is a machine readable language based on XML which has become the de facto standard for sharing biological pathway models and has the benefit of a large community of users and developers. SBML has been designed as a storage/exchange language and, hence, as a machine-readable rather than human-readable language, and its syntax is too heavy for users to read and write SBML models by hand. On the other hand, most modeling frameworks nowadays allow import/export of SBML models. SBML models are also available via an online repository (BioModels Database) for community sharing. Tools for translating SBML into process algebras have been described (Ciocchetta et al., 2008; Eccher and Priami, 2006). SPiM, one of the simulation tools for the biochemical p-calculus mentioned earlier, has been equipped with a high-level interface for model development and visualization. Models can be described either graphically or textually. In the graphical notation (Phillips et al., 2006), a system is represented as a directed graph with nodes for processes and arcs for interactions; in the high-level textual representation (Kahramanogu˘lları et al., 2009), a system is described by a set of constrained English sentences, in the style of the NL described in the following section. In both cases, these notations are automatically translated into an SPiM program which can be simulated. In the rest of this section, we describe in detail the NL, a high-level textual language which has been recently defined for biochemical modeling and from which process algebra models can be automatically generated.
3.1. The Narrative Language: A high-level executable textual language for biology A high-level textual modeling language called NL has been presented in Guerriero et al. (2009). The motivation which leads to the definition of the NL language is to allow modelers to formulate models in an intuitive way— as close as possible to the sort of informal description commonly employed by biologists to describe systems—but in such a way that this formulation could be easily translated into executable models. The NL is a narrativestyle language, in which the sequence of events occurring in a biochemical system is described by means of a sequence of semiformal textual rules. The main focus of the language is the representation of the molecules
229
Computational Modeling of Biological Pathways by Executable Biology
(e.g., proteins, genes, metabolites, enzymes) involved in cellular processes such as signaling pathways and of the dynamic changes which they undergo, in terms of state changes and of modifications of their binding/allosteric sites. Some basic geometrical constraints can be also specified by including topological information relative to the compartments in which the molecules are located. Finally, it is possible to specify different sorts of temporal/ causal relationships between events (sequential, concurrent, and competing events). Each numerical value (i.e., initial amounts, kinetic parameters, and compartment sizes) can be assigned a “reliability value,” which is a number describing its reliability, ranging from 100% for precise values obtained from high-quality experiments, to 0% for values with no experimental evidence. Reliability values are optional and do not influence the behavior of the program: they are annotations to inform use of the model. The basic entities of the language are components (proteins, genes, etc.) and compartments (cellular and subcellular locations). Molecules can interact (e.g., bind/unbind), undergo biochemical modification (e.g., phosphorylation/dephosphorylation) and move between compartments. The dynamic behavior of a model is described in the form of a narrative of events involving the basic entities, which imposes a temporal sequence and defines interdependencies between events. A compartment can represent a cellular or subcellular compartment (e.g., nucleus, cytoplasm, cell membrane) or an abstract location; it is described by an identifier, a name, the size, and the number of spatial dimensions (to distinguish between 2D compartments, i.e., membranes, and 3D ones). As an example, Table 8.2 shows the compartments involved in the gp130/ JAK/STAT model presented in Guerriero et al. (2009). A component represents a molecular species and it is identified by a name, an informal description, a list of interaction sites, a list of states, a list of locations, and the initial quantity of molecules present. Table 8.3 shows an example of model components which are part of the components of the gp130/JAK/STAT model presented in Guerriero et al. (2009). Since the effect of biochemical reactions is essentially the modification of protein sites, components are essentially seen as a list of interaction sites defined by a name and a state (e.g., active, bound, phosphorylated). States can also be associated with components to represent modifications occurring at Table 8.2 Example of model compartments Id
1 2 3 4
Name
Exosol Cell_membrane Cytoplasm Nucleus
Size
9.91 1.26 2.09 0.25
12
10 10 7 10 12 10 12
Unit of measure
Dimensions
l dm2 l l
3 2 3 3
Table 8.3 Example of model components
Name
Descr
OSM gp130
Ligand Receptor
OSMR
Receptor
STAT3
Effector
SOCS3
Inhibitor
Site
Site_state
Site_act
OSM SOCS3 OSM SOCS3 Y705 gp130 OSMR
Bound Bound Bound Bound Phosphorylated Bound Bound
False False False False False False False
State
State_act
Comp
Comp_act
Init_ amount
Rel (%)
Bound Bound
False False
Exosol Cell_membrane
True True
3000 1000
100 50
Bound
False
Cell_membrane
True
1000
50
Dimer
False
Cytoplasm Nucleus
True False
3000
30
Bound
False
Cytoplasm
True
0
100
231
Computational Modeling of Biological Pathways by Executable Biology
unknown interaction sites. A label associated with each state and site specifies the status of the component at system initialization. For instance, in Table 8.3, STAT3 has a phosphorylation site, Y705, which is initially not phosphorylated, and an unknown homodimerization site represented by a state called dimer. If the location of the protein is relevant to the model, the compartments in which it can be located during the evolution of the system are specified, as references to their identifiers in the table of compartments (or their names, if unique). A label associated with each compartment defines where the component is located initially. For instance, STAT3 can be either in the nucleus or in the cytoplasm, and is initially assumed to be in the cytoplasm. Finally, the initial quantity of the component is defined in terms of molecule counts. A reaction specifies the details of a biochemical modification or binding reaction; it is described by an identifier, an informal description, the reaction type (e.g., binding, unbinding, dimerization, phosphorylation, relocation, etc.), the reaction rate (i.e., the kinetic constant), and the reaction volume (i.e., the volume in which the reaction occurs, which can be the name of a compartment, or a different value in case the reaction is known to be localized). Table 8.4 contains an example of the definition of reaction details. A narrative of events describes the evolution of the system as a sequence of events, which can be optionally grouped into processes. An event is a constrained textual description of a biochemical reaction involving at most two components: it is described by an identifier, an informal description, a semiformal description, the identifier of the reaction associated with the event (in the table of reactions), and optionally an identifier of alternative events. The semiformal description, which can be prefixed by conditions on the state of components/sites or on their position, specifies the occurring reaction (e.g., phosphorylates, relocates to, binds, unbinds) and the involved component(s). The underlying assumption is that each event involves an interaction/modification of one site of the involved components: if no site is specified, the interaction/modification is assumed to involve one of the Table 8.4 Example of model reactions
Id Type
Rate
15 16 17 18 19 20 21 25
4.8 108 M 1 min 1 0.06 min 1 0.2 min 1 0 min 1 inf min 1 inf min 1 1 min (t1/2) 0.01 min 1
Binding Unbinding Phosphorylation Dephosphorylation Unbinding Homodimerization Relocation Synthesis
Unit
Rel (%) React_vol
Rel Unit (%)
20 30 80 0 10 50 10 50
l l l l l l l l
Cytoplasm Cytoplasm Cytoplasm Cytoplasm Cytoplasm Cytoplasm Cytoplasm Nucleus
50 50 50 50 50 50 50 50
232
Maria Luisa Guerriero and John K. Heath
Table 8.5 Example of model events Id Description
1
5
39 40 45
49
50
if gp130.LIF is not bound and LIF is not bound and gp130.typeI is not dimer and gp130.typeII is not dimer then LIF binds gp130 on LIF if gp130.OSM is not bound and OSM is not bound and gp130. typeI is not dimer and gp130.typeII is not dimer then OSM binds gp130 on OSM if STAT3.OSMR is bound then STAT3 phosphorylates on Y705 if STAT3 is in cytoplasm then STAT3 dephosphorylates on Y705 if STAT3 is in cytoplasm and STAT3 is dimer and STAT3. gp130 is not bound and STAT3.LIFR is not bound and STAT3.OSMR is not bound then STAT3 relocates to nucleus if STAT3 is in nucleus and STAT3 is dimer and STAT3.Y705 is phosphorylated and STAT3.PIAS3 is not bound then STAT3 synthesises SOCS3 SOCS3 degrades
React Alt
1
5
1
17 18 21
25
30
component’s generic states; if a list of sites is specified, the event represents simultaneous reactions involving different sites. Conditions on events are used to enforce the ordering of sequential events (e.g., a phosphorylation of a site of a protein is allowed only after it is bound to another protein); mutually exclusive events (e.g., competitive binding) are handled specifying which events are alternatives to each other; events that are not explicitly declared either alternative or sequential, are considered independent and are treated as concurrent events (e.g., independent events involving different proteins). Table 8.5 reports some examples of definition of events. 3.1.1. Tool support for model execution Since each entity and each description of event in the NL must satisfy rigorous language constraints, models written in the NL are formal and unambiguous. Consequently, NL models can be translated in a mechanical way into executable models (in process algebras or other formal languages), enabling modelers to easily simulate models developed in the NL. Translation of NL models into formal languages obviously requires knowledge of the target language, which means that the ability of users to perform analysis of NL models using a particular computational tool depends on their knowledge of the tool and of its underlying modeling language. Hence, to exploit the full power of the NL, automatic translation into formal languages is needed, so that users could potentially execute models without any specific knowledge of the particular target language used by different computational tools.
Computational Modeling of Biological Pathways by Executable Biology
233
The N2BB tool, available from N2BB, implements an automatic translation (Guerriero et al., 2007) of models specified in the NL into executable models compatible with the BetaWB tool. BlenX4Bio, described in Priami et al. (2009) is a tabular interface to the BetaWB, allowing modelers to define the entities of the NL via a user-friendly graphical interface. Using these tools, modelers who are not familiar with the BlenX language can build models simply filling the tables with the components, reactions, parameters, etc., and automatically obtain a model ready for execution via the BetaWB simulator, to which modelers only need to add the information on the time interval for which they want the simulation to run; once the simulation is completed, the time-course results can be either directly visualized in the BetaWB plotter or exported into other tools for further analysis.
4. Case Studies In this section, we present two case studies in order to illustrate the process algebra modeling approach and the kinds of analyses which can be performed using it. In Section 4.1, we illustrate in great detail a computational model of the JAK/STAT signaling pathway, focusing mainly on the model description in both the NL and process algebras; in Section 4.2, instead, we show some results obtained by analyzing a process algebra model of a circadian clock, which clearly show the difference between stochastic and deterministic models.
4.1. The JAK/STAT pathway As an example of modeling of signaling pathways, we describe the NL model of the gp130/JAK/STAT pathway presented in Guerriero et al. (2009). In that work, starting from the existing knowledge biologists had of the system, the NL model was developed and automatically translated into a BlenX model using the N2BB tool and simulated using the BetaWB simulator. Starting from the same NL model, in a later work (Guerriero, 2009), the model was translated into Bio-PEPA, which enabled us to obtain additional simulation and model-checking results. In this section, we briefly describe the system and report some extracts of the models and some analysis results. 4.1.1. The pathway The gp130/JAK/STAT signaling pathway is an interesting case study for computational modeling as it illustrates the role of both component compartmentalization of components and inhibitory mechanisms of different types. The pathway is also of significant biological interest due to its functions in human fertility, neuronal repair cancer, and hematological development (Mahdavi et al., 2007; Singh et al., 2006; Swameye et al., 2003) and accordingly
234
Maria Luisa Guerriero and John K. Heath
it has become a focus for the development of targeted therapeutic interventions. In this respect, modeling the gp130/JAK/STAT pathway and exploring its parameter sensitivities should provide valuable insights into identifying entry points for the development of new forms of intervention. Signaling through the gp130/JAK/STAT pathway is initiated by ligands of the interleukin 6/leukemia inhibitory factor family binding to the gp130 transmembrane receptor and a second receptor which, depending on the identity of the ligand, could be a second copy of gp130 or the structurally related receptors LIFR and OSMR. The creation of transmembrane receptor dimers elicits activation of the receptor-associated JAK kinase creating a phosphorylated residue which acts as a binding site for STAT3. Phosphorylation of STAT3 results in release from the receptor complex and transport across the nuclear membrane where it is able to bind to DNA target sites. Dephosphorylation by the phosphatase TC-PTP of STAT3 results in nuclear export where the phosphorylated STAT3 is free to reengage with phosphorylated receptor. The pathway is thus a regulated nuclear-cytoplasmic shuttle. We also consider two forms of pathway inhibition: the action of SOCS3 proteins which are induced in response to the DNA bound form of STAT3, and the action of PIAS3. This informal description of the pathway is shown in the form of a diagram in Fig. 8.1 and encoded formally as described later in this section. LIF
Extracellular space
r5 LIF LIF
r3
LIF LIF
r6
LIF r8
P
SOCS3 SOCS3
P
LLI FIFRR
gpp 1133 00
LLI IFF RR
P
P
P
r17
STAT3
LIF LIF
r9 ggpp 1133 00
P
13
LLII FFR
Cell membrane
r7
gp
00
ggpp 1133 00
0 13
LIL FIR FR
r4
gp
13
LI
gp
gp
13
0
0
FR
LIF LIF r2
STAT3
LLI IFF R
LIF LIF r1
P P
P P SOCS3
STAT3
STAT3
STAT3 STAT3
SOCS3
r10
Cytoplasm P
P
STAT3
STAT3
r11
r15
STAT3
P P STAT3
r12
STAT3
STAT3
STAT3
STAT3
P P STAT3
r16 SOCS3
R
0
Nucleus
STAT3
O SM
13
r13
r14
OSM
gp
0 13 gp
LI FR
0 13 gp
LI FR
OSM
LIF
STAT3
PIAS3
P P
r18
PIAS3
STAT3
Figure 8.1 Graphical representation of the gp130/JAK/STAT pathway.
Computational Modeling of Biological Pathways by Executable Biology
235
The molecular species we consider in the model are: two ligands (LIF and OSM), three membrane-bound receptors (gp130, LIFR, and OSMR), one effector (STAT3), and two inhibitors (SOCS3 and PIAS3). The receptor-associated JAK kinase and TC-PTP phosphatase are implicitly modeled. Receptors are activated by ligand bindings, and active receptors dimerize to form receptor complexes (gp130:LIFR or gp130:OSMR; reaction r1 in Fig. 8.1). Once the receptor dimeric complex is formed, each receptor subunit (gp130, LIFR, and OSMR) can undergo JAK-mediated phosphorylation (r2). STAT3 can bind on receptors’ phosphorylated sites (r3), and the binding of STAT3 leads to its activation (phosphorylation; r4). Once phosphorylated, STAT3 dissociates from the receptor complex, and its phosphorylated site allows STAT3 to homodimerize (r5). When STAT3 is in dimeric form, it can translocate into the nucleus (r6) where it can carry out its specific functions (not modeled here): STAT3 binds to the DNA, thus activating the transcription of downstream gene targets. Nuclear STAT3 dimers are inactivated through TC-PTP-mediated dephosphorylation, which leads to the dimers’ dissociation (r7) and to STAT3 export to the cytoplasm (r8), where STAT3 can undergo additional cycles of activation. The two inhibition mechanisms considered are due to SOCS3 and PIAS3. SOCS3 is synthesized by STAT3 (r9) and it acts by competing with STAT3 in binding to receptors (r10). PIAS3 acts by binding to active nuclear STAT3 (r11). 4.1.2. A Narrative Language model We overview here the main features of the NL model presented in Guerriero et al. (2009). The full model is included in the supplementary material of that work. We describe here in more details how STAT3 is modeled and the reactions in which it is involved. Additional extracts of the NL model have been already shown in Section 3.1. Four compartments are involved in the system: the exosol (the extracellular space, where the two ligands are located), the cell membrane (location of the receptors), the cytoplasm (initial location of STAT3), and the nucleus (in which STAT3 can translocate). The information given about the compartments is reported in Table 8.2 (note in particular the field called id which can used in the rest of the model to reference the compartments, and the compartment size which is needed for stochastic simulation). Table 8.6 reports the definition of the model component representing STAT3. For our purpose, STAT3 has one phosphorylation site (Y705, initially not phosphorylated) and four binding sites for binding of the three receptors and of PIAS3 inhibitor (all initially unbound). STAT3 can be present either as a monomer or as a homodimer, so a dimer state (initially set to false) is also present. In addition, it can be either in the cytoplasm (initially set to true) or in the nucleus (initially set to false). The initial amount of STAT3 is set to 3000 and a reliability of 30% is associated with this numerical value.
Table 8.6 Component representing STAT3 Name
Descr
Site
Site_state
Site_act
State
State_act
Comp
Comp_act
Init_amount
Reliab (%)
STAT3
Effector
Y705 gp130 LIFR OSMR PIAS3
Phosphoryated Bound Bound Bound Bound
False False False False False
Dimer
False
Cytoplasm Nucleus
True False
3000
30
Computational Modeling of Biological Pathways by Executable Biology
237
Receptors contain one or two ligand binding sites (OSMR has only one site for OSM, while LIFR and gp130 also have one site for LIF), one binding site for SOCS3, and a number of phosphorylation sites. Moreover, receptors can be in dimeric state (an additional site in gp130 allows us to distinguish between the two types of OSM receptors). Ligands, located in the exosol, only have a bound state to represent their binding to receptors, similarly to the two inhibitors SOCS3 and PIAS3. Table 8.7 reports some of the events involving STAT3: its activation (binding to active receptors, phosphorylation, and homodimerization), its shuttling between cytoplasm and nucleus, and the reactions related to the inhibitors (synthesis of SOCS3, and binding of PIAS3). As there are a lot of replications of events involving the three different receptors, we report here only the ones for gp130, but analogous events are defined for LIFR and OSMR. Each event is defined by a reaction involving one or two components, prefixed by a sequence of conditions involving the reactants, and by a reference to the identifier of the reaction (referencing to Table 8.8). For instance, event 37, which represents the phosphorylation of site Y705 of STAT3 following the binding of STAT3 to one of the receptors, is defined as “STAT3 phosphorylates on Y705” and it can occur only “if STAT3.gp130 is bound.” The condition for the occurrence of the binding event 31 (“gp130 binds STAT3 on gp130”) is more complex as it is composed by a conjunction of conditions (“if gp130.Y767 is phosphorylated and STAT3 is in cytoplasm and STAT3 is not dimer,” etc.) that STAT3 and gp130 need to satisfy for the binding to occur. The type, kinetic rate parameter, and reaction volume of the reactions involving STAT3 are reported in Table 8.8. Rate parameters are either a number or the keyword “inf” (for instantaneous reactions), and their units of measures are also specified. Reaction volumes in this case are simply the size of the enclosing compartments. In general, many rate parameters in biological pathways are either unknown or subject to considerable uncertainty. In initially setting up the model, therefore, it is common to use generic values—or data drawn from the literature—where the relative rates of reactions are employed (see, e.g., Fisher and Henzinger, 2007) and the task of identifying accurate parameters to capture the quantitative aspects of model behavior is carried out during the model analysis phase via parameter exploration. 4.1.3. The Bio-PEPA model The described NL model was translated into a Bio-PEPA model, presented in Guerriero (2009). The translation, though done manually, is quite mechanical, due to the similarities in the structure of the two modeling languages and hence it could potentially be implemented in a tool. In Bio-PEPA, each form of the involved molecular species is modeled as a distinct component: different species components are defined to represent
Table 8.7 Events involving STAT3 Id Description
STAT3 activation 31 if gp130.Y767 is phosphorylated and STAT3 is in cytoplasm and STAT3 is not dimer and gp130 is not bound and gp130.SOCS3 is not bound and STAT3.gp130 is not bound and STAT3.LIFR is not bound and STAT3.OSMR is not bound and STAT3.Y705 is not phosphorylated then gp130 binds STAT3 on gp130 37 if STAT3.gp130 is bound then STAT3 phosphorylates on Y705 STAT3 unbinding and homodimerization 41 if gp130.Y767 is phosphorylation and STAT3 is in 3 and STAT3 is not dimer and gp130 is bound and gp130.SOCS3 is not bound and STAT3.gp130 is bound and STAT3.LIFR is not bound and STAT3.OSMR is not bound and STAT3. Y705 is phosphorylated then gp130 unbinds STAT3 on gp130 44 if STAT3.Y705 is phosphorylation and STAT3 is not dimer and STAT3.gp130 is not bound and STAT3.LIFR is not bound and STAT3.OSMR is not bound then STAT3 homodimerizes STAT3 shuttling 45 if STAT3 is in cytoplasm and STAT3 is dimer and STAT3.gp130 is not bound and STAT3.LIFR is not bound and STAT3.OSMR is not bound then STAT3 relocates to nucleus 46 if STAT3 is in nucleus and STAT3 is dimer and STAT3.PIAS3 is not bound then STAT3 dephosphorylated on Y705 47 if STAT3 is in nucleus and STAT3 is dimer and STAT3.Y705 is not phosphorylated then STAT3 dehomodimerizes 48 if STAT3 is in nucleus and STAT3 is not dimer and STAT3.Y705 is not phosphorylated then STAT3 relocates to cytoplasm SOCS3 synthesis 49 if STAT3 is in nucleus and STAT3 is dimer and STAT3.Y705 is phosphorylated and STAT3.PIAS3 is not bound then STAT3 synthesises SOCS3 PIAS3 inhibition 57 if STAT3 is in nucleus and STAT3 is dimer and STAT3.Y705 is phosphorylated and STAT3.PIAS3 is not bound and PIAS3 is not bound then PIAS3 binds STAT3 on PIAS3 58 if STAT3 is in nucleus and STAT3 is dimer and STAT3.Y705 is phosphorylated and STAT3.PIAS3 is bound and PIAS3 is bound then PIAS3 unbinds STAT3 on PIAS3
React Alt
15
17 19
20
21 22 23 24
25
28 29
Table 8.8 Reactions involving STAT3 Id
15 17 19 20 21 22 23 24 25 28 29
Type
Rate
Unit
Binding Phosphorylation Unbinding Homodimerization Relocation Dephosphorylation Dehomodimerization Relocation Synthesis Binding Unbinding
4.8 10 0.2 inf inf 1 0.04 inf 15 0.01 1.0 108 0.06 8
1
1
M min min 1 min 1 min 1 min (t1/2) min 1 min 1 min(t1/2) min 1 M 1 min 1 min 1
Reliability (%)
Reaction_volume
Unit
Reliability (%)
20 80 10 50 10 20 20 10 50 20 30
Cytoplasm Cytoplasm Cytoplasm Cytoplasm Cytoplasm Nucleus Nucleus Nucleus Nucleus Nucleus Nucleus
l l l l l l l l l l l
50 50 50 50 50 50 50 50 50 50 50
240
Maria Luisa Guerriero and John K. Heath
the different states, locations, etc., of each component in the NL model. As an example, STAT3 is modeled by four distinct components representing, respectively, the cytoplasmic dephosphorylated monomeric form (STAT3_c), the cytoplasmic phosphorylated dimeric form (STAT3PD_c), the nuclear phosphorylated dimeric form (STAT3-PD_n), and the nuclear dephosphorylated monomeric form (STAT3_n); further species components are defined for each state of each complex containing STAT3. The definitions of the species representing the different forms of STAT3 are reported in Table 8.9 (again, we omit some of the duplicate reactions for the different receptors). Reactions and biochemical modifications are represented by actions executed synchronously by the involved species components, in the form “reaction_name species_role,” where reaction_name is a name identifying uniquely the reaction, and species_role describe the role of a species in the reaction: a down-arrow # indicates a reactant (i.e., its amount is decreased by the reaction), an up-arrow " indicates a product (i.e., its amount is increased by the reaction), and a plus indicates an positive regulator (i.e., it acts as a catalyst and its amount is unaffected by the reaction). Stoichiometric coefficients are assumed to be 1 by default, and they can otherwise be specified with reactions of the form (reaction_name, stoichiometry_coefficient) species_role. For instance, the reaction representing r7 in Fig. 8.1 is modeled as the reaction dephosphorylation_dedimerization_STAT3, which is performed by STAT3-PD_n and STAT3_n, and whose effect is to decrease the amount of STAT3-PD_n and increase (with stoichiometric coefficient 2) the amount of STAT3_n. For each reaction, a functional rate specifying its kinetic rate law is defined. The ones used in the species definitions for STAT3-PD_n and STAT3_n are shown in Table 8.10 (lines starting with // are comments and are ignored by the simulator). Finally, the initial amounts for all the species must be specified in the form showed in Table 8.11. Note that, analogously to the NL, the initial amounts are given in terms of molecule counts rather than concentrations in order to allow models to be analyzed via stochastic simulation. As mentioned above, Bio-PEPA allows us to interpret the system either as continuous-deterministic or as discrete-stochastic, and the Bio-PEPA Eclipse plug-in (Duguid et al., 2009) enables us to obtain time-series results via both stochastic simulation and ODE solvers. In addition to this, BioPEPA models can be exported for analysis via different tools, in particular, PRISM for model-checking. In Guerriero (2009), some stochastic simulation results have been presented, and the satisfaction of a number of expected/desired properties of the system is verified using PRISM. Some of the results described in that work were qualitative (or semiquantitative) analysis: though this kind of analysis does not provide exact numerical values, it can often be interesting, due to its greater efficiency compared to exact quantitative methods, and to the fact that the experimental data
Table 8.9 Bio-PEPA model extract: Species definitions for STAT3
STAT3_c
STAT3-PD_c STAT3-PD_n STAT3_n
¼ binding_gp130-P-OSM-OSMR_STAT3# þ unbinding_gp130-P-OSM-OSMR_STAT3" þ binding_gp130-OSM-OSMR-P_STAT3# þ unbinding_gp130-OSM-OSMR-P_STAT3" þ binding_gp130-P-OSM-OSMR-P_STAT3# þ unbinding_gp130-P-OSM-OSMR-P_STAT3" þ binding_gp130-P-OSM-OSMR-P_STAT3# þ unbinding_gp130-P-OSM-OSMR-P_STAT3" þ binding_STAT3-gp130-P-OSM-OSMR-P_STAT3# þ unbinding_STAT3-gp130-P-OSMOSMR-P_STAT3" þ binding_gp130-P-OSM-OSMR-P-STAT3_STAT3# þ unbinding_gp130-POSM-OSMR-P-STAT3_STAT3" þ . . . // analogous set of reactions for bindings to other receptor complexes þ relocation_STAT3_n_c" ¼ unbinding_STAT3-PD_c_gp130-LIF-LIFR" þ unbinding_STAT3-PD_c_gp130-OSM-LIFR" þ unbinding_STAT3-PD_c_gp130-OSM-LIFR" þ relocation_STAT3_c_n# ¼ relocation_STAT3_c_n" þ dephospho_dedimer_STAT3# þ synthesis_SOCS3 þ binding_PIAS3_STAT3# þ unbinding_PIAS3_STAT3" ¼ (dephospho_dedimer_STAT3, 2) " þ relocation_STAT3_n_c#
242
Maria Luisa Guerriero and John K. Heath
Table 8.10 Bio-PEPA model extract: Kinetic rates
relocation_STAT3_c_n ¼ [0.693/k58 STAT3-PD_c] dephospho_dedimer_STAT3 ¼ [k59 STAT3-PD_n] relocation_STAT3_n_c ¼ [0.693/k60 STAT3_n] synthesis_SOCS3 ¼ [k61 STAT3-PD_n] binding_PIAS3_STAT3 ¼ [k80/ (nucleusNA)PIAS3STAT3-PD_n] unbinding_PIAS3_STAT3 ¼ [k-80 PIAS3:STAT3-PD_n]
//STAT3-PD_c relocation cytoplasm -> nucleus //STAT3-PD_n dephosphorylation & dedimerization //STAT3-PD_n relocation nucleus > cytoplasm //SOCS3 synthesis by STAT3-PD_n //PIAS3/STAT3-PD_n binding //PIAS3/STAT3-PD_n unbinding
Table 8.11 Bio-PEPA model extract: Initial amounts
LIF[3000] OSMR [1000] STAT3_PD_c [0] SOCS3 [0] gp130:OSM:OSMR [0]
OSM [3000] gp130 [1000] STAT3_PD_n [0] gp130:LIF:LIFR [0] gp130:OSM:OSMR_P: STAT3 [0]
LIFR [1000] STAT3_c [3000] STAT3_n [0] gp130:OSM:LIFR [0]
used for validation of models, are often qualitative themselves. In Clark et al. (2010), additional verification techniques have been applied to this system, with the main aim of identifying modeling errors (both syntactical and conceptual). Syntactic analysis, which can be performed statically (i.e., without performing any kind of simulation), can identify errors which invalidate some essential implicit hypotheses (e.g., conservation of mass, no unbounded growth of cells, etc.). Dynamic analysis (i.e., performed on simulation results) can provide additional knowledge about the system (the range in which the amount of each molecular species is expected to be, the number of times each reaction is expected to occur on average, etc.). Figure 8.2 shows the localization of STAT3 as a ratio between cytoplasmic and nuclear STAT3. Figure 8.2A is obtained from experimental data (Guerriero et al., 2009), and Fig. 8.2B is obtained from the average behavior of 10 stochastic simulation runs of the Bio-PEPA model. Figure 8.3 is a graphical representation which is automatically generated from the Bio-PEPA model, and it clearly shows the limit of graphical notations for big systems: this model is quite big, but not huge (it is
243
Computational Modeling of Biological Pathways by Executable Biology
A 100 Nuclear STAT (%)
80
Cytoplasmic STAT (%)
60 40 20 0 0
100
200
300 400 Time (min)
500
600
B 100 Cytoplasmic STAT (%) Nuclear STAT (%)
80 60 40 20 0 0
100
200
300
400
500
Figure 8.2 Localization of STAT3 molecules: comparison of (A) experimental and (B) simulation results. Panel (A) is published in the supplementary material of Guerriero et al. (2009).
composed of 63 molecular species and 118 reactions), but this size is enough for reaction arrows to be intersecting so much that they become very hard to follow. 4.1.4. The BlenX model In Guerriero et al. (2009), we have presented several results obtained by simulating the BlenX model automatically generated from the NL model described above, using the BetaWB simulation tool. Figure 8.4 is a screenshot of BetaWB showing an extract of the BlenX model (at the bottom) and its visual representation (at the top): from this short extract it is clear that the syntax of the BlenX model is not particularly intuitive and the model is not as readable as the original NL model. The modeler, however, does not generally need to care about the generated model and can instead consider it
Figure 8.3 Petri-net style graphical representation automatically generated from the Bio-PEPA model.
Figure 8.4 Automatically generated BlenX model imported in BetaWB.
246
Maria Luisa Guerriero and John K. Heath
as a black-box: the model is executed and the results are reported automatically by the simulation tool. A particular emphasis in Guerriero et al. (2009) was on in silico experimentation for parameter exploration: the sensitivity of the model behavior to component removal and to changes in quantitative parameters was investigated. These results showed that the rates of dephosphorylation of nuclear STAT3 and of STAT3 nuclear export were the most influential on the system dynamics, while most of the other parameters did not have a great impact. As an example, in Fig. 8.5, we report a comparison of the
A
TC-PTP-mediated dephosphorylation 2000 Rate dephospho = 0 min–1 Rate dephospho = 0.004 min–1 Rate dephospho = 0.04 min–1 Rate dephospho = 0.4 min–1 Rate dephospho = 4 min–1
Species amount
1500
1000
500
0 0 B 900
100
200 300 Time (min)
400
500
JAK-mediated phosphorylation Rate JAK phospho = 12 min–1 Rate JAK phospho = 1.2 min–1 Rate JAK phospho = 0.2 min–1 Rate JAK phospho = 0.06 min–1 Rate JAK phospho = 0.02 min–1
800 700 600 500 400 300 200 100 0
0
100
200
300
400
500
Time (min)
Figure 8.5 Parameter sensitivity analysis: rate of (A) cytoplasmic phosphorylation and (B) nuclear dephosphorylation of STAT3. These results are published in Guerriero et al. (2009).
247
Computational Modeling of Biological Pathways by Executable Biology
effect of changing the rate of phosphorylation of cytoplasmic STAT3 and that of dephosphorylation of nuclear STAT3.
4.2. Circadian clocks Circadian clocks are a classic example of systems for which deterministic and stochastic models exhibit significantly different behavior. They are genetic networks which involve a very small number of proteins and genes, and which are present in small copy numbers. Consequently, they can exhibit a very noisy behavior, which means that the behavior of each cell in a population of cells can be very different from that of others. As mentioned earlier, the solution of a system of ODEs simulates the average population behavior, whereas a simulation of a stochastic system simulates the behavior of an individual cell. Because of the small noise, for big systems such as the signaling pathway described in the previous section, the deterministic population behavior is generally equivalent to the mean behavior of a large number of individual stochastic behaviors. For small systems such as circadian clocks and other oscillatory networks, instead, the noise can have such a big effect that the mean stochastic and the mean deterministic behaviors can be very different. Figure 8.6 shows a comparison of the results obtained from the model of the Neurospora crassa circadian clock presented in Akman et al. (2009, 2010b). The two panels refer to the same model settings with the exception of one altered parameter. The deterministic behavior (solid black line) is totally different for the two settings (in one case dampened oscillations, in the other persistent regular oscillations). The stochastic model, instead, is more robust and the oscillations are preserved in both settings: due to the stochastic noise, however, the oscillations observed in a single cell (red and A
B 700 600 500 400 300 200
ODE SSA (population) SSA (single cell 1) SSA (single cell 2)
600 FRQ protein
FRQ protein
700
ODE SSA (population) SSA (single cell 1) SSA (single cell 2)
500 400 300 200
100 0 24 72 48 96 120 144 168 192 216 240 264 288 312 336 360 Time (h)
100
0
24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 Time (h)
Figure 8.6 Individual versus population behavior: (A) and (B) refer to different parameter settings. The solid black line shows the deterministic behavior, the dashed blue line is the mean stochastic behavior, and the red and green dots are two individual stochastic behaviors.
248
Maria Luisa Guerriero and John K. Heath
green dots show two independent stochastic simulation runs) are irregular, both in amplitude and in phase, and consequently the observed mean behavior (dashed blue line is the average over 1000 stochastic runs) is of dampening oscillations for both settings. Additional details of the stochastic model and additional analysis results can be found in Akman et al. (2009). Other works on models of oscillating systems which exhibit differences between stochastic and deterministic behaviors are Akman et al. (2010a), Ballarini and Guerriero (2010), and Tymchyshyn and Kwiatkowska (2008).
5. Conclusions and Perspectives Here we have reviewed some of the tools and techniques employed in the analysis of biological pathways as executable computer programmes and shown case studies of how this approach can be applied in practice to reasoning about the dynamic behavior and structure of biological pathways. The major requirement in this approach is an ability to formulate the pathway of interest in a very precise and logical fashion to enable computational execution. Although this involves constraints for the biologist, which at first sight might be discouraging, the process of developing a rigorous pathway description acts to uncover areas of ignorance or ambiguity in biological knowledge. Although the diversity of computational tools is rapidly expanding, the development of common frameworks for pathway formulation, such as NL and related approaches, should enable a much closer dialogue between biologists and computer scientists.
ACKNOWLEDGMENTS The authors thank Jane Hillston for the helpful discussions. M. L. G. is funded by the Centre for Systems Biology at Edinburgh. “The Centre for Systems Biology at Edinburgh is a Centre for Integrative Systems Biology (CISB) funded by BBSRC and EPSRC, reference BB/D019621/1.” J. K. H. is funded by the Cancer Research, UK.
REFERENCES Akman, O. E., Ciocchetta, F., Degasperi, A., and Guerriero, M. L. (2009). Modelling Biological Clocks with Bio-PEPA: Stochasticity and Robustness for the Neurospora Crassa Circadian Network, Proc. of CMSB’09. Springer, Berlin, 52–67. Akman, O. E., Guerriero, M. L., Loewe, L., and Troein, C. (2010a). Complementary approaches to understanding the plant circadian clock, Proc. of FBTC’10. Cyprus, 1–19. Akman, O. E., Rand, D. A., Brown, P. E., and Millar, A. J. (2010b). Robustness from flexibility in the fungal circadian clock. BMC Syst. Biol. 4, 88.
Computational Modeling of Biological Pathways by Executable Biology
249
Aldridge, B. B., Burke, J. M., Lauffenburger, D. A., and Sorger, P. K. (2006). Physicochemical modelling of cell signalling pathways. Nat. Cell Biol. 8, 1195–1203. Ballarini, P., and Guerriero, M. L. (2010). Query-based verification of qualitative trends and oscillations in biochemical systems. Theor. Comput. Sci. 411, 2019–2036. BetaWB Home Page. http://www.cosbi.eu/index.php/research/prototypes/beta-wb. BioModels Database. http://www.ebi.ac.uk/biomodels. Bio-PEPA Home Page. http://www.biopepa.org/. BioSPI Home Page. http://www.wisdom.weizmann.ac.il/ biopsi. Cell Designer Home Page. http://celldesigner.org. Ciocchetta, F., and Hillston, J. (2009). Bio-PEPA: A framework for the modelling and analysis of biological systems. Theor. Comput. Sci. 410, 3065–3084. Ciocchetta, F., Priami, C., and Quaglia, P. (2008). An automatic translation of SBML into Beta-binders. IEEE/ACM Trans. Comput. Biol. Bioinform. 5, 80–90. Clark, A., Gilmore, S., Guerriero, M. L., and Kemper, P. (2010). On Verifying Bio-PEPA Models, Proc. of CMSB’10. ACM, 23–32. Clarke, E. M., Grumberg, O., and Peled, D. (1999). Model Checking. MIT Press, Cambridge, MA. COPASI Home Page. http://www.copasi.org. Danos, V., and Laneve, C. (2004). Formal molecular biology. Theor. Comput. Sci. 325, 69–110. Duguid, A., Gilmore, S., Guerriero, M. L., Hillston, J., and Loewe, L. (2009). Design and Development of Software Tools for Bio-PEPA, Proc. of WSC’09. IEEE Press 956–967. Eccher, C., and Priami, C. (2006). Design and implementation of a tool for translating SBML into the biochemical stochastic p-calculus. Bioinformatics 22, 3075–3081. Edinburgh Pathway Editor Home Page. http://www.bioinformatics.ed.ac.uk/epe. Fisher, J., and Henzinger, T. A. (2007). Executable cell biology. Nat. Biotechnol. 25, 1239–1249. Fisher, J., Piterman, N., Hubbard, E. J., Stern, M. J., and Harel, D. (2005). Computational insights into Caenorhabditis elegans vulval development. Proc. Natl. Acad. Sci. USA 102, 1951–1956. Funahashi, A., Morohashi, M., Kitano, H., and Tanimura, N. (2003). Cell designer: A process diagram editor for gene-regulatory and biochemical networks. BIOSILICO 1, 159–162. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. Guerriero, M. L. (2009). Qualitative and quantitative analysis of a Bio-PEPA model of the Gp130/JAK/STAT signalling pathway. Trans. Comput. Syst. Biol. XI 5750, 90–115. Guerriero, M. L., Heath, J. K., and Priami, C. (2007). An Automated Translation from a Narrative Language for Biological Modelling into Process Algebra, Proceedings of Computational Methods in Systems Biology (CMSB’07). Springer136–151. Guerriero, M. L., Dudka, A., Underhill-Day, N., Heath, J. K., and Priami, C. (2009). Narrative-based computational modelling of the Gp130/JAK/STAT signalling pathway. BMC Syst. Biol. 3, 40. Harel, D. (1987). Statecharts: A visual formalism for complex systems. Sci. Comput. Program. 8, 231–274. Heath, J. K. (2009). The Equivalence Between Biology and Computation, Proc. of CMSB’09. Springer, Berlin, pp. 18–25. Heiner, M., Gilbert, D., and Donaldson, R. (2008). Petri Nets for Systems and Synthetic Biology, SFM’08. Springer, Berlin, pp. 215–264. Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., Arkin, A. P., Bornstein, B. J., Bray, D., Cornish-Bowden, A., Cuellar, A. A., Dronov, S., et al. (2003).
250
Maria Luisa Guerriero and John K. Heath
The Systems Biology Markup Language (SBML): A medium for representation and exchange of biochemical network models. Bioinformatics 19, 524–531. Janes, K. A., and Yaffe, M. B. (2006). Data-driven modelling of signal-transduction networks. Nat. Rev. Mol. Cell Biol. 7, 820–828. Kahramanogu˘lları, O., Cardelli, L., and Caron, E. (2009). An intuitive automated modelling interface for systems biology. Proc. of DCM’09, pp. 73–86. KEGG. http://www.genome.jp/kegg/. Kohn, K. W. (1999). Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol. Biol. Cell 10, 2703–2734. Kugler, H., Larjo, A., and Harel, D. (2010). Biocharts: A visual formalism for complex biological systems. J. R. Soc. Interface 7, 1015–1024. Kwiatkowska, M. Z., and Heath, J. K. (2009). Biological pathways as communicating computer systems. J. Cell Sci. 122, 2793–2800. Kwiatkowska, M., Norman, G., and Parker, D. (2002). PRISM: Probabilistic symbolic model checker. Proc. of Conference on Modelling Tools and Techniques for Computer and Communication Systems Performance Evaluation, pp. 200–204. Mahdavi, A., Davey, R. E., Bhola, P., Yin, T., and Zandstra, P. W. (2007). Sensitivity analysis of intracellular signaling pathway kinetics predicts targets for stem cell fate control. PLoS Comput. Biol. 3, 1257–1267. MATLAB http://www.mathworks.com. Milner, R. (1989). Communication and Concurrency. Prentice-Hall. Milner, R. (1999). Communicating and Mobile Systems: The p-Calculus. Cambridge Universtity Press. N2BB Home Page. http://homepages.inf.ed.ac.uk/mguerrie/sw/N2BB/. Nielson, F., Nielson, H., Priami, C., and Rosa, D. (2007). Control flow analysis for BioAmbients. ENTCS 180, 65–79. Pa˘un, G. (2002). Membrane Computing: An Introduction. Springer-Verlag, New York. Peleg, M., Yeh, I., and Altman, R. (2002). Modeling biological processes using workflow and Petri net models. Bioinformatics 18, 825–837. Petri Nets World. http://www.informatik.uni-hamburg.de/TGI/PetriNets/. Phillips, A., Cardelli, L., and Castagna, G. (2006). A graphical representation for biological processes in the stochastic pi-calculus. Trans. Comput. Syst. Biol. 4230, 123–152. Pilegaard, H., Nielson, F., and Nielson, H. R. (2008). Pathway analysis for BioAmbients. J. Log. Algebraic Program. 77, 92–130. Priami, C. (2009). Algorithmic systems biology an opportunity for computer science. Commun. ACM 52, 80–88. Priami, C., and Quaglia, P. (2005). Operational patterns in Beta-binders. Trans. Comput. Syst. Biol. 1, 50–65. Priami, C., Regev, A., Silverman, W., and Shapiro, E. (2001). Application of a stochastic name-passing calculus to representation and simulation of molecular processes. Inf. Process. Lett. 80, 25–31. Priami, C., Ballarini, P., and Quaglia, P. (2009). BlenX4Bio—BlenX for Biologists, Proc. of CMSB’09. Springer, Berlin, pp. 26–51. PRISM Home Page. http://www.prismmodelchecker.org. Regev, A., and Shapiro, E. (2002). Cells as computation. Nature 419, 343. Regev, A., Silverman, W., and Shapiro, E. (2001). Representation and simulation of biochemical processes using the p-calculus process algebra. Proceedings of Pacific Symposium on Biocomputing (PSB’01), pp. 459–470. Regev, A., Panina, E. M., Silverman, W., Cardelli, L., and Shapiro, E. Y. (2004). BioAmbients: An abstraction for biological compartments. Theor. Comput. Sci. 325, 141–167. Reisig, W. (1985). Petri Nets: An Introduction. Springer-Verlag, New York. RuleBase. http://www.rulebase.org/.
Computational Modeling of Biological Pathways by Executable Biology
251
SBGN Home Page http://www.sbgn.org. SBML Home Page http://www.sbml.org. Setty, Y., Cohen, I. R., Dor, Y., and Harel, D. (2008). Four-dimensional realistic modeling of pancreatic organogenesis. Proc. Natl. Acad. Sci. USA 105, 20374–20379. Singh, A., Jayaraman, A., and Hahn, J. (2006). Modeling regulatory mechanisms in IL-6 transduction in hepatocytes. Biotechnol. Bioeng. 95, 850–862. Sorokin, A., Paliy, K., Selkov, A., Demin, O. V., Dronov, S., Ghazal, P., and Goryanin, I. (2006). The Pathway editor: A tool for managing complex biological networks. IBM J. Res. Dev. 50, 561–576. Stochastic Pi Machine Home Page. http://research.microsoft.com/en-us/projects/spim/. Swameye, I., Mu¨ller, T. G., Timmer, J., Sandra, O., and Klingmu¨ller, U. (2003). Identification of nucleocytoplasmic cycling as a remote sensor in cellular signaling by databased modeling. PNAS 100, 1028–1033. The BioAmbients Machine Home Page. http://aesop.doc.ic.ac.uk/tools/bam/. The P Systems Webpage. http://ppage.psystems.eu/. Tymchyshyn, O., and Kwiatkowska, M. (2008). Combining Intra- and Inter-cellular Dynamics to Investigate Intestinal Homeostasis, Formal Methods in Systems Biology. Springer, Berlin, pp. 63–76. VirtualCell Home Page. http://vcell.org/. Vyshemirsky, V., and Girolami, M. A. (2007). Bayesian ranking of biochemical system models. Bioinformatics 24, 833–839.
C H A P T E R
N I N E
Computing Molecular Fluctuations in Biochemical Reaction Systems Based on a Mechanistic, Statistical Theory of Irreversible Processes Don Kulasiri Contents 254 256 260 262 266 269 273 277 277 277
1. Introduction 2. Theoretical Developments 3. Elementary Chemical Reactions 4. An Example of Chemical Reaction 5. Activation of Transcriptional Factors 6. Binding and Unbinding TF to E-boxes 7. Binding and Unbinding of Activated TF to E-Boxes 8. Conclusions Acknowledgments References
Abstract We discuss the quantification of molecular fluctuations in the biochemical reaction systems within the context of intracellular processes associated with gene expression. We take the molecular reactions pertaining to circadian rhythms to develop models of molecular fluctuations in this chapter. There are a significant number of studies on stochastic fluctuations in intracellular genetic regulatory networks based on single cell-level experiments. In order to understand the fluctuations associated with the gene expression in circadian rhythm networks, it is important to model the interactions of transcriptional factors with the E-boxes in the promoter regions of some of the genes. The pertinent aspects of a near-equilibrium theory that would integrate the thermodynamical and particle dynamic characteristics of intracellular molecular fluctuations would be discussed, and the theory is extended by using the theory of stochastic differential equations. We then model the Department of Molecular Biosciences, Centre for Advanced Computational Solutions (C-fACS), Lincoln University, Lincoln, Christchurch, New Zealand Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87009-6
#
2011 Elsevier Inc. All rights reserved.
253
254
Don Kulasiri
fluctuations associated with the promoter regions using general mathematical settings. We implemented ubiquitous Gillespie’s algorithms, which are used to simulate stochasticity in biochemical networks, for each of the motifs. Both the theory and the Gillespie’s algorithms gave the same results in terms of the time evolution of means and variances of molecular numbers. As biochemical reactions occur far away from equilibrium—hence the use of the Gillespie algorithm—these results suggest that the near-equilibrium theory should be a good approximation for some of the biochemical reactions.
1. Introduction Molecular fluctuations in biochemical reaction systems are important within many fields, but we focus on the subsystems associated with gene expresion in this chapter. Variations in phenotypic characteristics have been observed in the genetically identical cells of organisms ranging in complexity from bacteria to mammals, and are hypothesized to be an important factor in evolution as well as in the physiological development (Kirschner and Gehart, 2005). Only in recent years, new experimental techniques in molecular biology, such as fluorescent reporters, allow stochastic gene expression to be quantified in vivo (e.g., see Austin et al., 2006; Blake et al., 2003; Elowitz et al., 2002; Kaern et al., 2005; Ozbudak et al., 2002; Pedraza and van Oudenaarden, 2005; Raser and O’Shea, 2004). These elegant experiments, along with the associated mathematical studies, have greatly facilitated our understanding of the sources and consequences of such stochasticity in genetic regulatory networks. Such stochasticity are divided into intrinsic noise and extrinsic noise; the former is associated with the gene and is dependent on the gene, whereas extrinsic noise is due to the surrounding biochemical reactions and diffusion processes within the cell, and therefore, is independent of the gene. While it is important to understand the “noise” associated with the above-mentioned molecular processes in terms of extrinsic and intrinsic components, we consider these noises to be the molecular fluctuations having mechanistic and thermodynamic characteristics. Some of these processes are thermodynamically irreversible (e.g., mRNAs are translated into proteins, but we have not seen proteins becoming mRNAs), and the fluctuations are intimately connected to the well-established theory of fluctuations and dissipation (Keizer, 1987). Mathematical models and the associated methods are essential parts of gene expression research, and some of the studies discussed above incorporate probabilistic mathematical models based on masters equations approach (Kampen, 2001), the formulation of which is based on the transition probabilities of molecules from one form to another form. Some other studies take purely a numerical simulation approach based on the most
Computing Molecular Fluctuations
255
popular Gillespie’s algorithm (Gillespie, 1977) for reaction kinetics or many variants of it because it works well when the molecule numbers are low, and is based on sound gas-kinetics laws. The Gillespie’s algorithms are valid for the biochemical reaction systems far away from the thermodynamic equilibrium as well as near the equilibrium. We have chosen molecular reaction subsystems associated with the circadian rhythms of Drosophila to compute molecular fluctuations associated with gene expression. The biochemical dynamics of the motifs of these subsystems can be represented by chemical reactions which can be used in other applications. We hope that the elucidations within such a data-rich mechanism would enliven the applications of theories. Our discussion here is based on a deterministic mathematical model (Xie and Kulasiri, 2007) developed to represent the transcriptional regulatory networks essential for circadian rhythmicity in Drosophila. The model incorporates the transcriptional feedback loops revealed so far in the networks of the circadian clock (PER/TIM and VRI/PDP1 loops where PER, TIM, VRI, and PDP1 are key proteins in the system). The model simulates sustained circadian oscillations in mRNA and protein concentrations in constant darkness in agreement with experimental observations. The model is robust given a wide range of parameter variations. The model simulates entrainment by light–dark cycles and phase response curves resembling the experimental results. The simulated per01, tim01, and clkJrk and E-box mutations are similar to those observed in the experiments (Xie and Kulasiri, 2007). (E-boxes are CACGTG enhancers in the promoter regions of genes, and transcriptional factors (TF) bind to them to activate or repress the transcription of a gene.) One of the main differences in this model is that conventional Hill functions are not assumed to describe the regulation of genes; instead, the explicit reactions of binding and unbinding processes of transcription factors to E-boxes in the promoters were modeled. As the activity around any promoter region strongly influences the entire gene network, it is often important to investigate possible molecular fluctuations around this region. By modeling the bindings and unbindings of the TFs to E-boxes as elementary reactions would allow us to investigate the molecular fluctuations associated with E-box-mediated gene expression. The purpose of this chapter is to explore the molecular fluctuations in these promoter motifs in general mathematical settings using a nearequilibrium theory of molecular fluctuations based on irreversible thermodynamics and compare the results with those from the well-established Gillespie algorithm. We make use of the theory of stochastic differential equations (SDEs) to extend the near-equilibrium theoretical solutions, and the comparisons with the results from the Gillespie’s algorithms would indicate whether we could use the near-equilibrium theory as an approximation for these reactions, which are normally considered as happening far away from the equilibrium. The advantage of a positive comparison, that is,
256
Don Kulasiri
the time evolutions of means and variances are similar in both cases, would allow us to use the theory as an approximation for the investigations in molecular fluctuations in the network, which would reduce the computational cost significantly.
2. Theoretical Developments When two types of molecules react and/or combine to produce another type of molecules, two physical processes must happen: they should physically move through the solution and they should collide. The driving forces acting on molecules translate into kinetic energy, and the medium acts as the dissipater of the kinetic energy; any such energy dissipation associated with small molecules generates fluctuations. A physical ensemble of these molecules would depict behaviors that can be captured by using appropriate extensive variables. (Extensive variables depend on the extent of the system of molecules, i.e., the number of molecules, concentrations, kinetic energy, etc., whereas intensive variables do not change with the size of the system, i.e., pressure, temperature, entropy, etc.) These measurable quantities at the macroscopic level have origins at the microscopic level. Therefore, we can anticipate that molecular-level description would justify the operational models that we develop at an ensemble level. Naturally, one could expect that the statistical moments of the variables of an ensemble would lead to meaningful models of the process we would like to observe. The total differential of entropy of an idealized system of molecules can be written as m dU P dS ¼ þ dVT dN; ð9:1Þ T T T where U is the internal energy; VT the volume of the system; N the number of molecules; P the pressure; m the chemical potential; and T is the absolute temperature. Equation (9.1) is a statement for a system of molecules and the system has the well-defined physical boundaries through which mass and heat transfer could occur. The momentum of the particles is included in the internal energy term, and by including the momentum (M), the total energy EΤ ¼ U þ (M2 /2m) where m is the mass of a particle. We can write Eq. (9.1) in the following form after including the momentum as a thermodynamic variable: m dE Τ v P dS ¼ dM þ dVT dN : ð9:2Þ T T T T
257
Computing Molecular Fluctuations
In Eq. (9.2), v is the row vector of particle velocities and M is the column (dS) can be vector of particle moments. The total differential of entropy expressed in terms of partial derivatives: @S @S @S @S dS ¼ dM þ dVT þ dE Τ þ dN ; ð9:3Þ @VT @E @M @N where @S=@ M indicates the row vector of @ S / @ Mi; @S=@ M, @ S / @ VT, and @ S / @ N are thermodynamically conjugate to the respective variables in Eq. (9.3), namely, E Τ ; M; VT ; and N . laws for irreversible processes states The Onsager principle for the linear that the rate of change of an extensive variable is linearly related to the difference of the corresponding thermodynamically conjugate intensive variable from its value at the thermodynamic equilibrium (Keizer, 1987). To simplify the notation in the linear laws, we introduce ai(t) ¼ E[xi(t)]0 E[xi(te)]0 to denote the average of the difference between an extensive variable of our choice and its value at the thermodynamic equilibrium, conditional on the initial values. Then, the Onsager linear laws can be written in terms of the relaxation matrix H governing the return of mean values of the extensive variables to equilibrium; da ¼ Ha; ð9:4Þ dt which has the solution, aðt Þ ¼ eðHtÞ a0 ;
ð9:5Þ
where a0 is the initial value of the selected process. The matrix H must be semidefinite as the entropy increases on the average during the negative relaxation process. Therefore, from a thermodynamic point of view, nearequilibrium fluctuations of an extensive variable is an exponentially decaying process from the initial value, and this fact should also be reflected in any other type of model of fluctuations. Using Eq. (9.5), we can deduce the two-time covariance function, Cðt1 ; t2 Þ, h i T Cðt1 ; t2 Þ ¼ E aðt1 ÞaT ðt2 Þ ¼ E½ðeðHt1 Þ a0 ÞðeðHt2 Þ a0 ÞT ¼ E½a0 a0T eðHt1 þH t2 Þ : ð9:6Þ If t2 ¼ t1 þ t and closer to the equilibrium, the process is stationary,
258
Don Kulasiri
h i T h i CðtÞ ¼ E a0 aT ðtÞ ¼ E a0 a0T eðH tÞ :
ð9:7Þ
As we can see, from a thermodynamic point of view, Eqs. (9.6) and (9.7) state that the covariances have an exponentially decaying character to them. As discussed before, any dissipative process would have fluctuations in extensive variables. Let us define the fluctuations with reference to the expected value conditioned upon the initial value as dxi ðt Þ ¼ xi ðt Þ E½xi ðtÞ0 :
ð9:8Þ
Then, it can be shown (Keizer, 1987) that, near equilibrium, the Onsager’s regression hypothesis for fluctuations can be written as ddX ¼ HdX þ ef; dt
ð9:9Þ
where ef is a random vector. This equation is the Onsager’s regression for fluctuations. This hypothesis is based on the thermodynamic hypothesis arguments not on the particles behavior in a physical ensemble. To complete the Onsager picture of random fluctuations in Eq. (9.9), we need to consider it as an SDE. Then, ef term can be defined in terms of the Wiener form of an SDE. process to develop the simplest While the Onsager regression hypothesis is based on the entropy and the coefficients that form the relaxation matrix, the Boltzmann equation is, on the other hand, dependent entirely on the molecular dynamics of collisions and the resulting fluctuations. It can be shown that the linearized Boltzmann equation is a special case of Onsager theory (Keizer, 1987). In the derivation of the Boltzmann equation, we have a six-dimensional space in which the position of r and velocity, v, of the center of mass of a single molecule are defined. We call this six-dimensional space the m-space or molecule phase space. We can divide the six-dimensional space into small cellular volumes and each volume elements is assigned an index i ¼ 1, 2, 3, . . . as a unique number for identification purposes. The number of molecules, Ni(t) would be the macroscopic Boltzmann variable associated in the volume element i, and we choose the volume element i to be sufficiently large that Ni(t) is a large number. It is assumed that binary collisions between the molecules of only two volume elements located at r; v and r; v1 occur in the m-space. Each of these and volume elements located at r; v0 and volumes loses one molecule each r; v0 1 gain one molecule each at the end of each collision. (The primes denote the velocities at the center of mass velocities after the collisions.) We can define the extensive property of the number density as m-space,
259
Computing Molecular Fluctuations
rðr; v; tÞ, so that rðr; v; tÞdr dv is the number of molecules with center of position and velocity inthe ranges ½r; r þ dr and ½v; v þ dv. mass Then, the Boltzmann equation gives ð @r ^T g½r0 r01 rr1 dv1 ð9:10Þ ¼ vrr r Frv r þ s ; @t where rðr; v0 ; tÞ ¼ r0; rðr; v1 ; tÞ ¼ r01 ; F is an external force field acting in the m-space, rr and rv are the derivatives with respect to r and v, is the respectively. The third term on the right hand side of Eq. (9.10) ^T is a linear operator and g is the constant dissipative effect of collisions; s relative velocity magnitude (Keizer, 1987). In the absence of an external force, Eq. (9.10) can be written as @r ¼ Vrr r þ ds ; @t
ð9:11Þ
with ds lumps the dissipation due to collisions. Unlike the Onsager’s linear laws, which are true only near the thermodynamic equilibrium, the Boltzmann equation is true not only in the vicinity of the equilibrium but also away from the equilibrium. However, near the equilibrium these two pictures are similar, and while the Boltzmann equation is valid, strictly speaking, only for diluted gases, the Onsager linear laws are valid for any ensemble. In the vicinity of the equilibrium, we can write r r; v; t ¼ re v þ Dr r; v; t ; ð9:12Þ where Dr r; v; t is a small change in the m-space density and superscript equilibrium values. Equation (9.12) is a crucial assumption “e” denotesthe which makes the entire theory valid only near the thermodynamic equilibrium. By substituting Eq. (9.12) for Eq. (9.11) and ignoring the higher order terms of Dr, we obtain @Dr ¼ Vrr Dr þ C ½Dr; @t
ð9:13Þ
with C[Dr] replacing the dissipation integral as a linear functional. It can be shown that, by adopting the Onsager hypothesis (Keizer, 1987), @Dr ¼ L ½X þ fe r; v; t ; @t where, X ¼ kB ln
r re
ð9:14Þ
is the local thermodynamic force in m-space;
260
Don Kulasiri
L ½X
ð vre rr X þ L S v1 X1 dv1 ; kB
ð9:15Þ
where LS is a linear operator (Keizer, 1987); and fe is a random term. The random term now can be defined with E½ feðr; V; tÞ ¼ 0 and E½feðr; v; tÞðfeðr0 ; v0 ; t 0 ÞÞ ¼ 2kB L S v; v1 d r r0 dðt t 0 Þ: ð9:16Þ In Eq. (9.14), the m-space density increments are expressed in terms of thermodynamic forces (X). By deriving the random term fe as in Eq. (9.16), we see that it is a zero-mean stochastic process in the m-space, d-correlated in r and t, but it is influenced by the velocity of the center of mean through a operator derived from the dissipation term, d , in the Boltzmann linear S equation. This analysis shows that the Boltzmann and Onsager pictures are united near the equilibrium. Equally importantly, Eq. (9.16) justifies the d-correlated stochastic process to model the fluctuations. Moving away from the m-space, we describe the fluctuations and dissipation using the theory of stochastic processes in an effort to develop operational models of molecular fluctuations.
3. Elementary Chemical Reactions As mentioned before, a direct collision between two molecules with velocities in the range v to v þ dv and v1 to v1 þ dv1 in the spatial volume element dr located at r is a conditional event which changes the number of molecules in the volume element dr dv, N(r, v, t), and N(r, v, t) ¼ r(r, v, t)dr dv. (The symbols in bold are vectors or matrices to distinguish the chemical reaction from the general theory.) If a collision occurs, both N(r, v, t) and N(r, v1, t) decrease by one, and both N ðr; v0 ; tÞ and N ðr; v0 1 ; tÞ increase by one; therefore, each collision produces a deterministic effect on N(r, v, t). We define any molecular process which causes a deterministic change in an extensive variable as an elementary process. A direct collision and a reverse of the same collision can be denoted by v; v1 Ð v0 ; v0 1 , and this can be considered as a single elementary process. However, to determine how elementary processes change the extensive variable(s), we need to evaluate the probability associated with each elementary process that occurs within dt. The Boltzmann equation provides the average affect of collision on the occupancy number, and the rate of change of N(r, v, t) due to the elementary process, v; v1 Ð v0 ; v0 1 , is given by
261
Computing Molecular Fluctuations
Rk ¼ ^ sT g½r0 r0 1 rr1 dv1 dv dr; which can be written as Rk ¼ Vkþ Vk ; where Vkþ and Vk are the transition rates. Let us define n(t) as the column vector consisting of the occupancy number of volume elements. Labeling each cell by an index i, we assume that each cell is large enough for Ni(t) to be a macroscopic variable. n(t) is a stochastic process governed by the probabilities of the elementary processes. Let v denote the vector of changes: vkþ for the forward part of the elementary process k and vk for the reverse part of the elementary process. Note that vk ¼ vkþ vk. As for an elementary process, the reverse process restores the effect of the direct collision. In addition, n(t) can be changed by the stream motion in m-space. Thus, the probability of the total change of n by dn1 þ dns can be defined according to ansatz by Keizer (1987) P2 n0 ¼ n þ dns þdn; t þ dtjn; t ¼
8 > > < > > :
if dn ¼ ok ; V dt þ Oðdt Þ; X k þ 1 Vk þ Vk þ Oðdt Þ; if dn ¼ 0; k
0;
otherwise:
ð9:17Þ It can be shown that this ansatz gives rise to a stochastic process with characteristics of a stochastic diffusion process after nontrivial scaling arguments. Keizer (1987) developed a Fokker-Planck type equation for the fluctuation of n(t) around its mean n ðtÞ, dnðtÞ nðtÞ n ðtÞ. The resulting Fokker-Planck equation for P2 ðn þ dn; tjn1 ; t1 Þ is, using the summation convention on repeated indices, @P2 @Hij ðnÞdnj P2 1 @ 2 gij ðnÞ þ ¼ @t @dni 2 @dni @dnj
ð9:18Þ
P þ oki V k V k þ Si ; Hij ðnÞ ¼ @ nj
ð9:19Þ
with @
and
262
Don Kulasiri
gij ðnÞ
X
okj V þ k V k okj ;
ð9:20Þ
k
when ni ðni ; t Þ solves the Boltzmann equation d ni X þ oki V k V ¼ k þ Si ; dt k
ð9:21Þ
where Si is stream effects. (Notation: the bar indicates mean values.) The Fokker-Planck equation (Eq. (9.18)) is very similar to that of the Fokker-Planck equation for an Orenstein-Uhlenbeck process, but in Eq. (9.18), Hij and gij are time-dependent; therefore, Eq. (9.18) can be considered as a generalization of an Orenstein-Uhlenbeck process. A diffusion process can always be expressed as an Ito SDE (Klebaner, 1998); therefore, the equivalent Ito SDE for Eq. (9.18) is ddn ¼ HðnÞdndt þ g1=2 ðnÞdw;
ð9:22Þ
where w is a vector-valued Wiener process and g1/2 is the square root of the matrix g, defined as (g1/2)2 ¼ g. As H and g are independent of dn, Eq. (9.22) is a linear SDE. Therefore, we can write ddn ¼ Hðnðn1 ; tÞÞdn þ feðt Þ; dt
h i with E feðtÞ ¼ 0, and h i E fei ðtÞfej ðt0 Þ ¼ gij ðn ðn1 ; tÞÞdðt t 0 Þ:
ð9:23Þ
ð9:24Þ
We see that Eq. (9.23) has the same form as Onsager’s regression hypothesis, with H driven by elementary processes, g being the two-time correlation 0 function depending on both t and t t , and feðt Þ being a colored noise. The point of difference, however, is that H matrix is not quite a relaxation matrix as we see in the example given below.
4. An Example of Chemical Reaction The following example given by Keizer (1987) is summarized here to illustrate the application of the theory. Let us consider the general bimolecular chemical reaction, which occupies the entire system: A þ B Ð C þ D:
ð9:25Þ
263
Computing Molecular Fluctuations
The thermodynamic extensive variables are the internal energy, the volume, and the molecular numbers nA, nB, nC, and nD. In this reaction, one molecule, each of A and B, is involved in the forward reaction, whereas none of the molecules of A and B is involved in the reverse reaction. The number of C and D molecules involved in the forward reaction is 0, whereas in the reverse reaction, one of each C and D molecules is involved. We can denote these numbers as follows: þ nþ A ¼ nB ¼ 1; nA ¼ nB ¼ 0; and
þ nþ C ¼ nD ¼ 0; nC ¼ nD ¼ 1:
In this notation, the superscript “þ” denotes the forward reaction and the superscript “” denotes the reverse reaction. Using these symbols, we can write reaction (9.25) as þ þ þ þ nA ; nB ; nC ; nD Ð n A ; nB ; nC ; nD ; that is, ð1; 1; 0; 0Þ Ð ð0; 0; 1; 1Þ: If we define the change of any of the number of molecules, ni, as oi ni niþ, oA ¼ oB ¼ 1, oC ¼ oD ¼ þ1. We can apply the mass action law to define the rates of elementary reactions using the activities for each species V þ ¼ Vkþ aA aB and V ¼ Vk aC aD ; where ai is the activity for the species i and V. The chemical potential of a chemical species, mi, is a very useful intensive property which is related to the entropy as discussed previously (Eq. (9.2)); that is, if Fi is the intensive variable thermodynamically conjugate to the molecular number, ni, thus @S m Fi ¼ i: ð9:26Þ T @ni E;V ;n The chemical potential is related to the thermodynamic equilibrium through the change in Gibbs free energy (DGe) for reaction (9.25) DGe ¼ mec þ meD meA meB ¼ 0: Therefore, by incorporating the chemical potential through the activity, ai, we can give a thermodynamic interpretation to the matrix, Hi and g in
264
Don Kulasiri
Eq. (9.23). The relationship between the activity, ai, and the chemical potential, mi, is given by
ai ¼ exp mi
m0i
Fi m0i exp =kB T ¼ exp ; kB kB T
where mi0 is the standard-state chemical potential, kB is the Boltzmann constant, and T is the temperature. (“exp” denotes the exponential function.) Thus, the transition rates can be expressed as V þ ¼ Vkþ exp m0A þ m0B =kB T exp½ðFA þ FB Þ=kB ; and V ¼ Vk exp m0C þ m0D =kB T exp½ðFC þ FD Þ=kB : We can define Oþ ¼ Vkþ exp m0A þ m0B =kB T ; and O ¼ Vk exp m0C þ m0D =kB T ; because both the rate constants and the standard-state chemical potentials are independent of composition. It can also be shown that Oþ ¼ O O, which is called microscopic reversibility (Keizer, 1987). Then, the transition rates for this reaction can be written as V þ ¼ Oþ exp½ðFA þ FB Þ=kB ;
ð9:27Þ
V ¼ O exp½FC þ FD =kB :
ð9:28Þ
and
However, we can see FA þ FB ¼
X
nþ j Fj ;
j
and FC þ FD ¼
X j
n j Fj :
265
Computing Molecular Fluctuations
Therefore, we can write Eqs. (9.27) and (9.28) as " # X V ¼ O exp n j Fj =kB :
ð9:29Þ
j
By using this general form of the transition rates, we can write the basic postulates of the theory as ( " # " #) X X d ni X þ oki Ok exp nkj mj ðn Þ=kB T exp nkj mj ðn Þ=kB T ¼ dt k j j ; t Þ; Ri ðn ; tÞ þ Si ðn ð9:30Þ ddni ; tÞdnj þ fei ; ¼ Hij ðn dt
ð9:31Þ
where, dni ðtÞ ni ðtÞ ni ðtÞ, and Hij ðn ; tÞ @Ri ðn ; t Þ=@ nj : h i E feðtÞ ¼ 0;
ð9:32Þ ð9:33Þ
h i Þdðt t0 Þ; E feiðtÞfej ðt 0 Þ ¼ gij ðn
ð9:34Þ
and
with gij ðn Þ ¼
X k
( oki Ok okj
" # " #) X X þ exp nkj mj ðn Þ=kB T þ exp nkj mj ðn Þ=kB T : j
j
ð9:35Þ To summarize, Eqs. (9.30)–(9.35) provide a mechanistic statistical theoretical framework to investigate the molecular fluctuations around mean. In this theory, H matrix plays an important role, and it is the gradient of the rate of change of mean molecular numbers with respect to the mean number of molecules. The theory outlined above can now be used to investigate the molecular fluctuations.
266
Don Kulasiri
5. Activation of Transcriptional Factors Before binding to E-boxes, TFs are activated by a number of ways, including phosphorylation and forming dimmers. This process can be generalized as an elementary unimolecular reaction, A Ð B, where B represents the activated form of A. We assume that the total number of the activated and nonactivated TFs, n, is a constant (n ¼ nA þ nB). This reaction can be written as (nAþ, nBþ) Ð (nA, nB), that is, (1, 0) Ð (0, 1). Therefore, oA ¼ oB ¼ 1. Eq. (9.30) of the theory now takes the form
d nA d nB mA ðnA Þ mE ðnB Þ ¼ ¼ O exp exp : ð9:36Þ dt dt kB T kB T The chemical potential now can be written as i Þ; mi ðnÞ ¼ m0i ðT Þ þ kB T lnðr i is the number density of the ith species, and therefore, r i ¼ ni =V : where r By defining the following rate constants 0 O exp m =k T B A kþ ¼ ; V and O exp m0B =kB T ; k ¼ V
we can rewrite Eq. (9.36) as
d nA d nB ¼ ¼ kþ nA k nB : dt dt
ð9:37Þ
nB =dt ¼ 0; therefore nAe / nBe ¼ k / kþ, At the equilibrium, d nA =dt ¼ d where the superscript “e” denotes the values at the equilibrium. Given that n ¼ nAe þ nBe, nBe ¼ kþn / (kþ þ k). Thus, Eq. (9.37) can be solved: nB ðt Þ ¼ neB þ exp½lt n0B neB ;
267
Computing Molecular Fluctuations
where nB0 is the initial value of nB and l ¼ kþ þ k. Thus, the SDE governing the molecular fluctuation dnB is given by ddnB ¼ ldnB þ feB ; dt where random term is defined by h i E fei ðt Þfej ðt0 Þ ¼ gðnB ðtÞÞdðt t 0 Þ: g can be derived from Eq. (9.35): nB ðt Þ þ kþ n: g ¼ ðk kþ Þ We solve this model for dnB and the variance of dnB, s(nB0, t), conditioned on the initial conditions, using the Ito SDE theory (Klebaner, 1998). Hence, we briefly illustrate that Ito integration also leads to the same results. As mentioned before, the SDE can be written as ddnB ¼ ldnB dt þ g1=2 ðnB ðtÞÞdw: g can now be expressed as a function of t gðtÞ ¼ kþ n þ ðk kþ Þ neB þ n0B neB exp½lt :
ð9:38Þ
ð9:39Þ
Therefore, dnB ¼
ðt
ldnB dt þ
0
ðt
g1=2 ðtÞdw;
ð9:40Þ
0
which is a linear SDE having a unique solution for the well-defined function of g(t) (see Klebaner, 1998). The solution of the linear SDE based on Ito integration is (see Klebaner, 1998) ðt 1=2 dnB ¼ expðlt Þ dnB ð0Þ þ g ðsÞ expðlsÞdw ðsÞ : ð9:41Þ 0
The integral in equation should be interpreted as an Ito integral. However, dnB ð0Þ ¼ nB ð0Þ nB ð0Þ ¼ 0; therefore, dnB ¼ expðlt Þ
ðt 0
g1=2 ðsÞ expðlsÞdw ðsÞ:
ð9:42Þ
268
Don Kulasiri
Further, from the mean-zero property of Ito integrals, E[dnB] ¼ 0, and let sðt Þ ¼ Þ; E½dnB ðt ÞdnÐ Bt ðt1=2 Ðt ¼ E expðlt Þ 0 g ðsÞ expðlsÞdwðsÞ expðlt Þ 0 g1=2 ðsÞ expðlsÞdw ðsÞ ; Ð t 2 ¼ expð2lt Þ E 0 g1=2 ðsÞ expðlsÞdwðsÞ :
From isometry property of Ito integrals, Ðt sðt Þ ¼ expð2ltÞ 0 E ½gðsÞ expð2lsÞdt; Ðt ¼ expð2ltÞ 0 gðsÞ expð2lsÞdt: Using Eq. (9.39) for g(t), after algebraic manipulation, we can show that, the variance, ðk kþ Þ n0B neB k neB ð expðlt Þ expð2ltÞÞ : ð1 expð2ltÞÞ þ sðt Þ ¼ l l
As we can see from the derivations mentioned above, the molecular fluctuation of B can be characterized fully using the initial conditions and the rate constants, k and kþ. These results are valid away from the initial conditions, and one can observe that when t ! 1, s(t) ! (knBe / l), that is, s(t) ¼ (knBe / l) ¼ (nAenBe / n). Near the equilibrium, we can obtain the variance s(t) ¼ se(1 exp ( 2lt)). The fluctuations and the variance of fluctuations decay exponentially, and the rate of decay is dependent on the parameters of the system. If there are temperature changes in the system, these are reflected through the rate constants. We have implemented the Gillespie’s Monte Carlo Simulation algorithm (Gillespie, 1977) for reaction A Ð B (for brevity, we do not provide the implementation) and compared with the theoretical results. Figure 9.1 shows the comparisons between the two methods for B; the system was simulated for 20 days, and the means and variances were computed using 10,000 realizations. The initial values are nA0 ¼ 150, nB0 ¼ 0; and the rate coefficients are kþ ¼ 0.4, k ¼ 0.2. The agreement between the two methods for the mean and variances is excellent, and this agreement holds for other initial and rate constants.
269
Computing Molecular Fluctuations
A 120
Particle number
100 80 60 40 20
Theoretical model Gillespie method
0
0
5
10
15
20
Time B 40
Variance
30
20
10 Theoretical model Gillespie method 0
0
5
10 Time
15
20
Figure 9.1 (A) Mean number of particles for B using the theory and the Gillespie’s method; (B) Comparisons for variances.
6. Binding and Unbinding TF to E-boxes As we discussed before with reference to circadian rhythms (Xie and Kulasiri, 2007), the region upstream to a promoter (promoter region) in a structural gene is inundated with a number of E-boxes. These E-boxes are
270
Don Kulasiri
where the proteins (TFs) bind either to activate or repress the transcription of the gene. We can model this situation, assuming independent binding of each E-box as a reaction, as follows: TF þ EB Ð TEB;
ð9:43Þ
where TF is the protein involved, EB is the E-box segment, and TEB is the protein-bound E-box. As this situation would occur in a large enough number of cells, we would treat the dynamics of the interaction macroscopically using the theory. Even within a cell, there may be large enough number of E-boxes in a gene to justify this approach. As the total number of E-boxes are fixed in a promoter region, we can assume nEB þ nTEB ¼ n ¼ const., where nEB is the number of free E-boxes and nTEB is the number of E-boxes bound to the transcription factor, TF. The number of TF (nTF) is assumed to be unknown as it is often the case in a cell. We can write the chemical reaction as þ þ þ nEB ; nTF ; nTEB Ð n EB ; nTF ; nTEB ; where nþ EB ¼ 1; n EB ¼ 0;
nþ TF ¼ 1;
n TF ¼ 0;
nþ TEB ¼ 0;
and
n TEB ¼ 1:
Therefore, reaction (9.43) can be written as (1, 1, 0) Ð (0, 0, 1), and oEB ¼ 1 ¼ oTF ; oTEB ¼ 1. The forward and backward rates of reactions with usual notation can be derived as follows:
0
0 O mEB m þ k ¼ 2 exp exp TF ; kB T kB T V and k ¼
0 O m exp TEB : kB T V
With the restriction, nEB þ nTEB ¼ n, the following equations are obtained: (the bar indicates the mean values.) d nEB ¼ kþ nEB nTF þ k nEBT ; dt d nTF d nEB d nTEB ¼ ¼ ; dt dt dt
ð9:44Þ ð9:45Þ
Computing Molecular Fluctuations
and
nTF ðt Þ ¼ nEB ðtÞ þ n0TF n0EB ;
where superscript “0” indicates the initial values of variables. We solve Eq. (9.45) analytically; the only valid solution is h pffiffiffiffi i þ fD0 exp D=k t þ neEB h pffiffiffiffi i ; nEB ðtÞ ¼ 1 D0 exp D=kþ t where
271
ð9:46Þ
ð9:47Þ
pffiffiffiffi kþ n0TF n0EB þ D ; f¼ 2kþ 2 D ¼ kþ n0TF n0EB þ 4kþ k n; pffiffiffiffi kþ n0EB þ kþ n0TF þ k D pffiffiffiffi ; D0 ¼ kþ n0EB þ kþ n0TF þ k þ D
and 1 pffiffiffiffi 1 0 0 D : n n TF EB 2kþ 2 pffiffiffiffi pffiffiffiffi We see that f þ neEB ¼ D=kþ neEB . D=kþ ; therefore, f ¼ We can see that nEB is an exponential function reaching toward the limiting values as one would expect. The rate of movement toward the equilibrium is a function of the initial values of TF and EB, the rate constants, and the total number of E-boxes. As in the previous example, the molecule numbers are influenced by the equilibrium molecule number of the species, that is, the thermodynamics of the reactions. However, the equilibrium molecule number is a function of initial values. Even if nTF0 nEB0, it can be shown that nEBe is always positive. We can define the fluctuation from mean of EB as dnEB ðtÞ ¼ nEB ðt Þ nEB ðtÞ. The SDE governing the behavior of dnEB is given by neEB ¼ equilibrium value ¼
dðdnEB Þ ¼ jdnEB dt þ g1=2 ðnEB Þdw;
ð9:48Þ
where j is obtained by differentiating the right hand side of Eq. (9.48) with respect to nEB : j ¼ 2kþ nEB kþ n0TF n0EB k . g is given by, according to Eq. (9.35),
272
Don Kulasiri
g ¼ kþ n2EB þ kþ n0TF n0EB k nEB þ k n: Equation (9.48) can be solved by first transforming g and j into the functions of t using Eq. (9.47) and then solving the linear SDE. Some realizations of the solution for EB are depicted in Fig. 9.2 for a specific set of initial conditions and the reaction rates. The time evolution of the variance of dnEB can be obtained by sðdnEB Þ ¼ E ðdnEB Þ2 ; Ð t 2 ¼ E 0 jðtÞdnEB dt þ g1=2 ðtÞdw : This can also be evaluated both numerically and analytically using the isometry property of Ito integrals. As before, we used the Gillespie’s algorithm to simulate the reaction (9.43) and the comparative results are given in Fig. 9.3. We show an extreme situation (nEB þ nTEB ¼ n ¼ 50; nTF0 ¼ 150; nEB0 ¼ 40; nTEB0 ¼ 40; and kþ ¼ 0.1; k ¼ 4) to highlight the short transient period. The system was simulated for 20 days, the mean and variance of TEB computed from 10,000 realizations are given in Fig. 9.3A and B, respectively. Even in this case, the theory and the Gillespie’s algorithms produce essentially the same results. 4
Fluctuation of EB
3 2 1 0 –1 –2 –3
0
0.1
0.2
0.3
0.4
0.5 Time
0.6
0.7
0.8
0.9
1
Figure 9.2 Three realizations of EB fluctuation for both forward and backward reaction rates are 0.4, the total number of E-boxes is 10, and the initial values of TFA and TEB are 0 and 10, respectively. Negative values indicate that current values are less than mean values.
273
Computing Molecular Fluctuations
A 40
Particle number
35 30 25 20 15 10
Theoretical model Gillespie method 0
5
10
15
20
Time B
10
Variance
8 6 4 2
Theoretical model Gillespie method
0
0
5
10 Time
15
20
Figure 9.3 (A) Mean number of particles for TEB using the theory and the Gillespie’s method; (B) Comparisons for variances.
7. Binding and Unbinding of Activated TF to E-Boxes In many situations, the TF is activated prior to binding onto the E-boxes. This situation can be modeled as two elementary reactions. TFA is activated TF, and TAB is TF bound E-boxes. TF Ð TFA EB þ TFA Ð TAB:
ð9:49Þ
274
Don Kulasiri
By following the similar procedure, we can obtain the following vectors for the theoretical equations: 2 3 nTF N ¼ 4 nTFA 5; nTAB and
2
kþ 1 H ¼ 4 kþ 1 0
k 1 k1 kþ EB 2n þ k2 nEB
3 0 þ k TFA 5: 2 þ k2 n þ k2 k2 nTFA
We also have the restriction: nEB(t) þ nTAB(t) ¼ n ¼ const. In Eq. (9.49), 0 O1 m þ exp TF ; k1 ¼ V kB T 0 O1 m exp TFA ; k 1 ¼ V kB T 0 0 O2 mEB m þ exp exp TF ; k2 ¼ V kB T kB T and k 2 ¼
0 O2 m exp TAB : V kB T
Note that O values are different for the two different reactions. Now, the fluctuations of N relative to the mean can be obtained from the following SDE: ddN ¼ HdN dt þ g1=2 dw;
ð9:50Þ
where w is a column vector of three independent Wiener processes, g is given by X þ gij ¼ oki V k þ V k okj ; k
where for the two elementary reactions, 2 3 1 v1 ¼ 4 1 5 0
275
Computing Molecular Fluctuations
and 2
3 0 v2 ¼ 4 1 5 1 We know that þ V þ TF ; 1 ¼ k1 n
V TFA ; 1 ¼ k1 n
þ EB nTFA ; V þ 2 ¼ k2 n
V TAB : 2 ¼ k2 n
Therefore, g11 ¼ kþ TF þ k TFA ; 1n 1n
g33 ¼ kþ EB nTFA þ k TAB ; 2n 2n
g22 ¼ g11 þ g33 ¼ kþ TF þ k TFA þ kþ EB nTFA þ k TAB ; 1n 1n 2n 2n g12 ¼ g21 ¼ g11 ¼ kþ TF k TFA ; 1n 1n g23 ¼ g32 ¼ g33 ¼ kþ EB nTFA k TAB ; 2n 2n and 45 Mean of TF TF
Number of molecule TF
40 35 30 25 20 15 10
0
1
2
3
4
5 Time
6
7
8
9
10
Figure 9.4 Mean molecular numbers of TF along with three realizations of instantaneous values. The forward and backward reaction rate for the first reaction is 0.4; the forward and backward reaction rate for the second reaction is 0.2; and the initial conditions for TF, TFA, and TAB are 40, 0, and 10; and n is 10.
276
Don Kulasiri
g13 ¼ g31 ¼ 0: As g is a positive semidefinite and symmetric, it has a well-defined square root. Figure 9.4 shows the time evolution of the mean of TF, and three realizations of TF are superimposed. We implemented the Gillespie’s algorithm for this reaction system, and the comparisons of the means and variances are given in Fig. 9.5 for nEB þ nTAB ¼ n ¼ 50;nTF0 ¼ 150; nTFA0 ¼ 0; nEB0 ¼ 40; nTAB0 ¼ 40; and k1þ ¼ 0.4; k1 ¼ 0.2; k2þ ¼ 0.1; k2 ¼ 4. As before, the comparisons A
100
Particle number
80 60 40 20 0 B
Theoretical model Gillespie method 0
5
10 Time
15
20
40
Variance
30
20
10 Theoretical model Gillespie method 0
0
5
10 Time
15
20
Figure 9.5 (A) Mean number of particles for TAF using the theory and the Gillespie’s method; (B) Comparisons for variances.
Computing Molecular Fluctuations
277
are based on 10,000 realizations. We chose to show only TAF because it couples the reactions, and the results show excellent agreement.
8. Conclusions We have reviewed the pertinent aspects of a mechanistic, statistical theory which integrates the thermodynamical and particle dynamic aspects of intracellular molecular fluctuations, and extended the theory using the theory of SDEs. The theory was used to model the fluctuations in binding and unbinding of TFs to E-boxes, which is a common motif in transcriptional regulation in circadian rhythms. The stochastic variations in biochemical reactions are ubiquitously simulated by the Gillespie’s algorithm. We compared the theoretical results with those from the Gillespie’s algorithms for the reaction systems and found them to be almost the same. Even though the theory is for near-equilibrium reactions, once extended with the SDEs, it can be used for biochemicals reactions in the motifs described here, which are usually considered as happening away from the equilibrium.
ACKNOWLEDGMENTS The funding is provided by Lincoln University LUREST grants. Yao He’s assistance is appreciated.
REFERENCES Austin, D. W., Allen, M. S., McCollum, J. M., Dar, R. D., Wilgus, R. J. R., Sayler, G. S., Samatova, N. F., Cox, C. D., and Simpson, M. L. (2006). Gene network shaping of inherent noise spectra. Nature 439(7076), 608–611. Blake, W. J., Kaern, M., Cantor, C. R., and Collins, J. J. (2003). Noise in eukaryotic gene expression. Nature 422(6932), 633–637. Elowitz, M. B., Levine, A. J., Siggia, E. D., and Swain, P. S. (2002). Stochastic gene expression in a single cell. Science 297(5584), 1183–1186. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 61, 2340–2361. Kaern, M., Elston, T. C., Blake, W. J., and Collins, J. J. (2005). Stochasticity in gene expression: From theories to phenotypes. Nat. Rev. Genet. 6(6), 451–464. Kampen, V. (2001). Stochastic Processes in Physics and Chemistry. Elsevier Sci. Pub., Amsterdam, Netherlands. Keizer, J. (1987). Statistical Thermodynamics of Nonequilibrium Processes. Springer-Verlag, New York. Kirschner, M. W., and Gehart, J. C. (2005). The Plausibility of Life: Resolving Darwin’s Dilemma. Yale University Press, New Haven, CT, USA. Klebaner, F. C. (1998). Introduction to Stochastic Differential Equations. Academic Press, London.
278
Don Kulasiri
Ozbudak, E. M., Thattai, M., Kurtser, I., Grossman, A. D., and van Oudenaarden, A. (2002). Regulation of noise in the expression of a single gene. Nat. Genet. 31(1), 69–73. Pedraza, J. M., and van Oudenaarden, A. (2005). Noise propagation in gene networks. Science 307(5717), 1965–1969. Raser, J. M., and O’Shea, E. K. (2004). Control of stochasticity in eukaryotic gene expression. Science 304(5678), 1811–1814. Xie, Z., and Kulasiri, D. (2007). Modelling of circadian rhythms in Drosophila incorporating the interlocked PER/TIM and VRI/PDP1 feedback loops. J. Theor. Biol. 245, 290–304.
C H A P T E R
T E N
Probing the Input–Output Behavior of Biochemical and Genetic Systems: System Identification Methods from Control Theory Jordan Ang,*,† Brian Ingalls,‡ and David McMillen*,† Contents 280 282 283 285 291 291 293 305 311 316
1. Introduction 2. System Identification Applied to a G-Protein Pathway 2.1. Construction of the frequency response 2.2. Interpretation of the frequency response 3. System Identification 3.1. Transfer function models 3.2. Applying the procedure to the G-protein pathway 3.3. Examples of experimental implementation 4. Conclusion References
Abstract A key aspect of the behavior of any system is the timescale on which it operates: when inputs change, do responses take milliseconds, seconds, minutes, hours, days, months? Does the system respond preferentially to inputs at certain timescales? These questions are well addressed by the methods of frequency response analysis. In this review, we introduce these methods and outline a procedure for applying this analysis directly to experimental data. This procedure, known as system identification, is a well-established tool in engineering systems and control theory and allows the construction of a predictive dynamic model of a biological system in the absence of any mechanistic details. When studying biochemical and genetic systems, the required experiments are not standard laboratory practice, but with advances in both our ability to measure system outputs (e.g., using fluorescent reporters) and our ability to generate * Department of Chemical and Physical Sciences, University of Toronto Mississauga, Mississauga, Ontario, Canada Institute for Optical Sciences, University of Toronto Mississauga, Mississauga, Ontario, Canada { Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, Canada {
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87010-2
#
2011 Elsevier Inc. All rights reserved.
279
280
Jordan Ang et al.
precise inputs (with microfluidic chambers capable of changing cells’ environments rapidly and under fine control), these frequency response methods are now experimentally practical for a wide range of biological systems, as evidenced by a number of successful recent applications of these techniques. We use a yeast G-protein signaling cascade as a running example, illustrating both theoretical concepts and practical considerations while keeping mathematical details to a minimum. The review aims to provide the reader with the tools required to design frequency response experiments for their own biological system and the background required to analyze and interpret the resulting data.
1. Introduction Recent advances in experimental techniques have allowed ever-increasing scrutiny of the dynamic behavior of cellular mechanisms. In many cases, such as development and response to environmental changes, the dynamic nature of a biological process is the key to its function. Standard tools of biological data analysis are well equipped to address steady-state behavior (e.g., dose–response curves), but to unravel the nature of dynamic processes (e.g., transient and oscillatory responses), new analytic methods are required. A fundamental aspect of a dynamic process is the timescale on which it acts. Just as biological systems span a wide range of spatial scales (from nanometer-wide proteins to kilometer-wide ecosystems), biological processes span a wide range of temporal scales, from ligand association (seconds) to protein expression (minutes to hours) to organismal development (months to years) to ecosystem alteration (years to millennia). Investigation of any given process often reveals a network of simultaneous events occurring over a range of timescales. The resulting dynamic network behavior is often not readily apparent from the nature of the individual components. The engineering community has long been dealing with (and exploit˚ stro¨m and ing) the dynamic behavior of mechanical and electrical systems (A Murray, 2008; Haykin and Van Veen, 2005). One of the basic notions the engineers have derived for addressing time-varying processes is the frequency response, which provides a concise characterization of the manner in which a system responds to perturbations at various timescales. The frequency response relies on frequency domain analysis, which applies directly only to linear systems. A system is linear if it acts additively: the response to two simultaneous perturbations is equivalent to the sum of the responses to the individual perturbations. Linearity is a helpful conceit; no real systems are perfectly linear. (In particular, biological systems typically involve saturation, rather than additivity, of response.) Nevertheless, systems often exhibit approximately linear behavior, and in fact all systems behave approximately linearly when exposed to perturbations that are
Frequency Response Methods in Biology
281
sufficiently small. In practice, many systems exhibit behavior that is close to linear. This is particularly true for self-regulating systems that tend to operate around a specific nominal condition; this applies both to engineered automatic feedback systems and to homeostatic biological mechanisms. The frequency response of a system can be assessed by observing the response of the system to specific inputs. This analysis is simplest when those inputs are sinusoidal oscillations at various frequencies; the corresponding responses indicate the behavior of the system over a range of timescales. In some contexts (e.g., electrical circuits), the generation of oscillatory signals is straightforward. In contrast, such signals can be difficult to produce in a biological setting, especially in the case of chemical signals (though Block et al., 1983 is an early example of work applying oscillatory signals to analyze a biochemical system). Recent advances in microfluidic technologies have placed the production of such inputs, and of other oscillatory inputs such as square waves, within broader reach (Beebe et al., 2002; Bennett and Hasty, 2009; Bennett et al., 2008; Hersen et al., 2008; Mettetal et al., 2008; Shimizu et al., 2010). In engineering applications, the frequency response can be efficiently assessed in a single experiment using an input that “excites” the system at multiple frequencies simultaneously (Ljung, 1999). (Standard examples are white noise, steps, or approximations of the “impulse function”—an infinitely tall, infinitely short pulse). The signal-to-noise ratios (SNRs) inherent in molecular biology make these experiments less useful in this setting (although see Block et al., 1982 for a successful implementation of a chemical impulse). The frequency response provides valuable insight into a system’s dynamic behavior. This includes a characterization of the bandwidth of the system: the fastest timescale on which the system can act. Moreover, the frequency response can be used to generate a transfer function model of the system’s input–response behavior. A transfer function model allows prediction of the response of the system to arbitrary inputs (provided linearity of behavior is adequately maintained). A transfer function model does not address the specific mechanisms (e.g., biochemical or genetic) underlying the input–response behavior; it is a “black box” model. The term system identification refers to the process of constructing such a model from observation of dynamic responses. In this review, we will introduce and illustrate the system identification process and highlight cases in which the method has been successfully applied experimentally. In Section 2, we present a running application—a yeast G-protein signaling cascade—and use a mathematical model of this system to illustrate how the frequency response can be assessed from idealized experimental observation of the system’s response to sinusoidal inputs, as well as how the frequency response provides insight into system behavior. In Section 3, we describe how a frequency response could be generated from real experimental observations, and then extend that analysis to the construction of a transfer function model. We close Section 3 with a brief
282
Jordan Ang et al.
discussion of examples of experimental applications of frequency response methods to biochemical systems. Finally, in Section 4, we review successful biological applications of systems identification to experimental data, and conclude with a discussion of the role that the methods we describe here may play in future efforts to understand biochemical and genetic systems.
2. System Identification Applied to a G-Protein Pathway Heterotrimeric G-protein signaling systems are a common component of eukaryotic signal transduction pathways, and are of acute clinical interest as common drug targets (McCudden et al., 2005; Oldham and Hamm, 2008; Yi et al., 2003). The G-protein component of the pheromone response pathway in the budding yeast Saccharomyces cerevisiae is a well-characterized example of this family of pathways; both the kinetic details and the dynamic behavior of the pathway have been studied (Yi et al., 2003). We will use this pathway, shown schematically in Fig. 10.1, to illustrate system identification techniques. In this section, we will examine the behavior of the mathematical model constructed by Yi et al. (2003) as an idealization of an experimental analysis. This will allow a clean presentation of the concepts underlying the frequency response.
production
L+R
de
gr
ad
Ga–Gbg
Ga–GDP
on
ion
at
rad
deg
ati
RL
Ga–GTP
Gbg
Figure 10.1 G-protein signaling pathway in yeast (Yi et al., 2003). The pathway input is the level of extracellular ligand, L. The ligand binds to receptor R to form the ligand– receptor complex RL. This complex catalyzes the association of GTP with the Ga subunit of the heterotrimeric G-protein complex, and the concomitant dissociation of the Ga and Gbg subunits. Ga-GTP activates downstream activity until the associated GTP is dephosphorylated to GDP, after which the G-protein complex reforms.
283
Frequency Response Methods in Biology
2.1. Construction of the frequency response As described in Section 1, the frequency response can be assessed from timeseries observation of the system’s response to oscillatory input signals. In the case of the pheromone response G-protein pathway, the input (extracellular ligand) is the level of alpha factor pheromone (assuming the target cells are mating type a). Figure 10.2 shows a model-generated dose–response curve indicating the abundance of the pathway output, active Ga-GTP as a function of extracellular alpha factor concentration (L in Figure 10.1). This curve, which matches the experimental results of Yi et al. (2003), indicates that over its active range, the signal transduction pathway’s active range is characterized by a near-linear response centered at an alpha factor level of about 1 nM. Consequently, we choose 1 nM as a nominal input level, and address the behavior of the system as the ligand input varies around this nominal value. From the curve, an alpha factor concentration of 1 nM corresponds to a Ga-GTP abundance of 510 molecules per cell. Linear systems and sinusoidal inputs share a special relationship that underlies all frequency domain analysis: the steady-state response of a linear system to a sinusoidal input is a sinusoidal output of the same frequency. This does not hold for nonlinear systems, or for other classes of inputs.
Abundance of active Ga-GTP (molecules per cell)
900 800 700 600 500 400 300 200 100 0 10−3
10−2
10−1
100 101 Alpha factor (nM)
102
103
Figure 10.2 Model-generated G-protein pathway dose–response. The steady-state abundance (in molecules per cell) of Ga-GTP (which we take as the system’s output) is plotted against the concentration of the extracellular alpha factor ligand (the system’s input).
284
Jordan Ang et al.
1000
alpha factor (input) Ga–GTP (output)
900 800
Abundance
700 600 500 400 300 200 100 0
0
50
100
150
200
250
Time (s)
Figure 10.3 Simulated G-protein pathway transient response. The system input (concentration of the extracellular alpha factor ligand) is varied sinusoidally. The system’s output passes through a brief transient period, after which it settles into a sinusoidal pattern of its own. The frequency of the output matches the input frequency, but the phase (the location of the peaks and troughs) is shifted and the amplitude is different.
Figure 10.3 shows a model simulation representing the time-series measurements of an experiment in which cells are exposed to a sinusoidal oscillation in ligand level. The system response follows a short transient before settling to a steady oscillatory behavior. The frequency of oscillation of the response matches that of the input, but the output oscillations do not have the same amplitude as the input, and the two signals are out-of-phase (i.e., the peaks and troughs are not aligned). The ratio of the amplitude of response to the amplitude of input is called the system gain; the difference in the phase is called the system phase shift (or just system phase). These measures of system behavior are dependent on the frequency of the input, but (as a consequence of linearity) do not depend on its amplitude. By probing the response of the system to sinusoidal inputs at various frequencies, we reveal how these two measures (gain and phase) depend on the timescale of the input. In the subsequent dynamic analysis, we will address the response of the system in terms of the deviation of the Ga-GTP level from 510 molecules per cell, to an input described by the deviation of the alpha factor level from 1 nM. Since we are simulating model behavior, we will address the response to pure sinusoids. In many cases, sinusoidal inputs cannot be experimentally achieved, and so other oscillatory inputs (e.g., square waves) must be used.
Frequency Response Methods in Biology
285
We will defer the treatment of these more general inputs (which requires a discussion of Fourier analysis) to Section 3. Figure 10.4 shows three simulated steady-state input–response pairs. In each case, the gain and phase can be assessed by comparing the two signals. Comparing these three responses (note the change in scale on the time axis), we see that the amplitude of the response (the deviation from nominal GaGTP levels) changes with input frequency, although the amplitude of the input remains the same. We see too that the phase behavior changes: at low frequency the response lies (nearly) in phase with the input; at higher frequencies the response falls significantly out-of-phase. When this experiment is repeated over a range of frequencies, an overall picture of the system’s frequency - dependence emerges. This information is typically displayed in a pair of plots: graphs of the system gain and system phase as functions of frequency. Figure 10.5 shows the corresponding curves for this system, with the data from Fig. 10.4 labeled. Following standard practice, frequency is plotted on a logarithmic scale, as is gain, while phase shift is plotted in degrees. These are known as gain and phase Bode plots, named after the American engineer Hendrik Bode (1905–1982) who was the first to apply this analysis. (Engineers typically scale the gain in a Bode plot by a prefactor of 20, and report the result in decibels; also they generally report frequency in radians per second rather than Hertz.) In this case, the Bode plot was generated directly from the mathematical model. In application, these curves would be fit to experimental observations. We illustrate this fitting process in Section 3.
2.2. Interpretation of the frequency response The Bode plots provide direct insight into the dynamic behavior of the system. The phase plot contains information about the way in which the system acts when connected to other systems (in cascade or in feedback). We will not address such issues here, but in section 3 we will find the phase plot useful as an additional constraint on the fitting of a transfer function to frequency response data. The gain plot indicates the response of the system to perturbations at various frequencies (i.e., timescales). In particular, it reveals the (frequency) filtering properties of the system. The notion of frequency filtering follows from the fact that any signal can be expressed as a combination of pure sinusoids, via Fourier decomposition.1 As illustrated above, the gain plot shows how a system amplifies or 1
A periodic signal can be written as a sum of sinusoids over a discrete range of frequencies, via a Fourier series. A nonperiodic signal can be expressed as an integral of sinusoids over a continuum of frequencies, via the Fourier transform.
286
Jordan Ang et al.
A 1.2
1.1
550
1.05 1
500
0.95 0.9
450
0.85 0.8 0.75
400 0
2
4
6
8
10
12
14
16
Time (s)
Input: deviation from nominal (nM)
B
Output: deviation from nominal (molecule per cell)
600
1.15
18 × 104
1.25 1.2
600
1.15 1.1
550
1.05 1
500
0.95 0.9
450
0.85 0.8 0.75
Output: deviation from nominal (molecule per cell)
Input: deviation from nominal (nM)
1.25
400 0
500 1000 1500 2000 2500 3000 3500 4000 4500 Time (s)
1.25 1.2
600
1.15 1.1
550
1.05 1
500
0.95 0.9
450
0.85 0.8 0.75
Output: deviation from nominal (molecule per cell)
Input: deviation from nominal (nM)
C
400 0
20
40
60
80
100
120
140
160
180
Time (s)
alpha factor (input)
Ga–GTP (output)
Figure 10.4 Simulated G-protein pathway steady-state response to sinusoidal inputs. Note the different scales on the time axes for the different plots. The input functions (light curves) have frequencies of (A) 2.5 10 5 Hz; (B) 10 3 Hz; and (C) 2.5 10 2 Hz; the resulting outputs are shown as dark curves. The change in amplitude and phase is summarized as a function of frequency in Fig. 10.5. (Note that since the input and output are measured in different units, the gain does not correspond to the ratio of the amplitudes of the traces in the graph.)
287
Frequency Response Methods in Biology
103 2
Gain
10
B Low-frequency gain: 200
20 4B
Corner frequency: 4C 0.005 Hz
101 100 10-1
4A
0
4A Phase shift (deg)
A
-20 4B
-40 -60 -80 -100 -120 -140
Roll-off (slope): -2
4C
-160 -2
10
10-5
10-4
10-3
10-2
10-1
100
-180
10-5
Frequency (Hz)
10-4
10-3
10-2
10-1
100
Frequency (Hz)
Figure 10.5 Frequency response of the G-protein pathway model. These Bode plots show (A) the system gain and (B) the system phase shift, as a function of the frequency of the input signal. The gains and phases of the signals with frequencies shown in Fig. 10.4 are marked here as points 4A, B, and C, corresponding to Fig. 10.4’s three panels. 103
102
Gain
101
100
10−1
10−2
10−1
100 Frequency (Hz)
101
Figure 10.6 Gain Bode plot for a resonant system. The system responds strongly at a frequency of 0.5 Hz, with the gain spiking upward by orders of magnitude at that point and dropping off sharply at lower and higher frequencies.
attenuates a sinusoidal input at a given frequency. When the linear system is exposed to a general input, each of its frequency components is passed to the output according to the corresponding gain, and thus this graph gives a powerful map of the system’s input–output behavior. As a first example, consider Fig. 10.6, which shows the gain plot for a system with a resonant frequency of 0.5 Hz. This plot suggests that this system will respond strongly to sinusoids at this resonant frequency, and will
288
A
Jordan Ang et al.
B
4 3
1.5 1 0.5
1
Output
Input
2 0 −1 −2 −3 −4
0 −0.5 −1
0
5
10
15 20 Time (s)
25
30
−1.5
0
5
10
15 20 Time (s)
25
30
Figure 10.7 Input–output pair for the resonant system whose Bode gain plot was shown in Fig. 10.6. (A) The input signal: Gaussian white noise, a random signal containing a wide range of frequencies. (B) The corresponding output signal: the resonant frequency at 0.5 Hz has been picked out of the noisy input and preferentially amplified, producing a near-sinusoidal output at the resonant frequency.
attenuate oscillatory signals at all other frequencies. Referring to Fig. 10.7, we see an example of that behavior. The input is a Gaussian white noise signal, which is composed of oscillations over a wide range of frequencies. The output, in contrast, shows only the influence of the input at the resonant frequency: the remaining components of the signal have been so attenuated that they are no longer visible on the plot. This sort of resonant system is useful in technology2 and may appear in some biological oscillators as well (see, e.g., Westermark et al., 2009), but is not representative of the typical behavior of biochemical and genetic networks; we include it here as a direct illustration of filtering. Figure 10.8 illustrates some of the filtering behaviors more commonly exhibited by cellular processes. A single input function—a low-frequency (0.16 Hz) sinusoid corrupted by higher frequency noise (Fig. 10.8A)—has been passed through three separate systems. In each case, the nature of the resulting output signal can be predicted from the system’s frequency response, as follows. In the first case (Fig. 10.8B and C), the system acts as a low-pass filter. The gain plot is approximately piece-wise linear: a constant value at low frequencies, then a corner, followed by a linear descent with a slope of 1 on the log–log plot. The frequency at which the corner occurs is known as the system’s corner (or cut-off or break) frequency, and indicates the system’s bandwidth. Below this frequency, the system allows all components of the 2
For instance, Fig. 10.7 illustrates the phenomenon that allows a distinct radio signal to be captured from the mixture of signals that we hear as radio “static.” The tuning dial allows the listener to change the resonant frequency, thus selecting which signal will be “lifted” out of the noise.
289
Frequency Response Methods in Biology
A Input:
5 4 3 2 1 0 −1 −2 −3 −4 0
1
10
0
10
−1
10
−2
Gain
Gain
10 10
−3
10
−4
10
−5
10 10
−6
10
10
2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2
10 10
10
C
15 20 Time (s)
25
30
−4
−2
0
2
10 10 10 Frequency (Hz)
10
4
10
E
F
1
10
0
10
−1
10
−2
Gain
10 10
Outputs:
10
D
B Filters:
5
−3 −4
10 −4
10
−2
0
2
10 10 10 Frequency (Hz)
10
G
2.5 1.5 1 0.5 0
−0.5 5
10
15 20 Time (s)
25
30
−1 0
5
10
15 20 Time (s)
25
0
−1 −2 −3 −4 −5 −6
10
4
2
0
10 10
−5 −6
10
1
30
4 3 2 1 0 −1 −2 −3 −4
−4
−2
10
0
0
2
10 10 10 Frequency (Hz)
5
10
15 20 Time (s)
4
10
25
30
Figure 10.8 Filtering behavior. (A) The input signal is a sinusoid with frequency 0.16 Hz, corrupted with high-frequency noise. This input has been passed through each of the systems in the second row, resulting in the output signals in the third row. (B and C) The gain plot for B shows that this system is a low-pass filter with a corner frequency of 1 Hz and a roll-off of 1. The output C shows that the high-frequency noise has been reduced, leaving the underlying sinusoidal signal more visible. (D and E) The gain Bode plot D is again a low-pass filter, this time with a lower corner frequency (10 2 Hz). The output signal has its high-frequency components attenuated more than in panel C, resulting in a smoother sinusoid. (F and G) This system’s gain Bode plot F shows that it is a band-pass filter, with a peak at 102 Hz. The frequency of the sinusoidal oscillation in the input signal (panel A) lies well below this peak, so the filter obscures the sinusoidal nature of the original signal, leaving only the higher frequency noise components. The resulting output G is relatively flat, and somewhat less noisy than the original because the highest frequency components of the noise have also been attenuated by the band-pass filter.
input to “pass” through to the output signal. High-frequency components of the input, in contrast, are significantly attenuated. Comparing the input and output signals in this case, we see that the low-frequency oscillations are retained in the output, as is the noise around that oscillation, but the noise has been smoothed: the high-frequency components have been stripped away, leaving a signal with less “wild” changes in direction.
290
Jordan Ang et al.
More specifically, the gain plot for this low-pass filter is characterized by three parameters: the gain at low frequency, the frequency at which the “corner” occurs, and the slope after the corner (the so-called “roll-off ”); all three are informative about the system’s behavior:
Low-frequency (DC) gain: The gain at low frequency is the degree to which constant and slowly varying inputs are amplified or attenuated. (Electrical engineers refer to this as the “DC gain” since direct current (DC) provides a constant input.) This low-frequency gain describes, for instance, the steady-state response of the system to a step input, and is often referred to as the (local, steady-state) sensitivity of the system to the input (Ingalls, 2004). Corner frequency: As mentioned, the corner frequency indicates the system’s bandwidth: the frequency above which inputs are attenuated. For cellular processes, the bandwidth is closely related to the timescales on which component processes act (e.g., association/dissociation rates, halflives of proteins, metabolic consumption rates). When exposed to disturbances that act on timescales considerably faster than their inherent timescales, these processes are unable to “keep up,” and instead react only to the temporal average of the perturbation. In the case of sinusoidal inputs centered around a nominal level, the signal averages to the nominal input, and so the system exhibits no dynamic response to such highfrequency signals. Roll-off: The drop beyond the corner frequency eventually reaches a straight line with a downward slope of N decades of gain per decade of frequency on this log–log plot. The slope N is called the relative degree of the system, and is a measure of how quickly the output responds to changes in the input (the larger the relative degree, the more sluggish the response). The filters considered here have a slope of one decade per decade, and so have relative degree 1. The effect of a change in the bandwidth is illustrated by the second system shown in Fig. 10.8D and E. This system is also a low-pass filter, but has a shorter bandwidth, and so allows fewer high-frequency components of an input signal to pass to the output. The response in the Fig. 10.8E illustrates this behavior: the low-frequency sinusoid is clearly seen in the output, but this filter has removed even more of the high-frequency noise. The gain plot of our third example (Fig. 10.8F and G) differs significantly from the low-pass filters: it shows attenuation at both low and high frequencies. This is referred to as a band-pass filter, and is characterized by the range (or band) of frequencies that it allows to pass from input to output. (The resonant system in Fig. 10.6 is also a band-pass filter, with a very tight band.) Referring to the corresponding output signal (Fig. 10.8G), we see a behavior that is complementary to the low-pass filters: the low-frequency sinusoid has been filtered out since its frequency does not match the
Frequency Response Methods in Biology
291
frequencies “passed” by the band-pass filter; only the higher frequency noise has been retained (and has been somewhat smoothed by blocking of its highest frequencies). In technology, band-pass filters are implemented to allow downstream processes to access specific frequency ranges within input signals. Biological processes that exhibit band-pass behavior may have been selected to perform similar roles. In any case, the range of frequencies over which a biological band-pass filter acts is likely indicative of the kinds of signals to which it is attuned, and so speaks directly to its function and environment. A third category, the high-pass filter, has essentially the opposite form and effect as a low-pass filter: the frequency response is small for low frequencies, and rises to some maximum for high frequencies. Such filters can, much like low-pass filters, be characterized by the slope on their logscale Bode plot and the corner frequency (here representing the frequency below which the response is attenuated). Returning to our G-protein pathway model (Fig. 10.5), we see that it displays primarily low-pass behavior, with a mild band-pass around 10 3 Hz. The low-frequency (DC) gain is 200 (a ratio of molecules per cell to nM), the bandwidth is 5 10 3 Hz, and the roll-off indicates a relative degree of 2. These values reveal the types of signals to which the pathway responds most strongly, and may provide insight into its evolution or biological function. As we have seen, the gain plot provides a useful nonparametric model for the system, and could be fit using standard nonlinear regression techniques (Seber and Wild, 2003). However, the phase shift data sheds light on the construction of a more useful representation of system behavior: an input– output map known as the system transfer function, that will allow prediction of the system response to arbitrary signals
3. System Identification In this section, we present the method of system identification from data, using simulated data from a stochastic simulation of the G-protein pathway model to illustrate the method. Before discussing the details of implementation, we introduce the notion of the transfer function.
3.1. Transfer function models Most dynamic mathematical models in systems biology are based on knowledge of the kinetics of underlying chemical and genetic mechanisms (Hasty et al., 2001; Ideker et al., 2001; Kitano, 2002; Tyson et al., 2001). Here we take a complementary approach: using only our collection of
292
Jordan Ang et al.
input–response data (and assuming no knowledge of mechanism), we fit a “black box” model that accurately describes the observed input–output behavior. Such a model does not directly serve to provide insight into the mechanisms of the system, but rather provides a concise, predictive representation of its input–output behavior. The model we seek to fit is called a transfer function and takes the form of a ˚ stro¨m and rational function: a ratio of polynomials of a complex variable (A Murray, 2008). The transfer function formalism allows for a simple representation of dynamic input–output relationships. Such dynamic relationships are often captured by differential equations. For example, suppose that for a given system, the response y(t) is related to the input u(t) according to the differential equation d2 d d yðt Þ þ a1 yðtÞ þ a0 yðtÞ ¼ b1 uðtÞ þ b0 uðt Þ: 2 dt dt dt
ð10:1Þ
To avoid working with dynamic equations of this sort, we make use of the Laplace Transform L, which is an operator that maps functions of time to functions of a complex variable s, so that L[y(t)] ¼ Y(s) (Boyce and DiPrima, 2008). The Laplace transform has two key properties that we will use. The first is that it converts derivatives to products: L[(dn /dtn)y(t)] ¼ snY(s) (provided the signal, y, is zeroed and resting at its nominal value at t ¼ 0). This means that the differential equation relating y(t) to u(t) can be converted to a simpler, nondynamic, equation relating their Laplace transforms. With L[u(t)] ¼ U(s), we have s2 Y ðsÞ þ a1 sY ðsÞ þ a0 Y ðsÞ ¼ b1 sU ðsÞ þ b0 U ðsÞ;
ð10:2Þ
which can be simplified to Y ðsÞ ¼
s2
b1 s þ b0 U ðsÞ: þ a1 s þ a0
ð10:3Þ
Letting G(s) ¼ (b1s þ b0) / (s2 þ a1s þ a0), we call G(s) the transfer function for the system. The transfer function provides a simple relationship between the Laplace transforms, but will not be useful unless we can recover information about the original signals. Here the second key property of Laplace transforms can be employed: when evaluated at s ¼ io (where i is the square root of negative one),3 the Laplace transform yields the Fourier transform, which, as
3
Note: most control engineers follow the electrical engineering notation j ¼
pffiffiffiffiffiffiffi 1:
Frequency Response Methods in Biology
293
we have discussed above, is a measure of the frequency content of a signal at frequency o. Thus the equation Y ðioÞ ¼ GðioÞU ðioÞ
ð10:4Þ
relates the content of the input, u, and the output, y, at frequency o. This is precisely the role of the frequency response, and in fact G(io), the value of the transfer function at s ¼ io, is precisely the frequency response of the system (at frequency o). To compare with a Bode plot, we recognize that G(io) is a complex-valued function of o: the gain at o corresponds to the modulus of this complex number, while the phase shift corresponds to its argument.4
3.2. Applying the procedure to the G-protein pathway For this illustrative example, we have generated simulated “data” from a stochastic version of the G-protein pathway model from Yi et al. (2003) (details in Appendix). Our system identification procedure seeks to recover aspects of the model behavior directly from the simulated data. In the following sections, we will present our implementation of the method shown schematically in Fig. 10.9. 3.2.1. Preliminary investigations As a first step, it is often worth performing preliminary investigations using input profiles that are standard in the lab, and therefore relatively easy to produce. These investigations represent a simple but informative "first pass" allowing us to focus on the appropriate amplitude and frequency ranges of the input signal before beginning frequency response analysis, where the necessary inputs are more challenging to produce experimentally. We begin our analysis of the system by determining the dose–response behavior for our chosen input and output. Figure 10.10 shows the results of exposing the system to a range of input values and determining the corresponding steady-state output value. As in Section 2, the input is the concentration of the alpha factor ligand and the output is the cellular abundance of the species Ga-GTP. From this graph, we can identify a roughly linear range of behavior between saturation at low and high input values. We chose to center our linear analysis around the input level 1 nM, as indicated in the figure. The corresponding nominal output value is a concentration of Ga-GTP of about 450 molecules per cell. Outside of the surrounding linear range, nonlinear effects become appreciable; these effects are also exposed in the frequency response analysis detailed below. We will 4
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi For a complex number x ¼ a þ bi, the modulus is jxj ¼ a2 þ b2 while the argument is arg(x) ¼ arctan (b / a). The argument of a complex number is sometimes referred to as its phase.
294
Jordan Ang et al.
Preliminary investigations
Probe system with oscillatory input + collect response data
Preprocess response data
Fourier filter response data at input frequency to determine system gain and phase shift
Repeat for various input frequencies
Experimental design
Create experimental bode plot
Choose linear transfer function model structure
Fit transfer function to bode plot No
Good fit? Yes
Determine the form of a nonlinear rectifier (if necessary)
Model validation No
Model ok? Yes
Figure 10.9 Schematic flow chart of the sequence of steps for system identification via frequency response. The dashed box indicates steps that should be repeated over a range of frequencies in order to capture the principle features of the Bode plot. The feedback loops on the left side illustrate the iterative nature of this method. If the estimated model is not satisfactory, one should first look to different (higher order) transfer model structures. If a satisfactory model still cannot be found, then the overall experimental design may need to be modified in order to capture more detailed or more accurate information.
295
Abundance of active Ga-GTP (molecules per cell)
Frequency Response Methods in Biology
900 800 700 600 500 400 300
1 nM nominal input
200 100 0 10−3
10−2
10−1
100 101 Alpha factor (nM)
102
103
Figure 10.10 Simulated G-protein pathway dose–response curve. To avoid regimes of saturation, we use 1 nM as the nominal dosage level about which to estimate a linear transfer function model.
discuss the linearity of the outputs after we have examined the system’s varying-frequency behavior. We then begin an exploration of the dynamic behavior of the system by exposing it to an input rectangular “pulse” (i.e., a step up followed by a step down). Using the knowledge gained from the dose–response analysis, we center our input values about the chosen nominal input. We observe an exponential rise when the input increases, and then exponential relaxation when the input is removed (Fig. 10.11). This exponential behavior is typical of biochemical and genetic networks. Some networks might display significant delay in the output response (indicating a high relative degree, see Section 2.2), or even damped oscillatory “ringing” rather than a monotonic rise and fall. (Such ringing is indicative of a strongly resonant frequency, as in Fig. 10.6, and would suggest that careful attention be paid to the frequency response near the frequency of the ringing.) In the case of exponential rise and fall, we are able to extract information about the timescale on which the system acts from the rise time (how long the system takes to reach its steady-state value after the step input is applied) and the decay time (how long the system takes to return to its initial level once the step input jumps back down to zero). In Fig. 10.11, the rise time shows that the pathway can respond to changes in the input signal on timescales on the order of tens of seconds, while the decay time is on the order of hundreds of seconds. Thus, to probe the system effectively, we must use oscillatory signals with periods that span this range, and ideally
296
Jordan Ang et al.
Alpha factor
(nM)
1.5
1
0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
3000
3500
4000
Active Ga-GTP
900 Molecules per cell
800 700 600 500 400 300
0
500
1000
1500
2000 Time (s)
2500
Figure 10.11 A rectangular pulse input (top plot) and the stochastically simulated output response (bottom plot, sampled points) for the G-protein pathway. The rise and fall times in the response provide a general idea of the system’s timescale.
extend to significantly shorter and longer timescales as well. In the analysis to follow, we will use input signals with periods between 15 and 32,000 s (frequencies of 3.125 10 5 to 0.066 Hz). Of course in a real implementation, experimental limitations will likely provide constraints on the range of timescales that can be effectively probed. 3.2.2. Experimental design, data preprocessing, and the experimental oscillatory response We exposed the G-protein pathway model to sinusoidal oscillatory input signals over a range of frequencies (3.125 10 5 to 0.066 Hz), and observed the output response function at each frequency. In experimental implementations, square waves produced by rapid switching of computercontrolled values are often used as inputs (Hersen et al., 2008; McClean
Frequency Response Methods in Biology
297
et al., 2009; Mettetal et al., 2008), though sinusoids have also been implemented (Bennett et al., 2008; Block et al., 1983; Shimizu et al., 2010). Using sinusoidal inputs allows us to mirror the analysis in Section 2. We discuss the case of general oscillatory inputs in Section 3.2.8. The choice of input amplitude can depend on a number of factors. The signal to noise ratio (SNR) of observations improves with increased amplitude, but increased amplitudes will also more strongly excite nonlinear effects; the Gprotein model system will exhibit linear behavior only over the range of input values indicated in Fig. 10.10. The experimental implementation may also constrain on the range of input amplitudes that can be achieved. Although the frequency response technique discussed here is used to estimate a linear model, it may not be possible to avoid exciting nonlinearities at experimentally realizable input amplitudes; such nonlinearities may be addressed by incorporating a nonlinear rectifier into the model, as presented in Section 3.2.6. We chose to probe our model system with sinusoids of amplitude half the nominal input value (i.e., alpha factor concentrations oscillating between 0.5 to 1.5 nM) When generating a time-series from the output response, note that the sampling frequency should be larger than twice the probing frequency, thus satisfying the Nyquist limit (Haykin and Van Veen, 2005) in order to avoid aliasing effects.5 Noise and nonlinearities will introduce content in the response at frequencies other than the driving frequency, but as long as the SNR is adequate, any aliasing resulting from higher frequency components should be negligible. We sampled the output response of our model system with 100 time points over the period of the input signal for the lowest frequency inputs (periods of 500 to 32,000 s), 50 time points over the input period for midrange frequencies (62.5 to 250 s), and 10 time points over the input period for high frequencies (15.625 to 31.25 s). A few more comments on data preprocessing are called for. As discussed in Section 2, for frequency response analysis, it is important to consider only the steady-state portion of the response, and not its initial transient. Once steady-state behavior is attained, it is beneficial to sample from multiple periods, to improve the SNR (Lipan and Wong, 2005); in essence, each period acts as a replicate. In our example, we waited for the system to reach the desired steady-state behavior (where the nature of the output oscillations is consistent and well established) and then sampled time-series whose length ranged from a single complete period for the lowest frequency responses (reflecting the difficulties of running long time-series experiments) to 10 complete periods for the highest frequencies. For convenience, the start time of our sampled steady-state data corresponds to the beginning of a sinusoidal cycle in the input signal. 5
Aliasing occurs when there are not enough data points during a period of a signal to faithfully represent that signal’s shape. For example, consider sin(t) sampled at t ¼ 0, p, 2p, . . ., leading to the mistaken observation of a constant signal.
298
Jordan Ang et al.
Period = 16,000 s
Period = 32,000 s
Molecules/cell
nM
1.5
1.5
1
1
1
0.5
0.5
0.5
1000
1000
1000
500
500
500
0
0
0
1
2
3 ×10
4
0 0
Period = 4000 s
nM Molecules/cell
0
1
1
0.5
0.5
1000
1000
1000
500
500
500
0
0
4000
6000
8000
0
1000
2000
3000
4000
0
1
1
0.5
0.5
1000
1000
1000
500
500
500
0
0
500
1000
0
500
1000
0
1
1
1
0.5
0.5
1000
1000
1000
500
500
500
200
250
10 0
20 0
30 0
40 0
500
0
0 150
Time (s)
2000
Period = 15.625 s
0.5
100
1500
1.5
1.5
50
0
Period = 31.25 s
Period = 62.5 s 1.5
1000 Period = 125 s
1
0
500
1.5
0.5
0
0
Period = 250 s 1.5
0
15,000
10,000
Period = 1000 s
1
2000
5000
1.5
Period = 500 s
nM
3 4 ×10
0.5
1.5
Molecules/cell
2
1.5
0
nM
1
Period = 2000 s
1.5
Molecules/cell
Period = 8000 s
1.5
0
100
200 Time (s)
300
0
50
100
150
Time (s)
Figure 10.12 Sinusoidal inputs of different frequencies (top panel plots, smooth curves) and their corresponding stochastically simulated output responses (bottom panel plots, sampled points) for the G-protein pathway model.
Finally, if necessary, one should also remove any obvious trends from the output data (e.g., linear trends caused by experimental drift), as well as obvious outliers in the data that are likely to arise from momentary instrument noise rather than the actual dynamics of the system. Figure 10.12 shows the results of sampling the output from our model system. 3.2.3. Fourier filtering and the experimental frequency response Unlike the idealized model in Section 2, the observed output responses here, like real experimental data, are not pure sinusoids, since they have been corrupted by nonlinearities and noise. We can isolate the contribution
299
Frequency Response Methods in Biology
that corresponds solely to the input (driving) frequency by determining the corresponding coefficient in the Fourier series expansion of the ^ of the observed output response time-series. The Fourier coefficient , R, observed output response, R(t), at the angular input frequency o, is described by ^ðoÞ ¼ 2 R nT
ð nT
eiot RðtÞdt;
ð10:5Þ
0
where T ¼ 2p / o is the period of the input signal. Here, t ¼ 0 represents the beginning of our steady-state response data (recall that we had synced this to the beginning of a sinusoidal cycle in the input signal), and n denotes the number of full steady-state periods sampled. ^ðoÞ, whose The result of this integration is a complex number, R ^ modulus jRðoÞj is the output amplitude at frequency o, and whose argu^ðoÞ , is 90 degrees below the output phase at frequency o. ment, arg R Since Ro(t) is represented as a discrete time-series, Eq. (10.5) is evaluated by means of numerical integration. Numerical computing packages (such as MATLAB, Mathematica, and Maple) are equipped with the algorithms for handling such a problem, and a simple strategy such as the midpoint method can be easily implemented in, for example, Python. For our G-protein example, we have used the MATLAB function trapz, which implements a trapezoidal numerical integration algorithm. The system gain and phase shift is then calculated as in Section 2, as the ratio of the output to input amplitude and the difference between the output and input phase, respectively. The individual data points in Fig. 10.13 show the result of applying this analysis to the G-protein pathway model. Note that this is in close agreement with the idealized results in Fig. 10.5. 3.2.4. Fitting the transfer function model We now seek to go beyond the graphical representation provided by the Bode plots to construct a predictive model of the system’s input–output behavior. As discussed in Section 3.1, given the form of a rational function as in Eq. (10.2), we search for values for the coefficients ai and bi so that the modulus of G(io) and the argument of G(io) match the gain and phase shift observations at each frequency o. Before any fitting can be attempted, we must specify the form of the transfer function, that is, the degree of the numerator and denominator polynomials in Eq. (10.2). A transfer function model of the linear behavior of an autonomous biochemical or genetic network will take the form of a rational function in which the order of the numerator polynomial is less than the order of the
300
Jordan Ang et al.
Gain (log scale)
3 2 1 0 −1 10−6
10−5
10−4
10−3
10−2
10−1
100
10−5
10−4
10−3 Frequency (Hz)
10−2
10−1
100
45
Phase (deg)
0 −45 −90 −135 −180 10−6
Figure 10.13 Bode plot for the G-protein pathway determined by Fourier filtering the stochastically simulated sinusoidal response data (discrete data points) plotted alongside the fitted curve generated by the estimated linear model (solid curve).
denominator polynomial. The difference in the orders of these two polynomials is called the relative degree of the model, and corresponds, as noted earlier, to the slope of the roll-off on the gain plot. The order of the denominator polynomial is referred to as the order of the system; it represents, in some sense, a measure of the complexity of the system, and corresponds roughly to the number of corners or “kinks” in the Bode plots. (Referring back to the filters in Fig. 10.8, the two low-pass filters are first order systems—one corner—while the band-pass filter is a second order system—two corners). It is often the case that a higher order system is well approximated by a lower order model, so while the number of visible corners may not correspond directly to the order of the system, it gives a good indication of the minimal order required to provide a decent fit to the frequency response. The gain Bode plot for our G-protein example (Fig. 10.13) has a roll-off slope of approximately 2, indicating that we should choose a transfer function of relative degree 2. A simple transfer function of this sort has numerator of degree 1 and a denominator of degree 3:
301
Frequency Response Methods in Biology
GðsÞ ¼
s3
b2 s þ b0 : þ a2 s2 þ a1 s þ a0
ð10:6Þ
The solid curve in Fig. 10.13 shows a fit of this model to the frequency response data, with parameter values of: a2 ¼ 2.0870 10 1, a1 ¼ 2.7659 10 3, a0 ¼ 3.0898 10 6, b1 ¼ 1.0914, b0 ¼ 7.6371 10 4. (We used the MATLAB function fminsearch to perform the fit, with a least squares error. Other nonlinear optimization routines, surveyed, e.g., by Moles et al. (2003) could also be used). We also fitted a fourth order model, but found little improvement in the fit. (Note that the model used to generate our “data” is in fact fourth order. Our analysis shows that a third order model provides a good approximation to the system.) 3.2.5. Simulating model trajectories The system response (beginning at the nominal steady state) can be simulated by evaluating an integral involving the system’s unit impulse response and the input function. The unit impulse response, g(t), is the system’s transient output in response to a unit impulse input at t ¼ 0, and can be found by applying the inverse Fourier transform to the frequency response, G(io). Specifically, the response y(t) can be calculated as the convolution of g(t) with an input function u(t), defined by yðtÞ ¼
ðt
uðtÞgðt tÞ dt:
ð10:7Þ
0
In practice, g(t) can be evaluated numerically from G(io) via the Inverse Fast Fourier Transform (IFFT). The IFFT routine, along with the Fast Fourier Transform (FFT) are standard components of numerical packages like MATLAB (as are routines evaluating convolution integrals). Alternatively, one can apply an inverse Laplace transform to the transfer function (by replacing the algebraic s variable with the time-derivative operator d/dt (compare Eqs. (10.2) and (10.3) to Eq. (10.1)). Output trajectories can then be determined by numerically solving the resulting differential equation. The resulting trajectories correspond to the linear model and so describe displacements from the nominal steady state. The nominal input and output values must be accounted for when comparing output model trajectories to measured data. Figure 10.14 shows the sampled oscillatory outputs alongside the oscillatory behavior predicted by our estimated model. The plots are in agreement, implying that our linear transfer function adequately represents the original oscillatory input–output data set.
302
Jordan Ang et al.
Molecules/cell
Period = 32,000 s
Period = 16,000 s 1000
1000
500
500
500
0
0 0
1
2
3
4 4 ×10
0 0
Molecules/cell
Period = 4000 s
1
2
4 4 ×10
500
500
500
10,000
0
0
1000
2000
3000
4000
0
1000
500
500
500
500
1000
0
500
1000
500
500
500
Time (s)
300
1000
1500
2000
200
400
600
0
0 200
4
Period = 15.625 s
1000
100
500
0
1000
Period = 31.25 s
Period = 62.5 s
0
2
0
0
1000
0
1. 5
Period = 125 s
1000
0
0
Period = 250 s
1000
0
1
Period = 1000 s 1000
5000
0. 5
×10
1000
0
0
Period = 2000 s
Period = 500 s Molecules/cell
3
1000
0
Molecules/cell
Period = 8000 s
1000
0
100
200 Time (s)
300
400
0
50
100
150
200
Time (s)
Figure 10.14 Sinusoidal response outputs from the stochastically simulated G-protein pathway for different input frequencies (sampled points) plotted alongside the output response generated by the estimated linear model (solid curves).
3.2.6. Applying a nonlinear rectifier In many circumstances, stronger nonlinear effects will cause discrepancies between the linearly modeled and measured data sets. If such is the case, it may be possible to address such effects after the fact, by appending a nonlinear element to the linear transfer function model. The method we present here attempts to correct for nonlinear effects by passing the linear model output through a static nonlinear rectifier. This linear–nonlinear cascade structure is referred to as a Wiener model, and is shown in block diagram form in Fig. 10.15, where the rectifier is represented mathematically by the static function fNL.6 In order to determine a form for fNL, we construct a scatter diagram that plots the linear model outputs against observations. A best fit trendline will define the function fNL. If the scatter plot indicates perfect agreement between model prediction and data, then there is no need for a nonlinear element. 6
An alternative structure that passes the original input signal through a static nonlinear element before it reaches the linear block is referred to as a Hammerstein model.
303
Frequency Response Methods in Biology
Wiener Input u(t)
Dynamic linear block
ylin(t)
Static non-linear rectifier
Output y(t)
y(t) = fNL(ylin(t))
Figure 10.15 Block diagram representing the structure of the Wiener model which places a static nonlinear block in series following the dynamic linear block (the transfer function model in our case).
The reader may be concerned that the addition of this nonlinear element will alter the model’s frequency response, painstakingly determined from the data. This is not the case. While this nonlinear element will cause lower frequencies to mix into higher frequencies, the driving (lowest) frequency component of an oscillatory input signal will propagate through unchanged. Therefore, measurements at the driving frequency will not be affected (Westwick and Kearney, 2003). The scatter plot for our G-protein example is shown in Fig. 10.16. This plot is populated with data from the oscillatory response simulations. The linear trendline—with a slope nearly equal 1 and a very small offset value of 18 molecules per cell—indicates that, over the range of inputs that were used, the estimated linear model sufficiently reproduces the true oscillatory output. Therefore, no nonlinear correction appears to be necessary in this case. In general, the nonlinearity fNL may be more complex than a linear, or even higher order polynomial correction. For example, in the work of Mettetal et al. (2008), a piece-wise defined function was chosen to compensate for both an offset and the fact that the linear model was providing predictions of negative responses within the input range of interest. 3.2.7. Model validation The final step in the system identification procedure is to compare outputs predicted by the estimated model to experimentally measured outputs for an independent set of data (that is, data that was not used for the purposes of fitting the model). Separating these data will ensure that the actual system has been modeled, rather than the specific output represented in the estimation data set. Comparisons can be done in both the time domain and the frequency domain. In Fig. 10.17, we compare the model and stochastically simulated responses in the time domain of the G-protein pathway to a rectangular pulse input. That they are in good agreement supports the validation that our model accurately captures the general dynamics of the G-protein system. Note that it is also a common engineering practice to perform a variety of statistical tests (particularly residual analysis) on the prediction errors of the estimated model as part of the model validation procedure (Ljung, 1999).
304
Jordan Ang et al.
Active Ga-GTP
1100
y = 1.0169x + − 18.3626 1000
Simulated experimental output
900
800 700
600 500 400
300 200 100 300
350
400
450
500 550 600 Linear model output
650
700
750
Figure 10.16 Scatter plot for the linear model output versus the stochastically simulated output for the G-protein pathway. The plot is populated with data that was sampled during the sinusoidal response simulations. The trendline represents the nonlinear rectifier: in this case a linear fit with a slope 1 and a very small offset of 18.36 molecules per cell indicates no obvious nonlinearities in the oscillatory response data.
3.2.8. Nonsinusoidal inputs In some cases, the generation of purely sinusoidal inputs will not be feasible, and so another oscillatory waveform must be used to excite the system. The system identification procedure can still be followed in this case, with some minor modifications. To illustrate how nonsinusoidal inputs may be used, we have repeated the frequency response analysis of the G-protein pathway model using square wave inputs. The driving frequency of the square wave represents the input frequency. Sampled points from the resulting output are shown in Fig. 10.18. When using nonsinusoidal oscillatory inputs, the amplitude of the input contribution at the driving frequency is not equivalent to the amplitude of
305
Frequency Response Methods in Biology
Alpha factor
(nM)
1.5
1
0.5
0
0
500
1000
2000
2500
3000
3500
4000
3000
3500
4000
Active Ga-GTP
800 Molecules per cell
1500
700 600 500 400
0
500
1000
1500
2000 2500 Time (s)
Figure 10.17 Model validation. A rectangular pulse input (top panel) and the corresponding output response (bottom panel) for the G-protein pathway generated by the estimated linear model (solid curve) and from stochastic simulation (sampled points). The linear model faithfully reproduces the output response.
the waveform itself. Since the input corresponds to a combination of sinusoids at different frequencies, Eq. (10.5) can be used to determine the amplitude of the contribution at the driving frequency. In the case of square waves, this calculation indicates that the amplitude of the overall wave must be scaled by a factor of 4/p in order to recover the contribution of the component at the driving frequency. Using the stochastically simulated square wave data, we performed a new fit to a transfer function model. The result is essentially the same as that obtained from the sinusoidal inputs; Fig. 10.18 shows the behavior predicted by the estimated transfer function model.
3.3. Examples of experimental implementation There is a long tradition of studying biological systems using the techniques of control theory (Bayliss, 1966; Iglesias and Ingalls, 2009; Ingalls et al., 2006; Khoo, 2000; Wiener, 1965). Techniques from system identification,
306
Jordan Ang et al.
Period = 16,000 s
Molecules/cell
nM
1.5 1 0.5 1000 500 0
0
0.5
1
1.5
2
2.5
Period = 2000 s
3 ×104
nM
1.5 1
Molecules/cell
0.5 1000 500 0
0
1000
2000
3000
4000
Period = 250 s nM
1.5 1
Molecules/cell
0.5 1000 500 0
0
200
400
600
800
1000
Period = 31.25 s nM
1.5 1
molecules/cell
0.5 1000 500 0
0
50
100
150 200 Time (s)
250
300
Figure 10.18 Square wave inputs of different driving frequencies (top plots for each period, smooth curves) and their corresponding stochastically simulated output responses (bottom plots for each period, sampled points) as well as the predicted responses from the linear model (bottom plots for each period, smooth curves) for the G-protein pathway.
including frequency response methods, have long been experimentally applied in physiology (Westwick and Kearney, 2003) and in bacterial chemotaxis (Block et al., 1982, 1983; Shimizu et al., 2010), and have
Frequency Response Methods in Biology
307
more recently been applied to signaling cascades in yeast (Bennett et al., 2008; Hersen et al., 2008; Mettetal et al., 2008). 3.3.1. Bacterial chemotaxis Bacteria such as Escherichia coli respond to gradients of attractant or repellent in their environment by altering the proportion of time they spend randomly tumbling versus swimming in a straight line; this shift is implemented by changing the direction of rotation of their flagella, and has the effect of implementing a random walk that is biased in the direction of increasing concentrations of nutrients. Elegant work from Howard Berg’s group has examined the frequency response characteristics of bacterial chemotaxis for decades, initially at the whole-cell level (Block et al., 1982, 1983; Segall et al., 1986) and later targeting specific elements of the intracellular response system (Shimizu et al., 2010; Sourjik et al., 2007); we will touch here on only a small portion of the large body of work on this subject. In whole-cell investigations (Block et al., 1982, 1983; Segall et al., 1986), the experimental protocol consisted of affixing cells to a cover slip by a single flagellum, using a procedure described by Berg and Tedesco (1975). When pinned in this manner, the cells’ flagellar motors caused the entire cell to spin either clockwise (CW; corresponding to the random “tumble” mode of motion in a free-moving cell) or counterclockwise (CCW; corresponding to the linear “run” motion seen in free-moving cells) in response to the ambient chemical environment. This motion could be observed and classified (with particular attention to the timing of transitions between CW and CCW rotation) through phase-contrast microscopy. The intracellular measurement experiments (Shimizu et al., 2010) used Forster resonance energy transfer, (FRET; also known as fluorescence resonance energy transfer) to generate a fluorescence signal proportional to the physical proximity of the labeled proteins; by attaching fluorescent proteins to key players in the chemotaxis signaling system, FRET studies provided an output measurement corresponding to the level of activity of this system (Sourjik et al., 2007). Impulse inputs. In the work of Block et al. (1982), the environment was changed by introducing very brief changes in the concentration of either an attractant or a repellent, through a micropipette: the pulse of chemical stimulation passed across the population of tethered cells as a diffusive wave, providing each cell with a very brief burst of stimulation. This input approximates a mathematical “impulse”: a sharp “spike” of input (the mathematical ideal is infinitely tall but infinitely short; real impulses, such as the chemical pulses used in the work described here, serve as an approximation to this mathematical abstraction). Characterizing the response to an impulse input illustrates an alternative approach to the frequency response method we have discussed above. In ˚ stro¨m and Murray, 2008; Cluett engineering, impulse and step responses (A and Wang, 1991) are used as an alternative to explicitly sweeping the
308
Jordan Ang et al.
frequency of a series of input signals. The basis for this approach is that these signals, by virtue of their sharpness, excite a wide range of frequencies simultaneously since a sharp signal, when broken down into a sum of signals of different frequencies, requires many frequencies in order to represent it accurately. This approach succeeded in (Block et al., 1982) yielding frequency response plots without the experimental complications of generating inputs of varying frequencies; the plots showed a band-pass filtering behavior in the bacterial response to the chemical attractants and repellents, with a peak response near 0.25 Hz. More generally, it is not clear that impulse or step responses will always be sufficient to characterize other biochemical systems, especially in terms of gene expression and signal transduction. Using specific input frequencies improves the experimental SNR, allowing one to distinguish responses from noise, and to use the persistent, periodic input to average the response over multiple periods, gathering more data and thus improving SNR. Lipan and Wong (2005) argued that periodic inputs offer significant advantages over step inputs in particular: in addition to noting the SNR improvements, they carried out a theoretical analysis indicating that to obtain a given Fourier component with a desired level of confidence, significantly more experimental replicates of step inputs would be required than of periodic inputs. Further, the theoretical analysis presented by Saeki and Saito (2002) suggested that step inputs are limited in their ability to generate frequency response information, even in systems less complex than biological networks. The successful application of impulse responses by Block et al. (1982) to derive a full frequency response plot is impressive, and demonstrates that this approach is at least sometimes feasible in a biochemical response context. The “digital” nature of the output signal employed by Block et al. (1982) (only one of two outcomes is observed for any given cell, CW or CCW rotation, rather than a continuum of possible values) may have helped reduce potential problems with the SNR, since noise inside the system would be seen in the output only if it was large enough to flip the response from one rotational state to another. Exponentiated sine inputs. In the work of Block et al. (1983), computercontrolled pumps were used to generate varying-frequency input signals, similar to the sinusoidal inputs discussed in Sections 2 and 3.2, but in this case implemented as exponential functions of the form (esin ot), superimposing a periodic input onto an exponential ramp. The same work (Block et al., 1983) showed that an exponential ramp input caused the switching rate of the flagellar motor to reach and remain at a steady state over time; this motivated the choice of an exponentiated sine wave as the frequency-varying input for this system. Using such signals, they obtained a Bode plot spanning frequencies from 10 3 to 0.05 Hz, showing high-pass filtering behavior with a slope on the log–log plot of þ2.
Frequency Response Methods in Biology
309
Another study (Shimizu et al., 2010) again employed input signals of the exponentiated sine form, and used FRET (Sourjik et al., 2007) to monitor the intracellular interactions of proteins CheY-YFP and CheZ-CFP. Phosphorylated CheY binds to the flagellar motor to affect its rotational behavior, while CheZ dephosphorylates CheY; the FRET signal, by reporting how many CheY–CheZ pairs were close to one another, thus provided a measurement of the output of the intracellular signaling system underlying the bacterial chemotactic response. The Bode plot constructed by sweeping the frequencies of the exponentiated sine inputs showed high-pass filtering behavior, qualitatively similar to that seen by Block et al. (1983) and well predicted by a mathematical model (Shimizu et al., 2010; Tu et al., 2008). A Bode plot relating the time derivative of the input to the output showed low-pass filtering behavior; the bacterial chemotaxis system appears to function as a high-pass filter of the signal itself, and a low-pass filter of its derivative, as previously predicted theoretically (Tu et al., 2008). 3.3.2. Signaling cascades in yeast Three papers using frequency response methods in yeast appeared in 2008. Two effectively simultaneous publications dealt with the high-osmolarity glycerol (HOG) osmo-adaptation pathway (Hersen et al., 2008; Mettetal et al., 2008), while the third (Bennett et al., 2008) addressed the galactose utilization response. Experimental setup. All three papers made use of microfluidic chambers to contain the cells and expose them to varying inputs: changing concentrations of extracellular sorbitol (Hersen et al., 2008), sodium chloride (Mettetal et al., 2008), or glucose (Bennett et al., 2008). In the studies of Hersen et al. (2008) and Mettetal et al. (2008), the flow was driven by pumps, and varying concentrations were obtained by computer-controlled switching of valves connecting the microfluidic chamber to different fluid reservoirs; the input signals generated were approximately square waves. In the work of Bennett et al. (2008), the flow was driven by gravity-induced pressure differences and the input concentration was regulated by a computer-controlled actuators that varied the height of two fluid reservoirs (Bennett and Hasty, 2009; Bennett et al., 2008); the input signals used were sinusoidal in shape. In all three cases, the output signals were observed through fluorescence microscopy: fluorescence of Hog1-YFP, using colocalization with the strictly nuclear protein Nrd1-RFP to isolate and measure only the Hog1-YFP present in the nucleus (Mettetal et al., 2008); fluorescence of Hog1-GFP, using colocalization with Htb2-mCherry to isolate the nuclear Hog1-GFP (Hersen et al., 2008); and whole-cell fluorescence of Gal1-yECFP fusion proteins (Bennett et al., 2008). Time-lapse fluorescence images, with image intensities quantified through various image analysis routines, generated time-series for the output signals in each case; combining these with the known input signal time-series yielded a
310
Jordan Ang et al.
complete input–output response profile for each frequency, and this procedure was repeated across a range of input frequencies.
A (w)
Wild-type 10–1
Low Pbs2 –2
10
10–2
10–1
100
w (rad/min)
101
100
w = 0.0046 Hz
B A
Nuclear hog 1-GFP fluctuations (a.u.)
3.3.2.1. Osmo-adaptation Yeast sense and respond to osmotic pressure through a multistage signaling cascade that terminates in glycerol production, activated when the protein Hog1 is phosphorylated and enters the nucleus. Mettetal et al. (2008) and Hersen et al. (2008) studied the frequency response of this system. The Bode gain plots for the HOG pathway’s response to changing osmotic pressure in the two papers are shown in Fig. 10.19A and B, respectively. At first sight, the plots appear to present different conclusions about the same pathway, showing band-pass filtering behavior in one case, and low-pass filtering in the other. Note, however, the different frequency ranges considered in the studies: Mettetal et al. (2008) considered periods ranging from 2 to 128 min, corresponding to frequencies of 1.3 10 4–8.3 10 3 Hz (shown in radians per minute in Fig. 10.19A: 0.05–3.14 rad/min), while Hersen et al. (2008) considered higher frequencies, ranging from 10 3 to 1 Hz; the two ranges overlap only slightly. The band-pass response seen in the work of Mettetal et al. (2008) (panel A) does not appear in the Hersen et al. (2008) data (panel B) because
80 60 40 20 0 0.001
0.01
0.1
1
Frequency (Hz)
Figure 10.19 The gain portions of two experimental Bode plots for the HOG osmoadaptation system in yeast. (A) Experimental results from Mettetal et al. (2008). The input signals were square waves of extracellular sodium chloride concentrations at varying frequencies, and the output was the fluorescence intensity of Hog-YFP in the nucleus. “Low Pbs2” refers to a yeast strain that expresses lower (than wild type) levels of the Pbs2 protein (an element in the signaling cascade). [From Mettetal et al. (2008). Reprinted with permission from AAAS.] (B) Experimental results from Hersen et al. (2008). The input signals were square waves of extracellular sorbitol at varying frequencies, and the output was the fluorescence intensity of Hog1-GFP in the nucleus. [Copyright 2008, National Academy of Sciences (USA), reprinted with permission.] Note that the results in panel (B) are shown on a linear vertical scale, rather than the log scale used in panel (A) and in the Bode gain plots elsewhere in this review (Figs. 10.5, 10.8, and 10.13). The asymptotic approach to zero is a curve on the linear scale, rather than the straight line seen on a log scale.
Frequency Response Methods in Biology
311
the latter considers frequencies higher than those at which the initial increase in the amplitude response were observed; at the same time, the significantly higher frequency regime examined by Hersen et al. (2008) provides valuable information, showing very clearly that the amplitude response does continue to decline with frequency in a classic low-pass filter pattern. Combined, these two studies provide a complete and consistent picture of the frequency behavior of the HOG pathway: an initial rise in the amplitude response, to a frequency of approximately 10 3 Hz, followed by a steady decline for higher frequencies. The slope of the decline of the amplitude in the log–log plot shown in Fig. 10.19A suggests a relative degree of 1. The authors in Mettetal et al. (2008) chose a transfer function model with this relative degree. The parameters of this model were fit using experimental data, then the same parameters were used to predict the system’s response to a step input, a form of input not used in the fitting. The result was a good match between the model predictions and the experimentally observed response (Mettetal et al., 2008), validating the model and providing a nice illustration of the predictive power of transfer function models. 3.3.2.2. Carbon source utilization When its preferred carbon source, glucose, is not available, yeast can activate genes to utilize galactose instead. Bennett et al. (2008) examined the frequency response of this system was examined. The resulting frequency response showed low-pass behavior. The authors went on to formulate a detailed biochemical kinetic model of the system, and attempted to use it to reproduce the experimentally observed frequency response. They found, however, that their initial model did not provide a good match to the experimental frequency response behavior. Modifying the model to allow the degradation rates of two key components (the mRNA of genes GAL1 and GAL3) to be a function of glucose concentration yielded a much more accurate model, and subsequent experiments confirmed that these mRNA half-lives were in fact glucose-dependent, revealing a level of posttranscriptional regulation that had not previously been known. The ability to compare with a full set of frequency response experiments, rather than, for example, steady-state experiments at a series of fixed glucose concentrations, provided the important ability to isolate which portion of the model needed to be adjusted, and in this case yielded an important new insight into a well-studied biological system.
4. Conclusion The engineering concepts described in this review can serve the systems and synthetic biology communities in a number of ways.
312
Jordan Ang et al.
Some experimental projects may have a frequency response input– output characterization as their end goal. Using the transfer function models described above, such projects can result in the ability to predict a system’s response to arbitrary inputs. That capability is often valuable in itself, and is of particular interest in the engineering-oriented field of synthetic biology (Andrianantoandro et al., 2006; Hasty et al., 2002; Kærn and Weiss, 2006; Kærn et al., 2003; Khalil and Collins, 2010; Purnick and Weiss, 2009), where the design and construction of novel cellular regulatory systems can be guided and informed by the use of frequency response methods. For systems biology purposes, obtaining a frequency response profile using the methods described here will typically be part of a larger effort to understand biochemical and genetic systems at a level beyond their input– output profile. These methods have several things to offer this endeavor. Most basically, the frequency response profile provides insight into the timescales that are most relevant to the system being studied: high or low gains at particular frequency ranges will indicate which timescales the system responds to most strongly, and whether such responses are tuned to any particular frequencies. This information places constraints on the biochemical or genetic mechanisms that underlie the observed response, and thus can guide experimental investigations. When comparing computational or theoretical models to experimental results, a full frequency response profile provides a richer basis for comparison than does steady-state or step-response data. We noted above, for example, that Bennett et al. (2008) used their frequency response data to pin down the source of an experimental mismatch in their modeling work. This is not simply a question of there being more data available to compare: the data is organized in a qualitatively different manner when experimental frequency response information has been obtained, and it allows access to a set of analytical tools that would not apply in the absence of such information. In the long term, researchers may aim to use “black box” models (where inputs elicit outputs by an unknown internal process) provided by frequency response methods as a step toward “white box” models (where the internal mechanism of the system are known; biochemical kinetic models (Conrad and Tyson, 2006; Hasty et al., 2001) are an example), possibly through intermediate “gray box” models (where the black boxes are used to partially populate a model with limited internal details) (Lipan, 2008). Such an approach was adopted by Mettetal et al. (2008), using the structure of the frequency response to guide the construction of a partially filled-in kinetic model of their system. Experimentally, the generation of stable, repeatable, rapidly varying oscillatory inputs remains a challenge, but the barrier for entry is dropping quickly as microfluidic technologies become less expensive and more widely available. To date, identification of frequency response from
Frequency Response Methods in Biology
313
experimental data in biochemical and genetic systems has been carried out using extracellular signals that are conveyed to the cells by changing the concentration of some chemical in the surrounding medium (Bennett et al., 2008; Hersen et al., 2008; Mettetal et al., 2008; Shimizu et al., 2010). Inducing oscillations in internal states that are not coupled to an extracellular signal will be a further challenge, though as Lipan (2008) notes, one path might be to use intracellular regulatory elements that can be induced by the application of photons (Shimizu-Sato et al., 2002), perhaps coupled to other synthetic tunable systems (Grilly et al., 2007). Our ability to apply oscillatory inputs to cells will only grow more sophisticated over time, increasing the number of systems to which frequency response methods can be profitably applied. Finally, we address again an apparently severe limitation of this approach: its applicability only to linear systems (ones for which the response to a sum of two inputs is the sum of the responses to each individual input). Every real physical system is nonlinear, displaying at least saturation behavior, and biological systems are particularly subject to nonlinearities. Engineers must often deal with systems that are highly nonlinear, but this has not prevented them from designing much of the modern world using techniques founded on linear approximations. They make use of the fact that systems are always approximately linear in the vicinity of their steady states (where self-regulating systems spend the bulk of their time); they use nonlinear correction blocks of the type described in Section 3.2.5; and they use the information derived from linear approaches as a starting point for understanding fully nonlinear behavior. In fact, linear approximations are at the core of the study of nonlinear systems (Glendinning, 1994): the first thing generally done when analyzing a nonlinear system is to consider its locally linear behavior. Biologists are faced with the study of the most complex nonlinear dynamic systems ever examined, and they too can benefit from the insight and foundation for future study provided by linear analysis. There can be no doubt that the methods of dynamic analysis we review here will become a valuable part of the biological toolbox.
Appendix We use a model of the G-protein pathway as developed in Yi et al. (2003) as a running example throughout the above discussion. The model has four independent states: free receptor, R; ligand-bound receptor, RL; inactive heterotrimeric G-protein, G; and active Ga-GTP, Ga. In addition, there are three dependent states: free ligand, L; free Gbg, Gbg; and inactive
314
Jordan Ang et al.
Ga-GDP, Gd. The species concentrations are measured in molecules per cell, while ligand concentrations are measured in nanomoles per liter (nM).
A.1. Deterministic model The dynamics are d ½Rðt Þ ¼ kRL ½L ðtÞ½RðtÞ þ kRLm ½RL ðtÞ kRd0 ½RðtÞ þ kRs dt d ½RL ðtÞ ¼ kRL ½L ðtÞ½RðtÞ kRLm ½RL ðtÞ kRd1 ½RL ðt Þ dt d ½GðtÞ ¼ kGa ½RL ðt Þ½GðtÞ þ kG1 ½Gd ðtÞ½Gbgðt Þ dt d ½Gaðt Þ ¼ kGa ½RL ðtÞ½GðtÞ kGd ½Gaðt Þ dt where the G-protein components are constrained by ½GbgðtÞ ¼ ½Gtotal ½GðtÞ ½GdðtÞ ¼ ½Gtotal ½GðtÞ ½Gaðt Þ: The free ligand concentration L(t) is the input to the system. It is assumed that ligand uptake by the cells is fast, and has a negligible effect on the overall concentration.The parameter values used are kRL ¼ 2 106 M1 s1 ; kRLm ¼ 1 102 s1 ; kRd0 ¼ 4 104 s1 kRs ¼ 4ðmolecules per cellÞs1 ; kRd1 ¼ 4 103 s1 kGa ¼ 1 105 ðmolecules per cellÞ1 s1 ; kG1 ¼ 1ðmolecules per cellÞ1 s1 kGd ¼ 4 103 s1 L ðt Þ : variable input; nominal value 1 nM Gtotal ¼ 10; 000ðmolecules per cellÞ
A.2. Stochastic model Since all processes in the deterministic model are individual chemical events, it can be easily converted to a stochastic model. The states are the same, with abundance in molecules (and the system size being a single cell). Again, it is assumed that ligand uptake by the cells is fast, and has a negligible effect on the overall concentration. Thus we retain a molar measure of ligand concentration, and treat its effect as a variation of the effectively first order parameter kRL[L(t)], which has units of s 1.
315
Frequency Response Methods in Biology
The set of reactions representing the model from Yi et al. (2003) is kRL
L þ R Ð RL kRLm
kRs
ðÞ Ð R kRd0
kGa
RL þ G ! Ga þ Gbg þ RL kRd1
RL ! ðÞ kGd
Ga ! Gd kG1
Gd þ Gbg ! G Numbering the reactions r1 through r8 (including reverse reactions), we have corresponding propensities a1 ¼ kRL L ðt ÞRðt Þ; a2 ¼ kRLm RL ðt Þ; a5 ¼ kGa RL ðtÞGðtÞ; a6 ¼ kRd1 RL ðtÞ; a8 ¼ kG1 Gd ðtÞGbgðtÞ Numbering the state (s1, s2, s3, s4) matrix is given as 2 1 1 1 6 1 1 0 S¼6 4 0 0 0 0 0 0
a3 ¼ kRs ; a4 ¼ kRd0 Rðt Þ a7 ¼ kGd Gaðt Þ;
¼ (R, RL, G, Ga), the stoichiometry 1 0 0 0
0 1 0 0
0 0 0 0 1 1 1 0
3 0 07 7 05 1
The species Gbg and Gd are updated according to the conservations GbgðtÞ ¼ Gtotal GðtÞ Gd ðtÞ ¼ Gtotal Gðt Þ GaðtÞ: The model was simulated using Gillespie’s direct method (Gibson and Bruck, 2000; Gillespie, 1977). The resulting numerical output was sampled at uniform time intervals. To mimic nonsystematic measurement noise, we added to each value a white noise term with a coefficient of variation (CV, the standard deviation divided by the mean) of 0.2.
316
Jordan Ang et al.
REFERENCES Andrianantoandro, E., Basu, S., Karig, D. K., and Weiss, R. (2006). Synthetic biology: New engineering rules for an emerging discipline. Mol. Syst. Biol. 2, 2006–2028. ˚ stro¨m, K. J., and Murray, R. M. (2008). Feedback Systems: An Introduction for Scientists A and Engineers. Princeton University Press, Princeton, NJ. Bayliss, L. E. (1966). Living Control Systems. W. H. Freeman, San Francisco, CA. Beebe, D. J., Mensing, G. A., and Walker, G. M. (2002). Physics and applications of microfluidics in biology. Annu. Rev. Biomed. Eng. 4, 261–286. Bennett, M. R., and Hasty, J. (2009). Microfluidic devices for measuring gene network dynamics in single cells. Nat. Rev. Genet. 10, 628–638. Bennett, M. R., Pang, W. L., Ostroff, N. A., Baumgartner, B. L., Nayak, S., Tsimring, L. S., and Hasty, J. (2008). Metabolic gene regulation in a dynamically changing environment. Nature 454, 1119–1122. Berg, H. C., and Tedesco, P. M. (1975). Transient response to chemotactic stimuli in Escherichia coli. Proc. Natl. Acad. Sci. USA 72, 3235–3239. Block, S. M., Segall, J. E., and Berg, H. C. (1982). Impulse responses in bacterial chemotaxis. Cell 31, 215–226. Block, S. M., Segall, J. E., and Berg, H. C. (1983). Adaptation kinetics in bacterial chemotaxis. J. Bacteriol. 154, 312–323. Boyce, W. E., and DiPrima, R. C. (2008). Elementary Differential Equations. Wiley, New York, NY. Cluett, W. R., and Wang, L. (1991). Modelling and robust controller design using step response data. Chem. Eng. Sci. 46, 2065–2077. Conrad, E. D., and Tyson, J. J. (2006). Modeling molecular interaction networks with nonlinear ordinary differential equations. In “System Modeling in Cellular Biology,” (Z. Szallasi, V. Periwal, and S. Jorg, eds.), pp. 97–123. MIT Press, Cambridge, MA. Gibson, M. A., and Bruck, J. (2000). Efficient exact stochastic simulation of chemical systems with many species and many channels. J. Phys. Chem. A 104, 1876–1889. Gillespie, D. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. Glendinning, P. (1994). Stability, Instability, and Chaos: An Introduction to the Theory of Nonlinear Differential Equations. Cambridge University Press, Cambridge, UK. Grilly, C., Strickler, J., Pang, W. L., Bennett, M. R., and Hasty, J. (2007). A synthetic gene network for tuning protein degradation in Saccharomyces cerevisiae. Mol. Syst. Biol. 3, 127. Hasty, J., McMillen, D. R., Isaacs, F., and Collins, J. J. (2001). Computational studies of gene regulatory networks: In numero molecular biology. Nat. Rev. Genet. 2, 268–279. Hasty, J., McMillen, D. R., and Collins, J. J. (2002). Engineered gene circuits. Nature 420, 224–230. Haykin, S., and Van Veen, B. (2005). Signals and Systems. Wiley, New York, NY. Hersen, P., McClean, M. N., Mahadevan, L., and Ramanathan, S. (2008). Signal processing by the HOG MAP kinase pathway. Proc. Natl. Acad. Sci. USA 105, 7165–7170. Ideker, T., Galitski, T., and Hood, L. (2001). A new approach to decoding life: Systems biology. Annu. Rev. Genomics Hum. Genet. 2, 343–372. Iglesias, P. A., and Ingalls, B. P. (eds.), (2009). Control Theory and Systems Biology, MIT Press, Cambridge, MA. Ingalls, B. P. (2004). A frequency domain approach to sensitivity analysis of biochemical networks. J. Phys. Chem. B 108, 1143–1152. Ingalls, B. P., Yi, T.-M., and Iglesias, P. A. (2006). Using control theory to study biology. In “System Modeling in Cellular Biology,” (Z. Szallasi, V. Periwal, and S. Jorg, eds.), pp. 243–267. MIT Press, Cambridge, MA.
Frequency Response Methods in Biology
317
Kærn, M., and Weiss, R. (2006). Synthetic gene regulatory systems. In “System Modeling in Cellular Biology,” (Z. Szallasi, J. Stelling, and V. Periwal, eds.), pp. 269–295. MIT Press, Cambridge. Kærn, M., Blake, W., and Collins, J. J. (2003). The engineering of gene regulatory networks. Annu. Rev. Biomed. Eng. 5, 179–206. Khalil, A. S., and Collins, J. J. (2010). Synthetic biology: Applications come of age. Nat. Rev. Genet. 11, 367–379. Khoo, M. C. K. (2000). Physiological Control Systems: Analysis, Simulation, and Estimation. IEEE Press, New York, NY. Kitano, H. (2002). Computational systems biology. Nature 420, 206–210. Lipan, O. (2008). Enlightening rhythms. Science 319, 417–418. Lipan, O., and Wong, W. H. (2005). The use of oscillatory signals in the study of genetic networks. Proc. Natl. Acad. Sci. USA 102, 7063–7068. Ljung, L. (1999). System Identification: Theory for the User. Prentice Hall, Saddle River, NJ. McClean, M. N., Hersen, P., and Ramanathan, S. (2009). In vivo measurement of signaling cascade dynamics. Cell Cycle 8, 373–376. McCudden, C. R., Hains, M. D., Kimple, R. J., Siderovski, D. P., and Willard, F. S. (2005). G-protein signaling: Back to the future. Cell. Mol. Life Sci. 62, 551–577. Mettetal, J. T., Muzzey, D., Go´mez-Uribe, C., and van Oudenaarden, A. (2008). The frequency dependence of osmo-adaptation in Saccharomyces cerevisiae. Science 319, 482–484. Moles, C. G., Mendes, P., and Banga, J. R. (2003). Parameter estimation in biochemical pathways: A comparison of global optimization methods. Genome Res. 13, 2467–2474. Oldham, W. M., and Hamm, H. E. (2008). Heterotrimeric G protein activation by Gprotein-coupled receptors. Nat. Rev. Mol. Cell Biol. 9, 60–71. Purnick, P. E. M., and Weiss, R. (2009). The second wave of synthetic biology: From modules to systems. Nat. Rev. Mol. Cell Biol. 10, 410–422. Saeki, M., and Saito, K. (2002). Estimation of frequency response set from interval data of step response. 41st Conference of the Society of Instrument and Control Engineers (SICE), Vol. 2, Osaka, Japan, pp. 1233–1236. Seber, G. A. F., and Wild, C. J. (2003). Nonlinear Regression. Wiley, New York, NY. Segall, J. E., Block, S. M., and Berg, H. C. (1986). Temporal comparisons in bacterial chemotaxis. Proc. Natl. Acad. Sci. USA 83, 8987–8991. Shimizu, T. S., Tu, Y., and Berg, H. C. (2010). A modular gradient-sensing network for chemotaxis in Escherichia coli revealed by responses to time-varying stimuli. Mol. Syst. Biol. 6, 382. Shimizu-Sato, S., Huq, E., Tepperman, J. M., and Quail, P. H. (2002). A light-switchable gene promoter system. Nat. Biotechnol. 20, 1041–1044. Sourjik, V., Vaknin, A., Shimizu, T. S., and Berg, H. C. (2007). In vivo measurement by FRET of pathway activity in bacterial chemotaxis. Methods Enzymol. 423, 365–391. Tu, Y., Shimizu, T. S., and Berg, H. C. (2008). Modeling the chemotactic response of Escherichia coli to time-varying stimuli. Proc. Natl. Acad. Sci. USA 105, 14855–14860. Tyson, J. J., Chen, K., and Novak, B. (2001). Network dynamics and cell physiology. Nat. Rev. Mol. Cell Biol. 2, 908–916. Westermark, P. O., Welsh, D. K., Okamura, H., and Herzel, H. (2009). Quantification of circadian rhythms in single cells. PLoS Comput. Biol. 5, e1000580. Westwick, D. T., and Kearney, R. E. (2003). Identification of Nonlinear Physiological Systems. IEEE Press, Wiley, Piscataway, NJ. Wiener, N. (1965). Cybernetics: Or the Control and Communication in the Animal and the Machine. The MIT Press, Cambridge, MA. Yi, T.-M., Kitano, H., and Simon, M. I. (2003). A quantitative characterization of the yeast heterotrimeric G protein cycle. Proc. Natl. Acad. Sci. USA 100, 10764–10769.
C H A P T E R
E L E V E N
Biochemical Pathway Modeling Tools for Drug Target Detection in Cancer and Other Complex Diseases Alberto Marin-Sanguino,* Shailendra K. Gupta,†,‡ Eberhard O. Voit,§,} and Julio Vera† Contents 1. Introduction and Overview 2. Biomedical Knowledge and Data Retrieval: Constructing a Conceptual Map of a Biochemical Network 3. Mathematical Modeling of Biochemical Networks: Translating Knowledge into Mathematical Equations 3.1. Network reconstruction 3.2. Flux balance analysis 3.3. Dynamic models 4. Model Calibration: Matching the Mathematical Model to Quantitative Experimental Data 5. Predictive Model Simulations as a Tool for Drug Discovery 6. Model Sensitivity Analysis as a Tool for Detecting Critical Processes in Biochemical Networks 7. Drug Target Detection Through Model Optimization 8. One Step Further: Combining Mathematical Modeling with Drug Screening via Protein Docking-Based Techniques 9. Final Remarks Acknowledgments References
320 325 327 329 331 335 337 341 345 350 356 359 367 367
Abstract In the near future, computational tools and methods based on the mathematical modeling of biomedically relevant networks and pathways will be necessary for the design of therapeutic strategies that fight complex multifactorial diseases. * Department of Membrane Biochemistry, Max Planck Institute of Biochemistry, Martinsried, Germany Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany { Indian Institute of Toxicology Research (CSIR), Lucknow, India } Integrative BioSystems Institute, Georgia Institute of Technology, Atlanta, Georgia, USA } The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA # 2011 Elsevier Inc. Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87011-4 All rights reserved. {
319
320
Alberto Marin-Sanguino et al.
Beyond the use of pharmacokinetic and pharmacodynamic approaches, we propose here the use of dynamic modeling as a tool for describing and analyzing the structure and responses of signaling, genetic and metabolic networks involved in such diseases. Specifically, we discuss the design and construction of meaningful models of biochemical networks, as well as tools, concepts, and strategies for using these models in the search of potential drug targets. We describe three different families of computational tools: predictive model simulations as tools for designing optimal drug profiles and doses; sensitivity analysis as a method to detect key interactions that affect critical outcomes and other characteristics of the network; and other tools integrating mathematical modeling with advanced computation and optimization for detecting potential drug targets. Furthermore, we show how potential drug targets detected with these approaches can be used in a computer-aided context to design or select new drug molecules. All concepts are illustrated with simplified examples and with actual case studies extracted from the recent literature.
1. Introduction and Overview Today’s drug discovery process starts with a chemical substance and aims to identify its potential therapeutic effects. This task very often requires the screening of thousands if not millions of molecules in “blind,” undirected biological tests. Once a positive therapeutic effect is detected and the side effects are assessed to be tolerable, the compound is allowed to enter clinical trials and, if successful, is subsequently launched for treatment in the general population (Paul et al., 2010). The specific mechanisms responsible for the therapeutic effects are sometimes only identified a posteriori, possibly after a long time, as was the case of aspirin (Miner and Hoffhines, 2007). This traditional paradigm for drug discovery has been challenged in the last decade because of three significant problems. Firstly, the demanding criteria for proofs of efficacy and safety with which new drugs must comply, together with the complexity of the targeted diseases, make the drug discovery process extremely long, expensive, and, in the vast majority of cases, poorly cost-effective. To launch a new drug very often requires decades of research and roughly a billion dollars in up-front investments. The second element to be considered is the incredible progress made in quantitative techniques for molecular biology in recent years. A decade ago experimentalists struggled to produce enough data for testing a hypothesis. Now, the emergence of “omics” technologies has almost inverted the situation: Data retrieval, processing, and analysis have become the critical steps in every modern biomedical research lab. The good news is that a plethora of computational techniques, including mathematical modeling, are becoming powerful enough to help with the management and analysis
Biochemical Pathway Modeling Tools for Drug Target Detection
321
of these enormous amounts of biological data. This computational support will continue to become stronger and increase our ability to analyze biochemical systems relevant for many diseases, thereby boosting the discovery of drugs and therapies. In addition to this, recent developments in genetic engineering and synthetic chemistry and biology promise the design of drugs a la carte in the foreseeable future. These new methods of modern allosteric synthesis and synthetic biology will enable the creation of specifically designed biomolecules that have the desired biological effect, while minimizing side effects as well as cost. The ensemble of biological and computational elements will make it possible to formulate a completely new paradigm for drug discovery in the near future, in which the traditional sequence of steps will almost be reversed. In this emerging era of computer-aided pharmacology, drug discovery will be initiated by stating the desired effect to be achieved and by subsequently modulating biochemical systems related to the target disease, first with computational and then with experimental methods. The experimental methods will primarily search for a feasible and implementable mechanism, which may restore the proper functioning of the biochemical system. The alterations of the system could possibly involve a range of actions, from diet design to finding protein drug targets, but the overriding rationale expected is the modulation of physiological processes that alleviate the pathological condition associated with the disease. This last step will involve the design of a therapy, including the chemical design of drugs that effectively interact with the intended targets. This vision of the future purports that mathematical modeling is a mandatory component of drug development, and it is necessary to back up this claim. For instance, one could think that sufficiently many good data, expert knowledge, and intuition should eventually be sufficient for solving biomedical problems with traditional and future experimental methods. The reality, however, points in a different direction. First, the investigation of complex and highly interconnected biochemical networks and the integration of high-throughput data are no longer feasible without the support of computational and mathematical modeling techniques. The enormous power of biological intuition will always be required but becomes insufficient in the era of large biological interaction networks of proteins, genes, and metabolites. Maybe even more important, the dynamic behavior of biological networks with regulatory structures like feedback loops can be counter-intuitive and change qualitatively in response to sometimes rather small quantitative perturbations of its components or possibly in the environment. Figure 11.1 illustrates this issue with a simple but relevant example, in which distinctly different behaviors in p53 signaling arise, depending on the strength of a single feedback loop inhibition, exerted by the protein MDM2. In cases of this nature, mathematical
322
Alberto Marin-Sanguino et al.
MDM2
p53
p53
Stress signal (DNA damage)
Time (h)
Figure 11.1 A negative feedback loop is a regulatory structure in which the activation of a signaling event regulates a process upstream in the system. The feedback often ensures homeostasis in the system, that is, the ability to keep an internal state that is robust toward small fluctuations in the input signal. Feedbacks also facilitate fast signal termination and permit sustained oscillations. These concepts are illustrated here with the case of a feedback loop involved with the tumor suppressor p53 and its inhibitor MDM2. Low levels of DNA damage trigger the synthesis and activation of p53, which is terminated, with delay, upon synthesis of its inhibitor MDM2, due to the negative feedback loop structure of the system (solid black line). By contrast, extreme levels of DNA damage can cause the same feedback loop to trigger sustained oscillations in p53 values (solid red line), which have been experimentally verified (Lahav et al., 2004).
modeling is a “must” because nonlinear properties evade our intuition and can be understood only in mathematical terms (for introductions to nonlinear biochemical systems, see Voit, 2000 and Tyson et al., 2003, Vera and Wolkenhauer, 2008). Therefore, a systemic perspective of structurally and dynamically complex biochemical pathways is needed. The mathematical framework that provides this kind of perspective is dynamical systems theory, and the language that implements this idea and permits actual exploration and analysis is mathematical modeling. Applied to biology, this framework is the core of systems biology. Systems biology addresses biomedical questions through integrating experiments in iterative cycles with mathematical modeling, computational simulation, and model analysis (www.erasysbio.net). In the future, applications of systems biology to drug development, mathematical modeling techniques, and other computational approaches are positioned to play an essential role, especially in the integration and analysis of quantitative and often heterogeneous experimental data. But modeling will also aid in the detection of key interactions in biochemical networks, which may be corrupted by disease, and in the detailed interpretation of information regarding potential biomarkers and the development of new drugs.
323
Biochemical Pathway Modeling Tools for Drug Target Detection
Although a definitive standard workflow for drug target detection using a systems biology approach has not yet emerged, we can distinguish a number of steps, which are visualized in Fig. 11.2. The procedure can be divided into three
S
R*
P*
P Biological / biomedical knowledge
P*
(+/+)
Experimental data
Hypotheses
Mathematical modeling
Model calibration
Mathematical modeling
p dXi gjk = Σ cij • gj • Π Xk k = 1 j dt
Model refinement
P*
Validated / predictive model S
Predictive simulations Analysis of regulation
(+/–)
Biological knowledge
R
?
Drug dose profiling
Sensitivity analysis
Optimizationbased analysis
Unsupervised drug target detection
Supervised drug target detection
Computational methodes for drug target detection
Review Original paper Letter
Figure 11.2 Conceptual scheme of a drug target discovery work flow that includes data-based mathematical modeling as a core component.
324
Alberto Marin-Sanguino et al.
main conceptual blocks, namely: (1) retrieval and process of biomedical knowledge; (2) mathematical modeling; and (3) application of computational methods for model analysis and prediction. The latter uses the constructed model as a predictive tool to identify potential drug targets and may include procedures to select, design, and test small molecules with a desired therapeutic effect. In this anticipated workflow, the first step is to retrieve, select, and organize the relevant biomedical information describing the investigated biochemical processes and its physiological context. In some circumstances, this state of the art of biomedical knowledge about the system might be complemented with biological hypotheses about the structure and dynamics of the investigated biological network. Once processed, this biomedical knowledge is converted into a conceptual map of the investigated biochemical network. The map provides a first level of abstraction and specifies important information that is to be transferred into the mathematical model. Depending on the available information and data, as well as the characteristics of the investigated biochemical system, different modeling frameworks can be used (Voit et al., 2008). But no matter what model is chosen, the information in the map must be complemented with quantitative data that are to be generated in specific wet lab experiments to design and characterize the model, in a process sometimes called model calibration. The desired output of this process, which may require several iterations between modeling and experimentation, is the proper characterization of the mathematical model in terms of precise, reliable values for the parameters in the model equations. Since the aim of drug target detection is the ability to make reliable, and possibly quantitative predictions, the model must be subjected to an external validation, which may be accomplished in different ways, such as comparisons with additional data sets or tests of model predictions regarding so far untested scenarios. These validation techniques can be used iteratively in a process of model refinement that leads to increasingly more accurate models. While validation is very important, the ultimate goal of modeling based drug discovery is not to create a predictive mathematical model, but to use it in the detection of potential drug targets. Toward this end, the following sections define, explain, and illustrate the use of a number of computational and mathematical modeling tools that allow: (a) the detection of critical biochemical processes, whose modulation may overcome the pathological conditions under investigation; (b) the design of clinically feasible strategies to modulate them (Fitzgerald et al., 2006); and (c) the identification of potential new biomarkers (Wang et al., 2010). The main emphasis will be on biochemical networks, which are pertinent to several aspects of drug target detection. In order to make the text accessible to a wide audience, we will proceed as follows. Every technique is illustrated with a simple mathematical “toy” model, which allows us to make concepts easy to understand and visualize.
Biochemical Pathway Modeling Tools for Drug Target Detection
325
At the end of each section we include and discuss one or more published cases, where these techniques have been used. This discussion will allow the reader to experience how the techniques are used in real life. For the sake of simplicity, we have removed most of the mathematical equations and technical concepts from the text, but even the few remaining technical details are not truly required to understand the applications. Readers who are not interested in mathematical detail can therefore skip Sections 3.1-3.3 and 4, where the construction of mathematical models is explained in some detail. The equations of the mathematical model that is used throughout as an example can be found in the Appendix.
2. Biomedical Knowledge and Data Retrieval: Constructing a Conceptual Map of a Biochemical Network The first step in a procedure for model-based drug target detection is to retrieve, select, and organize the relevant biomedical information describing the investigated biochemical processes and its physiological context. The aim here is to list the critical molecules, usually genes, proteins, metabolites, small molecules, and various complexes, and the biochemical interactions involved in the investigated system. Good resources for this initial step are: (a) relevant biomedical and biological publications selected manually, with the help of PubMed, or automatically retrieved using text mining tools; (b) databases containing relevant information about biochemical interactions or protein–protein interactions, such as KEGG, MetaCyc, and Brenda; and (c) expert advice of biomedical researchers, clinicians, or biologists, who have acquired deep knowledge of the investigated biological phenomenon. The Appendix presents a list of useful tools and databases for data and information retrieval. In some circumstances, the retrieved and processed biomedical knowledge of the system can be complemented with a set of biological hypotheses about the structure and dynamics of the investigated biological network that are relevant for drug target detection. Once the necessary biomedical knowledge is acquired and processed, it is converted into an annotated map of the investigated processes using standard protocols (Le Nove`re et al., 2005, 2009; Voit, 2000). Information concerning the critical ontologies associated with each interacting partner of the network can be retrieved from the Gene Ontology (GO) and used to complement the constructed annotated map of the biochemical network under analysis. A visual version of the map can be produced using software tools for pathway representation and annotation. The Appendix lists useful software tools for construction and annotation of conceptual maps.
326
Alberto Marin-Sanguino et al.
The map thus constructed is itself a model in the most general sense. It formalizes the knowledge of the system, enumerates the relevant variables, and identifies interactions among them. By providing a certain degree of abstraction and using a language that is accessible to everyone, these concept maps are an excellent tool for discussion between the members of interdisciplinary teams (experimentalists and modelers). However, this map contains only a “static view” of the system, and the fine details of regulation and the dynamics of the biochemical system under different experimental conditions can only be understood through developing and using a mathematical model (Goel et al., 2006, Vera and Wolkenhauer, 2008) (Fig. 11.3). Figure 11.4 demonstrates the process with a toy example, where two metabolites M1 and M2 are combined to synthesize a product called M3. A later transformation of M3 is catalyzed by an allosteric enzyme P, whose activation (P ! P*) is regulated by a kinase I, whose action is related to a High-throughput expression data
Protein–protein interaction databases
Text mining of biomedical literature
Quantitative time series data
Network structure
I P
Network dynamics
M3 M2
Observables (a.u.)
M1
P*
1
0 0
6
12
24
Time
Figure 11.3 Model construction as a two-step process. The structure of the network (the set of metabolites, proteins, and small molecules together with the interaction between them) is first established based on the biomedical knowledge that is documented in relevant publications and databases. We call this representation the “annotated map” of the biochemical system. To characterize the network dynamics, which consists of changes in the concentrations of the compounds over time, quantitative data are required and must be integrated into a dynamic model.
327
Biochemical Pathway Modeling Tools for Drug Target Detection
I V7 P V8
V2 V1
P*
V3
M1
M3 V4
M2 V6
V5
Figure 11.4 Conceptual map of a fictitious biochemical network used here as a simple case study. Solid black arrows account for transformation processes while dashed arrows account for regulatory processes (dashed black arrow: positive regulation; dashed red line: negative regulation). An example of a pathway with such a structure is the synthesis of amino acids, where M2 represents ammonia and M1 is either a common precursor for the family of amino acids through V4 or serves as substrate for the synthesis of compounds through V2.
signaling event. The system contains a feedback loop in which M3 regulates the input flux of M2. In healthy conditions, this feedback loop maintains homeostasis in M3 for a wide range of values in the input fluxes of M1 and M2. By contrast, an alleged pathological condition causes an overexpression of enzyme P. The heightened activity level leads to an imbalance in the network fluxes, which causes an accumulation of M2 to toxic levels. The toy example could be a simplified model of a biosynthetic pathway in which two precursors M1 and M2 are combined to produce a final product, as it is the case with the synthesis of amino acids. In this latter case, M2 would play the role of ammonia, entering the cell through a reversible exchange V5 V6 and M1 would be the common precursor of the family that can be converted into an amino acid through V4 or continue to feed the synthesis of the remaining amino acids in the family through V2.
3. Mathematical Modeling of Biochemical Networks: Translating Knowledge into Mathematical Equations A map of the network such as in Fig. 11.4 contains quite a bit of information. It displays the variables that are considered relevant, the processes that transform the components into one another, as well as regulatory
328
Alberto Marin-Sanguino et al.
M1
M2
2.0
M2
1.5
1.0
0.5
0.0 0.0
0.5
1.0 M1
1.5
2.0
Figure 11.5 Dynamics of a simple biochemical system, represented in a so-called phase space. The pathway has only two metabolites and all its states can be described as points in the plane, where the coordinates are the concentrations of M1 and M2. Given a hypothetical initial condition (red dot; initial concentrations for M1 and M2), the system evolves through different states (red line) until it reaches a steady state configuration, in which concentrations of compounds do not change anymore. In this case, the process begins with M2 decreasing very quickly, while M1 is essentially unchanged. Subsequently, both M1 and M2 increase, oscillate, and then come to rest at the steady state.
signals. The next step is to move from this qualitative description to one that is quantitative and dynamic. In other words, the map needs to be translated into a mathematical model that is able to predict responses to changes in the biochemical pathway in terms of metabolite and protein concentrations, signals, or metabolic fluxes. Within this framework of a mathematical model, every configuration of the biochemical network is defined as a state of the system. Thus, the state of a biochemical network is characterized by the magnitudes of all metabolites, interacting proteins, external signals, and fluxes. If a stimulus is applied to the biochemical network, it “moves” from one state to another, which typically means that the concentrations of the compounds will change over time. Given the state of the biochemical system at a certain instant, the model is able to predict how it will evolve over time and predict its configurations in the future (Fig. 11.5) .
329
Biochemical Pathway Modeling Tools for Drug Target Detection
The most frequently used modeling tools for dynamic systems are sets of ordinary differential equations (ODEs). In these models, the change in the concentration of a metabolite or interacting protein Xi in the biochemical pathway or network over time can be written with an equation like Eq. (11.1) X dXi ¼ si;1 V1 þ si;2 V2 þ þ si;p Vp ¼ si;j Vj dt j¼1 p
ð11:1Þ
The left-hand side of the equation can be read as variation of Xi per unit time, while the right-hand side is the sum of all the process leading to an increase (uptake, synthesis, etc.) or a decrease in Xi (export, degradation, etc.). The contribution of each process to the overall change in Xi has two factors: each Vi is the reaction or transport rate with which the process is taking place and the si,j are the stoichiometric coefficients of the reaction. These coefficients are positive if they contribute to processes that increase Xi or negative for those processes that decrease Xi. This kind of mathematical model can be used to simulate the dynamics of a biochemical pathway over time. Using appropriate software and quantifying the initial conditions of the system (concentrations of the metabolites and proteins), as well as any stimulus that is applied, the integration of these equations predicts the fate of the biochemical pathway after stimulation or any kinds of perturbation in a sense that the model exhibits how much every metabolite and protein concentration changes over time. Sections 3.1–3.3 describe the details of different kinds of mathematical models used for investigating biochemical networks in drug targeting.
3.1. Network reconstruction The coefficients si,j in equation 11.1 can be deduced from the map. They contain all information about the connectivity of the network and prescribe how some metabolites and protein states are transformed into others. It is helpful to collect all coefficients in a table as shown in Fig. 11.6. The rows of the table correspond to the different variables of the system and the columns to the various processes. Every entry in the table relates a variable with a process. If a process does not involve a certain variable, its entry in the table is zero. In mathematical terminology, the table is called the stoichiometric matrix (S). Setting up the stoichiometric matrix is one of the milestones in the construction of a model and integrates much biological information. The process is also referred to as network reconstruction because it translates the connectivity of the map into mathematical terms, thereby enabling its computational analysis.
330
Alberto Marin-Sanguino et al.
Stoichiometry M1 M2 M3 P P* I
v1 1 0 0 0 0 0
v2 –1 0 0 0 0 0
v3 –1 –1 1 0 0 0
v4 0 0 –1 0 0 0
v5 0 1 0 0 0 0
v6 0 –1 0 0 0 0
v8 0 0 0 1 –1 0
v7 0 0 0 –1 1 0
Kinetics, qualitative I V7 P V1
P* V8
V2 V3
M1
M3 V4
v1 v2 v3 v4 v5 v6 v7 v8
M1
M2
M3
0 + + 0 0 0 0 0
0 0 + 0 0 + 0 0
0 0 0 + – 0 0 0
P 0 0 0 + 0 0 + 0
P* 0 0 0 0 0 0 0 +
I 0 0 0 0 0 0 + 0
Kinetics, semi-quantitative M2 V6
V5
v1 v2 v3 v4 v5 v6 v7 v8
M1
M2
0 0.5 0.5 0 0 0 0 0
0 0 0 0 0.5 0 0 0.5 0 [0,–2] 0.5 0 0 0 0 0
M3
P 0 0 0 1 0 0 1 0
P* 0 0 0 0 0 0 0 1
I 0 0 0 0 0 0 [0,1] 0
Figure 11.6 Information regarding the structure of a biochemical pathway can be condensed into a set of tables or matrices, namely the stoichiometric matrix and a table specifying the effect of every compound in any of the biochemical processes. This table can also be used to assign qualitative values to the kinetic orders (see next section).
The information on which the network reconstruction is based can come from different sources, ranging from common biochemical knowledge to annotated genomes or databases. There are, however, some ambiguities that often arise in the process and must be resolved in order to obtain a reliable and unambiguous network representation. For instance, the functions of most genes in a genome are identified by sequence similarity and have therefore limited specificity. Some examples of these limitations are the specificity of transporters and cofactors. For this reason, it is often necessary to complement the information found in the genome with additional information such as whether a certain reaction uses NADH or NADPH as an electron donor. The overall stoichiometry of aggregated processes also poses a difficulty in network reconstruction. The electron transport chain, complex biosynthetic processes and even biomass production are often lumped into a single reaction for modeling purposes and the overall stoichiometry of such a reaction cannot normally be determined by
Biochemical Pathway Modeling Tools for Drug Target Detection
331
genomic information alone. In such cases, experimental information and educated guesses must be used to assign realistic values to the unknown coefficients. The analysis of the stoichiometric matrix provides some insight into important characteristics of the system. Mathematical properties of the matrix can be associated with conserved pools. For instance, the sum of the concentrations of all the states of a single protein in a signaling pathway is constant, as it is the case for P and P* in the illustration example. However, the stoichiometric matrix becomes most useful for analyzing possible steady states. Since all the concentrations remain constant in the steady state, the rates of the processes must balance one another in a way that is determined by the configuration of the network and, therefore, the stoichiometric matrix. An algebraic property of the stoichiometric matrix, called its rank, determines which reaction rates or fluxes must be measured in order to completely characterize a certain steady state. In the example, for instance, no steady state can be reached unless V3 ¼ V4. It is therefore enough to measure only one of the two. In most cases, the number of reaction rates or fluxes to be measured is too large for such an approach to be practical, which has prompted the development of techniques for interpreting incomplete information or even for making predictions regarding the distribution of all fluxes without measuring most of them. The most popular of such techniques is flux balance analysis (FBA) (Palsson, 2006), whose basic concepts and applications to drug discovery we discuss in the coming paragraphs.
3.2. Flux balance analysis FBA is concerned mainly with the stoichiometry of the system and the information that can be gained from it concerning the steady states of the biochemical network. At the steady state, both concentrations and reaction rates remain constant and Eq. (11.1) becomes Eq. (11.2): dX ¼ 0 ) S v¼0 dt
ð11:2Þ
This equation must be fulfilled in every steady state. It is of note that the left part of the equation is expressed in terms of concentration variables, while the right side is formulated in terms of fluxes. Thus, the same system permits alternative (dual) representations (Marin-Sanguino et al., 2010). Therefore, of all the possible states in a phase space like Fig. 11.5, there is a region defined by Eq. (11.2) where steady states may be found. This feasible region can be further limited by complementing the steady state condition
332
Alberto Marin-Sanguino et al.
with additional constraints. These constraints can in principle be of any nature. For mathematical convenience, they are usually represented as linear functions of the fluxes of the system. Thus, the feasible region is normally defined by (a) Steady state condition. Since all the rates must be balanced for every node of the network, there is an equation for each node that must add up to zero. In the illustration example, for instance, the balancing at M3 implies V3 V4 ¼ 0 and that at M1 requires V1 V2 V3 ¼ 0. (b) Physiological bounds. Maximum and minimum values constrain the variables and rates in the system. Upper limits can be due to hard physical and chemical constraints (e.g., diffusion limit for oxygen) or normalizations (e.g., an arbitrary limit of 100 for glucose uptake to normalize all the fluxes with respect to it). (c) Reaction reversibility. Rates of irreversible reactions can be defined as strictly positive. These constraints are extremely important since failing to identify irreversible reactions can lead to unrealistic scenarios like a futile cycle running backward to generate ATP “for free.” (d) Experimental data. When rates can be measured, their values can be used as constraints in order to accept only those scenarios that are consistent with the observed behavior of the system. Ideally, the feasible region for the parameters of a system would reduce to a single point if enough experimental information were available. In practice this is only the case for very simple systems, and additional techniques must be used to explore relevant steady states within the feasible region. One way of moving toward physiological solutions is to define biologically meaningful goals that can be attained by the biochemical network. Such goals depend on the nature of the network, but examples include the production of energy or redox equivalents, or the synthesis of specific biomolecules. The definition of such goals enables the computation of how each goal can be optimally achieved within constraints that are physiologically meaningful. To compute optimal states, an optimization problem must be solved. For instance, the task could be the following: “Find a state in the feasible region such that ATP production is maximal.” Optimization problems generally pose hard mathematical challenges, unless they belong to one of several subclasses that can be solved efficiently. Thanks to the linear nature of the constraints described above, any objective that can be formulated as a linear combination of the fluxes will lead to a class of optimization problems called linear programming, which can be solved for very large number of variables. The solution of the optimization problem provides a steady state value for each of the fluxes of the system. This technique enables the exploration of the capabilities of the network in terms of energy production, maximal conversion yields, or other features.
Biochemical Pathway Modeling Tools for Drug Target Detection
333
We can think of our particular example as a biosynthetic pathway that has V4 as its main output. The inputs of M1 and M2 are limited by some maximum value and V2 routes flux to other metabolic processes and should therefore be kept above a certain minimum. Due to the architecture of the network, it is necessary for any steady state that V1 ¼ V2 þ V3, V4 ¼ V3, and V5 V6 ¼ V3. These constraints enable us to represent every possible steady state in terms of only three fluxes: V1, V2, and V4. Our optimization problem can thus be stated as: maximize V4 subject to: V1 ¼ V2 þ V4 V1 < M1 maxinput V2 > min V2 V4 < M2 maxinput All other fluxes can be calculated from these three as shown above. The feasible region for this problem is shown in Fig. 11.7. In this case, all steady states are contained in a plane; in general, the solution space is multidimensional. The red polygon encloses the feasible area, where all constraints are satisfied. The solution to this problem is not unique: all states lying on the dashed line in Fig. 11.7 share the same value for the objective (V4) and differ only in the value for the alternate branch (V2). This is called a degenerate solution and often occurs in FBA, especially for large networks. Degenerate solutions can be the result of under-constrained models, which implies that additional biological constraints should be formulated. However, they may also occur when there are redundant pathways in a network that provide robustness or reroute overflows. They furthermore arise in cases such as our example, when the network can meet two different goals at once. In our case, the biosynthetic pathway can work at full capacity (max V4) while still providing flux through the alternate branch (V2) up to a certain limit. This situation would completely disappear if the maximum input of M1 were less than that for M2. In this latter case, our FBA problem would have a unique solution. FBA focuses on steady states and does not include regulatory information, such as the effects of feedback regulation. For that reason, not every state predicted by FBA is achievable by the cell. However, FBA flux distributions can be interpreted as theoretical maxima since states that are outside the feasible region are impossible to attain. For this reason, FBA can be used to check genome annotations and has been especially successful in predicting the impact of knocked out genes in the overall performance of biochemical and metabolic systems. An example is the use of FBA to predict
334
Alberto Marin-Sanguino et al.
180
120 V1 60
0 V2
60
0
V3
40 20
0 120
Figure 11.7 Feasible region for the example problem. The blue plane represents constraints among the fluxes, while the red area shows the domain that satisfies all constraints.
lethal mutations in the protozoan pathogen Leishmania major (Chavali et al., 2008). In this work, the authors reconstructed the metabolic network of this parasite of mammalian organisms and, using FBA, simulated a comprehensive set of lethal single and double gene deletions in search for therapeutically relevant knockouts in the microbe. Solving the optimization problem with maximum growth as the objective, and repeating the optimization without one or two certain genes at a time, the comparison of the results presented an estimation of the importance of the gene(s) for growth. Comparisons of FBA results with experimental data have shown that most microorganisms are extremely robust to single mutations and can often reroute their metabolism to circumvent the missing step without a significant loss of performance on growth. This result seems to agree with the above-mentioned existence of redundancies in biological networks that lead to equivalent solutions. What the analysis of Chavali et al., (2008) demonstrates is that FBA offers a useful tool for identifying critical steps for which no redundancy exists. These steps are interesting candidate targets for antimicrobial drugs. Furthermore, comparing FBA results between different types of metabolic networks seems to be a promising approach to improve the selectivity of drug action. Powerful as it is, FBA has the disadvantage that it does not use information about regulation. Therefore, it is unable to predict which steady state is finally reached by the cell in a given situation or how and when it switches from one state to another. To obtain this information, a dynamic model is needed.
335
Biochemical Pathway Modeling Tools for Drug Target Detection
3.3. Dynamic models In order to obtain a complete description of the model dynamics, the dependence of the reactions on the variables of the system must be made explicit in Eq. (11.1). This means that the equation must be expanded to take into account how every reaction rate depends on the metabolites, proteins, and small molecules involved in it. A general formulation of this task is given in Eq. (11.3): dXi X ¼ si;j Vj ðx; I Þ dt j¼1 p
ð11:3Þ
Here x is the vector of time-dependent variables, such as metabolites and proteins, and I accounts for the external stimulus. Vj is now a mathematical function, whose arguments are the variables x and I. Obtaining kinetic information to characterize this dependence is much more difficult than characterizing the stoichiometry of the biochemical network. First, the kinetics of each reaction depends on many different parameters. And second, the parameter values obtained in vitro will not necessarily reproduce the behavior of the reaction in vivo. A critical decision that must be made at this point is which kind of function or “rate law” to use for mathematically representing the processes. Traditional rate laws used in enzymology were defined to characterize the mechanism and molecular properties of an enzyme and not necessarily to be used for modeling. As a result, most rate laws have many more parameters than required to reproduce the behavior of the enzyme with reasonable accuracy. The larger number of parameters is a significant issue, because it increases the amount of data required to fit the model. It also introduces too many degrees of freedom, which leads to overfitting, which means that the model is so complex that it becomes unreliable. For this reason, the use of simplified rate laws is steadily gaining popularity. Examples are power-laws (Savageau, 1969; Vera et al., 2007a, 2010a, Voit, 2000), linlog models (Nikerel et al., 2006), and the so-called Saturable-Cooperative formalism (Sorribas et al., 2007). For our discussion in this chapter, we have chosen power-law models (Savageau, 1969; Torres and Voit, 2002; Voit, 2000), which are by far the most prevalent models for this purpose. Power laws have the general form given in Eq. (11.4). Vi ¼ ki X1f1 X2f2 Xnfn ¼ ki
Y
f
Xj i;j
ð11:4Þ
j¼1;n
Due to their similarity with mass action kinetics, the parameters ki are called rate constants and the fi,j are called kinetic orders, which here however may be any real numbers. The most remarkable property of power-law
336
Alberto Marin-Sanguino et al.
models is that the same power-law equation can represent different biochemically relevant dynamics by keeping its mathematical structure intact and changing only the numerical value for the kinetic orders. Specifically, (a) a negative value for a kinetic order represents inhibition; (b) a value of zero for a kinetic order indicates that the variable does not affect the modeled process; (c) a kinetic order equaling one means that the reaction behaves as in linear kinetics; (d) values between zero and one account for saturation-like behaviors in certain concentration ranges and are default settings for approximating Michaelis–Menten kinetics; (e) values higher than one often model cooperative processes or processes occurring on surfaces or in channels (Savageau, 1998; Vera et al., 2007a; Voit, 2000). The homogeneous structure of power-law representations makes it easy to construct complex mathematical models even if knowledge about the system is sparse, and also to test hypotheses about the structure and dynamics of the biochemical network (Akutsu et al., 2000; Veflingstad et al., 2004; Vera et al., 2010a; Vilela et al., 2009). It is convenient to collect all kinetic orders in a matrix that accompanies the matrix of stoichiometric coefficients (Fig. 11.6, right-bottom). Once the power-law rate has been chosen as the mathematical representation, the task of formulating a model is reduced to that of finding numerical values for the parameters k and f. This task can be accomplished in different ways. One option, which is especially useful for purely metabolic networks, consists of interpreting each rate constant as the parameter that relates the steady state values of metabolites with rates in the stoichiometric system, as shown in Eq. (11.5). ki ¼ Q
Vi0
0;fj j¼1;n Xj
ð11:5Þ
Therefore, if the steady state is known, ki is fully determined by the value of the kinetic orders. This is one of the reasons why performing FBA is a helpful step toward a dynamic model. When good estimates for the metabolite concentrations are available, they can be fed into Eq. 11.5) together with a flux distribution obtained from FBA. In such a case, only the kinetic orders must be estimated. Fortunately, not every metabolite has a direct effect in every reaction rate. As a result, most of the entries in the matrix of kinetic orders will be zero, which significantly reduces the complexity of the parameter estimation. Furthermore, not all kinetic orders have the same impact on the dynamics of the model and educated guesses for some of them are often sufficient to obtain a realistic model, at least as a first default that subsequently undergoes further refinement. Educated guesses are often used for the kinetic orders of substrates and products of each reaction (Voit, 2000, Chapter 5). For example, the classical biochemical assumption that the Michaelis–Menten constant of an enzyme is often close to the concentration of its substrate is analogous to setting the kinetic order to 0.5.
Biochemical Pathway Modeling Tools for Drug Target Detection
337
In the general case, the remaining unassigned values of rate constants and kinetic orders may be estimated by integrating quantitative experimental data in the way described in the following section.
4. Model Calibration: Matching the Mathematical Model to Quantitative Experimental Data Up to this point in our discussion, the model has been an abstract construct that encodes in mathematical terms the structure and some other qualitative features of the investigated biochemical network. In most cases, the structure alone is not sufficient and the mathematical model must be matched against quantitative data that describe the dynamics of the system and its steady state configuration. Only if this matching can be accomplished in a satisfactory fashion has the model the potential of becoming a reliable predictor of the outcomes of untested scenarios. In this process of model calibration, we follow a computational approach that ultimately assigns representative values to the parameters that characterize the model equations (rate constants, kinetic orders, etc.) in such a manner that the model reproduces the behavior of the system, as it is characterized by available experimental data. The underlying strategy is to determine those values for the model parameters that make the differences between the actual measurements and the values produced by the corresponding model simulation as small as possible (Fig. 11.8). This strategy can be represented by the mathematical statement in Eq. (11.6), which calls for minimizing the residual error between model and data: exp X tp nvar X 2 X 1 exp Xk;j ðti Þ Xk;j ðti Þ nexp nvar ntp k¼1 j¼1 i¼1
n
Min Ferror ¼
n
ð11:6Þ
In this equation, nexp is the number of experiments, nvar is the number of measured quantities (observables), and ntp is the number of time points where each observable was measured. Xk, j(ti) is the value of the j th observable at the ith time point obtained after numerical simulation of the model for the kth experiment and Xk, jexp(ti) is the corresponding value of the j th observable at the i th time point measured in the kth experiment. Experimental data used for model calibration can be of distinctly different types. Traditionally most dominant, especially in metabolic systems, used to be kinetic properties and steady state configurations of the pathway for a given number of biological conditions. Newer data consist of time series measurements of some
338
Alberto Marin-Sanguino et al.
1 Msim 3 (t = 12)
M3 (a.u.)
Mexp 3 (t = 12)
0.5
6
12 Time
18
Figure 11.8 Generic illustration of the idea of parameter estimation. An algorithm chooses parameter values that minimize the distance (solid grey arrow) between the simulated model predictions (M3sim) and the available experimental data (M3exp) for the considered experimental conditions.
or all variables considered in the mathematical model for a set of biological scenarios. Figures 11.9 and 11.10 illustrate both types of experimental data. Most actual parameter estimation efforts today combine a bibliographic search, where information is extracted from publications where similar experimental conditions were investigated, manual training of the parameter set, including the fine-tuning of values until the model is able to simulate the experimental data with sufficient accuracy, and algorithmic data fitting. In the latter case, which can be more or less automated, the experimental measurements are fitted by using computational techniques based on regression, the mathematical principle of maximum likelihood, or a number of other techniques. These techniques very often assign values to the model parameters in an iterative procedure that continues until the difference between the experimental data and the model simulations for the investigated biological scenarios are minimized. A variety of optimization algorithms have been specifically designed or adapted for this purpose, and they use local search methods, evolutionary algorithms, global search methods, or a number of specialized techniques. Overviews of this topic include the recent papers by Ashyraliyev et al. (2009), Chou and Voit (2009), and Srinath and Gunawan (2010). In addition, a variety of software tools are available for calibrating ODE models (see Appendix). In order to obtain reliable results, the model calibration requires a sufficient amount of quantitative, high-quality data from specifically designed experiments. The parameter estimation procedure can and should be complemented with an identifiability analysis, which assesses whether
339
Biochemical Pathway Modeling Tools for Drug Target Detection
I 2
M1
P* I(t) (a.u.)
P
M3
1.5 1 0.5 0
M2
Inadequate solution
15
M1 2.M2 2.M3
M1
Concentration (a.u.)
Concentration (a.u.)
15
10
5
0
10
20 30 Time (a.u.)
50 Time (a.u.)
Optimal solution
0
0
40
50
2.M2 2.M3
10
5
0
0
10
20 30 Time (a.u.)
40
50
Figure 11.9 Data used for parameter estimation: Time series data. Illustration of a hypothetical time series experiment that is used to match the model parameters. The system was stimulated with some amount of input signal I (indicated with the “injection” symbol), and the time-profile of I in top-right figure) and several of the states defined in the model were “measured” at different time points after stimulation of the system (M1, M2, and M3 indicated with the symbol “counter”). Symbols (triangles, squares and dots) in the bottom panels represent the artificial data and lines indicate the model simulation results for optimal parameter values (left) an inadequate values (right). All units are arbitrary.
the parameter values have been estimated with sufficient certainty, or if other sets of values may be equally able to reproduce the available data, a situation that reduces the predictive abilities of the model. The literature contains some papers with excellent discussions on how to integrate both analyses (e.g., Balsa-Canto et al., 2010; Srinath and Gunawan, 2010).
340
Alberto Marin-Sanguino et al.
I P
M1
P*
M3
50
30
40
20 10 0
4 0 2 V1 (normalized to v1o)
30 20
1.5 M3 (a.u.)
40 M2 (a.u.)
M1 (a.u.)
M2
1 0.5
10 0
0 2 4 V1 (normalized to v1o)
0
4 0 2 V1 (normalized to v1o)
Figure 11.10 Data used for parameter estimation: Steady state data. Illustration of the quantification of steady state configurations of the pathway for different flux values for V1. The concentrations of metabolites M1, M2, and M3 were “measured” at a sufficient long time after stimulation (modulation of V1) that allowed the system to reach the steady state. Green squares represent the artificial data and blue solid lines with symbols show the model simulation results for the estimated parameters.
For our case study, we designed two synthetic experiments to illustrate these concepts. The first computational experiment (Fig. 11.9) resulted in an artificial time series, where the concentrations of the metabolites M1, M2, M3 were “measured” at different time points after a transient stimulation of the system with an input signal I. In fact, these “measurements” are simulation results, which are artificially corrupted with noise. For the model calibration, we assume not to know the parameter values used in the simulation. Optimization methods like the ones described before are used to tune the value of the model parameters until model simulations achieve a satisfactory reproduction of the data. The left-bottom panel of Fig. 11.9 shows the comparison between experimental data and the ultimate model simulation for the chosen sets of parameters, while the bottom-right panel
Biochemical Pathway Modeling Tools for Drug Target Detection
341
of Fig. 11.9 compares data with a simulation obtained with an unsatisfactory set of parameter values. In the second experiment (Fig. 11.10) the steady state values of M1, M2, and M3 were “measured” for different values of the input flux V1 to the system. These types of data can be used as well to tune the model parameters. The bottom panel of Fig. 11.10 shows the comparison of experimental data (squares) and model simulations (line with solid dots) for the different values of V1 considered in the interval [0.25, 4]. A sufficiently validated and refined mathematical model can be used for model-based analysis of the biochemical network under investigation. Ideally, computer simulations of different biological scenarios have predictive abilities, thereby shortcutting the way to a better understanding of the system and also to detection of new drug targets. Different approaches for using mathematical models in drug target identification and drug discovery are discussed in the coming sections.
5. Predictive Model Simulations as a Tool for Drug Discovery A mathematical model that was sufficiently calibrated and validated has predictive abilities regarding not yet explored biological scenarios. Expressed differently, model simulations can be used for investigating the features of biological scenarios that have not yet been experimentally analyzed, therefore saving at least a portion of the costs of expensive and time consuming experiments. In the context of basic research, this property of mathematical models has been employed for testing hypotheses concerning features associated with the dynamics and structure of the network, and also with the subcellular compartmentalization of interacting biomolecules (for good examples, see Alvarez-Vasquez et al. 2005; Rehm et al. 2006; or Vera et al., 2010c). In this sense, predictive simulations are already useful in drug discovery by pointing to critical properties of the network under investigation. Outside dynamical modeling, the use of mathematical model simulations is not new in the context of drug discovery. For decades, pharmacokinetics has been devoted to the model-based analysis and simulation of the mechanisms by which a drug enters the body, distributes among the organs and tissues and is finally metabolized and eliminated (Rowland and Tozer, 2007). All these aspects are essential features for determining effective and secure drug doses. In more recent times the use of models in pharmacology has been extended toward pharmacodynamics, which uses simulation techniques to study the primary physiological effects of the drug and to fine-tune the drug dose with respect to the desired physiological effect (Rowland and Tozer, 2007).
342
Alberto Marin-Sanguino et al.
With the emergence of systems biology, the use of mathematical modeling in drug discovery can advance one step further. Namely, it is becoming feasible to expand the modeling approach of pharmacokinetics and pharmacodynamics by including detailed dynamic models that describe the intracellular pathways that are targeted or affected by the drug of interest and characterize issues related to their regulation in the cell. For this expansion, an integrative approach should include: (a) modeling of drug administration and degradation dynamics; (b) modeling of the dynamics of drug modulation of the targeted biomolecules, which are usually proteins or membrane-bound protein receptors; and (c) modeling of the dynamics of the intracellular networks (signaling, genomic, or metabolic networks) that are affected by the targeted biomolecule (Fig. 11.11). Simulations with these types of extended models would allow the finetuning of drug doses by tracing not only the metabolic fate of the drug itself, but also the effects of drug administration on critical intracellular processes that are altered by the drug administration. For our case study, we assume that a drug X is able to inhibit the activation of P through the input signal I. In this way, the administration of an adequate dose (adequate in timing and amount) of drug X can counteract the effect of the overexpression of P in a hypothesized
Pharmacodynamics D′
D
Dynamics of drug administration, metabolism and degradation Ø
Pharmacokinetics
R′
R*
Dynamics of interaction drug/protein Ø
P*
P
Systems biology Dynamics of pathway activation, gene activation, metabolism, effects in cell fate
P*n E*
M1
M2
G
Figure 11.11 Integration of different predictive simulation approaches in drug discovery and drug dose determination.
Biochemical Pathway Modeling Tools for Drug Target Detection
343
pathological condition of the system. It is easy to perform model simulations to assess the effects of different types of drug administration in the biochemical network. For instance, the center-left panel of Fig. 11.12 shows the simulated drug dose profile of drug X, where we assume two doses per day (every 12 h) and a half-life of the drug of approximately 6 h. The center-right panel shows how model simulations can be used to investigate the effects of drug administration on the concentrations of the different components of the system (M1, M2, M3, and P*). The black solid horizontal line represents the alleged physiologically tolerable threshold for the concentration of M2 (here set to 150% of the concentration in healthy subjects). In the simulated drug dosing experiment, the values of M2 always remain below the threshold. The leftbottom panel of Fig. 11.12 exhibits predicted model values for the network compounds in healthy subjects (HS), the pathological state (PS), as well as average values after the administration of the drug as described before. Finally, the right-bottom panel of Fig. 11.12 shows the predicted average values of M2 for different doses (twice a day) of the drug compared with HS and PS. These simple results indicate how model simulations can be used to establish adequate or minimally sufficient drug doses, and they also show the effects of the drug administration in the other network components. In recent years, this promising approach for testing drug efficacy has already been applied in realistic situations. For instance, Lai et al. (2009) developed a mathematical model combining experimental data for different aspects of murine erythropoiesis. This process of red blood cell generation and differentiation is closely regulated by the blood levels of the hormone erythropoietin, Epo. The model included three kinds of biological events: (a) subcellular events associated with the Epo-stimulated JAK2-STAT5 signaling activation (Vera et al., 2008); (b) Epo-regulated dynamics of erythrocyte differentiation; and (c) hypoxia-mediated regulation of Epo blood levels. In addition, the model included equations accounting for the dose-dependent injection of exogenous Epo and the later dynamics of the hormone, which was used in the study as a drug to fight a certain kind of anemia. With the help of their quantitative model, the authors performed several predictive simulations to establish the appropriate doses of exogenous Epo injection that were required to compensate for the pathological downregulation of the crucial signaling proteins. Another example is a series of articles by Qi and collaborators that were geared toward a better understanding of neurodegenerative diseases like Parkinson’s (Qi et al., 2008a) and schizophrenia (Qi et al., 2008b, 2010a). The most prominent common link between the two diseases is the neurotransmitter dopamine, which becomes depleted in the former and is found to be excessive in the latter. Thus, the authors began by setting up a complex dynamic model of dopamine metabolism in the brain, which accounted for synthesis, compartmentalization, degradation, and the known regulatory mechanisms. While the structure of the dopamine pathway system was fairly well known, the identification of parameter values was a significant challenge,
344
Alberto Marin-Sanguino et al.
X
I P*
P M3
M1
M2
1.5
7 M1
Available drug D
M2
6 Concentration (a.u.)
Concentration (a.u.)
M3
1
0.5
P*
5 4 3 2 1
0
0
20
40
60
80
100
0
0
20
Time (a.u.) 15
40 60 Time (a.u.)
80
100
15 HS PS
10
Mean(M2) (a.u.)
Concentration (a.u.)
wD
5
10
5
0
M1
M2
M3
P*
0
PS
Do/10 Do/2
Do
2Do
HS
Drug dose (a.u.)
Figure 11.12 Drug dose analysis using model simulations. Center-left panel: simulated time-profile of drug X; center-right: simulation of metabolites and protein concentrations induced by the administration of the drug. Bottom-left panel: concentrations of metabolites and activated enzyme P* in healthy subjects (HS), under pathological (PS) conditions, and with drug administration (wD); bottom-right: predictions using model simulations for the value of M2 in response to different doses of drug; Do represents the situation simulated in the center-left panel.
Biochemical Pathway Modeling Tools for Drug Target Detection
345
because metabolite levels were needed in certain cells within relatively small, defined areas of the brain. Interestingly, the authors were able to estimate parameters based on semiqualitative expert opinion and generic experience with models of this type, and the resulting model turned out to be quite predictive, as judged by comparisons with a variety of biological and clinical findings (Qi et al., 2008b). The model was subsequently used to identify the key determinants of the dopamine system, which were seen as prime candidates for drug targets. This identification consisted essentially of a global sensitivity analysis (see next section) and was performed through large scale Monte Carlo simulations that demonstrated which steps, and combinations of steps, had the greatest influence on the main metabolite, dopamine, and also on toxic by-products (Qi et al., 2009). The results reliably revealed enzymatic steps that are targeted by current drug treatments and suggested combination treatments, which were predicted to incur less severe side effects. Because dopamine metabolism only controls the neurotransmitter, but not the signal transduction process itself, the authors also studied how normal and perturbed dopamine signals are interpreted by receptor neurons (Qi et al., 2010b). Another promising application of predictive systems models in drug discovery is cronotherapy. The idea here is that mathematical modeling combined with quantitative data can be used to establish the optimal timing of drug delivery in such a way that the treatment occurs in positive synergy with the internal rhythms of the organism, thereby reducing toxicity and increasing the desired therapeutic effect. In line with this goal, Altinok et al. (2009) recently explored the idea of applied cronobiology in cancer therapy, where the efficacy of cytotoxic agents is critically affected by the regulation of cell cycle related processes. The authors developed a mathematical model to simulate the distribution of cell cycle phases and their interactions with the circadian clock molecules and used this model to investigate the optimal circadian patterns of administration of two anticancer drugs, 5-fluorouracil (5-FU) and oxaliplatin (l-OHP). In their analysis they compared various patterns of drug administration that differed in the timing of maximum drug delivery and looked for those settings displaying minimum cytotoxicity for the population of normal cells but high cytotoxicity toward a second cell population of tumor cells.
6. Model Sensitivity Analysis as a Tool for Detecting Critical Processes in Biochemical Networks Several computational and analytical tools can be used to analyze a biochemical network in search for potential drug targets. The underlying idea is to employ the predictive abilities of the mathematical model to find the set of biochemical processes within the network whose modulation is
346
Alberto Marin-Sanguino et al.
expected to affect the dynamics of the network most significantly. These biochemical processes later become priority candidates for drug targeting or other therapeutic approaches. Toward this end, the computation and analysis of model sensitivities can be an extremely useful tool. Sensitivity analysis is the study of how the variation in the critical outcomes of a given biochemical system can be categorized and assigned, qualitatively or quantitatively, to different sources of variation in the system (Saltelli et al., 2000). In the context of a mathematical model, we typically associate this feature to changes in the values of model parameters and look for those model parameters for which a numerical variation significantly affects critical responses of the system. These critical (i.e., especially sensitive) parameters are associated with processes in the network that are therefore considered particularly promising as potential candidates for drug-mediated modulation (Nikolov et al., 2010). We can distinguish two approaches to sensitivity analysis. In local sensitivity analysis we search for the parameters for which small variations around their nominal value have a significant influence on the critical responses of the biochemical pathway. The analysis is “local” because it is performed in the mathematical vicinity of a preferential or nominal configuration of the system, and also because only one parameter is perturbed at a time. In the conventional methods for estimating local sensitivities, the value of a given parameter is slightly modified (e.g., by 1–5%; up or down) and the model is used to compute the resultant change in the critical response of interest. The sensitivity is calculated with the equation SkX ¼
DX X ðk þ DkÞ X ðkÞ ¼ Dk Dk
where X represents a response variable and k the modified parameter; D represents a small change. In the case of power-law models, sensitivities of the model that are related to steady state configurations of the system can be calculated using simple algebraic equations, which renders sensitivity analysis quick and computationally simple for these models (Voit, 2000). In global sensitivity analysis the search for the critical parameters considers larger regions of feasible values for the parameters rather than preferential values. In addition, several, many, or all parameters may be varied simultaneously (for a comprehensive discussion, see Saltelli et al., 2000). This approach provides an initial impression of those biochemical processes whose modulation critically affects the system (see, e.g., Qi et al., 2009). However, highly nonlinear biochemical systems may have such a structure that the parameter influence, or the criticality of the process, differs substantially among large regions of parameter values. Thus, one may find distinctive regions with rather different response features. In this case, a unique global sensitivity value for each parameter may not reflect
Biochemical Pathway Modeling Tools for Drug Target Detection
347
biologically significant criticality and additional computational techniques must be used to dissect the relevant region of the parameter space into subspaces with similar meaningful sensitivities. Sensitivity analysis can be used for the detection of potential drug targets, because insensitive parameters would not have much effect, even if they could be altered therapeutically. This feature of sensitivities suggests a ranking of biochemical processes, along with the model parameters representing them, in terms of how effective their modulation might be for altering the input–output relationships of the system. The precise effect of any such modulation must be later investigated with additional simulations. Expressed differently, sensitivity analysis alone can point out where the most sensitive parameters are found in the model, but not necessarily what the precise effect of their modulation is; this latter aspect must be analyzed with additional simulations. As an illustration, we performed a sensitivity analysis of the mathematical model for our case study with the goal of identifying how variations in model parameters would affect the values predicted for the metabolites and P* (Fig. 11.13, top panel). The sensitivities were calculated for the pathological condition (PS) and the primary aim was to analyze the effect of parameter modulation on the value of M2, which is the metabolite critically accumulating in the pathological configuration of the system. The sign of the sensitivities is of great relevance, because it indicates whether an increase or a decrease in a parameter value has a positive or negative effect. A positive sensitivity value predicts that an increase in the parameter values raises the targeted response variable, while a negative sensitivity predicts a reduction. By inspecting the calculated results we see that critical parameters for affecting the value of M2 are k1, k4, k7, and k8. The latter two correspond to the “trivial solution” of the problem, because they suggest that, by inhibiting P activation or enhancing P* deactivation, we will affect the level of M2, which is quite obvious. However, our analysis also suggests two other means of modulating the concentration of M2, which are associated with the modification of the input flux of M1, which is governed by k1, and/or the catalytic transformation of M3, which is governed by k4. The sensitivity for k1 has a negative sign, which suggests that if we want to reduce the value of M2 we have to increase the value of the input flux of M1. This could be achieved by overexpressing the enzyme regulating the flux V1, which however is not realistically feasible, or by increasing the uptake of this metabolite, which could be accomplished with dietary restrictions. The sensitivity with respect to k4 is positive, which indicates that reduction of the flux V4 is expected to induce the downregulation of M2. This reduction of V4 could be implemented, for example, through competitive inhibition of P* or inhibition of any concurrent cofactor. The local sensitivity analysis alone does not allow us to judge whether the modulation of the sensitive biochemical interaction suffices to restore
348
Alberto Marin-Sanguino et al.
5
Sensitivity (a.u.)
M1 M2 M3 P* 0
k1
k2
k3
k4
k5
15
15
10
10
M2 (a.u.)
M2 (a.u.)
–5
5
0
V1o
2 V1o
4 V1o
k6
k7
k8
5
0
V4o
0.5 V4o
0.25 V4o
Figure 11.13 Sensitivity analysis is used to detect critical processes within the biochemical system. Top panel: local model parameter sensitivities calculated for the settings of the pathological condition (PS). The results focus on the steady state values of the metabolites M1, M2, and M3 and demonstrate how the active fraction P* of the protein is affected by modulation of the parameters k1–k8. Bottom-left panel: simulated values of M2 when the critical flux V1 is upregulated; V1o is the nominal value (V1o ¼ 16). Bottom-right panel: simulated values of M2 when the catalytic transformation of M3 is inhibited by changing the value of k4 (k4, 0.5k4, 0.25k4).
acceptable values of M2, and further computational simulations are required (Fig. 11.13, bottom panel). Such simulations suggest that duplication of V1 in the incoming flux of M1 can restore acceptable values of M2 (Fig. 11.13, bottom-left panel), but also that a 50% inhibition of the flux V4 can reduce M2 values to suitable levels. von Kriegsheim et al. (2009) used this technique to identify critical interactions regulating the timing and peaking of ERK signaling, a network fundamental to the regulation of important cellular events like cell proliferation, which appears to be corrupted in a number of cancers. In a nutshell, these authors set up a rather large computational model using quantitative
Biochemical Pathway Modeling Tools for Drug Target Detection
349
proteomic data and other experimental techniques. With this model, they functionally characterized the set of ERK interacting partners and their regulation of the fine tuning of ERK signaling features, such as signal duration, intensity, and localization. The model was later analyzed using sensitivity analysis, where the aim was to detect the critical model parameters and processes, whose modulation may significantly change the features of ERK signaling. Rather than finding a unique critical interaction controlling the network performance, their analysis revealed a large set of important parameters. The authors proposed that these interactions may exert synergistic regulation of the system and suggested that coordinated modulation of several processes (e.g., via multifactorial drug treatments) may be an effective way to control ERK signaling in pathological conditions. In a recent research study very close to drug discovery, Schoeberl et al. (2009) used mathematical modeling and sensitivity analysis to therapeutically target the signaling system of ErbB receptors and the PI3K signaling pathway, which is crucial for the development of anticancer drugs. Their team, which consisted of modelers and experimentalists from a pharmaceutical company, designed a mathematical model and calibrated it with quantitative experimental data describing the ErbB/PI3K signaling network. Sensitivity analysis was used to identify the critical receptors in the family related to the activation of AKT, an important protein downstream of the signaling system. They found that ErbB3, a member of the ErbB receptor family, has a dominant role in this activation process and is a promising target for drug inhibition. They later successfully tested a new potential drug that targets the receptor by inhibiting its phosphorylation (Fig. 11.14). For complex biochemical networks, sensitivity analysis may or may not be a sufficient tool due to the emergence or disappearance of nonlinear phenomena, such as sustained oscillations or multistability. Such phenomena are associated with the passage of a parameter across a critical value, which is called a bifurcation point. These points may be treacherous for sensitivity analysis, because not only does the sensitivity value depend on the particular value of the parameter, but it is also possible that the overall dynamics of the network is altered. For example, it may happen that an oscillation totally disappears. Nikolov et al. (2010) proposed to combine sensitivity and bifurcation analysis in order to detect the critical processes regulating the main outcomes of the biochemical network, as well as changes that qualitatively alter the dynamics of the system. They used this concept to detect critical interactions in a model describing the differentiation of red blood cells controlled by the Epo-JAK2-STAT5 signaling pathway and to analyze if modulation of their representative model parameters could induce or eliminate sustained oscillations.
350
Alberto Marin-Sanguino et al.
Sensitivity
103
Erb3
PI3K
pErb3
Erbx
pErb3
pErb3
0
–103 ERB3Total AKTTotal PDK1Total PTENTotal PP2ATotal
pPI3K
1
AKT
pAKT PP2A
pAKT (n.u.)
PTEN
0 10–12
10–10
10–9
10–8
10–6
Erb3 inhibitor (M)
Figure 11.14 Sensitivity analysis as a tool for detecting drug targets in ErbB/AKT signaling. Schoeberl et al. (2009) developed a data-driven model to describe the activation of the signaling system PI3K/AKT through the ErbB receptor family. Once characterized, the sensitivities of (active) phospho-AKT (pAKT) to changes in the various model parameters were computed (top-right panel). The authors found that the model parameter representing the total amount of Erb3 that can be activated has the highest positive sensitivity, suggesting it as a promising drug target for inhibition. Guided by this finding, the authors later developed and tested a specific inhibitor that targets the receptor Erb3. The panel at the bottom-right shows the comparison of the model simulations (black bars) and experimental data (gray bars) for different doses of the tested inhibitor (figures are pictorial representation, for precise data value, see Schoeberl et al. 2009).
7. Drug Target Detection Through Model Optimization The notion of using mathematical modeling and analysis of biochemical networks as a tool for identifying potential drug targets can be further expanded with more sophisticated approaches. These approaches often require more information about the biochemical origins of the investigated disease. For instance, it is useful to know which specific biochemical process appears to be altered and what metabolites or proteins are affected by this alteration. As an example for how these more advanced approaches can be derived, we discuss here a method proposed by Vera et al. (2007b, 2010b), in which mathematical modeling of biochemical networks was combined
351
Biochemical Pathway Modeling Tools for Drug Target Detection
with optimization techniques to detect potential drug targets in metabolic diseases. The underlying idea was simple: in a complex metabolic network that is misregulated by a disease, it may be possible to use drugs specific for a pertinent enzyme to redirect some critical fluxes in a manner that restores in the network all critical fluxes and metabolites to values similar to the ones found in health subjects (HS). As a simplified example, Fig. 11.15 shows M1 and M2 as critical metabolites whose concentrations are mis-regulated in the investigated disease. A clinical intervention with drugs and other therapeutic strategies should try to affect the network such that the actual pathological values for M1 and M2 (PS) return to values similar to those in HS. While conceptually simple, the actual implementation of such a treatment strategy is difficult for highly complex, interconnected and regulated biochemical networks, and mathematical models and computational techniques become obligatory. The strategy proposed by Vera et al. (2003, 2007b, 2010b) works in the following manner: Step 1. A mathematical model describing the dynamics of the metabolic network under analysis is set up and calibrated with experimental data. Step 2. Pertinent biomedical knowledge about the origin and effects of the disease is retrieved and processed. Critical fluxes and metabolites that are unbalanced in the disease condition are identified, and their values in HS are assessed. Step 3. Model simulations are used to investigate which biochemical processes in the network must be modulated and how this might be accomplished, for
2
M2 (n.u)
PS (MHPS 2 , M1 )
1
0
Designed intervention using inhibitors
HS (MHS 2 , M1 )
0
1 M1 (n.u)
2
Figure 11.15 Conceptual scheme of model optimization-based treatment techniques: Some critical biochemical processes are modulated by an enzyme-specific drug inhibition or some other therapeutic approach and affect the network in such a manner that the values of critical fluxes and metabolites are restored to values similar to those found in healthy subjects (HS).
352
Alberto Marin-Sanguino et al.
instance, through inhibition or activation, in order to move the system from the current values of critical metabolites and fluxes toward health. Step 4. Simulations may consider therapeutic modulations of one reaction at a time or multiple modulations simultaneously. The simulation results offer insights guiding possible strategies for developing potential multifactorial treatments. This method requires considerable computational effort if the disease system is complex. An elegant manner to deal with the challenge is to convert the task into an optimization problem of the critical parameters and processes. In the Optimisation Drug Discovery Program suggested in Vera et al. (2007b, 2010b), the search for these critical (drug-targeted) reactions is represented with an optimization program containing the following elements: 1. A mathematical description of the functional origin of the disease. 2. A mathematical formulation for the notion of moving the current pathological values of the critical metabolites and fluxes toward the desired health state. For example, one may execute the following minimization # p l Xj X HS X X Ji JiHS j ; Min lj l þ XjHS i¼1 i JiHS j¼1 "
where Xj and XjHS denote the current and the health values of the metabolites and Ji and JiHS are the corresponding values for the fluxes, respectively. The values of the positive weights lj and li are proportional to the relative importance of each key metabolite and flux. Finally, the objective function itself formalizes the minimization of the differences between the values of all critical metabolites and fluxes computed by the model in the pathological (X and J) and the healthy scenarios (XHS and JHS). Every term in the equation is scaled by its value in the healthy condition such that all contributions have comparable weights. 3. A number of conditions ensuring that the network configuration obtained in the optimization indeed represents a physiologically feasible and stable steady state. Additional conditions guarantee that the proposed interventions themselves are physiologically feasible. A more detailed mathematical description of the Optimisation Drug Discovery Program can be found in the Appendix. This optimization program can be computed using standard methodologies for nonlinear optimization (Banga, 2008) or via linear optimization techniques that benefit from the computational advantages of power-law models (MarinSanguino anf Torres, 2003; Marin-Sanguino et al., 2007; Torres and Voit, 2002; Voit, 1992). The method can be applied in an iterative manner: at
Biochemical Pathway Modeling Tools for Drug Target Detection
353
each step we allow for the modulation of a unique process (single enzyme target) in the network and compute the solution. This solution consists of a set of computationally predicted values for metabolites, initial substrates and the modulation level of the targeted enzyme. One selects and stores only those solutions for which the computed values of critical metabolites and fluxes are sufficiently close to the ones in the healthy condition (HS). The method may also be used to analyze combined treatments that consider parallel targets consisting of two or more processes in the network, although one should keep in mind that multifactorial treatments may increase or decrease the potential for adverse side effects. In our case study, we try to identify which biochemical interactions in the network may be targeted by drug inhibitors in such a way that the network is returned to its healthy condition. Specifically, we look for means of inhibiting network interactions that primarily restore the value of M2, which accumulates to toxic levels in the pathological scenario. A secondary goal is to restore the values of the other metabolites (M1 and M3) as far as possible. These goals may be translated into mathematical terms, for instance, with the following objective: M2 MHS M1 MHS M3 MHS 2 1 3 Min 1 þ 0:05 þ 0:05 MHS MHS MHS 2 1 3 This minimization task forces the network to move to a configuration that restores the healthy values of M1, M2, and M3 as much as possible. We have assigned a much higher weighting factor to M2 (1 vs. 0.05) because it is the critical metabolite showing toxic concentrations in the PS. We first consider the eight possible single target strategies and proceed as explained before by computing solutions per simulation. The results (Fig. 11.16) clearly demonstrate that only some of the network interactions are promising targets for drug inhibition. Strategies inhibiting V7 (trivial solution), V4, and V5 can entirely restore the value of M2 and the other two metabolites. In addition, inhibition of the interaction V2 produces a partial and possibly acceptable recovery of the values for M2, which however doubles the target value of M1. We also considered multifactorial strategies that combine the drugmediated inhibition of two interactions. The results of the set of 8 8 combinations (half of them are redundant) appear in Fig. 11.17. The results in terms of M2 do not vary in comparison with the single inhibition solutions because in this particular case only combinations of the single-inhibitor solutions (V2, V4, V5, V7) with other potential targets restore healthy M2 levels. Different solutions show different patterns regarding the concentrations of M1 and M3, which suggests potential combination therapies with lower negative effects on M1 and M3 levels. One must note that these results
354
Alberto Marin-Sanguino et al.
15 M1 M2 M3 Concentration (a.u.)
P* 10
5
0 V2
V1
V3
V4
V5
V6
V7
V8
Targeted interaction
Figure 11.16 Solutions of the drug target detection optimization program when single inhibition strategies are considered. Each set of bars accounts for the drug-mediated inhibition of one biochemical step (e.g., V1). M1
M2 14
V8 V7
12
V7
V6
10
V6
V5
8
V4
M3 12
V8
10
V3 V2
V5
8
V1
V2
2
V1
1 0.8
V5
0.6
V4 6
V3
4
V7 V6
V4 6
V8
V3
0.4
V2 4
0.2
V1
V1 V2 V3 V4 V5 V6 V7 V8
V1 V2 V3 V4 V5 V6 V7 V8
V1 V2 V3 V4 V5 V6 V7 V8
Targeted interaction
Targeted interaction
Targeted interaction
Figure 11.17 Solutions of the drug target detection optimization program where double inhibition strategies are explored. Each square reflects the combined inhibition of the network interactions indicated by the row and column. Values in the solution for M1 (left), M2 (center), and M3 (right).
only consider drug-mediated inhibition. If we would consider positive regulation for some or all of the processes as well, the set of promising solutions would increase. For instance, one could consider solutions that increase the value of V1, as suggested by the earlier sensitivity analysis. Vera et al. (2007b) used this methodology to identify potential enzyme drug targets for diseases associated with human purine metabolism and, in particular, hyperuricemia (Fig. 11.18). Hyperuricemia is a usually nonlethal
355
Biochemical Pathway Modeling Tools for Drug Target Detection
1.4
Vade
Vaptr
PRPP
Ade Vhprt
Vpolyam
Vtrans
Vden
SAM Vtrans
Vimpd XMP GMP Vgmps Vgmpr GDP GTP Vrnag
Vasuc IMP
Vgdrnr
dGMP dGDP dGTP
Vgrna Vdnag
Vadna DNA Vdnaa
Vgdna
Vdgnuc Gua
Vhprt
Vgua Xa
Vx
Vdada Hx
1.1
0.9 HS
PS
Diet Vxd+diet Vampd
1
0.5
Vhx
Vxd UA
Vhxd
dAdo dAMP dADP dATP
1.2
1
S-AMP Vasli Ado AMP Vampd ADP ATP Varna Vrnaa Vada Vinuc Vadrnr
RNA Vgnuc
Vmat
Vxd (norm.)
Vgprt
1.3 Uric acid (norm.)
R5P Vprpps
Vpyr
Vua
0.5
1
Vampd (norm.)
Figure 11.18 Combining mathematical modeling of biochemical networks and optimization to detect drug targets. Vera et al. (2007b) combined mathematical modeling and optimization to identify drug targets in hyperuricemia, a disease related to human purine metabolism. Toward this end, a mathematical model describing purine metabolism (left panel) was analyzed with the methodology described in the text. The authors looked for drug targets whose inhibition could reduce uric acid (UA) to healthy levels. A number of potential targets were detected and are indicated in the biochemical diagram with a red “injection” symbol. The panel at the top-right compares the levels of uric acid in the healthy state (HS) and the pathological condition associated to hyperuricemia (PS). Some of the solutions suggest a change in “diet”; “Vxdþdiet” represents the combination of diet and inhibition of the enzyme xanthine oxidase, which is similar to the conventional treatment based on the drug allopurinol; and “Vampd” suggests the inhibition of AMP deaminase. The analysis also allowed the investigation of combinations of drug targets whose simultaneous inhibition reduces levels of UA to the HS condition. For instance, inhibition of xanthine oxidase and AMP deaminase can be combined to reduce UA levels with intermediate levels of inhibition in both enzymes (bottom-right panel; black area displays combinations of inhibition that reduce levels of UA to the healthy state).
disease that is in some cases associated with a functional defect in the enzymatic activity of phosphoribosyl pyrophosphate synthetase, an enzyme that regulates the de novo synthesis of purines. As a result of this defect, the purines are unbalanced and the level of uric acid, which is here considered the critical metabolite, increases. This increase, in turn, triggers acute episodes of arthritic pain and nephropathy. An earlier mathematical model of purine metabolism (Curto et al., 1997, 1998a,b) was used to detect critical enzymes whose inhibition with specific drugs could restore the healthy level
356
Alberto Marin-Sanguino et al.
of uric acid without compromising the physiological levels of the other metabolites in the network. The method detected six potential single enzyme targets, one of them coinciding with the conventional clinical treatment using the drug allopurinol, but two of them totally unexpected. When considering potential multifactorial treatments, numerous possible solutions involving the parallel inhibition of two enzymes were detected.
8. One Step Further: Combining Mathematical Modeling with Drug Screening via Protein Docking-Based Techniques As we mentioned at the beginning of this chapter, the traditional paradigm of drug discovery is becoming increasingly problematic due to the demanding criteria that new drugs must satisfy and the complexity of many diseases. We are suggesting here that mathematical modeling combined with virtual screening and followed by in vitro and in vivo experiments is a promising approach for drug discovery. In this approach, the potential drug targets that were identified by the mathematical modeling techniques are chosen preferentially as the most promising candidates in a computeraided strategy that relies on the analysis of the three-dimensional (3D) structure of the targeted protein for the design or selection of new drug molecules. The procedure can be summarized as follows: Structural analysis of the target and target site identification. The 3D structure of the targeted protein is generated based on X-ray crystallography and NMR data or homology modeling. Later, an analysis of ligand binding sites on the surface of the target protein is performed to find critical docking sites. Ideally, the target site should be a pocket or protuberance on the protein surface surrounded by amino acid residues with hydrogen bond donors, hydrogen bond acceptors, and/or the capability of hydrophobic interactions. Complex structures stored in the protein data bank (PDB) can be invaluable for this target site prediction, if initial small molecular inhibitors are available. The output of this analysis is a number of protein sites where docking of a molecule may inhibit the activity of the targeted protein. Drug design. After detailed structural analysis and binding site prediction for the target protein, several computational methods may be used to select active inhibitory molecules. Focused screening inspects known molecules that bind to the site of interest. These small molecules are modified to become inhibitors based on maximizing complementary interactions with the target site (Chan et al., 2001). In virtual screening, large libraries of small compounds are computationally docked onto the site of interest on the target molecule in order to identify the molecules with highest
Biochemical Pathway Modeling Tools for Drug Target Detection
357
binding affinities. In a process called de novo generation, small molecule fragments are initially searched from fragment libraries and positioned in the binding site, and the linking of fragments is performed to partially build a ligand molecule. Most of the docking analysis protocols have two major components: (1) docking methods that generate optimal configurations of the ligand in the binding cavity of the protein; and (2) scoring functions, generally used to estimate the binding affinity of a ligand, based on the candidate ligand pose geometry and its docking with the target protein structure. The output of this process is a number of potential drug molecules with predictive inhibitory activity for the investigated protein. Lead optimization and prediction of pharmacokinetics properties. Those leads discovered by drug design methods need to be optimized in order to produce a list of candidates with low toxicity and improved bioavailability. Various in silico prediction tools are used to evaluate drug-likeness by analyzing the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties of the molecule. These prediction methods usually try to build a functional relationship between a set of molecular features and given ADMET properties using Quantitative Structure Activity Relationships (QSARs). Recently, substructure pattern fingerprint reorganization techniques along with support vector machines (SVMs) have been combined for the estimation of ADMET properties (Shen et al., 2010). ADMET Properties that are often evaluated include: human intestinal absorption, intestinal permeability; aqueous solubility; oral bioavailability, blood–brain barrier penetration, cytochrome P450 inhibition, hepatotoxicity, plasma protein binding, metabolic stability, active transport and toxicity. The final output of the process is a selection of compounds for experimental testing and the prediction of potential toxicity and side effects (Fig. 11.19). Laboratory testing with in vitro and in vivo experiments of the compounds completes the procedure, in which quantitative data-based mathematical modeling is used to detect potential drug targets, and docking-based techniques are subsequently used to select a number of lead molecules with inhibitory activity in the targeted proteins. Akhoon et al. (2010) combined these techniques to perform a dockingbased drug design and optimization of an anti-breast cancer oligopeptide targeting HER-2 (also known as Erb2). HER-2 overexpression is associated with increased tumor growth and metastasis, and therapeutic approaches are being developed to block these effects in breast cancer. In line with this goal, trastuzumab, a monoclonal antibody (mAb) against HER-2, was recently approved for treatment of breast cancer. Another mAb, pertuzumab, mediates the antibody dependent cytotoxic effects just like trastuzumab, but in addition to cytotoxic effects, pertuzumab binding directly
358
Alberto Marin-Sanguino et al.
Disease specific biochemical pathways I P′
P M1 f
M2 M3
f
Target identification
Experimental procedures for target structure
Literature mining/disease specific pathway/ microarray data analysis
Pure preparation of target in solution
Target expression
Identification of differentially expressed genes/proteins
Assay development Identification of enzymes at critical steps of disease pathways
High throughput screening
Therapeutic targets identification
Structure determination of target using X-ray crystallography/NMR Combinatorial chemistry
New drug target
Structure based drug design
In silico procedure for target structure
3D structure analysis of target
3D structure of target Identification of key amino acid residues
Homology modeling Search for suitable templates
No
Modeling successful
Ligand libraries
Structural validation
Model refinement
No
Prediction of binding sites
Yes
Build 3D model
Data mining
Docking
De novo generation
Fragment libraries
Yes Validated target structure
Lead identification
Lead optimization Biological assay for ADME and toxicity
Biological assay passed?
No
Yes Chemical synthesis of lead
Clinical trials If successfully passed Commercial drugs
Figure 11.19 Schematic diagram for computer-aided drug design, integrating system biology, and structural biology. The combined approach identifies a disease specific biochemical pathway and predicts implementable therapeutic targets through a structure based drug design strategy.
Biochemical Pathway Modeling Tools for Drug Target Detection
359
inhibits ErbB2 association with its partner receptors, thereby blocking the signaling cascade in its initial steps. Akhoon et al. (2010) used the structure of the HER-2/pertuzumab complex as a starting point to design other oligopeptides with analogous activity. An energy minimization model was used for the identification of the active site and the contact surface between the extracellular domain of HER-2 and pertuzumab; this information was used as a target spot of the computational peptide design system. Docking analysis indicated that amino acids VAL286, SER288, LEU295, and HIS296 of HER-2 were strongly involved in binding with pertuzumab. These residues interacting with the active site with high affinity were selected for the design of new sequential oligopeptide LPRAEDTVS. The structure of potential 9mer oligopeptides was modeled and docking simulation analysis was performed in the binding cavity of HER-2 extracellular domain. In order to increase binding affinity of the designed oligopeptide, several derivatives were randomly generated by altering the positions of amino acid residues positions in the 9mer oligopeptide followed by control docking in the target binding cavity. The result was the creation of a new oligopeptide (RASPADREV), which possesses high binding affinity with a highly populated cluster of docked poses of reasonable energy; the structure is shown in Fig. 11.20. The authors later computationally analyzed various ADMET properties of the oligopeptide, such as oral bioavailability in humans, solubility in pure water, pH dependent aqueous solubility, distribution, physicochemical properties, the octanol–water partition coefficient, pKa and ion fraction values, as well as the pH dependent distribution coefficient. The “designed” oligopeptide with anti-cancer activity against HER-2 was furthermore predicted to show other desirable properties like good solubility and stability.
9. Final Remarks The landscape and perspectives of drug discovery have changed drastically over the past decade. On one hand, the traditional approach to drug discovery, consisting primarily of a mass drug screening of millions of potential biomolecules, is becoming more and more problematic as we must deal with complex, multifactorial diseases that are often chronic and almost epidemic. On the other hand, pharmacology and molecular biology have moved from a lack of sufficiently many and sufficiently good experimental data to a situation where vast amounts of quantitative experimental data face the challenging problem of information retrieval, integration, interpretation, and understanding. Our working hypothesis in this chapter was that tools and methods based on the mathematical modeling of
360
Alberto Marin-Sanguino et al.
A
RASPADREV peptide
Active site HER2 extracellular domain
B
RASPADREV
Hydrogen bonds
Active site
Figure 11.20 The oligopeptide RASPADREV binds to the extracellular domain of HER-2. HER-2 receptor overexpression is associated with increased tumor growth and metastasis, and several drugs are being developed to block the effects of this overexpression. Akhoon et al. (2010) used computational methods for virtual screening of drug molecules, as they were described here, to design a new oligopeptide with docking in the target binding cavity of HER-2. The result was the design of new oligopeptides, including RASPADREV, with high binding affinity and sufficient bioavailability. Figure obtained from Akhoon et al. (2010) with the agreement of the publisher.
biochemical networks and pathways are becoming essential to gain new insights into the origin and root causes of complex diseases that are needed to design efficacious therapeutic strategies. The methodology described here complements the approaches of pharmacokinetics and pharmacodynamics by expanding the level of detail in models to the full dynamics of the
Biochemical Pathway Modeling Tools for Drug Target Detection
361
signaling, genetic, proteomic, and metabolic networks that are often involved in disease. These dynamic models, along with a number of computational techniques and strategies that we discussed here, can be powerful tools in the search for potential drug targets. While we exemplified them with specific cases studies, the concepts behind these techniques are disease independent, and the same basic methods can be applied with small variations to biochemical networks related to different kinds of pathologies, ranging from cancer to neurodegenerative diseases and metabolic pathologies. The implementation of these techniques and the application to actual disease are not trivial and require an interdisciplinary approach, with teams composed of experimental biologists and clinicians, mathematical modelers, computational biologists and bioinformaticians who work in close collaboration. Ultimately, this will lead to the development of systems biology based medicine, which will involve the integration of quantitative high-throughput data, mathematical modeling, computing, and analysis (Voit and Brigham, 2008; Wolkenhauer et al., 2010). This new paradigm is expected to boost the development of new drugs and to form the basis of the medicine of the future proposed by E. Zerhouni, the former director of the U.S. National Institute of Health, which will be predictive, personalized, preventive, and participatory.
Appendix
Complete mathematical model description
ODE equations: dM1 ¼ V1 V2 V3 dt dM2 ¼ V5 V6 V3 dt dM3 ¼ V3 V4 dt dP ¼ V7 V8 dt dP ¼ V8 V7 dt
362
Alberto Marin-Sanguino et al.
Rate equations: V1 V3 V5 V7
¼ k1 ; V2 ¼ k2 M1 a ¼ k2 M1 b M2 c ; V4 ¼ k4 M3 e P d ¼ k5 M3 f ; V6 ¼ k6 M2 g ¼ k7 Ph Ii ; V8 ¼ k8 Pj
Parameter values: Kinetic orders: a ¼ 1; b ¼ 0.5; c ¼ 0.9; d ¼ 1; e ¼ 0.5; f ¼ 0.9; g ¼ 0.5; h ¼ 1; i ¼ 2; j ¼ 1. Rate constants: k1 ¼ 10; k2 ¼ 1.2; k3 ¼ 0.665; k4 ¼ 16; k5 ¼ 9; k6 ¼ 2.887; k7 ¼ 133.3; k8 ¼ 400.
Equations for the Optimisation Drug Discovery Program in the case study # p Xj X HS X l X Ji JiHS j Min lj li HS þ HS X Ji j j¼1 i¼1 "
Subject to: Mathematical description of the functional origin of the disease Enzk ¼ EnzPS k Stable steady state of the disease related metabolic network dXi ¼0 dt Constraints guaranteeing physiological feasibility of any interventions: XiLB Xi XiUB ;
i ¼ 1; . . . ; nD ;
JkLB Jk JkUB ;
k ¼ 1; . . . ; nCF
Biochemical Pathway Modeling Tools for Drug Target Detection
363
Table A.1 Databases and software tools used for biomedical knowledge retrieval and conceptual map construction
HPRD iHOP
KEGG pathway
CellDesigner
Cytoscape
GO
MetaCyc
Online database containing information on protein–protein interactions for biochemical networks in humans (www.hprd.org). Webtool providing up-to-date information on biological molecules and interactions retrieved by automatically extracting key sentences from PubMed documents (www.ihop-net.org). Online database containing a collection of pathways and network maps representing updated knowledge on biochemical reaction networks (www.genome.jp/kegg/pathway). Software tool editor for drawing gene-regulatory and biochemical networks. Networks are drawn based on the graphical notation system proposed by Kitano, and are stored using a computational standard format (www.celldesigner.org). Software platform for visualizing biological pathways and networks, integrating data concerning annotations, gene expression profiles and other state data (www.cytoscape.org). Online database that provides a controlled vocabulary of terms for describing gene product characteristics, gene product annotation data, cellular components, and molecular functions (www.geneontology.org). Online database of nonredundant, experimentally elucidated metabolic pathways curated from the scientific experimental literature. MetaCyc contains pathways involved in both primary and secondary metabolism, as well as associated compounds, enzymes, and genes (www. metacyc.org).
Table A.2 Software for mathematical modeling and simulation of biochemical systems
Sbtoolbox Potter0 s Wheel COPASI AMIGO MADONNA
www.sbtoolbox2.org www.potterswheel.de www.copasi.org Available upon request in [email protected] www.berkeleymadonna.com
364
Alberto Marin-Sanguino et al.
Table A.3 Important databases used for the screening of potential drug targets
Database Drug-target databases PDTD: Potential Drug Target Database
DrugBank
PharmGKB: Pharmacogenomics Knowledge Base
ddTargets: Drug targets/ Disease target Database
TTD: Therapeutic Target Database
TRMPD: Therapeutically Relevant Multiple Pathways Database
Description
PDTD is a comprehensive, webaccessible database of drug targets that focuses on drug targets with known 3D-structures. PDTD currently contains 1207 entries covering 841 known and potential drug targets with structures from the Protein Data Bank (PDB) (www.dddc.ac.cn/pdtd). This database contains detail chemical, pharmacological, pharmaceutical data, with comprehensive drug target information. The database contains nearly 4800 drug and more than 2500 non-redundant protein drug targets (www.drugbank.ca). Central repository for genetic, genomic, molecular and cellular phenotype data and clinical information from various pharmacogenomics research studies (www.pharmgkb.org). Comprehensive database on therapeutic targets. Database contains around 4100 targets classified according to drug/disease type (www.sciclips. com/sciclips/drug-targets-main.do). Information about known and exploratory therapeutic protein and nucleic acid targets, diseases, pathways and corresponding drugs directed at each of these targets. Database currently contains information on 1894 targets and 5126 drugs (http:// bidd.nus.edu.sg/group/cjttd/ TTD_HOME.asp). This database currently contains 11 entries of multiple pathways, 97 entries of individual pathways, 120 targets covering 72 disease conditions
Biochemical Pathway Modeling Tools for Drug Target Detection
STITCH: Search tool for interaction of chemicals
MutationView/ KMcancerDB OMIM: Online Mendelian Inheritance in Man
METAGENE: Metabolic and Genetic Information Centre
HGMD: Human Gene Mutation Database DEPD: Differentially Expressed Protein Database
365
along with 120 sets of drugs directed at each of these targets (http://xin.cz3. nus.edu.sg/group/trmp/trmp.asp). Resource to explore known and predicted interactions of chemicals and proteins. Chemicals are linked to other chemicals and proteins by evidence derived from experiments, databases, and the literature. STITCH contains interactions for over 74,000 small molecules and over 2.5 million proteins in 630 organisms (http:// stitch.embl.de). A database of disease-causing genetic mutations and index of cancercausing mutations (http://mutview. dmb.med.keio.ac.jp). OMIM is a comprehensive compendium of human genes and genetic phenotypes. OMIM contains information on all known Mendelian disorders and over 12,000 genes. OMIM focuses on relationships between phenotype and genotype (www.ncbi.nlm.nih.gov/omim). A knowledgebase for inborn errors of metabolism providing information about the disease, genetic cause, treatment and the characteristic metabolite concentrations or clinical tests that may be used to diagnose or monitor the condition. It has data on 432 genetic diseases (www.metagene. de). A database of mutations in human genes searchable by gene and associated disease (www.hgmd.cf.ac.uk). DEPD systematically compares global protein expression profiles in disease and normal cellular states and focuses on quantitative changes that occur as a function of disease, treatment, or environment (http://protchem. hunnu.edu.cn/depd).
366
Alberto Marin-Sanguino et al.
Table A.4 List of some prominent computer assisted drug discovery software
Program
Description
Protein–Protein docking software 3D Dock FTDock, RPScore and MultiDock Perform rigidSuite body docking of biomolecules (www.bmm.icnet. uk/docking) ClusPro Fully automated protein-protein docking. Performs rigid docking by DOT, ZDOCK, PIPER. Clustering of complexes (http://structure.bu.edu/ Projects/PPDocking/cluspro.html) DOT Rigid docking of macromolecules including Protein, DNA, RNA (www.sdsc.edu/CCMS/DOT) FireDock Refinement and re-scoring of rigid-body protein– protein docking solutions (http://bioinfo3d.cs.tau. ac.il/FireDock) HEX Protein docking and molecular superimposition program (www.loria.fr/ritchied/hex) HADDOCK High Ambiguity Driven biomolecular DOCKing (www.nmr.chem.uu.nl/haddock) ZDock Fast Fourier Transform based protein docking program (http://zdock.bu.edu) Autodock Rigid and flexible docking. Docking of the ligand to a set of grids describing the target protein (http:// autodock.scripps.edu) DOCK Docks small molecules or fragments, allowing flexible ligand docking, and using a shape based algorithm (http://dock.compbio.ucsf.edu) FlexX Incremental construction approach of ligands; allows flexible protein–ligand docking (www.biosolveit. de/FlexX) GLIDE Complete suite for protein–ligand interactions based on systematic search techniques (www.schrodinger. com) GOLD Uses genetic algorithms for docking (www.ccdc.cam. ac.uk/products/life_sciences/gold) LigPlot Automatically plots protein–ligand interactions (www. biochem.ucl.ac.uk/bsm/ligplot/ligplot.html) VEGA Ranks ligands on the basis of receptor–ligand interaction energy (http://users.unimi.it/ddl) Discovery Flexible docking, virtual high-throughput screening, Studio de novo ligand generation (http://www.accelrys.com)
Biochemical Pathway Modeling Tools for Drug Target Detection
367
ACKNOWLEDGMENTS J. V. is funded by the German Federal Ministry of Education and Research (BMBF) as part of the project CALSYS-FORSYS under contract 0315264 (www.sbi.uni-rostock.de/calsys). The work was partially supported by NIH project P01-ES016731 and NSF project MCB0946595.
REFERENCES Akhoon, B. A., Gupta, S. K., Verma, V., et al. (2010). In silico designing and optimization of anti-breast cancer antibody mimetic oligopeptide targeting HER-2 in women. J. Mol. Graph. Model. 28, 664–669. Akutsu, T., Miyano, S., and Kuhara, S. (2000). Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics 16, 727–734. Altinok, A., Le´vi, F., and Goldbeter, A. (2009). Identifying mechanisms of chronotolerance and chronoefficacy for the anticancer drugs 5-fluorouracil and oxaliplatin by computational modeling. Eur. J. Pharm. Sci. 36, 20–38. Alvarez-Vasquez, F., Sims, K. J., Cowart, L. A., Okamoto, Y., Voit, E. O., and Hannun, Y. A. (2005). Simulation and validation of modelled sphingolipid metabolism in Saccharomyces cerevisiae. Nature. 433, 425–430. Ashyraliyev, M., Fomekong-Nanfack, Y., Kaandorp, J. A., and Blom, J. G. (2009). Systems biology: Parameter estimation for biochemical models. FEBS J. 276, 886–902. Balsa-Canto, E., Alonso, A. A., and Banga, J. R. (2010). An iterative identification procedure for dynamic modeling of biochemical networks. BMC Syst. Biol. 4, 11. Banga, J. R. (2008). Optimization in computational systems biology. BMC Syst. Biol. 2, 47. Chan, D. C., Laughton, C. A., Queener, S. F., and Stevens, M. F. (2001). Structural studies on bioactive compounds. 34. Design, synthesis, and biological evaluation of triazenylsubstituted pyrimethamine inhibitors of Pneumocystis carinii dihydrofolate reductase. J. Med. Chem. 44, 2555–2564. Chavali, A. K., Whittemore, J. D., Eddy, J. A., Williams, K. T., and Papin, J. A. (2008). Systems analysis of metabolism in the pathogenic trypanosomatid Leishmania major. Mol. Syst. Biol. 4, 177. Chou, I., and Voit, E. O. (2009). Recent developments in parameter estimation and structure identification of biochemical and genomic systems. Math. Biosci. 219, 57–83. Curto, R., Voit, E. O., Sorribas, A., and Cascante, M. (1997). Validation and steady-state analysis of a power-law model of purine metabolism in man. Biochem. J. 324(Pt 3), 761–775. Curto, R., Voit, E. O., and Cascante, M. (1998a). Analysis of abnormalities in purine metabolism leading to gout and to neurological dysfunctions in man. Biochem. J. 329 (Pt 3), 477–487. Curto, R., Voit, E. O., Sorribas, A., and Cascante, M. (1998b). Mathematical models of purine metabolism in man. Math. Biosci. 151, 1–49. Fitzgerald, J. B., Schoeberl, B., Nielsen, U. B., and Sorger, P. K. (2006). Systems biology and combination therapy in the quest for clinical efficacy. Nat. Chem. Biol. 2, 458–466. Goel, G., Chou, I., and Voit, E. O. (2006). Biological systems modeling and analysis: A biomolecular technique of the twenty-first century. J. Biomol. Tech. 17, 252–269. Lahav, G., Rosenfeld, N., Sigal, A., et al. (2004). Dynamics of the p53-Mdm2 feedback loop in individual cells. Nat. Genet. 36, 147–150. Lai, X., Nikolov, S., Wolkenhauer, O., and Vera, J. (2009). A multi-level model accounting for the effects of JAK2-STAT5 signal modulation in erythropoiesis. Comput. Biol. Chem. 33, 312–324.
368
Alberto Marin-Sanguino et al.
Le Nove`re, N., Finney, A., Hucka, M., et al. (2005). Minimum information requested in the annotation of biochemical models (MIRIAM). Nat. Biotechnol. 23, 1509–1515. Le Nove`re, N., Hucka, M., Mi, H., et al. (2009). The Systems Biology Graphical Notation. Nat. Biotechnol. 27, 735–741. Marı´n-Sanguino, A., and Torres, N. V. (2003). Optimization of biochemical systems by linear programming and general mass action model representations. Math Biosci. 184(2), 187–200. Marin-Sanguino, A., Voit, E. O., Gonzalez-Alcon, C., and Torres, N. V. (2007). Optimization of biotechnological systems through geometric programming. Theor. Biol. Med. Model. 4, 38. Marin-Sanguino, A., Mendoza, E. R., and Voit, E. O. (2010). Flux duality in nonlinear GMA systems: Implications for metabolic engineering. J. Biotechnol. 10.1016/j. jbiotec.2009.12.009. Miner, J., and Hoffhines, A. (2007). The discovery of aspirin’s antithrombotic effects. Tex. Heart Inst. J. 34, 179–186. Nikerel, I. E., van Winden, W. A., van Gulik, W. M., and Heijnen, J. J. (2006). A method for estimation of elasticities in metabolic networks using steady state and dynamic metabolomics data and linlog kinetics. BMC Bioinform. 7, 540. Nikolov, S., Lai, X., Liebal, U. W., Wolkenhauer, O., and Vera, J. (2010). Integration of sensitivity and bifurcation analysis to detect critical processes in a model combining signalling and cell population dynamics. Int. J. Syst. Sci. 41, 81–105. Palsson, B. O. (2006). Systems Biology: Properties of Reconstructed Networks. Cambridge University Press. Paul, S. M., Mytelka, D. S., Dunwiddie, C. T., et al. (2010). How to improve R&D productivity: The pharmaceutical industry0 s grand challenge. Nat. Rev. Drug Discov. 9, 203–214. Qi, Z., Miller, G. W., and Voit, E. O. (2008a). Computational systems analysis of dopamine metabolism. PLoS ONE 3, e2444. Qi, Z., Miller, G. W., and Voit, E. O. (2008b). A mathematical model of presynaptic dopamine homeostasis: Implications for schizophrenia. Pharmacopsychiatry 41(Suppl. 1), S89–S98. Qi, Z., Miller, G. W., and Voit, E. O. (2009). Computational analysis of determinants of dopamine (DA) dysfunction in DA nerve terminals. Synapse 63, 1133–1142. Qi, Z., Miller, G. W., and Voit, E. O. (2010a). The internal state of medium spiny neurons varies in response to different input signals. BMC Syst. Biol. 4, 26. Qi, Z., Miller, G. W., and Voit, E. O. (2010b). Computational modeling of synaptic neurotransmission as a tool for assessing dopamine hypotheses of schizophrenia. Pharmacopsychiatry 43(Suppl. 1), S50–S60. Rehm, M., Huber, H. J., Dussmann, H., and Prehn, J. H. (2006). Systems analysis of effector caspase activation and its control by X-linked inhibitor of apoptosis protein. EMBO J. 25 (18), 4338–49. Rowland, M., and Tozer, T. N. (2007). Clinical Pharmacokinetics and Pharmacodynamics: Concepts and Applications. Lippincott Williams & Wilkins. Saltelli, A., Chan, K., and Scott, E. M. (2000). Sensitivity Analysis. Wiley. Savageau, M. A. (1969). Biochemical systems analysis. I. Some mathematical properties of the rate law for the component enzymatic reactions. J. Theor. Biol. 25, 365–369. Savageau, M. A. (1998). Development of fractal kinetic theory for enzyme-catalysed reactions and implications for the design of biochemical pathways. BioSystems 47, 9–36. Schoeberl, B., Pace, E. A., Fitzgerald, J. B., et al. (2009). Therapeutically targeting ErbB3: A key node in ligand-induced activation of the ErbB receptor-PI3K axis. Sci. Signal. 2ra31. Shen, J., Cheng, F., Xu, Y., Li, W., and Tang, Y. (2010). Estimation of ADME properties with substructure pattern recognition. J. Chem. Inf. Model. 50, 1034–1041. Sorribas, A., Herna´ndez-Bermejo, B., Vilaprinyo, E., and Alves, R. (2007). Cooperativity and saturation in biochemical networks: A saturable formalism using Taylor series approximations. Biotechnol. Bioeng. 97, 1259–1277.
Biochemical Pathway Modeling Tools for Drug Target Detection
369
Srinath, S., and Gunawan, R. (2010). Parameter identifiability of power-law biochemical system models. J. Biotechnol. 10.1016/j.jbiotec.2010.02.019. Torres, N. V., and Voit, E. O. (2002). Pathway analysis and optimization in metabolic engineering. Cambridge University Press. Tyson, J. J., Chen, K. C., and Novak, B. (2003). Sniffers, buzzers, toggles and blinkers: Dynamics of regulatory and signaling pathways in the cell. Curr. Opin. Cell Biol. 15, 221–231. Veflingstad, S. R., Almeida, J., and Voit, E. O. (2004). Priming nonlinear searches for pathway identification. Theor. Biol. Med. Model. 1, 8. Vera, J., de Atauri, P., Cascante, M., and Torres, N. V. (2003). Multicriteria optimization of biochemical systems by linear programming: Application to production of ethanol by Saccharomyces cerevisiae. Biotechnol. Bioeng. 83, 335–343. Vera, J., Balsa-Canto, E., Wellstead, P., Banga, J. R., and Wolkenhauer, O. (2007a). Powerlaw models of signal transduction pathways. Cell. Signal. 19, 1531–1541. Vera, J., Curto, R., Cascante, M., and Torres, N. V. (2007b). Detection of potential enzyme targets by metabolic modelling and optimization: Application to a simple enzymopathy. Bioinformatics 23, 2281–2289. Vera, J., Bachmann, J., Pfeifer, A. C., et al. (2008). A systems biology approach to analyse amplification in the JAK2-STAT5 signalling pathway. BMC Syst. Biol. 2, 38. Vera, J., and Wolkenhauer, O. (2008). A system biology approach to understand functional activity of cell communication systems (2008). Methods in Cell Biology 90, 399–415. Vera, J., Rath, O., Balsa-Canto, E., et al. (2010a). Investigating dynamics of inhibitory and feedback loops in ERK signalling using power-law models. Mol. Biosyst. 6(11):2174– 2191. Vera, J., Gonza´lez-Alco´n, C., Marı´n-Sanguino, A., and Torres, N. (2010b). Optimization of biochemical systems through mathematical programming: Methods and applications. Comput. Oper. Res. 37, 1427–1438. Vera, J., Schultz, J., Raatz, Y., Ibrahim, S., Wolkenhauer, O., and Kunz, M. (2010c). Dynamical effects of epigenetic silencing of 14–3–3s expression (2010). Molecular Biosystems 6(1), 264–273. Vilela, M., Vinga, S., Maia, M. A. G. M., Voit, E. O., and Almeida, J. S. (2009). Identification of neutral biochemical network models from time series data. BMC Syst. Biol. 3, 47. Voit, E. O. (1992). Optimization in integrated biochemical systems. Biotechnol. Bioeng. 40, 572–582. Voit, E. O. (2000). Computational analysis of biochemical systems: A practical guide for biochemists and molecular biologists. Cambridge University Press. Voit, E. O., and Brigham, K. L. (2008). The role of systems biology in predictive health and personalized medicine. TOPATJ 2, 68–70. Voit, E. O., Qi, Z., and Miller, G. W. (2008). Steps of modeling complex biological systems. Pharmacopsychiatry 41(Suppl. 1), S78–S84. von Kriegsheim, A., Baiocchi, D., Birtwistle, M., et al. (2009). Cell fate decisions are specified by the dynamic ERK interactome. Nat. Cell Biol. 11, 1458–1464. Wang, K., Lee, I., Carlson, G., Hood, L., and Galas, D. (2010). Systems biology and the discovery of diagnostic biomarkers. Dis. Markers 28, 199–207. Wolkenhauer, O., Auffray, C., Baltrusch, S., et al. (2010). Systems biologists seek fuller integration of systems biology approaches in new cancer research programs. Cancer Res. 70, 12–13.
C H A P T E R
T W E LV E
Deterministic and Stochastic Simulation and Analysis of Biochemical Reaction Networks: The Lactose Operon Example Necmettin Yildirim* and Caner Kazanci† Contents 1. Introduction 2. Mathematical Modeling of Biochemical Reaction Networks and Law of Mass Action 2.1. Simple enzymatic reactions and Michaelis–Menten equation 2.2. Higher order kinetics and Hill equations 2.3. Steady state and linear stability analysis in one-dimensional models 2.4. Modeling coupled reactions and bistability 3. Stochastic Simulations 3.1. Cases where stochasticity matters 3.2. Stochastic simulation algorithms 4. An Example: Lactose Operon in E. coli 5. Conclusions and Discussion Acknowledgment References
372 372 374 375 377 378 381 382 384 386 393 395 395
Abstract A brief introduction to mathematical modeling of biochemical regulatory reaction networks is presented. Both deterministic and stochastic modeling techniques are covered with examples from enzyme kinetics, coupled reaction networks with oscillatory dynamics and bistability. The Yildirim–Mackey model for lactose operon is used as an example to discuss and show how deterministic and stochastic methods can be used to investigate various aspects of this bacterial circuit. * Division of Natural Sciences, New College of Florida, Sarasota, Florida, USA Department of Mathematics/Faculty of Engineering, University of Georgia, Athens, Georgia, USA
{
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87012-6
#
2011 Elsevier Inc. All rights reserved.
371
372
Necmettin Yildirim and Caner Kazanci
1. Introduction Recent advancements in experimental techniques in biology and medicine enable high-throughput experiments acquire data that are easier, cheaper, and more accurate. As more data became available, the need for the right mathematical and computational tools for analysis and interpretation became clear. Until recently, reductionist methods and statistics were the main tools used to study biological systems. “Simpler diseases” were understood and treated this way, but the remaining complicated ones, such as cancer, AIDS, sepsis, and heart diseases require new sophisticated techniques that can cope with their complexity. Structure of biological organisms enables researchers to use various mathematical and computational approaches for modeling, simulation, and analysis. At cellular level, there are distinct biochemical mechanisms that are responsible for specific jobs, such as energy production, protein synthesis, motor functions, defense, and signaling. We present a brief introduction to deterministic and stochastic modeling of biochemical networks. Outline of this chapter is as follows: In Section 2, we describe the basics of mathematical modeling of biochemical reactions, their steady state and stability analysis using the law of mass action. Section 3 summarizes stochastic modeling techniques and describes a basic stochastic algorithm. We focus on the Yildirim–Mackey model for lac operon in Section 4 and discuss how bistability arises in this network. We then simulate the lac operon model using both the deterministic and stochastic methods with experimentally estimated parameter values to show that this system indeed displays bistable behavior for physiologically reasonable parameters values. The chapter ends with Section 5 which includes conclusions and discussion.
2. Mathematical Modeling of Biochemical Reaction Networks and Law of Mass Action There are different approaches and methodologies to studying biochemical reactions. Mass-action kinetics which results in system of differential equations are commonly used to describe the dynamics of biochemical reaction networks. This approach is a fully deterministic and it is appropriate when a system under consideration has large number of molecules and these molecules are spatially homogeneous. In this section, we briefly describe how to construct differential equation models that describe dynamics of a reaction network under these two assumptions. Suppose that A, B, and C are three proteins and when molecules of A collide with molecules of B, they may react and form C. Assume that this reaction is associated with a positive rate constant k1, quantifying how likely
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
373
it is for such a collusion to result in a reaction. We also assume that C can break into A and B. Let us assume that the backward positive rate constant for reaction is k2. We use the following notation Eq. (12.1) to denote this chemical reaction system: k1
A þ B Ð C: k2
ð12:1Þ
Now suppose that there exist various amounts of these proteins in a constantly well-stirred pot, so that its contents remain spatially homogeneous. Here, we are interested in the temporal evolution of the molecular concentrations of these protein molecules. Let us denote the concentration of a species X with [X]. We would like to construct a system of differential equations that governs the temporal evolution of [A], [B], and [C]. Naturally, we can think of the reaction given in Eq. (12.1) as two separate k1 k2 reactions: A þ B ! C and C ! A þ C . According to mass-action kinetics, the time derivative of the concentration of protein A is equated to the difference between the sum of the gain terms (input chemical fluxes) that cause the concentration to increase and the sum of the loss terms (output chemical fluxes) that act to decrease the concentration as: X d½A X ¼ Input fluxes Output fluxes: dt
ð12:2Þ
For the reaction given in Eq. (12.1), the mathematical model is d d ½A ¼ ½B ¼ k1 ½A½B þ k2 ½C ; dt dt d ½C ¼ k1 ½A½B k2 ½C : dt
ð12:3Þ
This system of differential equations can be solved to simulate temporal evolution of [A], [B], and [C] after assigning values for initial concentrations of A, B, and C for t ¼ 0. Although mass-action kinetics is extremely useful for modeling chemical reactions, biological systems benefit greatly from enzymatic kinetics. Most chemical reactions in biological organisms rely on enzymes, special molecules that enable certain reactions to occur. In general, enzymes are fast acting molecules existing in low concentrations. These properties enables derivation of simpler equations for enzymatic reactions using certain approximations. In the following two sections, we give two examples from enzyme kinetics. In the first example, the rate for the product formation is a hyperbolic function of the substrate
374
Necmettin Yildirim and Caner Kazanci
concentration. In the second example, the rate of product is a sigmoidal function of the substrate concentration. Then we discuss importance of these types of functional relationships.
2.1. Simple enzymatic reactions and Michaelis–Menten equation Consider the following enzyme catalyzed reaction given in Eq. (12.4). An enzyme E binds to a substrate S and forms an enzyme–substrate complex ES with a rate constant k1. We assume this reaction is fully reversible. That is to say ES can break down into E and S, and suppose that the associated rate constant for this backward reaction is k2. In this reaction network, we assume that E can also release from ES and produce P and E. The rate constant for this final step of the reaction is k3. In this simple system, there are four time dependent variables: [E], [S], [ES], and [P]. k1
k3
E þ S Ð ES ! P þ E: k2
ð12:4Þ
We assume that the total concentrations of the enzyme and the substrate stay constant over time for this system. That gives us the following two equations: E0 ¼ ½E þ ½ES;
ð12:5Þ
S0 ¼ ½S þ ½ES þ ½P ;
ð12:6Þ
where E0 and S0 are the initial concentrations of the enzyme and the substrate, respectively. These two equations reduce the number of free variables from four to two. Now, we can write two differential equations that describe the dynamics of the concentration of ES and P. d½ES ¼ k1 ½E ½S ðk2 þ k3 Þ½ES: dt d½P ¼ k3 ½ES: dt
ð12:7Þ ð12:8Þ
These equations describe the dynamics of the single enzyme–substrate reaction in Eq. (12.4). However, we can further simplify this model using additional assumptions. Not all variables in a dynamic system change at same time scale. It is often the case that some variables change significantly faster than others. If we assume [ES] is a fast variable and reaches a steady state much earlier than [P], then we get d[ES] /dt 0 and hence
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
½ES ¼
k1 ½E½S: k2 þ k3
375
ð12:9Þ
This is called the quasi steady state assumption on [ES]. If we substitute [ES] given in Eq. (12.9) into Eq. (12.5) and solve the resultant equation for [E], we get ½E ¼
E0 : 1 þ ðk1 =k2 þ k3 Þ½S
ð12:10Þ
Plugging Eq. (12.9) into Eq. (12.8) after replacing [E] in Eq. (12.9) by Eq. (12.10) gives us d½P Vmax ½S ; ¼ dt Km þ ½S
ð12:11Þ
where Vmax ¼ k3E0 ¼ k2 þ k3 / k1. This equation is well known as Michaelis–Menten equation in enzyme kinetics. In Eq. (12.11), the parameter Vmax is the maximum rate that this reaction can occur and the parameter Km is defined as the values of [S] that gives half of Vmax. In other words, when [S] ¼ Km, the product formation (d[P] /dt d[S] /dt) occurs at half of its maximum rate (Vmax). A graphical representation of Michaelis–Menten equation in Eq. (12.11) for various values of Km is depicted in Fig. 12.1. As seen in this graph, all the curves approach to a maximum value of Vmax as [S] increases. For larger Km values, the curves shift toward the right. All the curves are concave downward and the concavities of these curves do not change as [S] increases.
2.2. Higher order kinetics and Hill equations Consider the following reaction system given in Eq. (12.12). In this reaction system, n-molecules of a substrate S bind to an enzyme E and form a complex ESn with a forward rate constant k1 and a reverse rate constant k2. Then the enzyme is released and a product P is formed with a rate constant k3 k1
E þ nS Ð ESn ; k2
k3
ESn ! P þ E:
ð12:12Þ ð12:13Þ
376
Necmettin Yildirim and Caner Kazanci
Km increases
d[P] dt
Vmax
[S ]
Figure 12.1 A graphical representation of Michaelis–Menten equation in Eq. (12.11) and hyperbolic kinetics for various values of Km when the maximum rate Vmax kept fixed. As Km increases, the curves move to the right and all curves are looking downward.
Let us also assume that at any time throughout the course of these reactions, the first reaction (Eq. (12.12)) is much faster compared to the second one (Eq. (12.13)). Therefore the first reaction reaches equilibrium (forward and backward reaction rates become equal) earlier than P starts to get produced. This allows us to write d½ESn ¼ k1 ½E½Sn k2 ½ESn ¼ 0: dt This is called the equilibrium assumption. Then, we can write ½E ¼
k2 ½ESn : k1 ½Sn
ð12:14Þ
Assuming that the total amount of enzyme is conserved, we can write Etot ¼ ½E þ ½ESn :
ð12:15Þ
Substituting Eq. (12.14) into Eq. (12.15) and then solving it for [ESn], we obtain ½ESn ¼
Etot ½Sn ; Keq þ ½Sn
Keq ¼
k2 : k1
377
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
Keq increases
d[P] dt
Vmax
[S]
Figure 12.2 A graphical representation of Hill Equation in Eq. (12.16) and sigmoidal kinetics for various values of Keq when the maximum rate Vmax kept fixed for n ¼ 4. Similar to the Michaelis–Menten curves, as Keq values increase, the curves move to the right. Unlike Michaelis–Menten curves, the Hill function curves are concave upward for smaller values of [S] and then they become concave downward for larger values of [S].
According to mass-action law, the dynamics of the concentration of P is proportional to the concentration of [ESn] with proportionality constant k3. Hence, d[P] /dt takes the following form d½P Vmax ½Sn ; ¼ Keq þ ½Sn dt
ð12:16Þ
where, Vmax ¼ k3Etot and Keq ¼ k2 / k1. A graphical representation of Hill equation in Eq. (12.16) for various values of Keq when n ¼ 4 is shown in Fig. 12.2. As seen in this figure, all the curves approach to a maximum value of Vmax as [S] increases. For larger Keq values, the curves shift toward the right. One of the important characteristics of Hill equation curves is that they are all concave upward and then become concave downward after a threshold value of [S] for no matter what Keq values are. This important feature leads to bistability, ability of a system to rest in two steady states, as will be discussed in the following sections.
2.3. Steady state and linear stability analysis in one-dimensional models One-dimensional mathematical model has the following general form d½A ¼ f ð½AÞ: dt
ð12:17Þ
378
Necmettin Yildirim and Caner Kazanci
d [A] dt
[A]
Figure 12.3 Graphical approach to the steady state and stability analysis of onedimensional model given in Eq. (12.17).
We say a point [A*] is a steady state if time derivative at that point is zero. The steady states can be computed by solving the equation f([A*]) ¼ 0, since d[A] /dt ¼ 0 at [A] ¼ [A*]. When Eq. (12.17) is plotted as a function taking [A] on x-axis and d[A] /dt on y-axis, the x-intercepts gives the steady state values (see Fig. 12.3). Now, we can think of the equation given by Eq. (12.17) a model that describes movement of an imaginary particle moving along [A]-axis and d[A] /dt is the velocity of that particle. Since d[A] /dt ¼ 0 at the steady state value, there is no change in [A] when d[A] /dt ¼ 0. If d[A] /dt < 0 for a value of [A], the arrows point to the left, otherwise they point to the right. As can be seen in Fig. 12.3, there are two types of steady states. The filled dot represents the stable steady state since the flow is toward this steady state. The open circle represent the unstable steady state since the flow is toward away from this steady state. We can conclude from Fig. 12.3 that a steady state is stable if d[A] / dt < 0 at that steady state value. It is an unstable steady state if d[A] /dt > 0 holds at that steady state value.
2.4. Modeling coupled reactions and bistability In this section, we give an example with a positive feedback loop, one of the important regulatory mechanisms in biological systems. It is capable of producing two stable steady states separated by an unstable steady state, so called “bistable system.” Bistability provides a true discontinuous switching between stable steady states. A bistable system often involves a positive feedback loop. Positive feedback loops are ubiquitous control mechanisms in gene networks. The lactose operon and the arabinose operon of Escherichia coli are two examples of this type of regulatory control networks (Lewin, 2008; Schleif, 2000). Consider the hypothetical system with positive feedback loop in Fig. 12.4. This reaction network has two proteins A and B. We use the Eqs. (12.18) and (12.19) to model the dynamics of this
379
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
A
B
Figure 12.4 A cartoon for a reaction network with a positive feedback loop.
toy reaction network with positive feedback. The dynamics of the concentration of A is given by Eq. (12.18). In this network, we assume A is produced with a constant rate a1 and degraded at a rate proportional to its concentration with a proportionality constant b1. The second term in this equation is for the increase in the production rate of A due to the positive feedback and we assume that this relationship has a Hill function form with n ¼ 2. Equation (12.19) models the dynamics of [B]. It is assumed that A is required for the production of B and the production rate of A is proportional to the concentration of A with a proportionality a2. We assume that B has a decay constant b2. d½A Vm ½B2 b1 ½A: ¼ a1 þ dt Km þ ½B2
ð12:18Þ
d½B ¼ a2 ½A b2 ½B: dt
ð12:19Þ
This system of differential equations has two time dependent variables [A] and [B] and six positive parameters a1, a2, b1, b2, Vm, and Km. 2.4.1. Steady state and stability analysis Suppose that the system given in Eqs. (12.18) and (12.19) has a steady state ([A*], [B*]). At this steady state ([A*], [B*]), d[A]/dt ¼ d[B]/dt ¼ 0 has to hold simultaneously. Therefore, we can write a1 þ
Vm ½B 2 b1 ½A ¼ 0; Km þ ½B 2
ð12:20Þ
a2 ½A b2 ½B ¼ 0:
ð12:21Þ
After solving Eq. (12.21) for [A*] and plugging it back in Eq. (12.20), we get a nonlinear equation in [B*] as a1 þ
Vm ½B 2 b2 ½B : 2 ¼ b1 a2 Km þ ½B
ð12:22Þ
380
Necmettin Yildirim and Caner Kazanci
When we look at the right-hand side and left-hand side of Eq. (12.22) as two different functions of [B], we see that the right-hand side is a linear function of [B] with a positive slope m ¼ b1(b2 / a2). The left-hand side is a Hill function which equals a1 when [B] ¼ 0, and approaches to a maximum value of a1 þ Vm as [B] ! 1 (Fig. 12.5). When both of these functions are plotted in the same plane choosing x-axis as [B], where these two functions intersect each other give us the steady state value or values. It is not hard to see that these two functions can intersect each other at only one point for small and large values of m and but at three points for intermediate values of m. Since the left-hand side of Eq. (12.22) is a Hill function and it is concave upward for small values of [B] that means increase in this function is relatively small for smaller concentration of B. Then there is a sharp increase for intermediate concentration of B and after that the curve changes its concavity and becomes concave downward and finally it levels off. This feature of the curve allows possibility of having multiple steady states for intermediate values of m. Figure 12.5 shows how one, two, or three steady states can arise in this model for different values of m. The local stability analysis of the model given in Eqs. (12.18) and (12.19) can be studied mathematically by linearizing this system of differential equations around a given steady state and looking at eigenvalues of the jacobian matrix. For the sake of simplicity, let us assume [B] is a fast variable in this system and d[B] /dt ¼ 0 in Eq. (12.19). After solving d[B] /dt ¼ 0 in Eq. (12.19) for [B] and putting it into Eq. (12.18), the two-dimensional model reduces to a one-dimensional model as:
RHS LHS
[B]
Figure 12.5 A diagrammatic representation showing how one, two, or three steady states may arise in the model given by Eqs. (12.18) and (12.19). The solid line represents the left-hand side of Eq. (12.22). The dash-dotted lines are for the righthand side of Eq. (12.22) for three different values of m ¼ b1b2 / a2. As seen in this plot, there is only one steady state for either small or large values of m. However, there is a range for m in which it is possible to have three coexisting steady states.
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
A
B
d [A]
d [A]
d [A]
dt
dt
dt
381
C
[A]
[A]
[A]
Figure 12.6 A diagrammatic representation of stability analysis of the bistable system modeled by Eq. (12.23). In this plot we see that when there is only one steady state (A and C), this steady state is stable (the flow is toward this steady state). When there are three steady states (B), the middle one is unstable (the flow is away from this steady state) and the other two are stable.
d½A Vm ½A2 b1 ½A: ¼ a1 þ dt Km ðb2 =a2 Þ2 þ ½A2
ð12:23Þ
In Fig. 12.6, we plot d[A] /dt versus [A] for small, medium, and large values of m. As seen in this figure, there is only one steady state for small and large values of m and these steady states are always stable (A and C in Fig. 12.6). When there exists three steady states (B in Fig. 12.6), the middle steady state is always unstable and the lowest and the highest steady states are always stable.
3. Stochastic Simulations Ordinary differential equation (ODE) models are widely used to simulate biochemical reaction systems. However, they are by no means a perfect in capturing every aspect of molecular reactions that occur in real life. Continuous variables used in ODEs are not appropriate to represent dynamics of molecular species that exist in low quantities in a system. Another major shortcoming of ODEs shows up in systems capable of multiple steady states. The deterministic solution of the ODE representation of such a system will always converge to a single stable steady state and stay there. However, in real life, constant switching behavior among steady states may be observed. Due to inherent fluctuations within the system, the state may be “pushed” from one steady state to another. Such issues may be accommodated by stochastic simulations, or stochastic differential equations (SDE).
382
Necmettin Yildirim and Caner Kazanci
3.1. Cases where stochasticity matters Molecules in a medium move and collide with each other. When two molecules of the same species collide, their speed and velocity changes. However, when two molecules with the capability to react collide, they may react with some probability p and form a new chemical species. This probability p is somewhat analogous to the reaction rate k in mass-action kinetics. This inherent probabilistic behavior may not be captured by deterministic differential equations. In general, the effects of this probabilistic behavior may be ignored without much penalty. However, in some cases, significant difference is observed between the stochastic and deterministic equations representing the same system. Here, we give two such example systems. For biochemical reaction systems, an ODE solution is an approximation to a stochastic phenomenon. For example, at equilibrium, the time course plot for the ODE solution is a straight horizontal line, while a continuous “noisy” activity is observed for the SDE solution. In reality, unlike the ODE simulation suggests, the activity in a biochemical reaction system never stops. At equilibrium, molecules keep colliding and the reactions keep occurring, but at a balanced rate so that the molecular species concentrations stay the same on average. The difference between the stochastic and deterministic simulations become significant as the number of molecules decrease, in which case, the noise takes over the dynamics. We demonstrate this effect of probabilistic behavior on the following reaction network (Scott, 1991): k
A þ X !1 2X; k X þ Y !2 2Y ; k Y !3 B:
ð12:24Þ
In this system, a constant supply of A is assumed. B is a product, so its concentration does not affect the system. The only changing quantities in the two-dimensional ODE are the concentrations of X and Y. Initial conditions are [X](0) ¼ [Y](0) ¼ 300, and the reaction rates are k1[A] ¼ 2, k2 ¼ 0.01, k3 ¼ 2. This reaction system is chosen because the concentrations of X and Y oscillates, clearly demonstrating how the inherent probabilistic behavior of molecular reactions may perturb the dynamics predicted by the ODE simulation. In Fig. 12.7, we compare the ODE and SDE simulations for this reaction system. Significant variation is observed between the two methods. This difference is expected to be more apparent in case of smaller systems, where there are small number of molecules present in the medium. In extreme cases, the dynamic behavior may be totally lost and dominated by noise. Even if the number of molecules remains high in the environment at all times, the ODE simulation still may convey a significantly different
383
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
Concentration of B
SDE ODE
Time
Figure 12.7 The difference between the ODE and SDE simulations for the same reaction system (Eq. (12.24)) is demonstrated. The time course of the concentration of B is shown to compare the two methods. The ODE solution for [B] converges to a limit cycle over time and stays there. Although the SDE solution initially shows similar behavior, it quickly diverges and displays significant variation in both amplitude and phase.
behavior than a stochastic simulation. The positive feedback mechanism given in Fig. 12.4 demonstrates such behavior. Although this system consists of two molecular species, it is possible to derive a one-dimensional ODE using the quasi steady state assumption for B as in Eq. (12.23) under the assumption given by Eq. (12.21). Note that this function is capable of having multiple steady states. For the following choice of parameters: a1 ¼ 0:42; b1 ¼ 0:004; a2 ¼ 0:36; b2 ¼ 32; Vm ¼ 0:93; Km ¼ 1:8; the system has two stable steady states at 13, 160 and an unstable steady state at 70. Depending on the initial condition, the ODE solution converges to one of the stable steady states (Fig. 12.8): 8 < 13; if 0 ½Að0Þ < 70; lim ½AðtÞ ¼ 70; if ; ½Að0Þ ¼ 70; t!1 : 160; if ; 70 < ½Að0Þ: Stochastic simulation of this bistable system conveys an interesting behavior; it switches back and forth between the two stable steady states. The switching occurs when the inherent perturbations around a steady state are large enough to push the solution to the other side of the unstable steady
384
Necmettin Yildirim and Caner Kazanci
Concentration of A
SDE ODE
160
70
13 Time
Figure 12.8 A stochastic solution of the positive feedback system in Eq. (12.23) with initial condition at [A](0) ¼ 70 is plotted, which conveys a “switching” behavior between the stable steady states. Steady states are shown with dashed lines.
state. This phenomenon continues repeatedly, preventing the system to stay at one steady state. This is the case even if A is present in abundant concentrations. Therefore what happens here is a lot more than a stochastic solution showing noisy behavior. If a system has multiple steady states, the stochastic solution may show significantly different results than the ODE solution. Stochastic methods may be a necessity for such systems.
3.2. Stochastic simulation algorithms Like numerical ODE solutions, various methods exist for stochastic simulations. Unlike numerical ODE solutions, there are many other factors to consider in choosing the correct stochastic method. Note that it is not possible to add some Gaussian noise to an ODE solution at each iteration to get a stochastic solution. Result of such an approach would be noisy and wrong. Similar but correct approaches exist, such as chemical Langevin equation (Gillespie, 2000), where the correct noise term is computed and added to the ODE solution. This method was used in Fig. 12.7. Computation of the correct noise term is essential, and is more complicated than computing the deterministic part of the solution. Chemical Langevin equation is a first order method. Higher order methods exist. However, computation of the noise term get extremely complicated. Another issue with this approach occurs when some molecular species exist in extremely low concentrations, in which case, the solution may go negative, indicating negative
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
385
concentrations. This implausible result can occur because the noise term may exceed an extremely small deterministic component of the solution. A different well-known methodology is proposed by Gillespie (1977), where each individual molecule and reaction is taken into account. Therefore it works very well even if some molecular species exist in low concentrations. However, if some molecular species exist in extremely high concentrations, then Gillespie’s stochastic algorithm may run extremely slow. There are many recent developments aimed to eliminate the mentioned shortcomings of stochastic solutions. Various modifications of Gillespie’s stochastic method that run faster (Gillespie, 2001; Gillespie and Petzold, 2003), and chemical Langevin equations that preserve positivity (Wilkie and Wong, 2008) are being developed, though such enhancements generally come at the cost of another compromise such as accuracy, complexity, or efficiency. We will go over a basic stochastic simulation method for demonstration purposes. Although Gillespie’s algorithm or chemical Langevin equations are considered better performing methods in general, basic stochastic algorithm is simple and intuitive. We will use the reaction system given in Eq. (12.4) to describe the basic stochastic algorithm. The three reactions and their associated reaction rates are as follows: k
E þ S !1 ES k1 ½E ½S; k ES !2 E þ S k2 ½ES; k ES !3 P þ E k3 ½ES: The method is based on the probabilities that these reactions occur over a fixed time interval dt. The smaller this time interval, the more accurate the simulation. We compute the probability that one of these reactions occurs during a time interval of length dt by multiplying its reaction rate with dt. For example, the probability that the first reaction will occur during a time interval of length dt is k1[E][S]dt. Here, dt has to be sufficiently small, so that this product stays less than 1. Actually, dt is supposed to be small enough that at most one of these reactions happen during dt. The probability that none of these reactions will occur over dt is 1 k1[E][S]dt (k2 þ k3)[ES]dt. Then we can devise an iterative simulation algorithm where we update the state of the system after a fixed time interval of length dt. We decide which reaction occurs by partitioning the interval [0,1] into four subintervals with lengths equal to the corresponding probabilities: 0
k1[E][S]dt First reaction
k2[ES]dt
k3[ES]dt
Second reaction Third reaction
1 − k1[E][S]dt − (k2 + k3)[ES]dt No reaction
1
386
Necmettin Yildirim and Caner Kazanci
We then choose a random value p between 0 and 1, and we update the state of the system depending on which subinterval p belongs to. For example, if 0 < p < k1[E][S]dt, then it means that E þ S ! ES has occurred once, therefore we update the system so that E and S loose one molecule each and ES gains one. Or if k1[E][S]dt þ (k2 þ k3)[ES]dt < p < 1 then it means that no reaction has occurred, so we jump to the next iteration without changing the state. This iterative scheme can be generalized for any chemical reaction system. At each iteration: 1. Reaction probabilities are computed. 2. Interval [0,1] is partitioned into subintervals according to these probabilities. 3. A uniform random number p is chosen from [0,1]. 4. System state is updated depending on which subinterval p belongs to. Similar to numerical ODE solvers, the accuracy of this algorithm increases as dt decreases. Similarly, if dt is extremely small, then simulations will take longer computing time. For smaller dt values, system state will not change during many iterations because the probability that no reaction occurs converges to 1 as dt ! 0. Gillespie’s stochastic algorithm provides a remedy for this issue by choosing an adaptive time-step dt, which is explained in his paper (Gillespie, 1977).
4. An Example: Lactose Operon in E. coli We use the lactose operon (the lac operon) of E. coli and a modified version of the Yildirim–Mackey model (Mackey et al., 2004; Yildirim and Mackey, 2003; Yildirim et al., 2004) developed for this bacterial regulatory circuit to demonstrate the methods and analysis described in previous sections. The lac operon is the classical example of an inducible circuit which encodes the genes for the transport of external lactose into the cell and its conversion to glucose and galactose. A cartoon that depicts the major components of this circuit is shown in Fig. 12.9. The molecular mechanism of the lac operon works as follows: The lac operon has a small promoter/ operator region (P and O) and three larger structural genes lacZ, lacY, and lacA. There is a regulatory gene lacI preceding the lac operon. lacI is responsible for producing a repressor (R) protein. In the presence of allolactose, a binary complex is formed between allolactose and the repressor that makes binding of the repressor to the operator region impossible. In that case, the RNA polymerase bound to the promoter is able to initiate transcription of the structural genes to produce mRNA(M). However, in the absence of allolactose (A) the repressor protein R binds to the operator region O and prevents the RNA polymerase from transcribing the structural
387
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
I
P
O
LacZ
LacY
LacA Transcription RNA (M) Translation
b-gal (B)
Repressor (R)
Permease (P) Cell membrane
b-gal Glucose
Allolactose (A)
Lactose (L) Lactose (Le)
Figure 12.9 Schematic representation of the lactose operon regulatory system. See the text for details.
genes. Once the mRNA has been produced, the process of translation starts. The lacZ gene encodes the portion of the mRNA that is responsible for the production of b-galactosidase (B) and translation of the lacY gene produces the section of mRNA that is ultimately responsible for the production of an enzyme permease (P). The final portion of mRNA produced by transcription of the lacA gene encodes for the production of thiogalactoside transacetylase which is thought not to play a role in the regulation of the lac operon (Beckwith, 1987). This positive control system works as follows: When there is no glucose available for cellular metabolism but if lactose (L) is available in a media, the lactose is transported into the cell by the permease. This intracellular lactose is then broken down into glucose, galactose, and allolactose by b-galactosidase. The allolactose is also converted to glucose and galactose by the same enzyme b-galactosidase. The allolactose feeds back to bind with the lactose repressor and enable the transcription process which completes the positive feedback loop. Yildirim et. al. (Mackey et al., 2004; Yildirim and Mackey, 2003) devised a mathematical model which takes into account the dynamics of the permease, internal lactose, b-galactosidase, the allolactose interactions with the lac repressor, and mRNA. The final model consists of five nonlinear differential delay equations with delays due to the transcription and translation process. We modified this model in this study and eliminated the delay terms. This change reduced the original model to a five-dimensional system of ODEs. The equation of this model are given in Eqs. (12.25)–
388
Necmettin Yildirim and Caner Kazanci
(12.29). The estimated values for the model parameters from the published data are listed in Table 12.1. The details on the development of this model and estimation of the parameters can be found in Mackey et al. (2004), Yildirim and Mackey (2003), Yildirim et al. (2004) (Table 12.2). We studied this model using both deterministic and stochastic approaches. To see if the modified model captures the experimentally Table 12.1 The model parameters estimated from experimental data (from Yildirim and Mackey, 2003)
n gM gA K KL1 KA gL aL KLe bL2 K1 KL2
2 0.411 min 1 0.52 min 1 7200 1.81 mM 1.95 mM 0.0 min 1 2880 min 1 0.26 mM 1.76 104 min 1 2.52 10 2 (mM) 2 9.7 10 4 M
mmax gB G0 aM aA aB bA KL gP aP bL1
3.47 10 2 min 1 8.33 10 4 min 1 7.25 10 7 mM/min 9.97 10 4 mM/min 1.76 104 min 1 1.66 10 2 min 1 2.15 104 min 1 9.7 10 4 M 0.65 min 1 10.0 min 1 2.65 103 min 1
Table 12.2 The equations describing the evolution of the variables M, B, L, A, and P in the Yildirim–Mackey model for the lac operon
d½M 1 þ K1 ½An þ G0 egM ½M : ¼ aM K þ K1 ½An dt d½B ¼ aB ½M egB ½B: dt
ð12:25Þ ð12:26Þ
d½L ½P ½Le ½P ½L ½B½L b L1 b L2 egL ½L : ð12:27Þ ¼ aL dt KLe þ ½Le KL1 þ ½L KL2 þ ½L d½A ½B½L ½B½A bA egA ½A: ¼ aA dt KL þ ½L KA þ ½A
ð12:28Þ
d½P ¼ aP ½M egP ½P : dt
ð12:29Þ
In this model egi ¼ gi þ m;
i 2 fM; B; L; A; P g.
389
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
observed bistable behavior (Cohn and Horibata, 1959; Novick and Wiener, 1957; Ozbudak et al., 2004), we set the left-hand side of each equation in the system Eqs. (12.25)–(12.29) to zero and solve the resultant system of five nonlinear equation for a range of Le concentration after keeping all the other parameters as in Table 12.1 for m ¼ 2.26 10 2 min 1. The result is shown in Fig. 12.10. Our modified model predicts that there is a physiologically range for the external lactose concentration that corresponds to the S-shaped curve in this figure. When the external lactose concentration falls in this range, the lac operon can have three coexisting steady states. Figure 12.11 shows how the bistability arises in evolution of b-galactosidase concentration in the deterministic simulation of the model. In this simulation, all the parameters are kept constant as in Table 12.1 when m ¼ 2.26 10 2 min 1 and we chose [Le] as [Le] ¼ 53 10 3 mM. As shown in Fig. 12.10, there are three steady states for this particular concentration of Le. We calculate these steady state values numerically as in Table 12.3. To produce this figure, the initial values for the concentrations of all the proteins were kept constant at their steady state values on the middle branch of the S-shaped curve when [Le] ¼ 53 10 3 mM except mRNA concentration. Then three initial values of the mRNA concentration were chosen slightly below its steady value on the middle branch and another three initials were chosen slightly above its steady state 10−3
B(t) (mM)
10−4
10−5
0
0.02
0.04
0.06
0.08
0.1
Le (mM)
Figure 12.10 Bistability arises in the lac operon model as the external lactose (Le) concentration changes when m ¼ 2.26 10 2 min 1. Notice that the parameter values, there exists a range of Le concentration for which there are three coexisting steady states for b-galactosidase concentration. Our calculations estimate this range as [0.026, 0.057] mM of [Le].
390
Necmettin Yildirim and Caner Kazanci
[B] (× 10−5 mM)
102
101
100
10–1 0
200
400
600
800
1000
[time]
Figure 12.11 Semilog plot of b-galactosidase concentration over time showing effects of the initial values of concentration mRNA around the middle branch of S-shape curve in Fig. 12.10 in the numerical simulation.
concentration on the same branch and the model equations in Eqs. (12.25)– (12.29) was solved numerically for [Le] ¼ 80 10 3 mM, which corresponds to the external lactose concentration on a steady on the upper branch of the S-shaped curve. When the simulation is started from an initial starting point that is exactly on the middle branch of S-shape curve, the b-galactosidase concentration stays constant over time (the horizontal line in Fig. 12.11) as it is a steady state for this system. Since the middle branch is unstable, small perturbations around the middle branch can kick the simulation either to the lower or the upper stable branches of S-shape curve. All the other runs converge to the stable steady states either on the lower branch or on the upper branch as seen in this simulation. We observe that the ones started initially above the steady state concentration of mRNA on the middle branch converged to the steady state on the upper branch, the ones started initially below the steady state concentration of mRNA on the middle branch converged to the steady state on the lower branch. In Fig. 12.12, the deterministic and stochastic simulation of the Yildirim–Mackey lac operon model is shown. To produce this plot, we run six simulations by choosing the steady state value on the lower branch of the S-shaped curve as the initial starting point when [Le] ¼ 53 10 3 mM and m ¼ 2.26 10 2 min 1 while all other parameters are kept constant as in Table 12.1. As seen in this simulation, the average of the stochastic simulations are about the same as the solution of differential equations. Since we pick the initial concentrations from the bistable region, there is a slow transition before reaching to the steady state in both simulations.
Table 12.3 The steady state values calculated from Eqs. (12.25)–(12.29) by setting the time derivatives zero [M*]
Lower branch Middle branch Upper branch
[B*] 6
2.80 10 5.33 10 6 6.56 10 4
[A*] 6
1.98 10 3.78 10 6 4.65 10 4
[L*] 2
1.00 10 2.04 10 2 3.37 10 1
[P*] 1
1.88 10 2.11 10 1 2.46 10 1
4.17 10 5 7.93 10 5 9.75 10 3
All the parameters are kept constant as in Table 12.1, when m ¼ 2.26 10 2 min 1 and [Le] ¼ 53 10 3 mM for which there exist three steady states (see Fig. 12.10).
392
Necmettin Yildirim and Caner Kazanci
100
[B] (× 10−5 mM)
80
60
40
20
0
0
200
400 [time]
600
800
Figure 12.12 Deterministic and stochastic simulation of the Yildirim–Mackey lac operon model given by Eqs. (12.25)–(12.29). In this simulation, the solid lines show the ODE solutions and the broken lines represent the results of the stochastic simulations. To produce this plot, we chose the steady state value on the lower branch of the S-shaped curve as the initial value when [Le] ¼ 53 10 3 mM and m ¼ 2.26 10 2 min 1 while all the other parameters are kept constant as in Table 12.1 and run six stochastic simulations for the external lactose concentration [Le] ¼ 80 10 3 mM which corresponds to a steady state value on the upper branch of the S-shaped curve in Fig. 12.10.
The deterministic model estimates this transition period about 120 min. The stochastic simulations predicts a significant variance in this transition period and estimate that this period may take up to 500 min for individual cells. We investigate the effects of stochasticity in the bistable region. To this end, we run the stochastic simulation eight times starting from the stable steady state on the lower branch of the S-shaped curve and another eight runs starting from the stable steady state on the upper branch of the S-shaped curve for [Le] ¼ 53 10 3 mM. The results are shown in Fig. 12.13. In a bistable system, the random fluctuations can push the system from one stable steady state to the other one. The frequency of this transition is higher for systems with higher noise levels. We observe that all simulations starting from the lower branch of the S-shaped curve ended up converging to the stable steady state on the upper branch. However, simulations initialized at the upper branch never switch to the lower steady state and stay on the upper branch. This indicates that the steady state on the upper branch is more robust, and is resistive against fluctuations in the protein concentrations compared to the steady state on the lower branch. As seen in this simulation, the time required to shift from the lower steady state to the upper steady state can change
393
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
60
[B] (× 10−5 mM)
50 40 30 20 10 0
0
200
400
600
800
[time]
Figure 12.13 Sixteen stochastic simulations of the Yildirim–Mackey lac operon model is shown, with the same parameters used for Fig. 12.12. Eight simulations use the lower steady state value as the initial condition, while the others use the upper steady state as the initial condition. We observe that all the simulations starting from the lower branch converge to the upper steady state and the simulations initialized from the upper branched stay on that steady state (only one of the simulations is plotted here).
significantly from one run to another. This transition can happen as early as in 60 min and as late as 600 min. Another surprising result is the variance at steady levels of b-galactosidase and lactose. When [Le] ¼ 53 10 3 mM, the steady state concentrations of b-galactosidase and lactose concentration are around 50 and 23,000 mM, respectively. In general, we expected to see less variation when concentration of a molecular species is high. In other words, relative noise is less for high concentrations. However, our stochastic simulation results indicate that relative noise appears to be about the same for both of these proteins (results are not shown). One conclusion we can derive from this simulation result is about the sensitivity of concentration of bgalactosidase, that significant changes in the concentration of b-galactosidase is not likely to have an impact on the entire system, because it will most likely be dominated by noise anyway.
5. Conclusions and Discussion Here, we present a brief introduction to mathematical modeling of regulatory biochemical reaction networks with some examples from enzyme kinetics and the couple systems that are capable of displaying oscillatory dynamics and bistable behaviors. We cover both deterministic and stochastic approaches and discuss the bistability and its origin from a mathematical point
394
Necmettin Yildirim and Caner Kazanci
of view in Section 2. We give the lac operon as a real-life example and show that this system is capable of bistable behavior for physiologically meaningful parameters ranges. We compare the stochastic and deterministic simulation results on the lac operon. All numerical computations in this study were performed in MatLab. There are software packages freely available to perform deterministic and stochastic simulation of biological and ecological networks (Adalsteinsson et al., 2004; KazancI, 2007). We study the Yildirim–Mackey lac operon model using both deterministic and stochastic approaches and show that the model is capable of producing three coexisting steady states that correspond to the S-shaped curve in Fig. 12.10. The external lactose concentration for the bistability is estimated as [0.026, 0.057] mM of [Le] which agrees well with the recent experimental result in (Ozbudak et al., 2004). In the bistable region, our stochastic simulation results indicate that the stable steady state on the lower branch of the S-shaped curve is less stable against noise than the steady state on the upper branch of the S-shaped curve. Furthermore, the fluctuations in the protein concentrations on the lower branch of the S-shaped curve are strong enough to shift the lac operon to the stable steady state on the upper branch (Fig. 12.13). Both the deterministic and stochastic simulations predicts that there is a significant transition period from bistable region ([Le] ¼ 53 10 3 mM) to fully the induced state ([Le] ¼ 80 10 3 mM ). The deterministic model estimates this period about 2 h and the stochastic simulations predicts this period may take as long as 500 min for individual cells (Fig. 12.12). To close, we would like to mention that both deterministic and stochastic methods have certain advantages and shortcomings. Deterministic simulations describe the average behavior, and are appropriate when the number of molecules in a system is large enough and molecules are spatially homogeneous. When the number of molecules is small, the stochastic methods simulate system behavior much better. Another major shortcoming of the deterministic simulation shows up in systems capable of multiple steady states. The deterministic solution of such a system always converges to a single stable steady state and stays there forever. However, in real life, constant switching behavior among steady states may happen due to inherent fluctuations within the system, as shown in Fig. 12.8. Dynamics such as this can only be captured by stochastic methods. On the other hand, the deterministic methods are often computationally more efficient and easier to implement. In Section 3, we went over two systems that display significantly different behavior when simulated by deterministic and stochastic methods. Stochasticity may play crucial role in regulation of a dynamical system. There are many other biological systems observed with such properties, systems that display noise-induced stability (D’Odorico et al., 2005) or stochastic resonance (Gammaitoni et al., 1998).
Deterministic and Stochastic Simulation of Biochemical Reaction Networks
395
ACKNOWLEDGMENT The work was partially supported by the New College of Florida Faculty Development Funds.
REFERENCES Adalsteinsson, D., McMillen, D., and Elston, T. C. (2004). Biochemical network stochastic simulator (bionets): Software for stochastic modeling of biochemical networks. BMC Bioinform. 5(24), 1–21. Beckwith, J. (1987). The lactose operon. In “Escherichia coli and Salmonella: Cellular and molecular biology,” (F. C. Neidhardt, J. L. Ingraham, K. B. Low, B. Magasanik, and H. E. Umbarger, eds.), Vol. 2, pp. 1444–1452. American Society for Microbiology, Washington DC. Cohn, M., and Horibata, K. (1959). Inhibition by glucose of the induced synthesis of the b-galactosidase-enzyme system of Escherichia coli: Analysis of maintenance. J. Bacteriol. 78, 613–623. D’Odorico, P., Laio, F., and Ridolfi, L. (2005). Noise-induced stability in dryland plant ecosystems. Proc. Natl. Acad. Sci. USA 102(31), 10819. Gammaitoni, L., Hanggi, P., Jung, P., and Marchesoni, F. (1998). Stochastic resonance. Rev. Mod. Phys. 70(1), 223–287. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340. Gillespie, D. T. (2000). The chemical Langevin equation. J. Chem. Phys. 113, 297. Gillespie, D. T. (2001). Approximate accelerated stochastic simulation of chemically reacting systems. J. Chem. Phys. 115, 1716. Gillespie, D. T., and Petzold, L. R. (2003). Improved leap-size selection for accelerated stochastic simulation. J. Chem. Phys. 119, 8229. KazancI, C. (2007). Econet: A new software for ecological modeling, simulation and network analysis. Ecol. Modell. 208(1), 3–8, (Special Issue on Ecological Network Theory). Lewin, B. (2008). Genes. 9th edn. Jones and Bartlett Publishers, Sudbury, Massachusetts. Mackey, M., Santilla´n, M., and Yildirim, N. (2004). Modeling operon dynamics: The tryptophan and lactose operons as paradigms. CR Biol. 327(3), 211–224. Novick, A., and Wiener, M. (1957). Enzyme induction as an all-or-none phenomenon. Proc. Natl. Acad. Sci. USA 43, 553–566. Ozbudak, E., Thattai, M., Lim, H., Shraiman, B., and Oudenaarden, A. (2004). Multistability in the lactose utilization network of Escherichia coli. Nature 427, 737–740. Schleif, R. (2000). Regulation of the 1-arabinose operon of Escherichia coli. Trends Genet. 16(12), 559–565. Scott, S. K. (1991). Chemical Chaos. 1st edn. Oxford University Press, Oxford and New York. Wilkie, J., and Wong, Y. M. (2008). Positivity preserving chemical Langevin equations. Chem. Phys. 353(1–3), 132–138. Yildirim, N., and Mackey, M. (2003). Feedback regulation in the lactose operon: A mathematical modeling study and comparison with experimental data. Biophys. J. 84(5), 2841–2851. Yildirim, N., Santilla´n, M., Horike, D., and Mackey, M. (2004). Dynamics and bistability in a reduced model of the lac operon. Chaos 14(2), 279–292.
C H A P T E R
T H I R T E E N
Multivariate Neighborhood Sample Entropy: A Method for Data Reduction and Prediction of Complex Data Joshua S. Richman Contents 398 398 399 400 401 402 402 403 403 405 407 408 408
1. Introduction 2. Current Methods and Limitations 3. k-Nearest Neighbors 4. Sample Entropy 5. Multivariate Neighborhood Sample Entropy: MN-SampEn 6. Relationship Between kNN and MN-SampEn 7. Relationship Between SampEn and MN-SampEn 8. Applying MN-SampEn to Proteomics Data 9. Algorithmic Implementation and Optimizing Tolerances 10. Results 11. Discussion 12. Limitations and Future Directions References
Abstract The analysis of large and complex databases poses many challenges. Such databases arise in health-services, electronic medical records, insurance, and other commercial data sources where both the number of observations and variables can be enormous. The problems are particularly acute in genomics and proteomics where the number of variables is typically much higher than the number of observations. Extant methods seek to balance the demands of making efficient use of the data with the need to maintain the flexibility required to detect complex relationships and interactions. To overcome some limitations of current methods, a novel analytical tool, Multivariate Neighborhood Sample Entropy (MN-SampEn) is introduced. It is a generalization of Sample Entropy to multivariate data that inherits many of Sample Entropy’s desirable properties. In principle, it selects significant covariates without Department of Medicine, Division of Preventive Medicine, University of Alabama School of Medicine, Birmingham, Alabama, USA Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87013-8
#
2011 Elsevier Inc. All rights reserved.
397
398
Joshua S. Richman
reference to an underlying model and provides predictions similar to those of k-Nearest-Neighbor methods, with fewer covariates required. However, adaptation to multivariate data requires that several additional optimization issues be addressed. Several optimization strategies are discussed and tested on a set of MALDI mass spectra. With some optimization strategies, MN-SampEn identified a reduced set of covariates and exhibited lower predictive error rates than k-Nearest Neighbors.
1. Introduction The analysis of large and complex databases poses many challenges. Such databases arise in health-services, electronic medical records, insurance, and other commercial data sources where both the number of observations and variables can be enormous. The problems are particularly acute in genomics and proteomics where the number of variables is typically much higher than the number of observations. The task is often to identify which covariates out of thousands, or even millions, are associated with a specific outcome such as a disease state. Often, the number of covariates is orders of magnitude larger than the number of observations, adding the risks of false discovery and overfitting. Along with the huge number of potential covariates is the possibility that important information may be contained in the complex interactions that may be missed by simple methods. These burgeoning challenges have been met by new and improved models and algorithms for classification and prediction. In this work, I will provide a brief overview of the strengths and limitations of these methods. I will then present a newly developed method for Multivariate Neighborhood Sample Entropy (MN-SampEn) that adapts and extends Sample Entropy (SampEn) statistics to multivariate data. Next will be an exposition of methods for optimizing MN-SampEn, and an example of its application to a proteomics data set.
2. Current Methods and Limitations The overview and discussion of current methods is slanted toward the analysis of mass spectrum data but is more broadly applicable. Analysis typically proceeds in two steps: data reduction/feature selection, followed by modeling or classification, though sometimes they are done simultaneously. The goal of data reduction is to identify a reduced and more tractable set of covariates that retains most of the important information. When done as a separate step, data reduction typically proceeds without considering the outcome using techniques like peak-picking, wavelets,
Multivariate Neighborhood Sample Entropy
399
principal components, singular value decomposition, and clustering methods. The methods used for the second step can be broadly divided into statistical models and algorithmic classifiers. The statistical models begin by assuming the functional form of the relationships between the covariates and the outcome (i.e., specifying the model) and then use the study, or training, data to estimate coefficients to best fit the model. Examples include: linear, logistic, multinomial and Poisson regression along with spline methods and generalized additive models. The algorithmic methods are most easily discussed, as the name suggests, by specifying the algorithm to be iteratively applied rather than a particular model. Examples include k-Nearest Neighbors (kNN), Classification and Regression Trees (CART or recursive partitioning), Support Vector Machines, and Neural Networks. The statistical models tend to be able to be built with relatively small training sets. This efficient use of the data comes at the price of specifying the form of the model beforehand. The models usually have difficulty with truly complex interactions and can be expected to be somewhat biased due to mis-specification. The algorithmic methods, on the other hand, are generally more flexible and better able to accommodate complex interactions but at the expense of requiring larger datasets for training. In general, no method is expected to be best in all circumstances (Caruana and Niculescu-Mizil, 2006). Proteomics and Genomics data present a potentially difficult situation combining complex and multivariate data with small sample size. What is needed is a method that can perform well when trained on a modest sample, while remaining flexible enough to deal with complex relationships. It seems reasonable to suppose that the most flexible approach would be that which does not restrict relationships at all.
3. k-Nearest Neighbors K-Nearest Neighbors (kNN) is such a method and, despite its simplicity, continues to perform fairly well for large training sets. It essentially relies only on the most basic assumption underlying all prediction: that observations with similar characteristics will tend to have similar outcomes. Nearest Neighbor methods assign a predicted value to a new observation based on the plurality or mean (sometimes weighted) of its k “Nearest Neighbors” in the training set. Given an infinite amount of data, any observation will have many “neighbors” that are arbitrarily near with respect to all measured characteristics, and the variability of their outcomes will provide as precise a prediction as is theoretically possible barring a perfectly and completely specified model. However, given that we never have an infinite amount of data, the actual utility of this asymptotic property
400
Joshua S. Richman
is questionable, especially for modest datasets. Unfortunately, because predictions rely solely on a collection of stored observations, it is computationally and memory-intensive and sensitive to the curse of dimensionality. See the paper by Friedman for a more complete discussion (Freidman, 1994).
4. Sample Entropy Another method that has been successfully applied to analyzing complex and noisy data, albeit in a different context, is SampEn. SampEn developed from techniques related to information theory and the analysis of nonlinear or “chaotic” processes (Richman, 2007). Those processes have the property of sensitive dependence on initial conditions meaning that two trajectories will diverge over time even if their initial conditions are arbitrarily close. This has the consequence that the series cannot, even in principle, be modeled precisely enough for long range forecasting. Because of this limitation, research efforts turned toward measuring and estimating characteristic properties such as the correlation dimension, Liapunov exponents, and Kolmogorov-Sinai entropy. There has been an increasing interest in applying these methods to the analysis of complex physiological time series. However, these methods are ill-suited for this purpose for two reasons. First, having been developed for the analysis of mathematical and physical processes they required series with more observations and, most importantly, much less noise than is ubiquitous in physiologic time series due to measurement error and the interaction of multiple physiologic pathways. Second, there were no established statistical properties to allow for statistical testing. SampEn statistics were developed to provide a measure of complexity related to Shannon’s information-theoretic entropy that was suitable for physiologic time series which are often short and noisy (Richman and Moorman, 2000). In essence, it estimates the probability that if two epochs of a signal are similar to one another, that they will continue to be similar when incremented forward. This can be understood as a measure of the rate of information production. In a highly predictable signal, two epochs following similar trajectories are very likely to remain similar. On the other hand, given a completely random signal, two trajectories will only be similar purely due to chance and are no more likely to remain similar than any two random points. Between these two extremes are a range of complex processes with variable short-term predictability. SampEn statistics have since been successfully applied to many different types of physiologic time series ranging from heart rate variability to endocrinology to studies of the human gait.
Multivariate Neighborhood Sample Entropy
401
5. Multivariate Neighborhood Sample Entropy: MN-SampEn A crucial observation is that SampEn measured the probability that a pair of observations that are similar for a set of observations (a vector of sequential observations in a time series) are also be similar for an additional observation, or outcome. Optimizing the metric definition of being “similar” essentially defines when a pair of observations can be considered neighbors. Furthermore, there is nothing in this view that restricts the analysis to time series. This suggested that with suitable modifications the basic idea of SampEn could be transplanted and adapted to the analysis of multivariate data. That generalization is presented here as MN-SampEn. This combines the model-free flexibility of Nearest-Neighbor methods with an approach developed for small and noisy datasets and thus has the potential to overcome some limitations of current methods. The formal definition of MN-SampEn is described as follows: Assume the dataset has n observations each with t independent covariates and one dependent variable, or outcome. The jth variable of the ith observation will be denoted by xi,j and the outcome by oi. Further assume that there is a metric assigned to each covariate in the dataset and the outcome and a vector r of thresholds where rj is the threshold for the for the jth covariate and ro is the threshold for the outcome. Define a kernel function 1 if d xi; j ; xk; j rj Cj ðxi; j ; xk; j Þ ¼ : 0 if d xi; j ; xk; j > rj The observed mean value of this kernel is pr( j ) ¼ E(Cj(xi,j, xk,j)), or the estimated probability that two observations “agree” within rj on the jth variable. This can be extended most simply to a kernel defined on a subset S of variables by defining 1 if d xi; j ; xk; j rj for all j 2 S CS;r ðxi;S ; xk;S Þ ¼ 0 otherwise with the corresponding empirical mean pr(S ) ¼ E(CS(xi,S, xk,S)). The quantity pr(S) can be viewed as the empirical probability that two observations “agree” on the set S of independent variables with tolerance r. That is the number of pairs of observations where xi,S ffi xj,S divided by the number of 1 n distinct pairs. In the absence of any missing data, the number of pairs is . 2 Similarly, the quantity pr (S, o) will denote the probability of agreement on the set S of independent variables and the outcome o. The overall measure of how well similarity on S is related to similarity on o is
402
Joshua S. Richman
Þ MNr ðS; oÞ ¼ ln prpðrS;o ðSÞ , the negative logarithm of the probability that two observations agreeing on S also agree on o. This is exactly analogous to SampEn (Richman and Moorman, 2000) and quantifies the uncertainty of o given similarity on S. An important related quantity is ÞMNr ðS;oÞ UIr ðSÞ ¼ MNr ðoMN informally referred to as uncertainty index of r ðoÞ the set of covariates S, which represents the proportion of baseline uncertainty in o accounted for by S. MNr(o) simply denotes the probability that the outcomes of a pair of observations will agree within ro. The general strategy is to find a set S of predictors and a vector r of tolerances that minimizes MNr(S, o) or equivalently maximizes UIr(S).
6. Relationship Between kNN and MN-SampEn Both methods categorize a new observation based on the elements of the training set in its “neighborhood.” However, kNN methods use neighborhoods defined by the parameter k. That is, for each point, the size of its neighborhood is made as large as necessary to contain k elements of the training set. As a consequence, observations whose position in the space of covariates is dense will have k neighbors that are relatively similar and accurate predictions. Observations in a sparse location of the covariate space will require neighbors that are relatively distant and hence less similar, leading to less accurate predictions. Refinements of kNN methods adaptively optimize which variables are most influential in defining the neighborhood (Domeniconi et al., 2000, 2002; Freidman, 1994) at the expense of increasing computational complexity. While this helps to overcome the curse of dimensionality, there is still little general guidance for optimally choosing k. In contrast, MN-SampEn identifies a single definition for the neighborhood that is used around all points. Thus, predictions for new observations will be based on a variable number of neighbors and a fixed neighborhood. The precision and accuracy of predictions for new observations will vary with the density of the observation’s location in the covariate space.
7. Relationship Between SampEn and MN-SampEn SampEn begins with a time series x, whose ith point is xi and divides it into vectors ui of a specified length m þ 1. Thus, the ith vector ui ¼ (xi, xiþ 1, . . ., xiþ m). In the terminology of nonlinear dynamics, the set of subsequences is a sampled reconstructed phase space of the underlying process. SampEn measures the probability that if two vectors are within a tolerance r for the first m points, that they remain within that tolerance for the m þ 1st points.
Multivariate Neighborhood Sample Entropy
403
SampEn has been demonstrated to have well-behaved statistical properties in theory (Richman, 2007) and empirically, even for small samples (Richman and Moorman, 2000). However, SampEn statistics are more complicated than MN-SampEn because the vectors used in SampEn calculations are both dynamically dependent on one another as well as being formally dependent due to overlap. In many cases, the observations used for MN-SampEn can be considered to be independent. Optimizing SampEn requires finding the best values of m and r. However, since the marginal distributions of the vector components are expected to be the same, only one value of r needs to be identified. Furthermore, different choices of m can be evaluated sequentially beginning with m ¼ 1 until SampEn is minimized. MN-SampEn will start with a larger set of potential predictors, each with potentially distinct marginal distributions, and will have to identify an optimal subset. Unless the outcome is categorical, its threshold will also have to be optimized.
8. Applying MN-SampEn to Proteomics Data The specific practical issues of applying MN-SampEn are discussed using a set of MALDI mass-spectrometry data supplied by Drs. James Mobley and Senait Asmellash the Director and co-Director of the University of Alabama at Birmingham Comprehensive Cancer Center Clinical Proteomics Core. It consists of MALDI spectra generated by a Bruker Ultraflex III of serum samples from 240 subjects divided by disease state: 27 (11%) were HIVþ; 51 (21%) had Tuberculosis (TB) and were HIVþ; 55 (23%) had TB; 58(24%) were PPDþ but without TB; and 49 (20%) were apparently healthy. Preprocessing was done with SpecAlign (Wong et al., 2005) and included baseline subtraction, total ion-current normalization, denoising, peak-picking, and aligning. A total of 276 peaks were identified. The analytical dataset consisted of a random identifier, intensities for each of the peaks, denoted by pi and an indicator for disease state. This data was chose to be a representative dataset and not to be particularly advantageous for MN-SampEn analysis.
9. Algorithmic Implementation and Optimizing Tolerances MN-SampEn was applied iteratively. The first step is to calculate MNr (pi, o) separately for each of the 276 peaks pi. For each peak, MNr (pi, o) was calculated using 10 different values of ri expressed as a fraction of the standard deviation, si, of the intensity of the ith peak, ranging from 0.05si to 0.5si. The covariate pi that optimizes MNr(pi, o) will be denoted by p(1), its associated tolerance by r(1), and MNr(p(1), o) will be denoted by
404
Joshua S. Richman
MNr(S1, o). In this case, the outcome o is categorical and the tolerance ro was taken to be the identity metric so there was no need to optimize it. However, in the more general case of a continuous or ordinal outcome, a range of values would need to be explored for ro as well. Then the selected optimal MNr(pi, o) will include specified values for the thresholds r(1) and ro which will be carried forward for subsequent iterations. The next step is to calculate MNr(S1 [ pi, o) for all pi 6¼ S1, each again considering a range of values for ri. The optimal choice will be MNr(S2, o) where S2 ¼ p(1) [ p(2). The cycle repeats, choosing the optimal MNr(Sn, o) ¼ MNr(Sn 1 [ p(n), o) until there is no pi such that MNr(Sn [ pi, o) < MNr(Sn, o). That is, it terminates when there is no additional covariate whose inclusion in the current set improves performance. It is clear from the discussion that it is a simple matter to choose at each step which MNr(S, o) is the smallest, but it is not clear that this would be the best choice. Consider that case where there are two observations which are both very different from the rest of the data, and very similar to one another including having the same outcomes. If, in the first step, a covariate and tolerance were chosen stringent enough so that those two were the only observations within each other’s neighborhood and every other observation was alone in its neighborhood, then MNr(S1, o) ¼ 0 indicating perfect classification. However, it would also be perfectly useless, since it would pertain only to that one pair of observations. In less extreme terms, to obtain results that are statistically robust and meaningful, the neighborhood definition must remain sufficiently expansive that a reasonable number of observations have neighbors. This general issue has also been addressed for SampEn statistics where the point was made that a balance must be struck in the choice of tolerance r and vector length m. As r decreases and m increases, there is a higher probability that similarities are genuine, but there are also fewer pairs of similar vectors and thus less stable estimates. For MNSampEn the situation is more complicated but analogous. As more predictors are added, pairs of observations that match within r (i.e., pairs in the same neighborhood) are more likely to be genuinely similar but too many observations will have no neighbors, limiting the utility. There are two main approaches to ensuring that neighborhoods are not too restrictive. The first is to mandate that as each new predictor is added, a minimum number of pairs of observations are required to match within r. MN-SampEn analysis was run on the test dataset with several options for setting this minimum number of matches. The first method was to set a minimum number m that was the same for all steps. Runs were done with m ¼ 10 (very low); and with m as a range of constant multiples of the number of observations (m ¼ j 240, for j in 1–10). Another strategy for assigning m recognizes that the number of pairwise matches would be expected to decrease as more predictors are added and included a value for m that varied for each step. Setting m(i) to be the minimum number of
Multivariate Neighborhood Sample Entropy
405
pairwise matches for the m set of i predictors Si and n denote the number of observations, the formula used was m(i) ¼ n((0.05j þ i)/i) where j ranged from 1 to 10. Thus, m(i) begins with m(1) ¼ n(1 þ 0.05j) and trends downward, asymptotically approaching n as i increases. An alternative method is still more adaptive. Instead of mandating a single threshold m or a set of iteration-specific thresholds m(i), it mandates that each member of the training set have a minimum number of neighbors. This ensures that the neighborhood is never allowed to become so restrictive that it is not broadly applicable. The second issue to consider is how, at each step, to choose the optimal predictor p(n). As noted above, it may not be optimal to simply choose the smallest MNr(Sn 1 [ p(n), o), because that may not rely on a sufficient number of pairwise matches for further refinement, even with guidelines for the minimum number of matches. In addition to that simple approach, two other methods were explored for this analysis, based on the uncertainty index UIr(S, o) and the number of pairwise matches. Note that in contrast to the sequence of MNr(Sn, o) which are monotonically decreasing, the UIr(Sn) are monotonically increasing. Denote the P number of pairwise matches over the set of predictors Sn 1 [ pi by r(Sn 1 [ pi). The two alternate optimization methods make use of slightly different strategies to balance the competing demands of accuracy and number of matches. P The first chooses the predictor that maximizes the product UIr(Sn) r(Sn 1 [ pi) referred Pto as Count UI. The second maximizes the product UIr(Sn) log( r(Sn 1 [ pi)) and is referred to as log(Count) UI. P They each reward higher UI (S ) and (S r n r n 1 [ pi) but differ in the P relative weight given to r(Sn 1 [ pi). Predicted outcomes were assigned to be the most common outcome in the observation’s neighborhood. Ties were broken at random. When a particular iteration identified several covariates, it was often the case that a given observation would not have any neighbors in the neighborhood defined by all of them. In this case, the algorithm estimated the predicted probability using a subneighborhood defined by the first n covariates. The number n was determined by finding that the observation had no neighbors in the neighborhood defined by the first n þ 1 covariates.
10. Results The best results from applying MN-SampEn to the test dataset are shown in Table 13.1 and compared to results from kNN. All kNN analyses used k ¼ 3, which was found to perform better in almost all cases than k ¼ 1 or k ¼ 2. Nearest Neighbor predicted values are shown both for the entire dataset and for the set of covariates used by each optimized MNSampEn analysis. The results highlight the importance of ensuring that an
Table 13.1 Comparing optimized MN-SampEn results to k-Nearest Neighbors for classifying 240 mass spectra by disease state kNN (k ¼ 3)
MN-SampEn optimizations Error rates Match thresholda
Selection method
None UI Adaptive log(Count) UI Minimum neighbors m¼1 log(Count) UI m¼2 log(Count) UI a b c d e
No. of covariatesb
MNSampEnc
Covariatesd (%)
Best subsete (%)
Covariates (%)
Full data (%)
3 11
0.00 0.02
82 84
59 48
59 53
50 50
7 6
0.77 0.81
47 42
44 42
48 53
50 50
Minimum number of pairwise matches required at each step. Number of covariates selected by MN-SampEn. Value of MN-SampEN. Error rate using the covariates selected by MN-SampEn. Error rate using the best subset of covariates selected by MN-SampEn.
Multivariate Neighborhood Sample Entropy
407
adequate number of matches is detected at each step when predictive accuracy is the ultimate goal. The first two rows show that MN-SampEn was minimized when there was no limit imposed and when the optimization method was either strict use of the uncertainty index or the log(Count) UI. In both cases, the error rate of predictions was high, 82% and 84% respectively, compared to 50% for kNN. However, when examining the best subset of covariates identified by MN-SampEn, performance is similar to kNN, with error rates of 59% and 48%. When the set of covariates used for kNN analysis was restricted to those selected by MN-SampEn, kNN error rates rose to 59% and 53% indicating that performance was diminished but that much of the information was retained in the reduced set of covariates. The best predictive performance for MN-SampEn was achieved by the methods that mandated that each observation have a minimum number of either one (m ¼ 1) or two (m ¼ 2) neighbors. For both cases the best optimization method was again log(Count) UI. For m ¼ 2 and m ¼ 3 respectively, the error rates were 47% and 42%, both lower than the kNN rate. Again, restricting the kNN analysis to the covariates selected by MNSampEn indicates that the reduced set retains most of the information. Within this general approach, the specific choice of optimization method (UI vs. Count UI vs. log(Count) UI) had a reduced impact, but the log(Count) method still showed consistently better results.
11. Discussion These results show that the general approach of MN-SampEn has potential both as a predictive algorithm and as a data-reduction method. It is clear, however, that merely minimizing MN-SampEn without regard to other factors results in high predictive errors. This is likely because MNSampEn is most reduced by identifying a set of covariates whose predictions are very good on the portion of observations who match on all of them, but quite poor for observations that only match some of the covariates. However, even within these strategies, it was apparent that the best subset of the selected covariates, usually the first two to four, led to predictive accuracy comparable to kNN. The most successful approach was based on mandating that each observation have a minimum number of neighbors. Predictive error rates were consistently lower than for kNN using a much-reduced set of covariates. However, it is important to note that this gain came at the expense of a considerable increase in computational complexity compared to kNN. MN-SampEn has the potential to be a useful tool for the analysis of large and complex datasets. The results presented here show that it can perform as well as or better than a thematically similar algorithm when properly
408
Joshua S. Richman
optimized. Because it derives from nonparametric entropy estimation, its covariate selection and data reduction may complement existing methods whose selection criteria are more subject to bias from model specification. Thus, the selected covariates may improve other methods even if MNSampEn’s own neighborhood-based predictions are suboptimal. Future studies will apply MN-SampEn to much larger databases to ascertain whether it continues to provide similar or improved performance relative to kNN with fewer covariates. Further development is required to maximize its utility and simplify its implementation.
12. Limitations and Future Directions The utility of MN-SampEN was potentially limited in this study by the small number of observations in the dataset. However, this size is typical for current proteomics studies, and this factor should have affected kNN similarly. More work is needed to compare data reduction based on MNSampEn to other methods and to compare additional predictive algorithms. The current implementation of MN-SampEn in R is very computationally intensive. This will improve as some of the multiple optimization strategies are discarded and a more efficient implementation suitable for distribution is under development.
REFERENCES Caruana, R., and Niculescu-Mizil, A. (2006). An Empirical Comparison of Supervised Learning Algorithms. Proceedings of the 23rd International Conference on Machine Learning, pp. 161–168. Domeniconi, C., Peng, J., and Gunopulos, D. (2000). Locally Adaptive Metric NearestNeighbor Classification. Technical Report UCR-CSE-00-01, pp. 1–26. Domeniconi, C., Peng, J., and Gunopulos, D. (2002). Locally adaptive metric nearestneighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 24, 1281–1285. Freidman, J. H. (1994). Flexible Metric Nearest Neighbor Classification. Technical Report, pp. 1–32. Richman, J. S. (2007). Sample entropy statistics and testing for order in complex physiological signals. Commun. Stat. Theory Methods 36, 1005–1018. Richman, J. S., and Moorman, J. R. (2000). Physiological time-series analysis using approximate entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol. 278, H2039–H2049. Wong, J. W., Cagney, G., and Cartwright, H. M. (2005). SpecAlign–Processing and alignment of mass spectra datasets. Bioinformatics 21, 2088–2090.
C H A P T E R
F O U R T E E N
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods L. Guzma´n-Vargas,* I. Reyes-Ramı´rez,* R. Herna´ndez-Pe´rez,† and F. Angulo-Brown‡ Contents 410 411 411 412 413 414 415 420 425 426 427 428 428
1. Introduction 2. Methods 2.1. The segmentation method 2.2. Correlations 2.3. Allan variance 3. Data Analysis 3.1. Distributions of excursions 3.2. Correlations of excursions 3.3. Excursions and simulated noise 3.4. Stability of excursions 4. Conclusions Acknowledgments References
Abstract We study the statistical properties of excursions in heart interbeat time series. An excursion is defined as the time employed by a walker to return to its mean value. We consider the homeostatic property of the heartbeat dynamics as a departing point to characterize the dynamics of excursions in beat-to-beat fluctuations. Scaling properties of excursions during wake and sleep periods from two groups are compared: 16 healthy subjects and 11 patients with congestive heart failure (CHF). We find that the cumulative distributions of excursions for both groups follow stretched exponential functions given by b g(t) eat with different fitting parameters a and b, leading to different * Unidad Profesional Interdisciplinaria en Ingenierı´a y Tecnologı´as Avanzadas, Instituto Polite´cnico Nacional, Me´xico D.F., Me´xico SATMEX, Av. de las Telecomunicaciones S/N CONTEL Edif. SGA-II, Me´xico D.F., Me´xico { Departamento de Fı´sica, Escuela Superior de Fı´sica y Matema´ticas, Instituto Polite´cnico Nacional, Me´xico D.F., Me´xico {
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87014-X
#
2011 Elsevier Inc. All rights reserved.
409
410
L. Guzma´n-Vargas et al.
decaying rates. Our results show that the average characteristic scale associated with the excursion distributions is greater for healthy data compared to CHF patients whereas sleep–wake transitions are more significant for healthy data. Next, we explore changes in the distributions of excursions when considering (i) a shifted mean value to define an excursion and (ii) the sum of the kth excursion successor. Besides, the presence of temporal correlations in the excursions sequences is evaluated by means of the detrended fluctuation analysis. We observe the presence of long-range correlations for healthy subjects, whereas for the CHF group, correlations are described by two regimes; over short scales the fluctuations are close to uncorrelated noise, and for large scales the fluctuations reveal long-range correlations. Finally, we apply a stability analysis of excursions based on the Allan variance which reveals that healthy dynamics is more stable than heart failure excursions.
1. Introduction Heartbeat interval fluctuations have been studied by means of methodologies from different areas of science. Particularly, methods from statistical physics have been applied to physiological signals revealing important information about the dynamics of such systems. These findings have permitted to characterize healthy heartbeat dynamics as fluctuations with 1/f behavior with long-range correlations and a broad multifractal spectrum (Goldberger et al., 2002; Ivanov et al., 1999a; Sugihara et al., 1996). An important aspect of the interbeat variability is that healthy systems have complex self-regulating mechanisms operating over multiple timescales and may generate signals that have scaling properties. Recent research focused on evaluation of complexity in heart rate variability has revealed that some fractal and scaling features characterizing healthy dynamics suffer changes under pathological conditions (Amaral et al., 1998; Goldberger et al., 2002; Guzma´n-Vargas and Angulo-Brown, 2003; Guzma´n-Vargas et al., 2005; Schmitt and Ivanov, 2007). Another important characteristic of heartbeat time series is the nonstationarity related with a large number of control mechanisms of the heart and external stimuli. However, when the interbeat sequences are observed locally, one can roughly define a local mean value. From a physiological point of view, the presence of local stationary segments can be understood as the capability of the system to preserve an approximately constant value but for a limited period of time. It is also argued that, according to the homeostasis principle, biological systems tend to maintain a constant output in spite of continual perturbations (Ivanov et al., 1998). In fact, a previous study to detect local stationary segments of heartbeat interval time series revealed that the distribution of these stationary segments follows a power-law behavior and the scaling exponent, which characterizes the distribution, is the same for healthy and heart failure groups (Bernaola-Galva´n et al.,
411
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
2001). Here we focus our attention in the scaling properties of excursions, which are defined as the period of time employed by a walker to return to its average value, to evaluate the capability of the system to preserve a mean output value, and to compare differences between wake and sleep periods. The problem of the first return time has been studied in contexts like financial index, intermittency, seismic activity, and simulated noise (Carpena et al., 2004; Ding and Yang, 1995; Fengzhong et al., 2006; Liebovitch and Yang, 1997). There are also important results of zero crossing probabilities for Gaussian long-term correlated data (Newell and Rosenblatt, 1962). We recently reported important scaling characteristics of excursions for healthy and pathological cardiac dynamics (Reyes-Ramı´rez and Guzma´n-Vargas, 2010). Our results about probability distributions and time organization of excursions indicate changes when healthy and heart failure dynamics are compared. In this chapter, we address the question of whether there are also changes in the scaling properties and in the stability of excursions between diurnal and nocturnal periods under healthy and heart failure conditions. Our approach is based on the application of methods from statistical physics and nonlinear dynamics such as segmentation algorithm, detrended fluctuation analysis (DFA) and Allan variance (AVAR).
2. Methods 2.1. The segmentation method Many physiological signals are nonstationary, meaning that their statistical properties change with the time evolution. For this type of signals, a global analysis of their statistical characteristics can lead to erroneous or misleading conclusions. Under some particular conditions, a nonstationary time series can be assumed to be a concatenation of a number of stationary segments. An important task is to identify stationary segments by means of methods with enough precision and with a low computational cost. To detect stationary segments in heartbeat time series, we use the segmentation method proposed by Bernaola-Galva´n et al. (2001). The method consists in considering a slider pointer to calculate the statistics t ¼ (mr ml) / SD, where mr and ml are the mean of the values on the right and left, respectively. SD is the pooled variance given by SD ¼
ðNl 1Þs2l þ ðNr 1Þs2r Nl þ Nr 2
1=2
1 1 þ N l Nr
1=2 ;
where sl and sr are the standard deviation of the two sets, and Nl and Nr are the number of points in the two sets. The quantity t is used to separate two
412
L. Guzma´n-Vargas et al.
segments with a statistically different mean. A significance level is applied to cut the series into two new segments (typically set to 0.95), as long as the means of the two new segments are significantly different from the mean of the adjacent segments (Bernaola-Galva´n et al., 2001; Fukuda et al., 2004). The process is applied recursively until the significance value is smaller than the threshold or the length of the new segment is smaller than a minimum ‘0.
2.2. Correlations The power spectrum is the typical method to detect autocorrelations in a time series. For example, consider a stationary stochastic process with autocorrelation function which follows a power-law C ðsÞ sg ;
ð14:1Þ
where s is the lag and g is the correlation exponent, 0 < g < 1. The presence of long-term correlations is related to the fact that the mean correlation time diverges for infinite time series. According to the WienerKhintchine theorem, the power spectrum is the Fourier transform of the autocorrelation function C(s) and, for the case described in Eq. (14.1), we have the scaling relation, Sð f Þ f b ;
ð14:2Þ
where b is called the spectral exponent and is related to the correlation exponent by g ¼ 1 b. When the power spectrum method is used to estimate the presence of correlations in real nonstationary time series, as in the case of heartbeat interval signals, it may lead to unreliable results. In past decades, alternative methods have been proposed to the assessment of correlations for stationary and nonstationary time series. A method which is very appropriate to the assessment of correlations in stationary and nonstationary time series is the DFA. This method was introduced to quantify long-range correlations in the heartbeat interval time series and DNA sequences (Peng et al., 1995a,b). The DFA is briefly described as follows: First, we integrate the original time series to get, yðkÞ ¼ Ski¼ 1 ½xðiÞ xave , the resulting series is divided into boxes of size n. For each box, a straight line is fitted to the points, yn(k). Next, the line points are subtracted from the integrated series, y(k), in each box. The root mean square fluctuation of the integrated and detrended series is calculated by means of vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N u1 X ½yðkÞ yn ðkÞ2 ; F ðnÞ ¼ t ð14:3Þ N k ¼1
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
413
this process is taken over several scales (box sizes) to obtain a power-law behavior F(n) na, with a, an exponent which reflects self-similar and correlation properties of the signal. The scaling exponent a is related to the spectral exponent b by means of a ¼ (b þ 1)/2 (Peng et al., 1995a). It is known that a ¼ 0.5 is associated to white noise (noncorrelated signal), a ¼ 1 corresponds to 1/f noise, and a ¼ 1.5 represents a Brownian motion.
2.3. Allan variance In time and frequency metrology, the time series from the oscillators, most commonly atomic clocks, are either represented by the phase deviation x(t) or by the normalized frequency deviation y(t), being both stochastic variables related by (Allan et al., 1988): yðtÞ ¼
dxðtÞ : dt
ð14:4Þ
The AVAR is used as a standard to define quantitatively the stability of an oscillator; where stability refers to the ability of a frequency standard to maintain its synchronization or syntonization over time (McCarthy and Seidelmann, 2009). The definition of AVAR is given by the expression (Allan, 1987; Allan et al., 1988): s2y ðT Þ ¼
2 E 1 D ; ytþT yt 2
ð14:5Þ
where T is the observation interval, the operator h i denotes time averaging and the average frequency deviation yt is defined as 1 yt ¼ T
ð tþT t
yðt0 Þdt0 ¼
xðt þ T Þ xðtÞ : T
ð14:6Þ
In discrete-time, the AVAR is evaluated with the following estimator using phase deviation data: ^2y ½k ¼ s
X 1 1 N 2k1 ðx½m þ 2k 2x½m þ k þ x½mÞ2 ; ð14:7Þ 2 2 2k T0 N 2k m¼0
where N is the total number of samples, T0 is the minimum observation time interval, and the integer k ¼ T/T0 represents the discrete-time observation interval and typically takes values k ¼ 1, 2,. . ., bN/3c (where brc is the integer part of r). Moreover, in terms of normalized frequency deviation data ({y[i]}), the estimator is (Allan, 1987)
414
L. Guzma´n-Vargas et al.
^2y ½k ¼ s
M2kþ1 X 1 ðyk ½i þ k yk ½iÞ2 ; 2ðM 2k þ 1Þ j¼1
ð14:8Þ
where k ¼ T/T0, M is the total number of frequency data points and the averaged frequency values are given by: yk ½i
X 1 iþk1 y½ j: k j¼i
ð14:9Þ
The frequency deviations of frequency standards are either systematic or stochastic. The latter are often well described by power-law spectral processes, Sy(f) f b, for which AVAR has an interesting property: it exhibits a power-law behavior, s2y ðT Þ T n ;
ð14:10Þ
where the following relationship between the two exponents applies (Allan, 1987): n ¼ b 1 for 2 b 2. Another way to write Eq. (14.10) is in terms of the so-called Allan deviation (ADEV), sy ðT Þ T n=2 T ;
ð14:11Þ
with 1.5 0.5, where ¼ 0.5 corresponds to white noise, ¼ 0 to 1/f noise and ¼ 0.5 to a Brownian motion.
3. Data Analysis We analyzed two different groups of individuals: 16 healthy subjects and 11 patients with congestive heart failure (CHF; Goldberger et al., 2000). For each individual, we considered interbeat sequences with approximately 3 104 beats corresponding to 6 h of ECG records. We selected 6-h during the day and 6-h from sleep periods. In Fig. 14.1, we show representative cases of heartbeat records from a healthy subject and a CHF patient during diurnal and nocturnal periods. A simple visual inspection reveals that the mean value and the standard deviations change for day–night transitions, that is, the mean value is higher in sleep whereas the standard deviation is larger during wake periods (Ivanov et al., 1999a). Here, we are interested in exploring changes in the homeostatic capability of healthy and diseased subjects for day–night transitions.
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
t(seg)
1.5
Healthy
CHF
Day
Day
415
1
0.5 0
2500
5000
7500 10,000 12,500 15,000
0
2500
5000
7500 10,000 12,500
t(seg)
1.5
1
0.5
Night
Night 0
2500
5000
7500 10,000 12,500 0
2500
Beat number
5000 7500 10,000 12,500 Beat number
Figure 14.1 Representative heartbeat interval sequences for a healthy subject and a patient with CHF during wake and sleep periods.
3.1. Distributions of excursions A representative case of the segmentation procedure is shown in Fig. 14.2A. Bernaola-Galva´n et al. (2001) reported that the cumulative distribution of stationary segments of length ‘ with local mean follows a power-law, G (> ‘) ‘ d, with d 2.2 for healthy and heart failure groups (BernaolaGalva´n et al., 2001). We calculate the excursions return times for each stationary segment with respect to the local mean (Reyes-Ramı´rez and Guzma´n-Vargas, 2010). More specifically, we identify an excursion with size t if xj > x and xjþt > x while xi > x for j < i < j þ t or conversely xj < x and xjþt < x while xj < x for j < i < j þ t (see Fig. 14.2C). In a recent work (Reyes-Ramı´rez and Guzma´n-Vargas, 2010), we have reported that the distributions of excursion sequences from stationary segments follow approximately the same functional behavior. This permits to pool the data from all segments to improve the statistics. In Fig. 14.3, we present the results of cumulative distributions of excursions for both groups during diurnal and nocturnal periods. Our results show that after normalizing the excursions, both groups healthy and CHF are consistent with distributions that follow a stretched exponential behavior given by, gðtÞ eat ; b
ð14:12Þ
416
L. Guzma´n-Vargas et al.
1.2
A
0.9 0.6 0 RR interval
0.9
5000
15,000
10,000
B
0.6 7000 0.8
7500 τ1
C
8000
8500 τ3
τ2
τ4
9000 τ5
9500 τ6
10,000
0.6
+σ μ
0.5
−σ
0.7
0.4
7620
7640
7660
7680
Beat number
Figure 14.2 Representative case of the segmentation procedure for (A) a nonstationary signal from a healthy subject. (B) As in (A) but for a magnification of the interval in (A). (C) Magnification of (B) to illustrate the excursion identification.
where a and b are constants. Specifically, for healthy subjects during diurnal hours, we found a ¼ 1.09 0.15 (mean value SD) and b ¼ 0.91
0.11 (Fig. 14.3A). For night periods, we observed a ¼ 1.41 0.19 and b ¼ 0.71 0.11. In Fig. 14.3B, the results for the CHF patients are presented. The best fit reveals that a ¼ 1.31 0.23 and b ¼ 0.77 0.13 for day periods whereas for night periods we observed a ¼ 1.44 0.44 and b ¼ 0.74 0.19. In order to compare the values describing the distributions, we calculate the characteristic scale associated to the stretched expo1=b nential distribution given by hti ¼ a b Gð1=bÞ, where G is the Gamma function. Note that for b ¼ 1, the mean value of an exponential function is recovered. For wake records from healthy data we get hti ¼ 0.95 0.11, while for CHF hti ¼ 0.84 0.12, revealing a fast decay under pathologic conditions, that is, the walker tends to alternate more frequently. For sleep records from the healthy group, the mean value is hti ¼ 0.77 0.10 whereas for CHF data hti ¼ 0.78 0.24 (note that both values are quite similar). It is important to notice that the characteristic scale of the healthy data from sleep periods is comparable with the mean value from diurnal CHF data, suggesting that healthy dynamics during minimal activity (sleep) is close to heart failure dynamics (Ivanov et al., 1998). These results are
417
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
Healthy
100 10-1
A
C
Day
Cumulative distribution
10-2 10-3 10–4
Shuffled
10-5 10-1
B
D
Night
10-2 10-3 10-4
Shuffled
10-5
10-1
100
101
0
5
10 t/s
t/s CHF 100 10-1
E
G Day
Cumulative distribution
10-2 10-3 10-4
Shuffled
10-5 10-1
F
H
Night
10-2 10-3 10-4 10-5
10-1
Shuffled 100
101 t/s
0
5
10 t/s
Figure 14.3 Cumulative distributions of normalized excursions for wake and sleep periods of healthy and heart failure groups. (A, B, E, F) Double-logarithmic plot and (C, D, G, H) linear-logarithmic plot of distributions. We also show the cases of shuffled records, that is, for each segment we shuffled the interbeat time points and then the distribution of excursions was constructed by pooling the data from all segments. An exponential behavior is observed for the shuffled case. For clarity, the distributions were scaled by a factor 1/10.
418
L. Guzma´n-Vargas et al.
1.2 Healthy 1.1
CHF
Scale 〈t〉 (sleep)
1 0.9 0.8 0.7 0.6 0.5 0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Scale 〈t〉 (wake)
Figure 14.4 Wake versus sleep average characteristic times hti for healthy and CHF groups.
summarized in Fig. 14.4, where the average of time constants from wake periods of the healthy group is different than the mean value corresponding to CHF data and a smaller difference is observed for sleep phases. The differences between wake and sleep stages are more significant under healthy dynamics compared to CHF patients. We also remark that wake phases lead to characteristic scales greater than those from sleep periods, confirming that during diurnal activity the walker tends to perform larger excursions. This result is in general agreement with previous studies which report a weaker anticorrelated behavior for wake records (Ivanov et al., 1998). In order to evaluate the accuracy of the findings described above, we consider two procedures which affect the distributions: (i) change of the local mean level to define an excursion (see dashed lines in Fig. 14.2C); (ii) the sum of the kth excursion successor. Concerning (i), we tested the effect of the local mean value over the distribution’s parameters. To this end, we repeated the calculations by considering an excursion with respect to a shifted mean value m qs, with s the standard deviation and q ¼ 0.1,
0.2, 0.3, 0.4, 0.5. In Fig. 14.5, the results of the mean characteristic scale hti for different values of q are presented. For healthy data from wake periods, we observe that hti shows a convex behavior as the local mean value is moved upward or downward whereas for sleep phases the timescale
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
419
1 Healthy wake Healthy sleep CHF wake CHF sleep
Characteristic scale 〈t〉
0.9
0.8
0.7
0.6
m − 0.4s
m − 0.2s
m
m + 0.2s
m + 0.4s
Figure 14.5 Statistics of the characteristic timescale hti for distributions obtained from a shifted mean value. Average values from healthy data during wake and sleep periods. Symbols indicate the average value of each parameter and the error bars indicate the standard error of the mean. (A) We observe that hti is a convex function as the local mean value is moved upward or downward, revealing a rapid decaying of the distributions. We remark that for healthy-day data hti show a nonsymmetrical behavior with respect to the shifted mean value.
exhibits a small change when q increases or decreases. For heart failure data from wake periods, we also observe that t shows a small decreasing as the local mean value is moved upward or downward whereas for sleep periods the value remains almost constant for the shifted mean within the interval m 0.2s. We notice that the values of hti for the CHF group during night periods are close to their corresponding values from wake records, revealing that under pathologic conditions wake–sleep changes are less abrupt compared to healthy cases. Regarding the second point, the number of beats from the mean crossing-level (i.e., the starting point of the excursion) and its kth successor are considered; for k ¼ 1, we get the original excursion definition as those we have studied above. We found that for k ¼ 2, 3, 4, 5, the cumulative distributions of excursions with kth successor are consistent with a stretched exponential function. We estimate a and b for several values of k to construct the characteristic mean value of the distributions. The results are presented in Fig. 14.6. We observed that for healthy data from wake periods, the mean value of the characteristic scales is larger than the values from sleep records. It is also important to note that healthy mean values from night periods are quite similar to the values from wake CHF records.
420
L. Guzma´n-Vargas et al.
Characteristic scale
5 Healthy day Healthy night CHF day CHF night
4
3
2
1
1
2
3 4 k-th successor
5
6
Figure 14.6 Characteristic mean value hti versus kth excursion successor for wake and sleep periods. Symbols represent the mean value and error bars the standard error. As expected, in all cases hti increases as k increases, indicating the presence of large excursions. For healthy and CHF groups, the diurnal mean values are greater than the sleep mean values for several values of k. We note that the healthy mean values from sleep periods are close to the CHF values from wake records.
3.2. Correlations of excursions In order to get a better answer to the question whether the observed distributions of excursions are related to the presence of correlations, we study the memory in the time organization of excursions. Figure 14.7 shows representative time evolution of excursions from one healthy subject (Fig. 14.7A) and one CHF patient (Fig. 14.7B). We observe that both sequences look different; particularly because of the presence of clusters in healthy data as an indication of memory. In contrast, CHF data is characterized by the presence of many periods with small size excursions and low density of large excursions. To go further in the study of correlation in excursion sequences, we consider the conditional cumulative distribution G (tjt0), which is the survival probability that an excursion within the interval t0 was followed by an excursion bigger than t. For sequences with no memory, G(tjt0) is expected to be independent of t0, that is, the order in the sequence of excursions is not correlated. If G(tjt0) shows changes as function of t0, it indicates that t0 “influences” the next excursion size. For shuffled data, G(tjt0) must be exponential and independent of t0. To test the effect of t0 on G(tjt0), we consider two intervals for t0: t0s and t0l which correspond to small and large values of t obtained from a six equal-size
421
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
10 A
Healthy
B
Heart failure
t/s
6 4 2 10 8
t/s
Normalized excursions
8
6 4 2 0
200
400
600
800
1000
Figure 14.7 Representative sequences of excursions from one healthy subject and one CHF patient. (A) Visual representation of excursion clustering observed in a healthy subject, suggesting the presence of memory. (B) As in (A) but for a CHF patient. We observe that the sequence is characterized by the presence of mostly small excursions, except for episodes with large consecutive excursions.
partition of the entire interval in increasing order, that is, each interval contains one-sixth of the total number of excursions. Due to the poor statistics for one subject, all the normalized excursions for different subjects constitute an ensemble of excursions to generate the mentioned partition. The results for G(tjt0s) and G(tjt0l ) are presented in Fig. 14.8. For healthy subjects during wake and sleep periods (see Fig. 14.8A and C), the conditional cumulative probability for t0s tends to be similar to the probability for t0l in the range of small excursions whereas for large excursions G(tjt0s) < G(tjt0l), indicating that large excursions tend to follow large excursions. In contrast, for heart failure group during wake periods (see Fig. 14.8B), we observe that G(tjt0s) is close to G(tjt0l) for small excursions whereas for large ones the probability for t0s is slightly larger than the probability for t0l for wake periods, indicating that large excursions are more likely to be preceded by a small one. Moreover, for sleep periods (see Fig. 14.8D) it is observed that the probability for t0s is slightly lower than the probability for t0l, indicating that small excursions are more likely to be preceded by a large one. 3.2.1. Detrended fluctuation analysis In this section, we study the memory in the time organization of excursions. We use the DFA method to detect changes in correlations of excursion sequences between sleep and wake periods. As shown in Fig. 14.9, for
422
L. Guzma´n-Vargas et al.
100 A
B
Small t0 Large t0
10-1
G(t/t0)
10-2
10-3
Healthy (wake)
CHF (wake)
10-4 Shuffled
Shuffled 100 C
D
Small t0 Large t0
10-1
G(t/t0)
10-2
10-3
Healthy (sleep)
CHF (sleep)
10-4 Shuffled
Shuffled 10-5 10-1
100
101 t/s
10-1
100 t/s
101
Figure 14.8 Conditional cumulative distribution for healthy and CHF groups during wake and sleep periods. In this plot, t0 represents the interval obtained from a six equalsize partition of the entire interval in increasing order. For a comparison, we used the smallest and largest interval for t0 (see text for details). The lower plots represent the distributions obtained from shuffled data. For both groups, we observe that the data can be fitted by an exponential function. We have scaled the distributions by a factor 1/10.
healthy data from wake periods the scaling behavior along at least for two decades is characterized by the average exponent a ¼ 0.64 0.04, while for wake CHF data the scaling is characterized by two regimes; over short
423
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
3 Healthy day CHF day
2.5
Log F(n)
2
al ≈ 0.71 a ≈ 0.65 a ≈ 0.65
1.5 as ≈ 0.55
al ≈ 0.75
1 as ≈ 0.55
0.5
a ≈ 0.5 0 − 0.5
Shuffled 0
1
2 Log n
Healthy night CHF night 3
4
Figure 14.9 Plots of DFA analysis for wake and sleep excursion sequences from a healthy subject and a patient with CHF. For the healthy subject during wake period (open circles) and sleep period (filled circles), a single scaling exponent a 0.65 is identified for timescales 10 < n < 100. In contrast, a crossover pattern is observed in the CHF patient for day (open diamonds) and night (filled diamonds) records. For short scales, the scaling exponent is close to the white noise value (as 0.55), whereas for large scales, the excursion sequences display positive correlations (al 0.7). As a control, we also show the cases of shuffled records which show a scaling exponent close to the uncorrelated value a 0.5.
scales (4 n 102) the average exponent is as ¼ 0.55 0.01 whereas for large scales (102 n 103) the value is al ¼ 0.71 0.07. For sleep periods, we observe that the scaling relation for healthy dynamics is characterized by the exponent a ¼ 0.64 0.02 whereas CHF data are described by the regimes as ¼ 0.55 0.03 and al ¼ 0.75 0.13. We see that in all cases the exponent is larger than 0.5, indicating the presence of positive correlations. Important differences are observed when one compares as and al for CHF data (p-value < 10 3 by the Student’s test). We remark that for short scales, the average DFA exponent for the healthy group is slightly larger than the corresponding scaling DFA exponent of the CHF group, confirming that the excursion sequences under healthy conditions are more correlated compared to pathologic data. It is also important to note that the scaling exponent which characterizes diurnal excursions is almost equal to
424
L. Guzma´n-Vargas et al.
the exponent from night periods. In order to get a better evaluation of the presence of the crossover in the scaling exponent for both periods, we extracted both scaling exponents (as and al) for healthy data. Figure 14.10 shows the scatter plot of the scaling exponents as versus al from healthy and CHF subjects. For both wake and sleep periods, we observe a clear separation between healthy and heart failure groups.
Day 1 0.9
al
0.8 0.7 0.6 Healthy 0.5 0.4 0.5
CHF
0.55
0.6
0.65
0.7
Night 1 0.9
al
0.8 0.7 0.6 Healthy 0.5 0.4 0.5
CHF
0.55
0.6 as
0.65
0.7
Figure 14.10 Scatter plot of as versus al for wake and sleep excursion sequences from healthy subjects and CHF patients. We estimate as over short scales 4 n 120 and al over large ones 120 n 1000. A clear separation between the two groups is observed for both periods.
425
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
3.3. Excursions and simulated noise Moreover, we verify if the observed stretched exponential distribution of excursions is related to the presence of correlations in the heartbeat interval time series. To this end, we generate Gaussian correlated 1/f b-noise by using the Fourier filtering method with 0 b 1 (Makse et al., 1996). The case b 1 roughly corresponds to the observed average spectral exponents in healthy data (Goldberger et al., 2002). The simulated data consists of 32,000 values with zero mean and standard deviation equal to one. We apply the segmentation algorithm to the generated data and construct the distribution of excursions by pooling the data from all the stationary segments. The generated data for the specific case of b ¼ 1 (segmented data) leads to a cumulative distribution of excursions also consistent with a stretched exponential function, while the distribution from uncorrelated noise (b ¼ 0) is exponential (data not shown). To test the effect of the segmentation process on the characteristic scale hti and a, we perform calculations of these quantities for several values of b. Figure 14.11 shows the results of hti and a for original unsegmented and 5 Characteristic scale 〈t〉
A 4 Unsegmented 3
2 Segmented 1 0.8
DFA exponent (a)
B 0.7
Unsegmented
0.6
0.5 Segmented 0.4
0
0.2
0.4 0.6 Spectral exponent (b)
0.8
1
Figure 14.11 (A) Statistics of the time constants hti for excursion sequences from unsegmented and segmented correlated 1/f b-noise. (B) Correlation exponent versus spectral exponent b.
426
L. Guzma´n-Vargas et al.
segmented data. For original data (Fig. 14.11A), we notice that hti increases as b increases, indicating a slow decay in the distribution due to the presence of large excursions for long-term correlated data. This behavior is consistent with theoretical results which report that the probability of having nonzero level crossing after t steps is bounded from above by a stretched exponential (Bunde et al., 2005; Newell and Rosenblatt, 1962). We also find that the scaling correlation exponent a (Fig. 14.11B), slowly increases for b > 0.4 and is close to 0.65 for b ¼ 1, revealing that positive correlations are present in excursion sequences for long-term correlated noise. In contrast, for segmented data both hti and a almost do not show significant changes for different values of b (Fig. 14.11A and B). In particular, we observe that hti slowly increases and a 0.5, indicating that the fluctuations of excursions are close to the exponential (uncorrelated) case and that the segmentation procedure almost destroyed long-range correlations in excursion sequences. From these findings, we can conclude that the observed scaling values in excursion sequences from real data are not only related to the presence of long-range correlations in heartbeat interval series.
3.4. Stability of excursions Finally, we explore the stability of excursions by means of the AVAR statistics. In the context of our study, an excursion can be roughly considered as an oscillation around the local mean. Thus, in this analogy, the number of beats of one upward excursion followed by a downward excursion makes a cycle, that is, a “complete period of the oscillation.” Therefore, we can attempt to analyze the stability of the “oscillation” around the local mean that occurs in the paired excursions. The stability in this case will tell us whether there is a trend on maintaining certain duration of the paired excursions when analyzing different timescales. Then, we apply the AVAR method to the sequences of excursion duration from healthy and CHF groups. The results of our calculations are shown in Fig. 14.12. For healthy subjects during wake periods, we observe that ¼ 0.45 0.02 whereas for sleep phases we get ¼ 0.42 0.02. For CHF data during wake periods, the best fit leads to ¼ 0.48 0.02 and ¼ 0.49 0.04 for night episodes. The values of the scaling exponent for the healthy groups, , for wake and sleep cases are different from 0.5, whereas for the CHF data the values are close to 0.5, which corresponds to uncorrelated noise. These results are in general agreement with those observed by means of the DFA method. Since the scaling exponents for healthy data are lower than for CHF group, we remark that excursion time sequences from healthy subjects are more stable compared to CHF patients, for both wake and sleep records. This also confirmed since the ADEV decays slower than for CHF data for increasing scale T, indicating that for larger scales the ADEV is low, which is a signature of higher stability.
427
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
Healthy (wake) CHF (wake) Healthy (sleep) CHF (sleep)
Allan deviation (s)
100
0.48
0.49
10-1
0.45
0.42
10-2 100
101
102
103
Scale T
Figure 14.12 Allan deviation statistics for excursion time sequences from healthy and CHF groups during wake and sleep periods. For clarity, we have scaled the data from sleep periods.
4. Conclusions We have analyzed excursion sequences from healthy and heart failure groups during wake and sleep periods. Our results reveal that excursions can be characterized by stretched distributions with different fitting parameters. We observed that for healthy data wake–sleep differences are more significant than for CHF conditions. Furthermore, the application of two procedures to test the alterations of the fitting parameters, also reveals that the distributions from wake and sleep are quite similar for the heart failure group whereas important differences are observed in healthy data. In particular, our results show that for healthy data during wake periods the walker is able to perform larger excursions compared to sleep phases, confirming that for periods of minimal activity (e.g., sleep) the heartbeat dynamics exhibits fluctuations close to an uncorrelated behavior with a strong neuroautonomic control. By means of DFA analysis, we confirm the presence of long-term correlations, expressed through a single scaling exponent, in excursion sequences from healthy data during wake and sleep periods whereas under pathologic conditions correlations are described by a crossover, indicating that over short scales, excursions are close to uncorrelated fluctuations. Our results are in concordance with previous studies
428
L. Guzma´n-Vargas et al.
which report that wake heartbeat fluctuations are characterized by scaling exponents larger than the sleep exponents and to the white noise regime (a ¼ 0.5; Ivanov et al., 1999b). Finally, our analysis also reveals that for healthy data during wake and sleep periods, excursions sequences over short scales are less anticorrelated than CHF excursions, indicating stronger neuroautonomic control under heart failure conditions. Moreover, by the application of the ADEV statistics to the excursions, which characterizes the frequency stability of precise oscillators, we found that the excursions duration for healthy subjects are more stable than for CHF patients, indicating that the variability of the excursions duration is lower for healthy data over multiple timescales.
ACKNOWLEDGMENTS We thank I. Fernandez-Rosales for fruitful comments and suggestions. This work was partially supported by US-Mexico Foundation for Science (FUMEC), EDI-IPN, COFAA-IPN, and Consejo Nacional de Ciencia y Tecnologı´a (CONACYT,J49128F26020), Me´xico.
REFERENCES Allan, D. W. (1987). Time and frequency (time-domain) characterization, estimation, and prediction of precision clocks and oscillators. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 34, 647–654. Allan, D., Kartaschoff, P., Vanier, J., Vig, J., Winkler, G. M. R., and Yannoni, N. F. (1988). Standard terminology for fundamental frequency and time metrology. Proceedings of the 42nd Annual Frequency Control Symposium, Baltimore, MD, pp. 419–425. Amaral, L. A. N., Goldberger, A. L., Ivanov, P. C., and Stanley, H. E. (1998). Scaleindependent measures and pathologic cardiac dynamics. Phys. Rev. Lett. 81, 2388–2391. Bernaola-Galva´n, P., Ivanov, P. C., Amaral, L., and Stanley, H. (2001). Magnitude and sign correlations in heartbeat fluctuations. Phys. Rev. Lett. 87, 169105. Bunde, A., Eichner, J. F., Kantelhardt, J. W., and Havlin, S. (2005). Long-term memory: A natural mechanism for the clustering of extreme events and anomalous residual times in climate records. Phys. Rev. Lett. 94(4), 048701. Carpena, P., Bernaola-Galva´n, P., and Ivanov, P. C. (2004). New class of level statistics in correlated disordered chains. Phys. Rev. Lett. 93(17), 176804. Ding, M., and Yang, W. (1995). Distribution of the first return time in fractional Brownian motion and its application to the study of on–off intermittency. Phys. Rev. E 52(1), 207–213. Fengzhong, W., Kazuko, Y., Havlin, S. H., and Stanley, H. E. (2006). Scaling and memory of intraday volatility return intervals in stock markets. Phys. Rev. E 73(2), 026117. Fukuda, K. H., Stanley, H. E., and Nunes Amaral, L. A. (2004). Heuristic segmentation of a nonstationary time series. Phys. Rev. E 69(2), 021108. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P., Mark, R., Mietus, J., Moody, G., Peng, C.-K., and Stanley, H. (2000). Physionet, physiobank and physiotoolkit. Circulation 101, e215–e220.
Scaling Differences of Heartbeat Excursions Between Wake and Sleep Periods
429
Goldberger, A., Amaral, L., Hausdorff, J., Ivanov, P. C., Peng, C.-K., and Stanley, H. (2002). Fractal dynamics in physiology: Alterations with disease and aging. Proc. Natl. Acad. Sci. USA 99(Suppl. 1), 2466–2472. Guzma´n-Vargas, L., and Angulo-Brown, F. (2003). Simple model of the aging effect in heart interbeat time series. Phys. Rev. E 67(5), 052901. Guzma´n-Vargas, L., Munoz-Diosdado, A., and Angulo-Brown, F. (2005). Influence of the loss of time-constants repertoire in pathologic heartbeat dynamics. Phys. A: Stat. Mech. Appl. 348, 304–316. Ivanov, P. C., Amaral, L. A. N., Goldberger, A. L., and Stanley, H. E. (1998). Stochastic feedback and the regulation of biological rhythms. Europhys. Lett. 43(4), 363–368. Ivanov, P. C., Amaral, L., Goldberger, A., Havlin, S., Rosenblum, M., Stuzik, Z., and Stanley, H. (1999a). Multifractality in human heartbeat dynamics. Nature 399, 461–465. Ivanov, P. C., Bunde, A., Amaral, L., Fritsch-Yelle, J., Baevsky, R., Havlin, S., Stanley, H., and Goldberger, A. (1999b). Sleep–wake differences in scaling behavior of the human heartbeat: Analysis of terrestrial and long-term space flight data. Europhys. Lett. 48, 594–599. Liebovitch, L. S., and Yang, W. (1997). Transition from persistent to antipersistent correlation in biological systems. Phys. Rev. E 56(4), 4557–4566. Makse, H. A., Havlin, S., Schwartz, M., and Stanley, H. E. (1996). Method for generating long-range correlations for large systems. Phys. Rev. E 53(5), 5445–5449. McCarthy, D. D., and Seidelmann, P. K. (2009). TIME—From Earth Rotation to Atomic Physics. Wiley-VCH Verlag GmbH & Co., Weinheim. Newell, G. F., and Rosenblatt, M. (1962). Zero crossing probabilities for Gaussian stationary processes. Ann. Math. Stat. 33, 1306–1313. Peng, C. K., Havlin, S., Stanley, H. E., and Goldberger, A. L. (1995a). Long-range anticorrelations and non-Gaussian behavior of the heartbeat. Phys. Rev. Lett. 70, 1343–1346. Peng, C. K., Mietus, J., Hausdorff, J., Havlin, S., Stanley, H. E., and Goldberger, A. L. (1995b). Quantification of scaling exponents and crossover phenomena in nonstationary heartbeat time series [Proceedings of the NATO Dynamical Disease Conference, edited by I. Glass]. Chaos 5, 82–87. Reyes-Ramı´rez, I., and Guzma´n-Vargas, L. (2010). Scaling properties of excursions in heartbeat dynamics. Europhys. Lett. 89(3), 38008. Schmitt, D. T., and Ivanov, P. C. (2007). Fractal scale-invariant and nonlinear properties of cardiac dynamics remain stable with advanced age: A new mechanistic picture of cardiac control in healthy elderly. Am. J. Physiol. Regul. Integr. Comp. Physiol. 293(5), R1923–R1937. Sugihara, G., Allan, W., Sobel, D., and Allan, K. (1996). Fractal dynamics in physiology: Alterations with disease and aging. Proc. Natl. Acad. Sci. USA 93, 2608.
C H A P T E R
F I F T E E N
Changepoint Analysis for Single-Molecule Polarized Total Internal Reflection Fluorescence Microscopy Experiments John F. Beausang,* Yale E. Goldman,†,‡ and Philip C. Nelson* Contents 1. Overview 1.1. The changepoint problem 1.2. Traditional approach 1.3. Improved approach: Heuristic 1.4. Simple derivation of changepoint statistic 1.5. Why read this article? 2. Multiple Channels 2.1. Introduction to polTIRF method 2.2. Multiple-channel changepoint analysis 3. Detailed Analysis 3.1. Threshold for false positives detection 3.2. Correct for nonuniform distribution of false positives 3.3. False positives, multiple-channel case 3.4. Automated, multiple-channel changepoint detection algorithm 3.5. Critique of multiple-channel changepoint algorithm 4. Simulation Results 4.1. No-changepoint simulations 4.2. Single-changepoint simulations 4.3. Two-changepoint detection 5. Discussion 5.1. Single photon counting in single-molecule biophysics 5.2. Vista: Transient state detection 6. Conclusion Acknowledgments References
433 433 434 436 437 439 439 439 441 442 442 443 445 447 448 450 451 453 456 457 457 460 461 462 462
* Department of Physics and Astronomy, University of Pennsylvania, Philadelphia, Pennsylvania, USA Pennsylvania Muscle Institute, University of Pennsylvania, Philadelphia, Pennsylvania, USA Department of Physiology, University of Pennsylvania, Philadelphia, Pennsylvania, USA
{ {
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87015-1
#
2011 Elsevier Inc. All rights reserved.
431
432
John F. Beausang et al.
Abstract The experimental study of individual macromolecules has opened a door to determining the details of their mechanochemical operation. Motor enzymes such as the myosin family have been particularly attractive targets for such study, in part because some of them are highly processive and their “product” is spatial motion. But single-molecule resolution comes with its own costs and limitations. Often, the observations rest on single fluorescent dye molecules, which emit a limited number of photons before photobleaching and are subject to complex internal dynamics. Thus, it is important to develop methods that extract the maximum useful information from a finite set of detected photons. We have extended an experimental technique, multiple polarization illumination in total internal reflection fluorescence microscopy (polTIRF), to record the arrival time and polarization state of each individual detected photon. We also extended an analysis technique, previously applied to FRET experiments, that optimally determines times of changes in photon emission rates. Combining these improvements allows us to identify the structural dynamics of a molecular motor (myosin V) with unprecedented detail and temporal resolution.
A List of Symbols a a a, b y, f E[Lrm], sm2 i k L0(r0) L1(r, r0 , t) Lr(t) or Lrm r and L ^r L m
discrete macromolecular state index; Sa: the state indexed by a acceptable fraction of false positives (Section 3.1) body-frame angles describing orientation relative to an actin filament laboratory-frame angles describing orientation of a fluorophore expectation and variance of uncorrected log likelihood function indexes which of several changepoints is under discussion simulated photon rate log likelihood for no changepoint; L0, max: its maximum over r0 log likelihood for one changepoint; L1, max(t): its maximum over r, r0 log of the likelihood ratio; Lr*: its absolute maximum; m*: position of the maximum corrected log likelihood ratios (Section 3.2) photon sequence number, from 1 to N; also m, m0 : how many of the observed photons came before (resp. after) a proposed changepoint
Changepoint Analysis—Single-Molecule polTIRF
M m q r0 r, r0 ra0 and ra tm t dt Dt x w
433
number of simulation runs index labeling the np distinct polarization channels; mm, m0 m: how many of the observed photons of type m came before/after a proposed changepoint number of proposed changepoints in an interval assumed photon rate under the assumption of no changepoint (or r0,m in the multiple-rate case) assumed photon rates before and after a changepoint (or rm, rm0 in the multiple-rate case) thresholds to reject false positives for uncorrected and corrected log likelihood ratio arrival times of individual photons, in increasing order in the range from 0 to T proposed value of changepoint time fictitious time slice, eventually taken ! 0; k: index of time slices from 0 to T/dt finite bin duration in traditional method photon sequence number as a fraction of the total, ¼ m/N ratio of photon rates before/after a changepoint
1. Overview 1.1. The changepoint problem Many experiments in single-molecule biophysics seek to determine the time course of discrete intramolecular motions (Michalet and Weiss, 2002). For example, we may wish to know when in a mechanochemical cycle does one subunit of an enzyme move spatially relative to another, when does a ligand bind, and so on. One popular method involves Fo¨rster resonance energy transfer (FRET; Weiss, 1999). Oversimplifying somewhat, FRET converts the spatial distance between two fluorescent probes attached to a macromolecule (or on two molecules) into an observable signal, a photon emission rate. A second method, and the main application to be discussed in this chapter, is polarized total internal reflection fluorescence microscopy (polTIRF; Beausang et al., 2008; Rosenberg et al., 2005). The method will be discussed in greater detail below, but again oversimplifying, it converts the spatial orientation of a fluorescent probe into a set of distinct photon emission rates. Each rate describes the probe’s average number of emitted photons per time with a particular polarization, given a particular excitatory polarization and intensity.
434
John F. Beausang et al.
In each of the situations just described, the experimenter hopes to observe discrete changes of internal state as sudden changes in photon emission rate(s), and to interpret those jumps as specific spatial movements. Ideally, such data will tell us the precise times of the changes, for example, so that kinetic constants may be determined accurately, and also the number of distinct states and the precise spatial distances or orientations in each state. (Different methods based on hidden Markov modeling have been proposed to extract kinetic parameters directly from unbinned single-photon trajectories; Andrec et al., 2003; McKinney et al., 2006). Such methods achieve high time resolution, but unlike ours, they require knowledge of the underlying kinetic scheme.) Single-molecule fluorescence measurements are limited, however, by shot noise: There are only a finite number of photons available, either because each state is short-lived, or because most fluorophores photobleach (stop fluorescing) after a finite number of excitations. Increasing the photon count via stronger illumination generally hastens the eventual bleach. Slowing the kinetic steps by any of various expedients can distort the natural functioning of the enzyme under study. For all these reasons, we would like to make optimal use of the available photons by employing a good changepoint detection scheme. The “changepoint problem” has a long history in probability theory (Chen and Gupta, 2001). In its abstract form, we consider a time series of observations. We wish to compare the hypotheses: (H0) The observations are independent draws from a single unknown probability distribution, (H1) all the observations up to time t are independent draws from one unknown distribution, and those made later than t are independent draws from a different unknown distribution; . . .(Hp) there are p such sudden transitions. In this general form, the changepoint problem has many applications (e.g., in finance). But it cannot be attacked without specifying our assumptions more completely. For example, as stated, the problem allows us to suppose that every observation is separated from the next by a changepoint—not a useful conclusion! The rest of this overview section gives a concise, self-contained tutorial on changepoint detection. The reader who wants to know what the method can do may wish to examine Figs. 15.1 and 15.2 before proceeding. Succeeding sections give more implementation details (see also Beausang, 2010). A glossary of abbreviations appears at the end.
1.2. Traditional approach For applications to single-molecule biophysics, we can formulate a more specific version of the general changepoint problem: We suppose that, in each quasi-stationary state S(a), photons are emitted in a Poisson process with
435
Changepoint Analysis—Single-Molecule polTIRF
B 25 Time, a.u.
Photon count/bin
A 20 15 I1
10 5
0. 2
0.6 0.4 0.2
I2
0
0.8
0.0
0. 4 0. 6 0. 8 Time, a.u.
0
1. 0
200
0.4 0.6 Time, a.u.
1.0
D 40 30
98
Photon count/bin
Log likelihood ratio
r
C
50 100 150 Photon sequence number
102
20 10 0
0
50 100 150 Photon sequence number
Time, a.u.
Photon count/bin
F 120 100 80 60 40 20 0 0.0
100 50 0
200
E
150
0.0
0.2
0.8
1.0 0.8 0.6 0.4 0.2 0.0
0.2
0.4 0.6 Time, a.u.
0.8
1.0
0
400 800 1200 Photon sequence number
Figure 15.1 (A–C) Illustration of changepoint detection methods on simulated data with N ¼ 200 photons and a ratio between the high and low rates of w ¼ 3. (A) The photons are binned into 20 constant-width temporal bins. Examining the graph by eye, we may guess that there is a change in photon rate somewhere around the vertical-dashed line, but neither this change time nor the initial and final rates (I1 and I2), nor even the existence of a changepoint, are clear. (B) As described in Section 1.3, the kink in the cumulative distribution of photon arrival times gives a much clearer indication of changepoint time, and the two slopes flanking that point yield the corresponding photon rates. Because these are simulated data, we can compare the actual (triangle) and inferred (dashed line) changepoint times. This chapter describes a quantitative implementation of this simple observation. (C) The peak of the log-likelihood surface occurs at photon sequence number m ¼ 99 (vertical-dotted line), and the 95% confidence intervals at m ¼ 98 and 102 enclose the actual changepoint (inset, vertical lines). (D–F) Illustration on real experimental data. A bifunctional fluorescent dye molecule was attached to one of the two lever arms of a myosin-V molecular motor. The motor bound to an immobilized actin filament and began its mechanochemical cycle in the presence of 10-mm ATP. The dye was excited by polTIRF in each of the several incident polarizations (see Section 2.1), and individual emitted photons were detected
436
John F. Beausang et al.
some stationary mean rate r(a). Given a time series of photon arrivals, we then wish to find these rates and the times of the transitions between them. One way to address the question is to divide time into bins of size Dt large enough that every bin contains many photons. Dividing the count in each bin by Dt gives an estimate of the photon emission rate (intensity). We create a histogram of these estimated rates, identify cut points between its peaks, and declare a changepoint in the data whenever two successive estimates of the rates straddle a cut point. Although it is straightforward, this traditional approach has several weaknesses in single-molecule work. Practically achievable photon rates may not be high, forcing us into a dilemma: We must either take Dt to be large, compromising temporal resolution, or small, giving few photons in each bin. In the former case, most changepoints will lie in the middle of a bin, smearing out the transitions; we may even miss some transitions altogether if a state is too short-lived. In the latter case, ordinary Poisson fluctuations in photon counts become large enough to obliterate some transitions between states with similar rates and conversely can create apparent transitions where none took place. Figure 15.1 illustrates the issues. The plot in panel (A) shows some simulated data, which could represent the estimated photon emission rate of a single-molecule photobleaching event. Clearly, there is a changepoint, but we cannot manually identify its time to very good accuracy, nor can we identify the rates themselves very well. Section 1.3 gives an improved approach to this same data, and Section 1.4 explains how to convert this observation into a useful statistic.
1.3. Improved approach: Heuristic Clearly, the dilemma described in Section 1.2 rests with the random character of photon emission; the rates r found in each time bin are just sample average rates. Nevertheless, if we index the photons by sequence number m and plot their arrival times versus m, we should get a bumpy line that eventually shows a well-defined slope 1/r. Changepoints should then appear as kinks in that line. The curve in Fig. 15.1B shows that indeed the very same data used to obtain panel (A) now display a visible kink at a well-defined time. The difference in time resolution between Fig. 15.1A after passing through a polarization splitter. (Time-stamped data also arise in FRET measurements.) (D) shows the photon counts in a set of 20-time bins (total of N ¼ 1280 photons recorded). No changepoint is visible to the eye. (E) separates the total counts into the several “flavors,” or tagged subpopulations, of emitted photons. Of these, two have been selected for display as solid and dashed curves. A changepoint is visible, but its time cannot be established to greater accuracy than about two time bins. (F) shows the cumulative distribution described in Section 1.3. Each photon time series displays a sharp kink, and moreover, the two curves’ kinks occur at the same time (vertical position).
Changepoint Analysis—Single-Molecule polTIRF
437
and B arises because in the traditional method, we coarsen our data by binning, whereas in the improved method, every photon’s precise arrival time is retained. Figure 15.1E and F shows the same phenomenon with real experimental data. One could simply take a time series with many changepoints and visually identify kinks in a graph-like Fig. 15.1B by laying a ruler along straight stretches in the graph. In practice, we would prefer a method that is both more automatic and more objective than that; the rest of this chapter will develop such a method. It may be tempting to modify the visual method by attempting a least-squares fit of Fig. 15.1B to a piecewise linear function, but least-squares rests on assumptions about the statistical character of data that are not met in this context. Section 1.4 and later sections will instead proceed from a more fundamental, maximum-likelihood approach. As mentioned earlier, we would also like to generalize changepoint analysis to handle situations where several distinct streams of photons are observed, and each macromolecular state a is characterized by the set (r1, r2, . . ., rnp). In our application, each observed photon is tagged by its polarization and by the polarization of the incident radiation that gave rise to it. For example, the data shown in Fig. 15.1E and F include separate traces for two of the subpopulations in a particular experiment. (Other photon subpopulations were not displayed because they were not as sensitive to the particular conformational change occurring at this time.) Section 2.2 will pursue this generalization. Despite the mathematical complexity of the discussion to follow, we wish to emphasize the underlying simplicity of the method: The almost trivial replotting of data in Fig. 15.1B already contains the heart of changepoint detection.
1.4. Simple derivation of changepoint statistic The single-channel case for detecting changepoints in single-photon counting (SPC) data was developed by Watkins and Yang (2005), who applied it to single-molecule FRET recordings (Watkins and Yang, 2006). This section gives a simple derivation of their key formula. We can think of successive observations by discretizing time into small slices dt. dt will be sent to zero in the following discussion; it will not enter our final formulas. It is not a time-binning parameter, because we do not lump groups of photons into batches. We then imagine recording (photon)/ (no photon) in each slice. This binary random variable is supposed to be distributed as a Bernoulli trial with probability rdt to observe a photon, and (in the limit dt ! 0), zero probability to find more than one. If we observe over total time T, we then wish to compare the hypotheses: (H0) Uniform photon emission rate r0 throughout all T/dt time slices; (H1) Uniform rate r until time t, then uniform rate r0 thereafter; etc. Consider first hypothesis (H0) (actually a one-parameter family of hypotheses). Suppose that photons have been observed at times t1 < < tN,
438
John F. Beausang et al.
all between 0 and T. From this information, we would like to identify the best estimate of the rate r0. To do so, we calculate and maximize a “log likelihood function” L0(r0), defined as the logarithm of the probability that the observed photon times would have been observed, had the hypothesis (H0; r0) been true: L0 ðr0 Þ ¼ ln P ðt1 ; . . . ; tN jr0 Þ ¼
T =dt X k¼1
ln
r0 dt; ð1 r0 dt Þ;
if a photon in this slice; otherwise:
ð15:1Þ Taking the limit dt ! 0 gives L0(r0) ¼ N ln(r0dt) r0T. (Exponentiating this formula for L0, integrating over the allowed range of t’s, and summing over N confirms that the corresponding probability distribution is properly normalized.) Maximizing over the rate r0 then gives the optimal choice N/T, as could have been expected. We can now see why the slope of the cumulative photon distribution (Fig. 15.1B and F) tells us a rate, and hence why the heuristic method of Section 1.3 works: The slope in any region not containing a changepoint is just T/N, the reciprocal of the optimal choice just found for the rate r0. Turning now to hypothesis (H1), we would like to identify the best estimates of its three parameters r, r 0 , and t. For any choice of t, partition the observed photons into m that arrive prior to t and m0 ¼ N m that arrive later than t. The same steps as before now give L1 ðr; r 0 ; tÞ ¼ m ln r þ m0 ln r 0 þ N lnðdt Þ rt r 0 ðT tÞ: Maximizing over r and r 0 gives r ¼ m/t, r 0 ¼ m0 /(T t), and so L1;max ðtÞ ¼ m lnðm=tÞ þ m0 lnðm0 =ðT tÞÞ þ N ð1 þ ln dtÞ: Our best estimate of the changepoint time is the value of t that maximizes this quantity (recall that m and m0 are themselves functions of t). We can get a more meaningful statistic by computing the ratio of likelihoods for the no- and one-changepoint hypotheses, or equivalently, the difference of log-likelihoods L1, max(t) L0, max, which we will simply call Lr(t): Lr ðtÞ ¼ m lnðm=tÞ þ m0 lnðm0 =ðT tÞÞ N lnðN =T Þ:
ð15:2Þ
The divergent constant ln(dt) has dropped out of this expression. Watkins and Yang obtained Eq. (15.2), following different reasoning from that given here (Eq. (15.4); Watkins and Yang, 2005). Instead of expressing Lr as a function of time, we may equally well regard it as a function of the photon
Changepoint Analysis—Single-Molecule polTIRF
439
sequence number of the proposed changepoint m, and write Lrm. Let Lr* denote the absolute maximum of Lr overtime (or m).
1.5. Why read this article? To illustrate the power of this approach, Fig. 15.1C shows the log likelihood ratio function for the same dataset that was used to generate panels (A and B). (A similar plot appears when the method is applied to the experimental data in Fig. 15.1E; not shown.) The graph shows that, at least for changepoints located well away from the starting and ending times, the statistic not only precisely identifies the true changepoint, but also does not identify any other (false) changepoints, provided enough photons have been collected. In fact, the uncertainty in the changepoint, determined from those photons with Lrm Lr* 2, is 2 photons and encloses the known location of the changepoint (Fig. 15.1C inset). Unlike the traditional approach, no artificial time base due to binning is imposed on the data, and there is no need for user adjustable thresholds that separate the rate trace into supposedly different regions. As a result, different rate regions of the data are determined in a model-independent way; afterward, the photon emission rates in these regions can be used to determine the orientation of each macromolecular state. Figure 15.2 illustrates the benefit of our improved analysis in the context of polTIRF studies of myosin V; see Section 2.1. In the figure, the dots represent orientations determined by the traditional method by binning the data into 80 ms intervals (Forkey et al., 2005). Although they generally cluster around the results of our method (horizontal lines), the latter is cleaner and eliminates the spurious outlier points that the traditional analysis generates close to changepoints (Section 1.2). The simple derivation given in Section 1.4 has not yet fully addressed the question of distinguishing hypothesis (H0) from (H1). That is, assuming a single changepoint exists, we found the best estimate of its time, but still have not answered the question of whether in fact any changepoint is present. After all, Lrm will always have some maximum; how large a peak is enough to declare a changepoint? Section 3 will discuss this point.
2. Multiple Channels 2.1. Introduction to polTIRF method Space does not allow a full review of the polTIRF method. For our purposes, however, a simple characterization is sufficient (for details, see Beausang et al., 2008; Forkey et al., 2000, 2003, 2005; Quinlan et al., 2005; Rosenberg et al., 2005).
440
John F. Beausang et al.
q, degrees
80 60
40
20
0 0.0
0.5
1.0
1.5 Time, s
2.0
2.5
0.5
1.0
1.5 Time, s
2.0
2.5
f, degrees
80 60
40
20
0 0.0
Figure 15.2 Application of changepoint analysis to experimental data on the motions of the molecular motor myosin V. Dots show polar (y) and azimuthal (f) angles of a fluorescent probe attached to one lever arm of the motor inferred from photon rates obtained by the traditional time-binned method. The angles are defined in a system whose polar axis is the optical axis of the microscope. There are many outlier points, in part reflecting transitions that occur in the middle of a time bin. Solid lines show those same angles inferred from all the photons that lie between successive changepoints (dashed lines), indicating a clear alternating stride between well-defined values of f. For each state, five lines are drawn to indicate the uncertainty in the fit angles, as described in Section 3.4. Generally, these lines are too close to distinguish.
Most fluorescent molecules absorb and emit light via dipole transitions. Thus, a fluorophore’s dipole moment is a director (headless vector), anchored to a body-fixed frame of reference; overall rotation of the molecule changes the dipole moment’s orientation in space, and hence its ability to be excited by various incident polarizations, and also its propensity to
Changepoint Analysis—Single-Molecule polTIRF
441
emit photons of various polarizations. Classic early applications to single molecules include Ha et al. (1996, 1998) and Sase et al. (1997). TIRF excites only those fluorophores located within 100 nm or so of a chamber’s boundary by setting up an evanescent wave that penetrates only that far into the chamber. This evanescent wave has a polarization related to that of the propagating wave that created it. Thus, by scanning over several incident beam directions and polarizations (typically 4 or 8), the experimenter sequentially changes the illuminating beam’s character. By means of a timing signal synchronized to the switching optics, each emitted fluorescence photon can be tagged with the illuminating beam polarization that created it. Such SPC techniques have recently begun to enter molecular biophysics (e.g., in Gopich and Szabo, 2009; Hinze and Basche´, 2010; Talaga, 2009; Yang and Xie, 2002). Moreover, by sending the emitted photon beam through a polarization splitter prior to detection, experiments can further subdivide them, for a total of np ¼ 8 or 16 polarization channels (photon types), each with its own emission rate. The collection of all these rates, modulo an overall rescaling, can be computed as a function of the fluorophore’s spatial orientation by using Fermi’s Golden Rule. Conversely, if we measure all these rates, we can use a maximum likelihood analysis to identify our best estimate of that orientation (Forkey et al., 2000, 2005). The above discussion assumed that only one fluorophore is illuminated at a time, and also that the fluorophore is rigidly anchored. In reality, of course, everything in the nanometer world undergoes thermal motion. In fact, a fluorophore anchored to an enzyme may have differing amounts of thermal motion at different steps in the kinetic cycle, revealing changes in the mobility of the probe or of the protein to which it is attached. Thus, the goal of polTIRF is to deduce both the mean of the fluorophore orientation and its variance (“wobble”), as functions of time, from records of photon arrivals (Forkey et al., 2000, 2005).
2.2. Multiple-channel changepoint analysis The previous section motivated the study of multiple streams of distinct photons. (FRET experiments also involve photon streams with two distinct colors. Xu et al. (2008) developed a correlation analysis for finding simultaneous changepoints in two FRET intensities, different from the one implemented here.) We thus suppose that each photon is tagged with an index m running from 1 to np. Our experimental data then consist of pairs (t1, m1), . . ., (tN, mN). n Let the total number of photons of type m be Nm, so Smp¼ 1 , Nm ¼ N. Hypothesis (H0) now involves a set of np photon emission rates {r0,m}, and Eq. (15.1) becomes
442
John F. Beausang et al. np X L0 r0;m ¼ Nm ln r0;m dt r0;m T : m¼1
We optimize over each rate as before to obtain L0, max. Similarly, generalizing the one-changepoint log likelihood, and subtracting, gives the analog of Eq. (15.2): Lr ðtÞ ¼
np X mm ln mm =t þ m0 m ln m0 m =ðT tÞ Nm ln Nm =T ; m¼1
where mm is the number of photons of type m detected prior to the proposed changepoint and m0 m ¼ Nm mm; thus, Smmm ¼ m. As before, we will often regard Lr as a function of the sequence number m (not time t) of a proposed changepoint. (Thus, each of the mm and m0 m is a function of m.) Rearranging gives our key formula: " n # p X 0 0 Lrm ¼ mm ln mm =Nm þ m m ln m m =Nm ð15:3Þ m¼1 ½m lnðtm =T Þ þ m0 lnð1 tm =T Þ: The peaks of Lrm identify potential changepoints in multiple-channel data. The two pieces of information recorded in polTIRF experiments can be viewed as separate contributions to Eq. (15.3): The time stamp information reports on the overall emission rate of the fluorophore and is contained in the second term, which depends only on the arrival time of each photon and not its polarization. The polarization information, which consists of a tag m ¼ 1, 2, . . ., np for each photon, is contained in the first term.
3. Detailed Analysis An earlier section raised the issue of false positives, which we now explore.
3.1. Threshold for false positives detection Again, let Lr* denote the absolute maximum of Lr overtime (or m). The probability that Lr* is a false positive can be determined from the distribution of Lr* values when no changepoint is present. A threshold for false positives can be defined such that for example, 95% of the Lr* are below the threshold and correctly report no change. In data where the presence of a
Changepoint Analysis—Single-Molecule polTIRF
443
changepoint is unknown, Lr* that exceed the threshold are taken to be valid changepoints with 95% confidence. Letting a denote the fraction of acceptable false positives, then the desired threshold r0 is set by requiring that ð15:4Þ Prob Lrm > r0a for any m ¼ a: Note that the threshold does not depend on the absolute rate of photon emission, but it does depend on the total number of photons in the interval as expected because Eq. (15.2) increases with increasing N. Remarkably, Eq. (15.4) can be computed exactly for the single-channel case by using an algorithm developed by Noe´ (1972). The threshold is then found by solving Eq. (15.4) for the threshold that yields the desired a. The dependence of the threshold on N is found by repeating the calculations over the range of photon counts that will be encountered experimentally (Owen, 1995; Watkins and Yang, 2005). The threshold is also easy to compute via simulations, which will be necessary in multiple-channel data because Noe´’s algorithm only applies to the one-channel case. Thresholds corresponding to a ¼ 0.05 were simulated (see Section 4.1.2) over a wide range of N and fit to a power law function (see Table 15.1) for use in the changepoint algorithm (see np ¼ 1 curve in Fig. 15.3A).
3.2. Correct for nonuniform distribution of false positives Even though the technique outlined in Section 3.1 successfully determines the number of false positives, the location of these false positives across the interval is highly nonuniform. The probability of detecting a changepoint is 10 higher in a region near the boundary of the interval (containing 1–5% of the total photons) than in the center of the interval. This problem has been addressed for single changepoints, and a two-step solution proposed, by Henderson (1990). Qualitatively, the phenomenon is not unexpected: For changepoints near the edge of the interval, a random fluctuation in the region with a small number of photons is easily fit with a rate that differs from that estimated in the larger region. As a result, changepoints are more likely to be identified near the boundaries of the interval. This bias arises because photons in the middle of the interval can arrive with a relatively wide distribution of times, all of which are centered about t/T 0.5, whereas photons near the boundaries of the interval have a relatively narrow distribution of times, either close to zero or close to T. Thus, even when we generate photons in a stationary Poisson process, nevertheless the log likelihood ratio Lr is larger on average near the boundaries than in the middle. In order to correct for this effect, the distribution of Lrm at each m is normalized so that it has zero mean and unit standard deviation. This is
444
John F. Beausang et al.
A
20 np = 16 15 r05 %
8
10 2 5
1 102
103
105
104
B 7
np = 1
6 2 r5 %
5 8 4 16 3 2 102
103 104 Number of photons, N
105
Figure 15.3 Values of the threshold r for 5% false positive rate (error fraction) for the uncorrected (A) and corrected (B) log likelihood function, as a function of the number N of photons in the interval. The correction procedure is discussed in Section 3.2. The curves for np > 1 polarization channels are discussed in Section 3.3.
accomplished by subtracting the mean of the log likelihood ratio, E[Lrm], and dividing by its standard deviation sm at each value of m: rm ¼ Lrm E ½Lrm : L sm
ð15:5Þ
If the initial distributions of Lrm at each m were different size Gaussian distributions, then the fraction of false positives would now be uniform across the interval. Actually, however, the Lrm are beta-distributed random variables (Henderson, 1990); thus, renormalization alone is not sufficient.
Changepoint Analysis—Single-Molecule polTIRF
445
An additional weighting function Wm ¼ 0.5 ln (4m(N m)/N2) is applied to further penalize the likelihoods near the edge of the interval, resulting in ^ rm : the final form of the corrected log likelihood function L rm þ Wm ¼ Lrm E½Lrm þ Wm : ^ rm ¼ L L sm
ð15:6Þ
For the single-channel case, the E[Lrm] and sm can be evaluated analytically (Henderson, 1990), and thus, the threshold for false positives (using Eq. (15.4) with ra in place of ra0) can still be calculated using Noe´’s algorithm (Watkins and Yang, 2005). These analytic solutions, however, are not readily extended to multiple channels and so are not repeated here. As will be discussed in Section 4, the E[Lrm] and sm can be obtained from simulations for any number of channels. Determining these correction factors requires numerous simulations over the desired range of photons N and a number of polarization channels np (see Section 4.1.1), but they only need to be performed once, tabulated, and then the results referenced by the algorithm.
3.3. False positives, multiple-channel case As with the single-channel case, Lrm for the multiple-channel case also suffers from a nonuniform distribution of false positives, which is corrected for in the same way as in Section 3.2. This time the simulations start with a fixed overall number of photons N, then partition it randomly into the counts Nm in each channel, distribute each Nm randomly within the time interval, evaluate the changepoint likelihood function, and repeat. The resulting correction factors and weighting function are different from the single-channel case and remove most of the bias, except for a small peak very close to the boundary. In order to avoid this residual bias, we estimated the width of the peak and arranged for the MCCP algorithm to accept only those changepoints that occur within the center 95% of the interval. That is, changepoints are neglected if they occur within a buffer region of 0.025 N photons on either end of the interval (see, e.g., Fig. 15.4 vertical-dashed lines for the np ¼ 16 channel case). The procedure for locating the peak, finding its confidence interval, and testing for its significance is the same as for the single-channel case, except that a new threshold for false positives must be computed for multiple channels. As mentioned in Section 3.1, the threshold for false positive detection depends on N, but it also depends on the number of polarization channels np among which the photons are divided. The new threshold values with the correction factors E[Lrm] and sm, are determined from simulations similar to the single-channel case but with the photons divided
446
A
× 10–4 8
P(false positive)
John F. Beausang et al.
6
N = 1000
4 2 0
B
0
200
400
600
800
1000
× 10–5 8 N = 10,000 6 4 2 0
0
2000 4000 6000 8000 Location across interval
Figure 15.4 The solid curve shows the distribution of false positives for np ¼ 16 polarization channels across the interval for uncorrected log likelihoods Lrm; it is strongly peaked near the edge of the interval, then decays slowly to a minimum at the center. The distribution becomes increasingly peaked as N is increased from N ¼ 1000 (panel A) to 10,000 (panel B). The fraction of the total probability lying within the first and last 5% of each interval is 30% and 60% (instead of 10%) for N ¼ 1000 and 10,000, respectively. Applying the correction factors (see Eq. (15.6)) to the log likelihood and excluding 2.5% of the photons from near the edges (verticaldashed lines, see Section 4.1.1) result in a nearly uniform distribution of false positives (dotted curve). For comparison, a uniform distribution with total false positive rate 5% would look like the horizontal-dashed line.
among the different polarization channels. The details of the simulations will be discussed in Section 4.1.2, but the thresholds for np ¼ 1, 2, 8, and 16 polarization channels and a ¼ 0.05 are shown in Fig. 15.3. We summarized the simulated values with interpolating functions of the form r05% ¼ A þ B (log 10N)C for the uncorrected likelihoods and r5% ¼ a/(1 þ b(log10N)c) for the corrected likelihoods; see Table 15.1 for the values of the best-fit parameters.
447
Changepoint Analysis—Single-Molecule polTIRF
Table 15.1 Fitting parameters used to determine the 5% false-positive threshold r5% for different numbers of polarization channels np using the uncorrected (left columns) 0 likelihood r5% (N) ¼ A þ B(log10N)C and corrected (right columns) log likelihood np
A
B
C
a
b
c
1 2 8 16
85.07 120.0 50.25 21.22
87.91 124.2 61.09 8.857
0.0229 0.0182 0.0352 2.004
6.207 5.369 4.212 3.839
1.481 1.360 1.671 1.338
3.029 3.314 3.377 3.032
The parameters define the interpolating functions r05% ¼ A þ B(log10 N)C and r5% ¼ a/(1 þ b(log10 N)c).
3.4. Automated, multiple-channel changepoint detection algorithm In experimental data (Beausang et al., 2008; Forkey et al., 2003), a processive myosin V molecule is recorded for multiple steps and consequently multiple changes in orientation of the attached fluorophore are contained within the data, not just a single changepoint as has been discussed so far. One way to proceed would be to let q be the number of changepoints and test hypothesis (Hq; t1, . . ., tq; r, r0 ,. . .) for all possible values of its parameters. This quickly becomes impractical, however, as q grows large. Fortunately, we can apply our method iteratively to the entire data set (Watkins and Yang, 2005), even though the assumption of constant photon emission rates on either side of any given changepoint is clearly not true. After the changepoints are found in this rough manner, they are optimized one at a time in order to eliminate the influence of neighboring changepoints. More precisely: 1. For a single recording, which includes N photons, np polarization channels and multiple changepoints, the MCCP algorithm is applied as follows: (a) Calculate Lrm for each photon m in the interval, using Eq. (15.3); (b) Apply the correction and weighting factors E[Lrm], sm, and Wm to each value of Lrm to obtain the corrected log likelihood ^ rm ; (c) Within the interval function for each photon in the interval L 0.025–0.0975N, find the most likely changepoint as the location m* of for the likelihood peak; (d) Test the candidate changepoint significance ^ > r ; (e) If the by comparing it with the false-positive threshold L a r peak exceeds the threshold, record its location as a changepoint. 2. On the next iteration, only those photons occurring prior to the peak m* just found are analyzed, and the location of the largest peak above the threshold is again determined. Similarly, the largest peak in the region between m* and the end of the data set is also found. This process is repeated on each subregion of the data, creating a list of candidate changepoints, until no more peaks exceed their respective thresholds.
448
John F. Beausang et al.
3. The location of each candidate changepoint is reevaluated over just the range limited by its nearest neighbors. More precisely: (a) Confidence intervals are determined for each changepoint time as those photon ^ 2; (b) Each sequence numbers with log likelihoods greater than L r changepoint time is reevaluated using only the region that starts at the upper confidence limit of the preceding changepoint and ends at the lower confidence limit of the succeeding one. If the changepoint no longer exceeds the significance threshold over this reduced range, then the region is combined with its neighbor and then that neighbor is evaluated. Regions containing fewer than 50 photons are not expected to yield reliable rate information and so are always combined with the neighboring region. 4. After refining the location of each changepoint, the intervals between all adjacent changepoints are tested for any additional changepoints. 5. Steps 2–4 are repeated four times to optimize the location and number of changepoints. After the changepoints are determined, the photon rates in each interval are used to estimate the maximum likelihood orientation and wobble of the fluorophore, as outlined in Section 2.1. In order to assess the sensitivity of the inferred orientation to the precise location of the changepoint, four additional sets of rates are determined for each interval by using the edges of the confidence intervals as the boundary instead of the changepoint. For example, consider two changepoints at sequence numbers mi* and m iþ1 , with confidence intervals (mi, miþ), etc. Five sets of orientations are then determined for the ith interval by using the five ranges: þ þ þ mi ; miþ1 ; mi ; miþ1 ; mi ; miþ1 ; mi ; miþ1 ; mi ; miþ1 : The five corresponding inferred orientations give us an estimate of the uncertainty of our determination; see Fig. 15.2.
3.5. Critique of multiple-channel changepoint algorithm The MCCP algorithm makes several simplifying assumptions that will be elaborated here before discussing the simulations. For single-molecule experiments, the time between photons due to the rate of the fluorophore emission provides an absolute limit on the achievable time resolution. Typical count rates are 20–50 photons/ms. The statistical model that underlies the multiple-channel log likelihood function (Eq. (15.3)) assumes that photons in each polarization channel are emitted independent of one another and detected simultaneously. In practice, however, polTIRF experiments alternately illuminate the sample so that only one excitation polarization state is active at a time. Artifacts may
Changepoint Analysis—Single-Molecule polTIRF
449
arise if the molecule moves on time scales comparable to the polarization switching time, but this is not typically the case for biological macromolecules and >10 kHz cycling frequencies. The threshold for false positives is clearly a crucial parameter, as it determines the validity of a particular changepoint. An advantage of the changepoint analysis is that this threshold is not a user-defined value, but is instead determined by the desired limit a on false positives. The analytic method used to calculate the single-changepoint threshold is not readily applied to multiple photon channels, but we found it was easy to instead find the threshold by using computer simulations. Furthermore, the threshold is a smooth function of the number of photons in the interval (Fig. 15.3), so only a few values of N need to be calculated and the rest can be obtained from an equation fit to the simulations. In the multiple-channel case, our assumption in Section 3.3 that the photon rates were randomly chosen deserves discussion. For applications to polTIRF experiments, a better assumption might be that the photons are randomly distributed among the channels with an average that is consistent with an isotropic distribution of fluorophores. Because the log likelihood function (Eq. (15.3)) depends on the number of photons in each channel, the false-positive threshold in these two scenarios would not be the same. Distribution of the photons equally in the different channels, however, results in the largest magnitude likelihood (on average), and so, the threshold determined in this way is a conservative estimate of whether or not a false positive occurred. Also, assuming an equal distribution of photons is advantageous because it is independent of the model used to represent the molecule’s fluorescence emission and detection. The weighting function W, and the 2.5% buffer zone used to remove the remaining bias, are easy to apply to the changepoint analysis with minimal additional computation. The origin of the weighting function appears to be somewhat ad hoc (Henderson, 1990); however, it is effective (Fig. 15.4) and has been used by other groups (Watkins and Yang, 2005). A key feature of the weighting function is that although it suppresses detection of changepoints near the edge of the interval, still it does not preclude them entirely; a legitimate changepoint will be detected if its likelihood is large enough. The additional buffer zone on either end of the interval, however, precludes the detection of changepoints in this small region. Short-duration events that precede or follow a long-duration event may be missed, but in typical experiments, events longer than 10,000 photons are not common, and the resulting dead time equivalent of 250 photons is near the limit of detection in our application. In cases where this trade-off is not desirable, the MCCP algorithm would be useful for identifying the long-duration dwells, which could then be subjected to a local analysis at the two ends to test for additional changepoints.
450
John F. Beausang et al.
Estimating the 95% confidence intervals from the log likelihood surface is a common statistical practice (Bevington and Robinson, 2003; Edwards, 1972); however, more rigorous confidence intervals can also be defined (Watkins and Yang, 2005). For example, all photons adjacent to a changepoint for which hypothesis (H1) is at least 5% likely to be true would be included in the 95% confidence interval. In the single-channel case, Watkins and Yang (2005) found that the fraction of changepoints that fell within the confidence interval depended on the magnitude of the changepoint. Simulations to determine the confidence interval in the multiplechannel case would be more expensive than those used to determine the false positive threshold, because both the magnitude of the changepoint and the number of photons in the interval would need to be varied. Given these limitations, simply estimating the 95% confidence interval from 2 offset ^ rm L ^ 2) is a on the log likelihood surface (i.e., all photons i with L r practical compromise. In single-molecule polTIRF experiments, changepoints are expected to occur when the probe changes orientation, but changepoints will also be detected when the total photon rate changes magnitude, similar to the scenario in single-channel changepoint analysis. Typically, genuine reorientations incur little change in the total photon rate, but fluctuations in the total rate do occur. For example, the changepoint algorithm easily detects the step decrease in rate when the single-molecule bleaches to background, as well as the occasional double bleach and blinking events where the fluorophore turns off and then back on again.
4. Simulation Results Three types of simulations were performed to test the algorithm: (1) No-changepoint simulations tested the null hypothesis (H0) and were used to determine the correction factors E[Lrm] and sm and the threshold for false positives; (2) Single-changepoint simulations assessed the false negative rate of the algorithm over a range of changepoint magnitudes and duration; (3) Double-changepoint simulations of a large transition followed by a shortlived state with a second transition tests the algorithm’s sensitivity to detect substeps within the myosin V cycle. The various photon rates for a simulation are generated either arbitrarily to give intuition on the detection algorithm or by calculating the polarized fluorescent photon rates that correspond to actual fluorophore orientations using a simple model of the probe (Section 2.1). Our simulations of the MCCP analysis rely on generating a specified number of interphoton arrival times from an exponential distribution. Each photon is randomly assigned to one of the independent polarization
Changepoint Analysis—Single-Molecule polTIRF
451
channels (usually np ¼ 8 or 16, which correspond to the typical number of channels in experimental data) with a probability that is weighted according to its relative rate. For example, if the probe model assigns rate k to six polarization channels, and rate 2k to the remaining two, then the photon arrival times are generated with ktot ¼ 10k, and each photon is randomly assigned to one of the six low-rate polarization channels with probability 0.1 and to one of the two high-rate channels with probability 0.2. This twostep process ensures that the total rate is constant and that each of the individual polarization channels has the proper relative rate with exponentially distributed arrival times. A changepoint is introduced after the mth photon by using one set of weights from 1, . . ., m and second set of weights from (m þ 1), . . ., N. The MCCP algorithm (Section 3.4) is then applied to the simulated data, and any statistically significant changepoints are recorded. This process is repeated, typically 500–10,000 times depending on the simulation, to minimize statistical fluctuations.
4.1. No-changepoint simulations
^ calculated from Eqs. (15.3) and (15.6) must The peak log likelihood L r exceed a threshold to be considered a valid changepoint (with false positive rate a). The threshold is determined from simulations over a range of photons N and polarization channels np for a ¼ 0.05. 4.1.1. Correction factors As discussed in Section 3.2, the distribution of peak log likelihoods simulated under conditions of the null hypothesis (i.e., no changepoint) is not uniform across the interval and results in a bias for detecting falsepositive changepoints preferentially near the boundaries of the search interval. The distribution of Lrm can be empirically determined by repeatedly applying Eq. (15.3) to a constant rate simulation. The mean and standard deviation of Lr at each point across the interval are then used to normalize the likelihood. The process is repeated over a range of N to generate a lookup table for the two correction factors. Values of N not in the lookup table are linearly interpolated between the two nearest values. Determining the correction factors from simulations is computationally expensive, but can be performed on a PC in a few days. Also, it is a one-time cost that can be referenced by the algorithm in a look-up table. The resulting correction factors E[Lrm] and sm for various N follow similar trends across the interval as N is increased (Fig. 15.5). In order to compare simulations with different numbers of photons on the same graph, the photon index m is normalized by the total number x ¼ m/N and plotted on a logarithmic scale to emphasize the region close to the boundary of the interval. Because the correction factors are symmetric about x ¼ 0.5, the
452
John F. Beausang et al.
A 5.0
E[ r]
4.5 50,000 4.0
10,000
3.5
5000
3.0
1000 500
100
N = 50 nP = 8
2.5 10-4
10-3
10-2
10-1
0.5
B
sm
2.0
1.5
1.0
nP = 8 0.5 10-4 10-3 10-2 10-1 Normalized distance across 1/2 the interval
Figure 15.5 MCCP correction factors for (A) the expected value E[Lrx] and (B) the standard deviation sx of the log likelihood function Lr (Eq. (15.3)) for N ¼ {50, 100, 500, 1000, 5000, 50,000} and np ¼ 8. The horizontal axis x ¼ m/N indicates the position of the mth photon across the interval normalized to the total number of photons. Only half the distribution is shown; the correction factors are symmetric about x ¼ 0.5. These functions are needed to evaluate the correction given in Eq. (15.5).
counting statistics are improved twofold by superimposing the results from the two halves of the interval. As the number of polarization channels increases from 8 to 16 (data not shown), the magnitude of both E[Lrx] and sx increase, as expected since the number of ln terms in Eq. (15.3) doubles. When N is 500, all of the correction factors show a plateau in the center of the interval that increases as the edge of the interval is approached and then falls abruptly immediately at the edge. The increase in both the mean and the standard deviation near the edge of the interval reflects the observed increase in the fraction of false positives. Unlike the correction factors for changepoints with multiple photon channels, the correction factors in the single-channel case (not shown) increase monotonically at the edge of the boundary. The reason that the multiple-channel correction factors experience a sharp decrease immediately at the boundary is that the
Changepoint Analysis—Single-Molecule polTIRF
453
magnitude of the Lr depends on the number of terms in Eq. (15.3) that contribute to the sum. If enough photons are present in each region (before and after a changepoint), then every term can contribute to the sum. But, as the algorithm tests points that are closer to the boundary, eventually the number of photons in some of the polarization channels will drop to zero, and the corresponding terms will drop out and lower Lr proportionally (because 0 ln 0 ¼ 0). The final result, including the correction factors, weighting function and buffer, is a uniform distribution of false positives, at least in the range considered here (N ¼ 50–50,000 and np ¼ 8) (Fig. 15.4 dotted curves). 4.1.2. False-positive threshold We found the threshold for false positives from a set of constant rate simulations similar to the ones just described. Instead of recording the first ^ and its and second moments of Lr, however, the peak log likelihood L r ^ is location m* is recorded for each of the M simulations. This list of L r sorted and the value that separates the largest Ma from the remaining M (1 a) is the desired threshold. The functional dependence on N is obtained by repeating the calculation over a range. Because actual data can have any value of N, the simulated values of r are fit to the interpolating function ra ¼ a/(1 þ b(log10N)c) to determine a, b, and c. Finally, the entire process is repeated for different numbers of polarization channels. The results for np ¼ 1, 2, 8, 16 are shown in Fig. 15.3B, and the corresponding values of a, b, and c for each fit are summarized in Table 15.1. All thresholds used here correspond to a 5% false positive rate (a ¼ 0.05). For comparison, we used the same set of simulations to determine the threshold for the uncorrected log likelihood function Lr* (see Fig. 15.3A).
4.2. Single-changepoint simulations 4.2.1. Power to detect arbitrary rate change A low false positive rate is important for confidence in the results; however, a low fraction of false negatives (i.e., the power of a test) is also crucial in order to detect a majority of the changepoints. The power of the MCCP algorithm is determined from simulations performed with np ¼ 8 and 16 polarization channels for various total photon counts N and rate change magnitudes w ¼ max(r 0 /r, r/r 0 ). The photon rates were not based on any assumed orientation of the probe; half of the rates changed from r to wr and the other half change from wr to r, thus ensuring a constant total rate. The simulation placed the changepoints at the ^ 2 midpoint of the interval (N/2) and the MCCP was successful if the L r confidence interval enclosed the true location.
454
John F. Beausang et al.
We ran simulations over a range of changepoint magnitudes w and photon counts N, for np ¼ 8 (Fig. 15.6A) and np ¼ 16 (Fig. 15.6B). Five thousand simulations for each combination of {N, w, np} were run and the fraction of trials with a changepoint are recorded (solid lines), as well as the fraction of changepoints whose confidence interval includes the true location (dotted lines). As expected, simulations with a large N and w resulted in a higher fraction of detected changepoints, and larger rate changes required fewer photons to identify the changepoint. At the larger w and N, nearly 100% of the changepoints are detected. Even though an interval corresponding to the 95% confidence interval was chosen, the actual accuracy of the method exceeded 98% depending on N and w. The nonzero fraction of detected events at w ¼ 1 indicates the false positive error rate. A
B
L=8
L = 16
Fraction detected
1.0 0.8 0.6 N = 5000 1000 500 100
0.4 0.2 0.0 1.0
1.5
2.0
2.5
3.0 1.0 Rate ratio c
C
1.5
2.0
2.5
3.0
3
4
5
D
Fraction detected
1.0 0.8 0.6
N = 5000 1000 500 300 200 100
0.4 0.2 0.0 1
2
3
4 5 1 2 Signal to background ratio (SBR)
Figure 15.6 Power of the MCCP algorithm to detect changepoints of different magnitudes, as a function of the number of polarization channels np ¼ 8 (A and C) and np ¼ 16 (B and D) and the number of photons in the interval. Top row, solid lines, and symbols: The fraction of changepoints detected versus an arbitrary relative photon rate change of w and various N. Dotted lines: The fraction of changepoints that were ^ 2 confidence interval of the true detected and assigned a time lying within the L r time. The high fraction meeting this condition indicates that these confidence intervals are conservative. Bottom row: The fraction of changepoints detected for an angle change corresponding to the tilting motion of a probe attached to the myosin V lever as it steps (see Table 15.2), as a function of signal-to-background ratios (SBR) for various N.
455
Changepoint Analysis—Single-Molecule polTIRF
Increasing the number of polarization channels from 8 to 16 (Fig. 15.6A and C and B and D, respectively) decreases the power of the test slightly due to the increase in the photon counting noise that occurs when N photons are divided into twice as many polarization channels. The sensitivity to additional photon channels is mitigated in the arbitrary rate model used here (Fig. 15.6A vs. B), because all of the rates contribute equally to the changepoint. This is not true when the rate change arises from probe reorientations (Fig. 15.6C vs. D), because some of the photon rates respond more strongly to a particular change than others. 4.2.2. Power to detect myosin lever arm change In order to determine the power of the MCCP in experiments of myosin V stepping, we performed simulations of the probe angle before and after a step using the values in Table 15.2. Instead of an arbitrary rate ratio w, the simulations were performed by assuming base rates given by a dipole model with specified angles (Table 15.2) plus a base rate representing background fluorescence. We varied the background and present results as a function of the signal-to-background ratio (SBR), defined as (rate of fluorophore þ rate of background)/(rate of background). Otherwise, the simulation conditions were similar to those in Section 4.2.1. As in the arbitrary rate case (Section 4.2.1), the power of the algorithm to detect changepoints increases with increasing SBR and the number of photons (Fig. 15.6). The reduction in sensitivity when the number of polarization channels is increased from np ¼ 8 (Fig. 15.6C) to 16 (Fig. 15.6D) is larger than in the arbitrary rate case, because some of the additional polarization channels are not sensitive to the angle change yet still “steal” a fraction of the total number of photons from the other channels. Experiments with SBRs of 3 require 200 and 400 photons for 90% detection in the 8 and 16 channel configurations, respectively. If the fluorophore emits photons at rate 30 ms 1, then the corresponding Table 15.2 Orientation and wobble (d) used in the simulations of myosin V stepping (see Section 2.1) State
{y, f}
{b, a}
d
Prestep Detached head Poststep
{96.7, 168.8} – {18.9, 23.3}
{20, 20} – {80, 85}
40 90 40
The orientations are represented in polar coordinates, in the microscope (y, f) and actin (b, a) frames. That is, b is the polar angle of the probe with respect to the actin filament and a is the azimuthal angle around the filaments, where a ¼ 0 is parallel to the microscope stage and a ¼ 90 is parallel to the optical axis of the microscope. All angles are in degrees.
456
John F. Beausang et al.
time resolution in the two cases would be 7–10 ms and 13–25 ms. The shortest-duration detectable events will be tested directly in Section 4.3.
4.3. Two-changepoint detection Substeps in the myosin V ATPase cycle are predicted to occur in a short period of time immediately before or after a step is taken; that is, a second changepoint adjacent to the large one that accompanies the tilting motion of a step. We ran simulations to determine the sensitivity of the MCCP algorithm to detecting these short-lived states over a range of photon counts in the transient state, Nt ¼ 1 1000, and various SBRs. The simulation consists of a long-lived state with a well-defined orientation, followed by a short-lived state with large wobble (no well-defined orientation), and ends in a long-lived state also with a well-defined orientation. Specifically, the angles from Table 15.2 are used to represent a myosin in the (prestep)/(detached head)/(poststep) configurations. The number of photons in the pre- and poststep states is held fixed at 2000 each, and the number of photons in the transient state is varied. Each combination of Nt and SBR is simulated 500 times, and the fraction of trials resulting in single, double, and triple changepoints is recorded for both 8 (Fig. 15.7A–C) and 16 polarization channels (Fig. 15.7D–F). The simulation technique was outlined in previous sections. To find the changepoints in each trial, the algorithm is applied three times: first to the entire interval of Nt photons, and if the peak log likelihood exceeds the threshold, the regions to the left and right of the peak are interrogated for changepoints in these shorter regions. In the event that three changepoints are detected, the middle one is reevaluated on the interval between the other two and only retained if its peak exceeds the required threshold. As the number of photons in the transient state increases, the fraction of trials with single changepoints decreases (Fig. 15.7A and D), while the fraction with two changepoints increases to 90% (Fig. 15.7B and D). The fraction of trials with a spurious third inferred changepoint is relatively constant at 10%. When there is no transient state, the fraction of trials with single changepoints is 90%, indicating an 10% false positive rate. If the known locations of the simulated changepoints are used to determine the accuracy of the detected changepoints, then fewer of the trials will be considered successes. For example, if the overlap between the detected and the actual interval of the transition is required to be between 90 and 110%, then two to three times more photons in the transition are required to detect the same fraction of events. If a fluorophore emits 30 photons/ ms, then only 300 photons will be recorded during a 10-ms transient state. If the SBR is assumed to be 3, then 500 photons are required to detect 50% of the events in the 8-channel configuration, and 700 photons with
457
Changepoint Analysis—Single-Molecule polTIRF
A
D
0.8
0.4
Fraction detected
0.0 B
E
0.8
0.4
0.0 C
SBR = 4 3 2
0.8
F
1.5
0.4
0.0 0
500
1000 0 500 Number of photons in transient state
1000
Figure 15.7 The power of the MCCP algorithm to detect short-duration (transient) states in the myosin V ATPase cycle, specifically, the short-lived detached state after the motor head releases from actin but before it steps and rebinds, is determined from photon emission rates simulated using the angles in Table 15.2. Simulations with np ¼ 8 (left) and np ¼ 16 polarization channels (right) indicate one (top), two (middle), or three (bottom) detected changepoints as the number of photons in the transient state is increased from 0 to 1200 for various SBRs. Requiring that the interval be detected with at least 90% accuracy (dashed curves, middle panels) significantly increases the number of photons needed to identify the state reliably (see text).
16 polarization channels, approximately twice as many as was required in the single-changepoint simulations.
5. Discussion 5.1. Single photon counting in single-molecule biophysics The main points of our method are summarized in Section 1. Fluorescence experiments that utilize SPC technology can achieve very high time resolution by recording the arrival time of each detected photon. There is no
458
John F. Beausang et al.
binning of the raw data (i.e., lumping photons into groups); afterward, the experimentalist can choose any bin size for analysis. This chapter has described an alternative approach that never imposes a bin size on the data and uses the photon arrival times directly. Changepoint detection algorithms meet both of these requirements and are particularly powerful because they do not require any user-defined threshold that separates high and low rate states (Watkins and Yang, 2005). All parameters within the changepoint algorithm are statistically defined once a desired false positive error rate is chosen. The high time resolution polTIRF experiments discussed in Section 2.1 are an example of fluorescence experiments that implement SPC technology. In addition to recording photon arrival times, a polarization tag is also recorded for each detected photon. In these experiments, most changepoints do not involve any change of the overall photon rate; instead, we must find the times when the photon rates change relative to one another. Because of this distinction, we developed a new multiple-channel changepoint (MCCP) analysis to analyze high time resolution polTIRF data. The basic idea of the method is to test whether two adjacent regions of the data are better described by two different photon emission rates or by one constant rate. Because three free parameters m, r, r0 will always fit the data better than one r0, we defined a threshold consistent with a specified false positive rate that requires the two-rate hypothesis to be significantly better than one rate. If that condition is met, the location in the interval with the largest log likelihood above this threshold is identified as a changepoint. All of the changepoints in the data can be determined by applying this test recursively to the intervals between previously determined changepoints. In recordings with only one channel, the log likelihood simplifies to just the second term of Eq. (15.3) (Watkins and Yang, 2005); changes in the total photon rate can be located within a few photons of the actual change (see Fig. 15.1). Qualitatively, this precision is consistent with the abrupt change in slope when the arrival times are plotted versus the corresponding sequence numbers (Fig. 15.1B). In recordings with two photon channels, analogous to FRET or to a simplified polarization measurement, the location of the changepoint is often determined predominantly by the first term in Eq. (15.3), because the total photon rate is often approximately constant (see Fig. 15.1D), although the individual photon rates change abruptly (e.g., Fig. 15.1E). Despite the relative coarseness of the polarization information, changepoints can still be accurately identified (Fig. 15.1C). We noted that the uncorrected log likelihood function Eq. (15.3) has the disadvantage that its magnitude is not uniform across the interval and is higher on average near the boundary, even if no changepoint exists. The peaks in the log likelihood function (and thus the changepoints) are
Changepoint Analysis—Single-Molecule polTIRF
459
therefore biased near the edge of the interval, especially for large numbers of photons (solid line, Fig. 15.4). Analytical corrections for this effect have been derived (Henderson, 1990) for the single-channel case and successfully applied to fluorescence data (Watkins and Yang, 2005). We found analogous correction factors from simulations for the multiple-channel case and used them in the MCCP algorithm. Modifying the likelihood function using the correction factors, a weighting function, and a narrow exclusion region that prohibits changepoints from occurring within the first and last 2.5% of the data, nearly eliminates the bias across a wide range of photons (dotted line, Fig. 15.4). Correcting for this effect is particularly relevant for finding substeps in myosin V polTIRF experiments because the intervals of interest are adjacent to a prominent changepoint—the same region that is sensitive to a false positive. For intervals with a sufficient number of photons N is 500, the shape of the correction factors across the interval follows a pattern as N is increased. Their average values are relatively constant in the center region, peak near the edge, and then drop precipitously at the boundary. The reason for this drop is that the log likelihood (Eq. (15.3)) is proportional to the number of polarization terms; if there are too few photons in a region, then some of the terms drop out and the log likelihood function decreases. This effect is clearly seen when comparing the distribution of correction factors for various number of polarization channels np (not shown), where in the single-channel case, there is no decrease at the edge and it becomes more pronounced as the number of channels increases. Simulated single changepoints with np ¼ 8 and 16 polarization channels were accurately identified over a range of photon counts N and SBR (Fig. 15.6). The number of photons required to detect an event was inversely proportional to the size of the transition, that is, large magnitude changepoints events were easier to detect. Simulations that assume that each of the channels participates equally in the changepoint were used to compare the 8 and 16 polarization channel cases (Fig. 15.6A and B). For a given N and SBR, there is a small reduction in the sensitivity when the number of polarization channels is increased, but often this is a useful trade-off because the orientation of the probe is better defined with 16 polarizations. The accuracy of our method can be determined by comparing the changepoint with its known location. The confidence limits are expected to enclose the known location for 95% of the trials. The actual accuracy depended on the number of photons and SBR, but was often greater than 98% for most of the conditions (dashed lines, Fig. 15.6). By using the dipole model for the probe to determine the photon rates (Forkey et al., 2005), instead of distributing the photons according to an arbitrary change w, we assessed the sensitivity of the analysis for experimental data. Because the experiment entails a single molecule of myosin V translocating along actin, we simulated the orientation of the probe before
460
John F. Beausang et al.
and after the myosin steps (see Table 15.2). The detection of events improves as the number of photons and the SBR increases (Fig. 15.6C and D); however, the sensitivity decreased when the number of polarization channels was increased from 8 to 16. An optimistic value of the SBR in polTIRF experiments is 3, indicating that 200 photons are required to detect 95% of the changepoints in the 8-channel case. This number approximately doubles when the number of channels increases to 16. The reason for this is that only a few channels are sensitive to the orientation change; thus, the number of photons contributing to the changepoint can be fewer than expected based on the SBR. For example, a probe that rotates 90 from being aligned along the x-axis to the z-axis would be obvious if the polarizations were aligned along those two directions, but would be invisible to polarizations aligned at 45 to those directions. Figure 15.2 shows another way to underscore the usefulness of the method: Changepoint analysis lets us identify the widest possible bins for accumulating photon statistics, leading to more reliable estimates of orientations (in polTIRF) or distances (in FRET). In polTIRF, the improvement can be especially significant, because the inferred probe orientation is a highly nonlinear function of the photon rates, and those rates are never exactly known. Suppose that a particular set of photon rates define an orientation uniquely (apart from the unavoidable 180 dipole ambiguity). Nevertheless, in practice, that set of rates may be near enough to a degenerate point that the unavoidable statistical fluctuations in estimating the rates create spurious jumps between two very different inferred orientations. Changepoint analysis addresses this problem by maximizing the number of photons used in each orientation determination, thus minimizing the statistical uncertainty in rate estimates and keeping them away from such degeneracies.
5.2. Vista: Transient state detection A bigger challenge is to detect a relatively short-lived state immediately adjacent to a large changepoint. Such a pattern is expected during a step of myosin V where the large changepoint corresponds to the tilting of the lever arm before and/or after a step and the small changepoint is the short-lived transient state of the detached head before it rebinds to actin. Our simulations emulated this scenario by modeling three states: (1) a long-lived state (with 2000 photons) corresponding to the prestep head orientation with relatively little wobble since both heads are attached to actin, (2) a variable duration transient state (0–1200 photons) of large probe wobble due to the detached head rapidly diffusing toward the next binding site, and (3) a longlived state (also with 2000 photons) in the leading head poststep orientation with relatively little wobble.
Changepoint Analysis—Single-Molecule polTIRF
461
When there is no such transient state, the algorithm detects 90% of the single changepoints (Fig. 15.7A and D) representing the step, similar to Fig. 15.6C and D. As the number of photons in the transient state increases, the probability to detect it also increases (Fig. 15.7B and E) but plateaus at 90% due to a relatively constant 10% probability to detect a spurious third changepoint (Fig. 15.7C and F). For large N, the 3-changepoint cases almost always involves a correct determination of the transient state plus an additional false positive somewhere else in the interval. Detecting the transient state requires more photons in the 16 polarization channel case (Fig. 15.7E) than it does with 8 channels (Fig. 15.7B), similar to the results discussed for single-changepoint detection. For SBR ¼ 3, approximately 750 and 1100 photons are required to detect 80% of the intervals in the 8 and 16 polarization channel cases, respectively. It is important to realize that these simulations give useful estimates for the design of experiments, but the actual sensitivity may be different for different orientations. Determining the probe orientation and wobble in the interval between changepoints (Section 3.4) can be used to validate whether a particular changepoint is physically relevant or not. In polTIRF experiments, for example, a small change in the overall rate may result in a statistically significant changepoint, but if the corresponding inferred orientation does not also change, then it is not likely to be biologically relevant. The usefulness of this approach, however, is compromised by spurious changes in the orientation that arise from overfitting to photon counting noise. The MCCP algorithm minimizes this problem by ensuring that the maximum number of photons is included in each dwell, but the effect still remains for short-duration dwells.
6. Conclusion Our conclusions were already summarized in Section 1.5; Figs. 15.1 and 15.2 apply our method to experimental data. We extended a changepoint analysis for single-channel fluorescence experiments like FRET (Watkins and Yang, 2005) to make it applicable to multiple-channel data, for example, from polTIRF. Our method dramatically improves the time resolution potential of such experiments and also their accuracy in determining orientation changes in molecular motors. We tested the method’s accuracy and power to detect changepoints over a range of photon numbers and SBRs. Our simulations indicate that approximately 700 and 1100 photons are required to detect the detached state between myosin V steps in 8- and 16-channel polTIRF configurations. With 8 polarization channels, fewer photons are required to locate the short-lived state; however, 16 channels are required to accurately identify the increase in wobble cone.
462
John F. Beausang et al.
Glossary FRET Forster resonance energy transfer. MCCP Multiple-channel changepoint. polTIRF Polarized total internal reflection fluorescence microscopy. SBR Signal to background ratio. SPC Single photon counting.
ACKNOWLEDGMENTS We thank Haw Yang for an extensive discussion and for sharing some computer code, and Xavier Michalet and Chris Wiggins for bringing references to our attention. This work was partially supported by NSF grants, DGE-02-21664 ( JFB) EF-0928048 (PCN) and DMR0832802 (PCN and YEG), and NIH grant R01 GM086352 (YEG).
REFERENCES Andrec, M., Levy, R. M., and Talaga, D. S. (2003). Direct determination of kinetic rates from single-molecule photon arrival trajectories using hidden Markov models. J. Phys. Chem. A 107(38), 7454–7464. Beausang, J. F. (2010). Single Molecule Investigations of DNA Looping Using the Tethered Particle Method and Translocation by Acto-Myosin Using Polarized Total Internal Reflection Fluorescence Microscopy. Ph.D. Thesis. University of Pennsylvania, Philadelphia, PA. Beausang, J. F., Sun, Y., Quinlan, M. E., Forkey, J. N., and Goldman, Y. E. (2008). Orientation and rotational motions of single molecules by polarized total internal reflection fluorescence microscopy. In “Single Molecule Techniques,” (P. R. Selvin and T. Ha, eds.), pp. 121–148. Cold Spring Harbor, NY. Bevington, P. R., and Robinson, D. K. (2003). Data Reduction and Error Analysis for the Physical Sciences. 3rd ed. Boston: McGraw–Hill. Chen, J., and Gupta, A. (2001). On change point detection and estimation. Commun. Stat. Simul. Comput. 30(3), 665–697. Edwards, A. (1972). Likelihood. 1st edn. Cambridge University Press, London. Forkey, J. N., Quinlan, M. E., and Goldman, Y. E. (2000). Protein structural dynamics by single-molecule fluorescence polarization. Prog. Biophys. Mol. Biol. 74(1–2), 1–35. Forkey, J. N., Quinlan, M. E., Shaw, M. A., Corrie, J. E. T., and Goldman, Y. E. (2003). Three-dimensional structural dynamics of myosin V by single-molecule fluorescence polarization. Nature 422(6930), 399–404.
Changepoint Analysis—Single-Molecule polTIRF
463
Forkey, J. N., Quinlan, M. E., and Goldman, Y. E. (2005). Measurement of single macromolecule orientation by total internal reflection fluorescence polarization microscopy. Biophys. J. 89, 1261–1271. Gopich, I. V., and Szabo, A. (2009). Decoding the pattern of photon colors in singlemolecule FRET. J. Phys. Chem. B 113(31), 10965–10973. Ha, T., Enderle, T., Ogletree, D. F., Chemla, D. S., Selvin, P. R., and Weiss, S. (1996). Probing the interaction between two single molecules: Fluorescence resonance energy transfer between a single donor and a single acceptor. Proc. Natl. Acad. Sci. USA 93(13), 6264–6268. Ha, T., Glass, J., Enderle, T., Chemla, D. S., and Weiss, S. (1998). Hindered rotational diffusion and rotational jumps of single molecules. Phys. Rev. Lett. 80(10), 2093–2096. Henderson, R. (1990). A problem with the likelihood ratio test for a change-point hazard rate model. Biometrika 77(4), 835–843. Hinze, G., and Basche´, T. (2010). Statistical analysis of time resolved single molecule fluorescence data without time binning. J. Chem. Phys. 132(4), 044509. McKinney, S. A., Joo, C., and Ha, T. (2006). Analysis of single-molecule FRET trajectories using hidden Markov modeling. Biophys. J. 91(5), 1941–1951. Michalet, X., and Weiss, S. (2002). Single-molecule spectroscopy and microscopy. C.R. Phys. 3(5), 619–644. Noe´, M. (1972). The calculation of distributions of two-sided Kolmogorov-Smirnov type statistics. Ann. Math. Stats. 43(1), 58–64. Owen, A. B. (1995). Nonparametric likelihood confidence bands for a distribution function. J. Am. Stat. Assn. 90, 516–521. Quinlan, M. E., Forkey, J. N., and Goldman, Y. E. (2005). Orientation of the myosin light chain region by single molecule total internal reflection fluorescence polarization microscopy. Biophys. J. 89(2), 1132–1142. Rosenberg, S. A., Quinlan, M. E., Forkey, J. N., and Goldman, Y. E. (2005). Rotational motions of macro-molecules by single-molecule fluorescence microscopy. Acc. Chem. Res. 38, 583–593. Sase, I., Miyata, H., Ishiwata, S., and Kinosita, K. (1997). Axial rotation of sliding actin filaments revealed by single-fluorophore imaging. Proc. Natl. Acad. Sci. USA 94(11), 5646–5650. Talaga, D. S. (2009). Information-theoretical analysis of time-correlated single-photon counting measurements of single molecules. J. Phys. Chem. A 113(17), 5251–5263. Watkins, L. P., and Yang, H. (2005). Detection of intensity change points in time-resolved single-molecule measurements. J. Phys. Chem. B 109(1), 617–628. Watkins, L. P., and Yang, H. (2006). Quantitative single-molecule conformational distributions: A case study with poly-(L-proline). J. Phys. Chem. A 110(15), 5191–5203. Weiss, S. (1999). Fluorescence spectroscopy of single biomolecules. Science 283(5408), 1676–1683. Xu, C., Kim, H., Hayden, C., and Yang, H. (2008). Joint statistical analysis of multichannel time series from single quantum dot-(Cy5)n constructs. J. Phys. Chem. B 112(19), 5917–5923. Yang, H., and Xie, X. S. (2002). Probing single-molecule dynamics photon by photon. J. Chem. Phys. 117(24), 10965–10979.
C H A P T E R
S I X T E E N
Inferring Mechanisms from Dose–Response Curves Carson C. Chow,* Karen M. Ong,* Edward J. Dougherty,† and S. Stoney Simons Jr.† Contents 1. Introduction 2. General Theory 2.1. Inhibitors 3. Application of Model to Data 3.1. Inferring cofactor mechanisms by direct fitting to data 3.2. Inferring mechanisms using graphical analysis 4. Discussion Acknowledgments References
466 467 471 472 473 475 479 482 482
Abstract The steady state dose–response curve of ligand-mediated gene induction usually appears to precisely follow a first-order Hill equation (Hill coefficient equal to 1). Additionally, various cofactors/reagents can affect both the potency and the maximum activity of gene induction in a gene-specific manner. Recently, we have developed a general theory for which an unspecified sequence of steps or reactions yields a first-order Hill dose–response curve (FHDC) for plots of the final product versus initial agonist concentration. The theory requires only that individual reactions “dissociate” from the downstream reactions leading to the final product, which implies that intermediate complexes are weakly bound or exist only transiently. We show how the theory can be utilized to make predictions of previously unidentified mechanisms and the site of action of cofactors/ reagents. The theory is general and can be applied to any biochemical reaction that has a FHDC.
* Laboratory of Biological Modeling, NIDDK/CEB, National Institutes of Health, Bethesda, Maryland, USA { Steroid Hormones Section, NIDDK/CEB, National Institutes of Health, Bethesda, Maryland, USA Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87016-3
465
466
Carson C. Chow et al.
1. Introduction
Luciferase activity as percent of maximal response with ligand
Ligand-regulated gene transcription is ubiquitous in biological systems. The dose–response curve of the amount of final gene product expressed versus the amount of ligand present is of crucial importance for development, differentiation, and homeostasis. In many cases, the dose–response curve in gene induction obeys a sigmoidal curve but not all sigmoidal curves have the same shape (Goldbeter and Koshland, 1981). For example, a dose– response curve obeying a first-order Hill equation or function (Hill coefficient equal to 1), goes from 10% to 90% of maximum activity over an 81-fold change in ligand concentration whereas the change is only ninefold in a second-order Hill function, which thus has a different shape (Fig. 16.1). (A first-order Hill function is sometimes called a Michaelis–Menten function.) The shape and position of a first-order Hill dose–response curve (FHDC) is specified by the potency (i.e., concentration required for 50% of maximal response, or EC50) and maximum activity (Amax). These two parameters completely describe the expression of the regulated gene in response to ligand concentration.
100
75
50 Hill coefficient n=2
25
n=1 n = 0.5
0 0.001
0.01
0.1 1 10 Ligand concentration
100
Figure 16.1 Shapes of different Hill plots. Computer-generated dose–response curves are shown with Hill coefficients “n” of 0.5, 1, and 2. The dashed lines show 10% and 90% of full activity, which requires a change in ligand concentration of 6561 for n ¼ 0.5, 81 for n ¼ 1, and 9 for n ¼ 2.
Mechanisms From Dose–Response Curves
467
The addition of various cofactors can shift the EC50 and Amax, yet preserve the shape of the dose–response curve. These properties put strong constraints on the mechanisms of gene induction and raise two questions: how can a FHDC arise from a multistep reaction sequence, and how do cofactors modify potency? To address these questions, we recently developed a general theoretical framework for dose–response curves of biochemical reactions and showed that it is possible for an arbitrarily long sequence of complex-forming reactions to yield FHDCs, provided that a stringent but biologically achievable set of conditions are satisfied. The theory in turn provides a means to make previously unobtainable predictions about the mechanisms and site of action of cofactors that influence the dose–response curve. The FHDC also allows standard methods of enzyme kinetics to be modified for the analysis of FHDCs of arbitrarily long biochemical reaction sequences at steady state. Although mathematical models have been extensively developed for enzymes, receptor binding, trafficking, and signaling, missing information about downstream steps (such as the phosphorylated proteins and final cellular response) have previously limited mathematical development in this area (Lauffenburger and Linderman, 1993). In contrast, our theory is applicable even when only partial information is available because the constraints of a first-order Hill function and the mechanism of factors permit modeling regardless of whether their position or order in a given cascade of steps is known. The theory also avoids the “explosion of parameters” that usually confounds the search for mathematical models by “telescoping” the unknown intermediate steps to produce a simplified analytical equation with a small set of measurable parameters.
2. General Theory The classical explanation for a Hill coefficient of one in steroidinduced gene expression has been that steroid binding to receptor is the rate-limiting step (Baxter and Tomkins, 1971). Consider the reaction R þ S $ RS ! P, where R is the steroid receptor, S is the steroid, and P is the final protein product. If the reactions obey mass action kinetics, and the steroid-receptor binding reaction is fast compared to the formation of the product or to the time of product measurement, we can assume that it reaches equilibrium or steady state so that [RS] ¼ q[R][S], where square brackets indicate concentration and q is the affinity or association constant. By mass conservation, [R] þ [RS] ¼ RT, where RT is the total receptor concentration. Combining the steady state and mass conservation equations results in [R] þ q[R][S] ¼ RT. Solving for [R] and substituting back into the steady state equation gives:
468
Carson C. Chow et al.
½RS ¼
qRT ½S 1 þ q½S
ð16:1Þ
and the measured protein product is assumed to be proportional to [RS]. In this form, the maximal activity (Amax) is RT and the effective concentration for 50% of maximum activity (EC50) is equal to the inverse of the association constant (i.e., dissociation constant) 1/q. Hence, both EC50 and Amax cannot be influenced by downstream cofactors, which is contradicted by experiments (Kim et al., 2006). However, if downstream reactions are included, a FHDC for the final product with respect to the steroid concentration is not guaranteed. For example, steroid–receptor complex binding to DNA initiates a series of transcriptional and translational processes. Even two steps of such a sequence, for example, R þ S $ RS, RS þ D $ RSD, where the second reaction represents binding to DNA, would not generally lead to a FHDC (Ong et al., 2010). Strickland and Loeb provided a possible solution by examining a twostep process for a system dependent on a secondary mediator such as cyclic AMP (Loeb and Strickland, 1987; Strickland and Loeb, 1981). They showed that if a step following steroid-receptor binding forms an intermediate product Y, where Y itself is also a first-order Hill function of [RS], then the result is that [Y] is a first-order Hill function of [S] and both Amax and EC50 are modifiable by the intermediate process. Hence, if the steady state concentrations of a sequence of reactions are such that the concentration of the product of each reaction is a FHDC with respect to the prior product, then the concentration of the final product will have a FHDC with respect to any prior product (Ong et al., 2010). We utilized this observation and developed a general theoretical framework of ligand-induced gene expression that could quantitatively model the dose–response curve of the concentration of the expressed protein product as a function of the concentration of applied ligand in the presence of various cofactors or reagents. We modeled gene induction as a series of steps or reactions that obey mass action. These reactions include all the processes thought to be involved in gene expression, starting at steroid binding to receptor, ending in the translation of mRNA into protein product, and including, for example, processes such as transcription factor binding to DNA, phosphorylation, methylation, ubiquitination, the uncoiling or unwinding of DNA, the release of paused polymerase, and mRNA processing. We presumed that enough time has elapsed between the addition of the ligand (and any other cofactors/reagents) and the measurement of the final product so that the reactions reach a state of steady state or near steady state. The problem then reduces to solving the system of equations given by steady state and mass conservation equations for all the reactions involved.
Mechanisms From Dose–Response Curves
469
Our theory is applied to a sequence of n binary reactions of the form Yi 1 þ Xi $ Yi, where i ¼ 1, 2, . . ., n, is an index for a reaction. We identify Y0 as the steroid, X1 as the receptor, and Y1 as the receptor–steroid complex. We call the subsequent X variables activating factors or activators and the Y variables products. The activators correspond to added or intrinsic agents, including coactivators, comodulators, or small molecules that facilitate an individual reaction locally. We generalize to factors that inhibit local reactions later. The products correspond to states or complexes in the transcription–translation process that are formed or induced by the cofactors and prior products. The reactions need not be reversible; they only should reach steady state. For example, reactions could also have the form Yi 1 þ Xi ! Yi, Yi ! ∗, where the second step indicates decay or inactivation without addition of a cofactor. Under steady state conditions governed by mass action principles, the concentrations P n obey [Yi] ¼T qi [Xi][Yi1] and mass conservation implies that [Xi] þ k ¼ i [Yk] ¼ Xi for i ¼ 1, 2, . . ., n, where the n association constants qi and the n total concentrations XiT are free parameters. The dose–response curve is given by solving the concentration and mass conservation equations simultaneously to obtain [Yn] as a function of [Y0]. In general, obtaining the functional relationship will require the solution of a 2n 1-degree polynomial that will not yield a FHDC. In order to retain a Hill coefficient of one, each reaction must “disentangle” from the other reactions so that the dose–response curve between any two consecutive products has a Hill coefficient of one and thus any downstream product is always a first-order Hill function of any upstream product. A mechanism that could lead to this condition is that cofactors only act transiently but produce a lasting response. Transient binding leads to an effective reduction in the average concentration of products. Reactions could then have the form Yi 1 þ Xi ! Yi 1 þ Yi. For example, the steroid–receptor complex could bind transiently to DNA but affect the DNA state (e.g., methylation, ubiquitination, uncoiling, untwisting, etc.), facilitate the binding of another cofactor, or alter the mRNA state during translation (Kim et al., 2006). This could also be a multistep process as in enzyme–substrate reactions (Fromm, 1975; Segel, 1993). The fact that a transient binding event can lead to a delayed action implies that the effect of a factor on the dose–response curve may occur downstream from where the factor actually binds. Considerable experimental evidence has been recently advanced in support of transient binding (dubbed “hit-and-run”) of glucocorticoid receptor (GR) to endogenous genes (Nagaich et al., 2004; Stavreva et al., 2004). Finally, concentration-limiting and transient-binding or hit-and-run mechanisms could both be present at various stages of the reaction sequence. The result is that any given product is a first-order Hill function of any upstream product with the form
470
Carson C. Chow et al.
vi ½Yi1 1 þ wi ½Yi1
½Yi ¼
ð16:2Þ
where vi and wi are defined in Table 16.1 (Ong et al., 2010). There is, however, one step that we call the concentration limiting step (CLS), where products after this step obey the form [Yi] ¼ vi[Yi 1]. In other words, the CLS is the step in which the concentration of the product is much smaller than that for any other reaction. Therefore, the CLS is the equilibrium analogue of the kinetic rate-limiting step, which is that step for which the forward reaction rate is the slowest. For all steps after the CLS, the free concentration of any activator is equal to its total concentration. The location of the CLS depends on the parameters and the experimental situation. It also need not exist for every reaction sequence, in which case all reactions are considered to act before the CLS. Each of the individual first-order Hill functions (16.2) are substituted into each other sequentially to obtain the formula for the concentration of the final protein product in terms of the concentration of any previous product [Yb 1]: GVbCLS ½Yb1 ; ð16:3Þ 1 þ WbCLS ½Yb1 Pn Qm Pm k m m where G ¼ ¼ k ¼ cls ak clsVcls þ 1 , Vb Q i ¼ b vi, Wb ¼ i ¼ b wi Q i1 n v , with the convention x ¼ x x x , and j ¼ b j i ¼ a i a a þ 1 n Qn m cls i ¼ q xi ¼ 1 if n < a. Note that Wb ¼ Wb , for m cls. As is evident from comparison with Eq. (16.1), Amax ¼ Vbm/Wbm and EC50 ¼ 1/Wbm. For steps m cls, the denominators of all products [Ym] are the same. Thus, the ½P ¼ G½YCLS ¼
Table 16.1 Values for vi and wi for activator and inhibitor at position i before, at, or after CLS Position
Before CLS
Activator
ni ¼
Activator with inhibitor
qiwiT
vi ¼
wi ¼ qiei
qi XiT ð1þai bi q0i½Ii Þ 1þgi q0 i ½Ii 0
ei þai q i½Ii Þ wi ¼ qi ð1þg q0 ½Ii i i
At CLS
vi same as before CLS wi ¼ qi
Pn
k¼i
ek
Qk
After CLS vi same as before CLS wi ¼ 0
j ¼ i þ 1vj
vi sameas before CLS Pn Qk wi ¼ vi ¼
qi
e k¼i k
v þai q0i½Ii j¼iþ1 j
1þgi q0i qi XiT 1þgi q0i ½Ii
wi ¼ 0
½Ii
471
Mechanisms From Dose–Response Curves
sum of any number of products [YCLS] to [Yn] will preserve first-order Hill form. This implies that all reaction sequences with the same Wbm can be considered to be parallel pathways. From a Boolean logic perspective, the cofactors prior to the CLS are equivalent to an “AND” operation in the sense that all cofactors are necessary for gene induction and cofactors summed in parallel pathways after the CLS are equivalent to an “OR” operation, in the sense that any of the cofactors is sufficient to produce the final product. The parameters Vbm and Wbm are functions of parameters of all the reactions leading up to the final product. Hence, a cofactor influencing any reaction in the entire sequence can affect both the Amax and the EC50. In total there are 4n cls þ 1 parameters, namely the a, e, q, and XT parameters, and the choice of these parameters specifies a model of the data. In general, all of these parameters and even the number of steps will not be known. However, the theory is still useful because the number of parameters necessary to specify the model can be reduced by exploiting the property of the first-order Hill function that allows any sequence of steps to collapse or “telescope” into a single first-order function with an effective Amax and EC50 (Ong et al., 2010). Each effective reaction can itself take part in another reaction sequence and in this way arbitrarily complex reaction sequences that maintain a FHDC can be constructed. This telescoping property will be exploited below when comparing the model predictions to the data. In addition, the reaction sequence has a “modular” structure in that reactions, or even a sequence of reactions, can be inserted or deleted without affecting the dose–response curve shape. Thus, different genes could possibly mix and match different pieces of the reaction sequence and each have a first-order response but with a different Amax and EC50.
2.1. Inhibitors One property of the FHDC-preserving sequence of reactions is that any type of reaction whose output product has a FHDC with respect to its input product can be inserted into any position of the sequence. Thus, the results of enzyme kinetics theory can be adapted and applied (Fromm, 1975; Segel, 1993). For example, an inhibitor acting at step i leading to product [Yi] is represented by the following reaction scheme. Yi1
Yi1
þ
Xi þ Ii gq0 i l þ Xi 0
qi
Yi þ Ii aq0 i l $ Yi 0 $
ða=gÞqi
!
Yi
!b
Yi
472
Carson C. Chow et al.
where 0 a 0 1, 0 g 1, and 0 0 b. The case of a ¼ 0 (i.e., reactions 0 Yi þ Ii $ Yi and Yi 1 þ Xi $ Yi are missing) is called competitive 0 inhibition. The case of g ¼ 0 (i.e., reactions X þ I $ X and Yi 1 þ i i i 0 0 Xi,$ Yi are missing) is called uncompetitive inhibition and the case of a ¼ g is called noncompetitive inhibition. We refer to the case where a and g are both nonzero as mixed inhibition. Subclasses of each type of inhibition occur depending upon the value of b. Those situations in which b ¼ 0 involve linear inhibition (e.g., linear competitive); and, when b > 0, it is called partial inhibition (e.g., partial competitive). Solving this system, and assuming that 0 0 the two products [Yi*] and [Y i ] are then summed via [Yi] ¼ [Yi*] þ b[Y i], gives a final expression with the form Eq. (16.2) but with different parameters as are listed in Table 16.1. One should note that although I is called an inhibitor, its actions need not be inhibitory or repressive on the final product. For example, if b > 0, I can be activating if it diverts the output from a lower yield product to a higher yield one. In fact, a partial uncompetitive inhibitor is a special case of an activator acting after the CLS. This is because the reaction 0 0 Yi þ Ii $ Yi is an activation equation and [Y i ] is proportional to [Yi*]. This allows the two products to be directly summed as activators after the CLS.
3. Application of Model to Data The telescoping property of reaction sequences preserving the FHDC renders application of the theory to experimental data tractable because extraneous and unknown cofactors can be collapsed into a small number of unknown parameters that can be determined experimentally. The result is that there are at most a small number of possible models that can explain a given data set. The models can also be expanded to account for more data. For example, suppose that we are interested in the effect of an activator Xi or inhibitor Ii acting at step i on the induction of the final product. There are only three different model forms that Eq. (16.3) can take, namely if the step appears before, at, or after the CLS. The general form of the final product with the activator (X) and inhibitor (I) is mathematically expressed as
D0 ð1 þ q0 i g½Ii Þ þ D1 XiT ð1 þ q0 i ab½Ii Þ ½Y0 ½P ¼ 1 þ q0 i g½Ii þ ðD2 ð1 þ q0 i g½Ii Þ þ D3 ðD4 þ q0 i a½Ii Þ þ D5 XiT ð1 þ q0 i ab½Ii ÞÞ½Y0
ð16:4Þ where the D parameters are functions of the parameters of the “hidden” reactions (Ong et al., 2010). Although they can be derived exactly (Ong et al., 2010) in terms of the parameters of the reactions of the entire sequence, they can be treated as effective parameters that can be estimated from the data. Equation (16.4) shows how the addition of an activator or inhibitor will directly influence the Amax and EC50 of the dose–response curve.
473
Mechanisms From Dose–Response Curves
We can then infer mechanisms using (Eq. (16.4)) by observing how a given cofactor influences Amax and EC50. With the addition of a cofactor, Amax and EC50 can both either increase, decrease, or not change, leading to eight different possibilities, not including the trivial case of both not changing. The possible mechanisms leading to these outcomes is listed in Table 16.2 (Ong et al., 2010).
3.1. Inferring cofactor mechanisms by direct fitting to data A given cofactor could be an activator or one of several types of inhibitor and act before, at, or after the CLS. The theory can generate an analytical model equation for the dose–response curve for protein product with respect to steroid concentration for each of these possible mechanisms. This is also not limited to the action of single cofactors. Analytical model equations can be generated for any number of cofactors. The mechanism of a cofactor can then be inferred by determining which model best fits the data. The most direct approach, which we describe in this section, is to fit the many possible models directly to the data. In the next section, we show how graphical methods adapted from enzyme kinetics theory can be utilized to discern the mechanism. Here, we use the example of the combined action of comodulator Ubc9 and glucocorticoid receptor (GR) on the dose–response curve of dexamethasone. As previously reported (Cho et al., 2005; Kaul et al., 2002; Kim et al., 2006; Szapary et al., 2008) and displayed in Fig. 16.2, the effects of increased Ubc9 depend critically on GR concentration and are independent of its sumoylation activity (Cho et al., 2005; Kaul et al., 2002). With low levels of GR, Ubc9 robustly increases Amax while perturbing EC50 marginally. At higher GR concentrations, there is less proportional increase in Amax and a much greater decrease in EC50. We can determine possible mechanisms by examining the models that include the combined effect of steroid, receptor, and Ubc9. The constraints for the model are that steroid and receptor are in the first reaction at high GR concentrations and that exogenous Ubc9 increases the Amax and decreases the EC50 (see Fig. 16.3). Matching this behavior to one of the eight possibilities in Table 16.2 shows that this combination of responses must be due to an activator acting before or after the CLS. We, therefore, conclude that Ubc9 is an activator that acts either before or after the CLS. (Note that an activator after the CLS is mathematically equivalent to a partial uncompetitive inhibitor). The general model for two activators acting before or after the CLS can be written in the form (Ong et al., 2010) A¼
aK1 K2 DT RT ðK3 þ K4 U T ÞST ; 1 þ K1 ð1 þ K2 RT ð1 þ K3 þ K4 U T ÞÞST
ð16:5Þ
Table 16.2
Effect of cofactors on Amax and EC50
Amax
EC50
Cofactors and context
Mechanism and position
Decrease
Decrease
L and U or A after CLS
Decrease
Increase
Decrease
No change
Increase
Decrease
Increase Increase No change No change
Increase No change Decrease Increase
CBP (with GR) (Szapary et al., 1999), NCoR (with PR) (Song et al., 2001) GMEB2 (Kaul et al., 2000), NCoR (with GR), CPT, H8, DRB (with high GR and Ubc9) (Kim et al., 2006) DRB, H8 (with high GR), VPA (with high GR and Ubc9) (Kim et al., 2006) TIF2 (with GR) (Szapary et al., 1996, 1999), Ubc9 (with high GR) (Kaul et al., 2002; Kim et al., 2006) Not observed but predicted TSA, VPA, Ubc9 (with low GR) (Kaul et al., 2002; Kim et al., 2006) SRC-1 (with GR) (Szapary et al., 1999) TIF2 siRNA (with GR) (Luo and Simons, Human Immunology, 2009)
C or L before or at CLS P and U before CLS or A after CLS A before or after CLS C after CLS A at or after CLS A after CLS L or C anywhere
Note: Abbreviations: A, activator; I, inhibitor; L, linear inhibitor; P, partial inhibitor; C, competitive inhibitor; U, uncompetitive inhibitor.
475
Mechanisms From Dose–Response Curves
0.1 ng GR ± Ubc9
10 ng GR ± Ubc9 3600
750
500
2400 1800
250
1200
Luciferase activity
Luciferase activity
3000
600 0 0.1
1
10 Dex (nM)
100
None
0.1
1 10 Dex (nM)
100
0 1000
-135 ng Ubc9
Figure 16.2 Varying GR concentration alters the effects of Ubc9 on the Amax and EC50. CV-1 cells were cotransfected with GREtkLUC plus the indicated amounts of GR Ubc9 plasmids and the averages of triplicate luciferase activities were plotted (error bars are S.D. for triplicates). The best-fit FHDR curve was determined by Kaleidagraph (R2 > 0.96). The vertical line indicates the EC50 for each dose–response curve.
where A represents the product (Luciferase activity), UT represents the total Ubc9 concentration, RT represents the total steroid–receptor concentration, and [S] is the free steroid concentration, which we assume is approximated by the total steroid concentration. We used a Bayesian Markov Chain Monte Carlo method to obtain the free parameter values although any fitting algorithm could be used (details are given in the Appendix). The posterior statistics of the parameters are in Table 16.3. The fit of the predicted data (Fig. 16.3) is surprisingly good considering that we assumed a linear relationship between plasmids added to the culture media and their expressed protein concentrations. Parameters K3 and K4UT reflect the amount of product without and with applied Ubc9, respectively. The model fit (see Appendix and Table 16.3) finds that K3/K4UT is 1. This ratio has different implications depending on where Ubc9 acts. If Ubc9 acts before the CLS then this implies that the ratio of endogenous to exogenous Ubc9 in the cell is near 1. However, qRT-PCR measurements give a ratio of 100 (unpublished results), which rules out this possibility. Thus, we conclude that Ubc9 is an activator that acts after the CLS, just as was previously proposed (Kaul et al., 2002).
3.2. Inferring mechanisms using graphical analysis Graphical methods have been extensively used in classical enzyme kinetics to extract parameters and infer mechanisms (Segel, 1993). Here we show how plots of Amax/EC50, as a function of the concentration of a cofactor, can be used to predict the mechanism of a cofactor.
476
Carson C. Chow et al.
5000 0.1 ng GR
2 ng GR
Activity
4000 3000
No Ubc9 135 ng Ubc9 175 ng Ubc9
2000 1000 0
Activity
4000
10 ng GR
25 ng GR
3000 2000 1000 0 10–2
10–1
100 101 Dex (nM)
102
10–1
100 101 Dex (nM)
102
103
Figure 16.3 Dose–response curves with different amounts of GR and Ubc9. The induction by Dex of transiently transfected GREtkLUC reporter and different amounts of GR (for A, B, C, and D respectively: 0.1, 2, 10, and 25 ng GR plasmid) Ubc9 plasmid in CV-1 cells was determined as in Fig. 16.2. Solid circle ¼ 0 ng Ubc9, Open square ¼ 135 ng Ubc9, Solid triangle ¼ 175 ng Ubc9. Curves were determined by best fit to mathematical models by a Bayesian Markov Chain Monte Carlo method.
Table 16.3 Posterior statistics of parameters for Ubc9 fits
Parameter Mean
Median
Standard deviation
Interquartile range
Max likelihood value
K1 K2 K3 K4 a DT
0.040 1430.32 0.376871 4.45226 160,449 0.047
0.0012 60.47 0.00730698 0.389866 35,508.1 0.013
0.00076 75.62 0.0071 0.5133 64,995.3 0.025
0.039 1484.67 0.38 4.45 183,930 0.041
0.041 1435.08 0.37 4.46 150,807 0.054
477
Mechanisms From Dose–Response Curves
From Eq. (16.4) we find that A max D0 ð1 þ q0 g½IÞ þ D1 X T ð1 þ q0 ab½IÞ ¼ EC50 1 þ q0 i g½I
ð16:6Þ
where all the free parameters are nonnegative and D0 ¼ 0 if the factors act at or before the CLS. The cofactor is either an activator or an inhibitor with concentration XT or [I], respectively. If the cofactor is an inhibitor then it can be competitive, uncompetitive, noncompetitive, or mixed, and in each case it can further be partial or linear. The plot of Amax/EC50 versus cofactor can be either linear with a positive slope or nonlinear. Figure 16.4 shows a decision tree that can be followed to determine the mechanism and site of the cofactor based on the shape of the plot. In some cases, other plots are needed to refine the decision, such as EC50/Amax versus cofactor and 1/ ((Amax /EC50) (1/ G)) versus cofactor, where G is a constant that the nonlinear curve in EC50/Amax versus cofactor approaches. An example of the utility of these plots, which offer an alternative approach to the above direct fitting to data, is shown in Fig. 16.5 for Ubc9. The interpretation of the linear plot for Amax/EC50 versus different concentrations of Ubc9 at a constant amount of cotransfected GR depends upon the position of the true y-intercept. As seen from the Western blot of cell cytosols before and after transfection with Ubc9 plasmid (see insert), 150 ng of transfected Ubc9 plasmid gives at least 10-fold more Ubc9 protein than the endogenous level of Ubc9. This means that the true zero for this plot (i.e., where there is no Ubc9 in the cells) is at the dashed vertical line or
y-intercept > 0
(A > CLS) or PU < = CLS
Linear Amax
y-intercept = 0
EC50
A ≤ CLS
Nonlinear Linear
(C or L) ≤ CLS
EC50 Amax
Linear
C > CLS
Nonlinear
A > CLS
Amax –1
Nonlinear
Approaches G
EC50 –
1 G
Figure 16.4 Decision tree for determining factor mechanism from plots of Amax and EC50. See text for details.
478
Carson C. Chow et al.
10 Authentic Ubc9: + Flag: Flag/Ubc9:
+
+ +
+
Amax/1000EC50
8
6
4
2
0 –50
0
50
100
150
200
Ubc9 (ng)
Figure 16.5 Graphical analysis of Ubc9 activity with GR. The Amax and EC50 were determined from dose–response curves for GR (10 ng) induction of cotransfected reporter (GREtkLUC) in the presence of the indicated amounts of Ubc9 plasmid as in Fig. 16.2. The ratio of Amax/EC50 was plotted versus Ubc9 to give a computergenerated straight line (R2 ¼ 0.99) using Kaleidagraph. The insert shows the Western blot detection by anti-Ubc9 antibody of Ubc present in parallel treated cells that had been transfected with 150 ng of Flag/Ubc9 plasmid or an equimolar amount of empty Flag vector. From these Western blot data, it is calculated that the endogenous Ubc9 is the equivalent of 15 ng of Flag/Ubc9 plasmid. The dashed vertical line therefore represents the theoretical point where there is no Ubc9 in the cells.
even closer to the labeled zero position. Therefore, the plot for Amax/EC50 intersects the true x ¼ 0 line (dashed) at y > 0, in which case Ubc9 is identified as an activator acting after the CLS. This is the same conclusion that was reached using the above method. Another example concerns the modification of GR induction of the synthetic reporter GREtkLUC by varying concentrations of the coactivator TIF2 in U2OS cells (Fig. 16.6). The Amax and EC50 were determined by exact fits of the amount of induced luciferase activity above basal level to a FHDC for three steroid concentrations (Fig. 16.5A). The linear plot for Amax/EC50 versus different concentrations of TIF2 intersects the x-axis at 20 ng of TIF2 plasmid (Fig. 16.5B). Importantly, densitometric analysis of Western blots for TIF2 from identically treated cells indicates that the total TIF2 protein after transfection with 20 ng of TIF2 plasmid is 2.56 times the endogenous protein level, in which case the endogenous TIF2 is equivalent to 13 ng of TIF2 plasmid. The lower value of endogenous TIF2, as
479
Mechanisms From Dose–Response Curves
A
B TIF2:
1600
1200
800 0 ng TIF2 3 ng TIF2 6 ng TIF2 12 ng TIF2 20 ng TIF2
400
0
–
+
50 Amax /100EC50
Luciferase activity above basal
60
0
1
2
3 Dex (nM)
4
5
40 30 20 10
6
0
–20
–10
0
10
20
TIF2 (ng)
Figure 16.6 Graphical analysis of TIF2 coactivator activity with GR. (A) Dose– response curves for GR transactivation of GREtkLUC. The program Kalediagraph was used to fit FHDC curves to the raw Luciferase data for GR induction of transiently transfected GREtkLUC reporter and the indicated amounts of cotransfected TIF2 plasmid in U2OS cells with the indicated amounts of cotransfected TIF2 plasmid (R2 > 0.998). The Amax and EC50 values generated by the curve-fitting program were then used to plot Amax/(100 EC50) in (B). The straight line represents the computer-generated best fit (R2 ¼ 0.94) using Kaleidagraph. The insert in (B) shows the Western blots with anti-TIF2 antibody of lysates from cells that had been transfected under the parallel conditions with 0 or 20 ng of TIF2 plasmid.
determined by Western blot, compared to the x-axis intercept in Fig. 16.6B leads us to conclude that the plot of Amax/EC50, adjusted for total TIF2 concentration that includes the endogenous contribution, would have a yintercept value greater than zero. In this case (as seen in Fig. 16.4) TIF2 is acting as an activator after the CLS. This conclusion confirms the widely accepted view of how TIF2 acts. In addition, these assays for Ubc9 and TIF2 say when each factor is acting. Such information about when a factor functions is a unique outcome of our model. Current methods (e.g., ChIP assays) reveal when (and where on the genome) a factor binds but are silent regarding when the factor exerts its biological activity.
4. Discussion We have described methods for inferring biological mechanisms for the actions of cofactors in gene-induction reactions based on a theoretical framework for the dose–response curve of gene induction by steroid receptors. Our theory does not require a detailed description of each participating
480
Carson C. Chow et al.
step and/or factor. Nevertheless, it correctly predicts the shape of the dose– response curve. One important result is that all steps before and after an arbitrarily chosen step will telescope down, thereby permitting exact solutions to the equations that can be fit directly to the experimental data. Our theory was constructed from first principles, starting with the experimental observation that the dose–response curve for steroid action has a Hill coefficient of one. It is strong support for the validity of our theory that, with such minimal initial experimental constraints, the model successfully yields FHDC curves and accounts for all of the different combinations of changing Amax and EC50 that we have observed experimentally (Ong et al., 2010). The theory also explains how steps well downstream of steroid binding can alter the Amax and EC50 (Kim et al., 2006; Tao et al., 2008) and provides theoretical corroboration for our previous hypothesis of Ubc9 acting downstream of both GR and a rate-limiting step (Kaul et al., 2002), which is further support for the validity of our theory. Furthermore, this theory permits the adaptation of the graphical methods of enzyme kinetics for studying steroid hormone action, as witnessed by the success with analyzing the different mechanisms by which Ubc9 and TIF2 each increase the Amax, and decrease the EC50, of GR-mediated transactivation. With these new methods, one can now determine mechanistic details of cofactors, such as when cofactor activity is expressed, that are otherwise inaccessible. A stringent test for the application of our theory is the invariance of an 81-fold change between 0.1 and 0.9 of maximal induction under all conditions (Goldbeter and Koshland, 1981). This does not hold for a general complex-forming sequence of reactions. Thus, any reaction scheme that is not described by our theory would be unlikely to produce the 81-fold change. Conversely, the observation of a dose–response curve that requires appreciably more or less than an 81-fold change in steroid to change from 0.1- to 0.9-fold of maximal induction (or repression) indicates the presence of at least one step that does not follow first-order Hill plot kinetics. Such observations appear to be more common among endogenous genes (He and Simons, 2007) and provide clues for the unique steroid-regulated behavior of these genes. Finally, our theory provides a cautionary tale for pure bottom-up modeling and blind high throughput data gathering. Our theory shows that essentially all the reactions in the network determine the Amax and EC50 of the dose–response curve. Thus measuring any incomplete subset of the reaction constants is not able to predict the properties of the dose–response curve. However, the theory also shows that if the average concentration of downstream products is small then the dose–response curve is controlled by only a small number of effective parameters that can be measured directly. The implication is that detailed information about biological mechanisms
Mechanisms From Dose–Response Curves
481
can be inferred by combining a theoretical framework with a relatively small number of directed experiments.
Appendix. Model Fit to Data and Parameter Estimates for Ubc9 and Glucocorticoid Receptor We fit the model for two activators (glucocorticoid receptor (RT) and Ubc9 (UT) and steroid (ST) (dexamethasone) (given by Eq. (16.4)) to experimental data. The data set consisted of Luciferase activity as a function of ST, RT, and UT. In all, there were 60 data points taken each in triplicate (Ong et al., 2010). The concentrations of ST, RT, and UT in Eq. (16.5) are inputs expressed in units of nM. We then fit the model to the data to obtain the K parameters and a. DT was an extraneous free parameter that was fixed to an arbitrary number. The proportionality constant a represents, the luminosity per mole of output protein luciferase and the proportionality constant of luciferase to the amount of final complex. The experiments utilize plasmids for GR and Ubc9. In order to fit the model to the data, we assumed that the amount of protein expressed is proportional to the amount of plasmids added. The actual proportionality constant is not important for the fits but for convenience we estimated the concentrations of protein expressed based on the amount of added plasmids. Assuming 660 g/mol per base pair, a well volume of 0.00033 L, and 6800 bp/pSG5/GR plasmid and 5163 bp/Ubc9 plasmid, we calculated the concentration of plasmid in nM from the mass in nanograms of plasmid added. We assumed that the concentration of the actual number of proteins translated is directly proportional to the concentration of the plasmid. From these estimates, we equated 0.1, 2, 10, and 25 ng of GR plasmid with 6.75E5, 0.00135, 0.00675, and 0.0169 nM of GR, and 135 and 175 ng of Ubc9 plasmid with 0.120053 and 0.155625 nM. Again, these values are not important for the model fits. They only change the scale of the parameters. The only assumption that is important is that the amount of plasmid added is proportional to the amount of protein expressed in the cells. The data was fit using a Bayesian Markov Chain Monte Carlo method, specifically a variant of the Metropolis–Hastings Algorithm with parallel tempering (Gregory, 2005). Initial priors for the parameters (K1, K2, K3, K4, a, DT) were (0.04285, 1335, 0.346736, 3.549, 101084, 0.080007) and bounded to the range ([0, 100], [0, 1], [0, 1], [0, 1], [0, 1], [0, 0.080007]) with guess ranges of (1, 10, 30, 30, 100,000, 0.1). The upper bound of DT was determined from the concentration of luciferase plasmid and was forced to be below that value. These guess ranges were determined empirically by trial and error to give a reasonable acceptance rate of the
482
Carson C. Chow et al.
algorithm. The parallel tempering was run at different inverse “temperatures” (b) of 0.00001, 0.001, 0.1, 0.4, 0.7, and 1. The Monte Carlo algorithm was run for 100,000 iterations at each value of b and the first half of the results were discarded, resulting in 77.2%, 75.4%, 18.2%, 16.2%, 15.9%, and 15.6% acceptance rates, respectively. w2 values at each b were calculated using the last half of the trial fits (i.e., the last 50,000 values) to allow the transients to decay. The log-likelihood versus b was integrated to obtain the true w2 of 2066.27 for this model.
ACKNOWLEDGMENTS This research was supported by the Intramural Research Program of the NIH, NIDDK.
REFERENCES Baxter, J. D., and Tomkins, G. M. (1971). Specific cytoplasmic glucocorticoid hormone receptors in hepatoma tissue culture cells. Proc. Natl. Acad. Sci. USA 68, 932–937. Cho, S., Kagan, B. L., Blackford, J. A., Jr., Szapary, D., and Simons, S. S., Jr. (2005). Glucocorticoid receptor ligand binding domain is sufficient for the modulation of glucocorticoid induction properties by homologous receptors, coactivator transcription intermediary factor 2, and Ubc9. Mol. Endocrinol. 19, 290–311. Fromm, H. J. (1975). Initial Rate Enzyme Kinetics. Springer-Verlag, New York. Goldbeter, A., and Koshland, D. E. J. (1981). An amplified sensitivity arising from covalent modification in biological systems. Proc. Natl. Acad. Sci. USA 78, 6840–6844. Gregory, P. C. (2005). Bayesian Logical Data Analysis for the Physical Sciences: A Comparative Approach with Mathematica Support. Cambridge University Press, pp. 312–351. He, Y., and Simons, S. S., Jr. (2007). STAMP: A novel predicted factor assisting TIF2 actions in glucocorticoid receptor-mediated induction and repression. Mol. Cell. Biol. 27, 1467–1485. Kaul, S., Blackford, J. A., Jr., Chen, J., Ogryzko, V. V., and Simons, S. S., Jr. (2000). Properties of the glucocorticoid modulatory element binding proteins GMEB-1 and -2: Potential new modifiers of glucocorticoid receptor transactivation and members of the family of KDWK proteins. Mol. Endocrinol. 14, 1010–1027. Kaul, S., Blackford, J. A., Jr., Cho, S., and Simons, S. S., Jr. (2002). Ubc9 is a novel modulator of the induction properties of glucocorticoid receptors. J. Biol. Chem. 277, 12541–12549. Kim, Y., Sun, Y., Chow, C., Pommier, Y. G., and Simons, S. S., Jr. (2006). Effects of acetylation, polymerase phosphorylation, and DNA unwinding in glucocorticoid receptor transactivation. J. Steroid Biochem. Mol. Biol. 100, 3–17. Lauffenburger, D. A., and Linderman, J. J. (1993). Receptors: Models for Binding, Trafficking, and Signalling. Oxford University Press, 365 pp. Loeb, J. N., and Strickland, S. (1987). Hormone binding and coupled response relationships in systems dependent on the generation of secondary mediators. Mol. Endocrinol. 1, 75–82. Luo, M., and Simons, S. S., Jr. (2009). Modulation of glucocorticoid receptor induction properties by cofactors in peripheral blood mononuclear cells. Human Immunol. 70, 785–789.
Mechanisms From Dose–Response Curves
483
Nagaich, A. K., Walker, D. A., Wolford, R., and Hager, G. L. (2004). Rapid periodic binding and displacement of the glucocorticoid receptor during chromatin remodeling. Mol. Cell 14, 163–174. Ong, K. M., Blackford, J. A., Jr., Kagan, B. L., Simons, S. S., Jr., and Chow, C. C. (2010). A new theoretical framework for gene induction and experimental comparisons. Proc. Natl. Acad. Sci. USA 107, 7107–7112. Segel, I. H. (1993). Enzyme Kinetics: Behavior and analysis of rapid equilibrium and steadystate enzyme systems. Wiley, New York. Song, L.-N., Huse, B., Rusconi, S., and Simons, S. S., Jr. (2001). Transactivation specificity of glucocorticoid vs. progesterone receptors: Role of functionally different interactions of transcription factors with amino- and carboxyl-terminal receptor domains. J. Biol. Chem. 276, 24806–24816. StavrevA, D. A., Muller, W. G., Hager, G. L., Smith, C. L., and Mcnally, J. G. (2004). Rapid glucocorticoid receptor exchange at a promoter is coupled to transcription and regulated by chaperones and proteasomes. Mol. Cell. Biol. 24, 2682–2697. Strickland, S., and Loeb, J. N. (1981). Obligatory separation of hormone binding and biological response curves in systems dependent upon secondary mediators of hormone action. Proc. Natl. Acad. Sci. USA 78, 1366–1370. Szapary, D., Xu, M., and Simons, S. S., Jr. (1996). Induction properties of a transiently transfected glucocorticoid-responsive gene vary with glucocorticoid receptor concentration. J. Biol. Chem. 271, 30576–30582. Szapary, D., Huang, Y., and Simons, S. S., Jr. (1999). Opposing effects of corepressor and coactivators in determining the dose-response curve of agonists, and residual agonist activity of antagonists, for glucocorticoid receptor regulated gene expression. Mol. Endocrinol. 13, 2108–2121. Szapary, D., Song, L.-N., He, Y., and Simons, S. S., Jr. (2008). Differential modulation of glucocorticoid and progesterone receptor transactivation. Mol. Cell. Endocrinol. 283, 114–126. Tao, Y.-G., Xu, Y., Xu, H. E., and Simons, S. S., Jr. (2008). Mutations of glucocorticoid receptor differentially affect AF2 domain activity in a steroid-selective manner to alter the potency and efficacy of gene induction and repression. Biochemistry 47, 7648–7662.
C H A P T E R
S E V E N T E E N
Spatial Aspects in Biological System Simulations Haluk Resat, Michelle N. Costa, and Harish Shankaran Contents 486 489 489 491 500 508 509 509
1. Introduction 2. Methods and Frameworks 2.1. Construction of spatial frameworks 2.2. Treatment of reaction kinetics 2.3. Algorithms to model spatial aspects 3. Summary and Future Prospects Acknowledgments References
Abstract Mathematical models of the dynamical properties of biological systems aim to improve our understanding of the studied system with the ultimate goal of being able to predict system responses in the absence of experimentation. Despite the enormous advances that have been made in biological modeling and simulation, the inherently multiscale character of biological systems and the stochasticity of biological processes continue to present significant computational and conceptual challenges. Biological systems often consist of well-organized structural hierarchies, which inevitably lead to multiscale problems. This chapter introduces and discusses the advantages and shortcomings of several simulation methods that are being used by the scientific community to investigate the spatiotemporal properties of model biological systems. We first describe the foundations of the methods and then describe their relevance and possible application areas with illustrative examples from our own research. Possible ways to address the encountered computational difficulties are also discussed.
Pacific Northwest National Laboratory, Computational Biology and Bioinformatics Group, Richland, Washington, USA Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87017-5
#
2011 Elsevier Inc. All rights reserved.
485
486
Haluk Resat et al.
1. Introduction A kinetic model is the mathematical description of the interactions that can occur among a set of reactants in a given physical construct (Resat et al., 2009). The mathematical formulation has to define the occurrence properties of the reactions and where they can occur, that is, the model identifies reaction rates, propensities, allowed physical regions, etc. The formalism used for representation (stochastic, deterministic, hybrid, agentbased) as well as the method used to simulate the system properties are the other components that define a computational kinetic study. Mathematical representations that are typically used to simulate the dynamics of biological systems cover a wide complexity spectrum. Figure 17.1 summarizes which types of mathematical approaches are most suitable as a function of process rates and system size. Nonlinear, ordinary, or partial differential equations (PDEs) have been the most traditional approach to model the time evolution of biological systems (Gillespie, 1992a, 2000; Resat et al., 2009). Arguably the simplest representation involves a set of coupled, first-order ordinary differential equations (ODEs) called the reaction rate equations (RREs). The RREs are phenomenological equations derived under the assumption that the system is spatially homogeneous; that is, it is well stirred that no persistent correlations
Number of molecules System size
PDE
ODE
Compartmentalized ODEs Deterministic spatial
Deterministic non-spatial
SPDEs
SDEs
Stochastic spatial
Stochastic non-spatial
SSA
MC methods kdiff / kreact
Figure 17.1 Tabulation of mathematical methods that are often employed in kinetic studies: ODE, ordinary differential equation; PDE, partial differential equation; SPDE, stochastic PDE; SDE, stochastic ODE; MC, Monte Carlo; SSA, Gillespie-type stochastic simulation approach; kreact, rate of biochemical processes; kdiff, rate of diffusion processes.
Multiscale Aspects of Biological Simulations
487
develop among the positions and among the interaction patterns of the molecules. Although it is a crude representation, the RRE has the advantage of being the least computationally expensive approach, and it can accommodate vastly different time scales that the model system may require. The latter aspect (the “dynamical stiffness” condition) arises due to the wide differences in the rates at which the various reactions can proceed. The dynamical stiffness issue has been the subject of many studies in a multitude of scientific disciplines, and there are good ODE solvers which overcome the numerical difficulties (Ascher and Petzold, 1998). Even though it has been employed successfully in many studies, an ODE-based deterministic approach alone may be inappropriate for certain systems when the underlying problem is inherently stochastic (Bortz et al., 1975; Gillespie, 1976, 1977b). Stochastic effects could be due to the large fluctuations in the concentrations of biomolecules, which becomes particularly important when reacting particles exist at small copy number levels. At this limit, since the assumption of continuous molecular concentrations is no longer valid, differential equation-based models break down. Such limits are often encountered for molecular species involved in the transcriptional regulation of gene expression (McAdams and Arkin, 1999). Under such conditions, special treatment is required to model the studied biological network (Gillespie, 1976, 1977a,b), and a more general approach based on discrete stochastic simulation methods in which natural fluctuations are treated in an explicit manner has been advocated for the quantitative analysis of biological networks (Arkin et al., 1998; McAdams and Arkin, 1997). At the fundamental level, the stochastic approach involves the solution of the same underlying reaction equations that also define the corresponding deterministic models. However, in addition to the mean solutions that deterministic methods can provide, stochastic simulations also provide information about the intrinsic fluctuations of the system and about the possible roles of limiting reaction species. Discrete stochastic simulations were pioneered by Gillespie, who showed the equivalency of deterministic and stochastic formalisms in the limit where the abundances of all constituent chemical species tend to infinity, and discussed the importance of stochastic fluctuations when the abundances are small (Gillespie, 1977a,b). We note the distinction between discrete stochastic solution of the reaction kinetic equations and stochastic differential equations (Resat et al., 2009). The latter has long been used in other physical sciences (Gillespie, 1992b, 2000), and they are usually obtained by starting with an assumed ODE and then adding a “noise” term. Although such approaches have the advantage of being much less expensive computationally, they are basically phenomenological equations designed to capture the intrinsic fluctuations where it is assumed implicitly that a presumed statistical distribution (e.g., Gaussian noise) characterizes the system fluctuations. Hence, their applicability is more appropriate when system sizes are large enough to justify the employed
488
Haluk Resat et al.
statistical distribution profile. Therefore, the use of discrete stochastic approaches to solve the reaction equations is often more suitable in studying biological systems, particularly when the expression levels of the reacting species are small. However, discrete stochastic simulations are often too slow and computationally expensive, and the exclusive use of this technique is impractical for most realistic systems. Thus, there is a need to develop multiresolution methods which make use of fast deterministic simulations for those parts of the system when justified, and employ discrete stochastic simulations wherever it is necessary. Such hybrid methods can utilize a hierarchy of mathematical schemes that range from the most accurate but most computationally intensive (discrete stochastic) to the most coarse but most computationally efficient (ODE) method. The major focus areas of this chapter—spatial organization and structural hierarchies—add an additional degree of complexity to biological system simulations. Hierarchal organization is frequently observed in biological systems where biochemical activities and biological functions are restricted to defined niches that respond to the local environment. An obvious example is the cell — cells often survey their immediate surroundings and adjust their metabolic or signaling patterns according to the local extracellular conditions they are exposed to. Similarly, within the cell there are additional levels of organization where the local intracellular conditions become important — cells utilize specialized compartments that create favorable environments for specific biological activities. These subcellular compartments exchange material in a concerted fashion to fulfill their roles in fundamental cellular processes such as synthesis, degradation, and signal transduction. Further, in a multicellular organism cells reside and function in the context of a tissue. Thus, compartmentalized reactions that occur at many smaller spatial scales simultaneously drive and respond to changes at the tissue level. Capturing this hierarchical organization in dynamic models presents significant additional challenges in multiscale modeling. Particle transport is central to the information and material transfer that occurs between compartments in biological systems (Berg, 1993; Slepchenko et al., 2003; Stundzia and Lumsden, 1996). We note that material transfer often occurs over spatial distances many orders of magnitude larger than the size of involved transporters, and its mechanisms can include both passive transport through concentration gradient relaxation or active transport through the involvement of transporters, chaperones, or vesicles. This chapter focuses on modeling the dynamics in biological systems when spatial aspects play a role. We advocate a system-of-systems approach and discuss how such approaches can be constructed by the inclusion of spatial aspects either directly or through compartment-based formalisms. As speedy progress in experimental techniques, particularly fluorescent labeling and high-resolution microscopy, has started to make it possible to collect biological information about local dynamical activities
Multiscale Aspects of Biological Simulations
489
(see, e.g., Burke et al., 2001; Judd et al., 2003; McAdams and Shapiro, 2003; Ozcelik et al., 2004; Viollier et al., 2004), development of kinetic models that can capture the effects due to local inhomogeneities is certainly becoming feasible. These efforts will allow researchers to make headway into the last frontier—spatial dynamics—in systems biology efforts (Kholodenko, 2003, 2006; Lemerle et al., 2005).
2. Methods and Frameworks 2.1. Construction of spatial frameworks Approaches to incorporate spatial aspects into biological models mostly fall into two broad categories which we designate as: (i) multicompartment models and (ii) models that explicitly incorporate spatial dimension. Specifically, spatial models are usually built using the following two approaches, or using a combination thereof: (i) Multicompartment models involve the phenomenological partitioning of the system into distinct reaction volumes, and they are arguably the simplest to construct and simulate (Chaturvedi et al., 1977). The system is constructed as a collection of idealized compartments, which may not be faithful representations of the physical reality, but possess biological relevance. Each compartment constitutes an independent reactor volume with compartments being linked at the higher whole-system scale. Mass/ information transport is included in these models as exchange reactions between compartments. For example, modeling a eukaryotic cell as a collection of subcellular structures, defining which types of reactions can occur within each subcellular structure, and how these subcompartments exchange material through excretion or uptake of biomolecules would result in a multicompartment model of the cell. This model can be extended out to a higher spatial level of organization, where each cell in a collection of cells forms a compartment of the system. In multicompartment models, the compartment identity is merely used to dictate reaction and mass/information transport rates while the spatial location of the compartment is not modeled. This representation is particularly well suited to study vesicle-mediated processes such as endocytosis/phagocytosis (Birtwistle and Kholodenko, 2009; Resat et al., 2003; Shankaran et al., 2006) while keeping the complexity to a minimum. (ii) In models that explicitly incorporate space, a grid-like construction is typically used for physical partitioning of the system (Beyenal et al., 2004; Noguera et al., 1999a). The total system volume is subdivided into physical subvolumes that are small enough for each to be considered spatially homogeneous within the desired level of accuracy and sensitivity. Each
490
Haluk Resat et al.
grid unit can be thought of as a distinct reactor and, whenever a chemical reaction occurs in the system, where it occurred is known within the resolution of the subdivisions. The spatial resolution comes at a considerable cost in computational complexity. If ODEs are used to represent the reaction kinetics, the model converts to a more demanding PDE where the kinetics are ruled by the local conditions of the subdivision grid points. The PDE model would also require the inclusion of transport/exchange between subvolumes and the definition of spatial boundary conditions. These types of models are well suited when diffusion and/or hydrodynamics play a significant role in system dynamics. Conceptually speaking, there are equivalencies between the above approaches. Each grid unit in a grid-based explicit spatial simulation can in fact be thought of as a “compartment” when it comes to constructing mathematical models. For both treatments, we need to specify the equations that govern the reaction kinetics within each system partition and the rules for mass/information exchange. In multicompartment models, we need to phenomenologically define the rules that govern mass/information transport between the compartments. In explicit spatial simulations, mass transport between the grid units is taken care of using physical laws such as the diffusion equation or the Navier–Stokes equation. We note that while explicit spatial simulations do not necessarily have to include a grid, we only consider gridbased formalisms in detail here. Scenarios where the problem is simple enough to allow the use of Newtonian dynamics type models where exact locations are tracked in a continuum space are likely to be rare as we construct increasingly realistic 3D models of biological systems. Obviously, an approach that combines the above two fundamental representations is also possible and often required for realistic models of biological systems. Such mixed representations can help with splitting the investigated system into two parts; components whose spatial dynamics can be defined on the grid structure and components that are directly included as compartments. The latter can also be recorded according to their positions in the underlying grid frame. These mixed representations are particularly well suited when the details included in the model are disparate and emphasizes different aspects. For example, dynamics of small messenger reactants are primarily ruled by their uptake/release and diffusion. So their profiles can be represented using the grid frame. In contrast to such reactant molecules, individual cells can have several attributes (cell states) that dictate their internal kinetics and behavior. Thus, it may be necessary to represent cells as individual compartments with their own dynamics and regulatory rules and possibly containing their own subcompartments if subcellular structures are included in the model. Such mixed representations have been successfully used in individual-based models where cellular motility and migration dynamics or the dynamics of subcellular vesicles are investigated or to study higher order structure formation in research areas such as
Multiscale Aspects of Biological Simulations
491
tissue/tumor modeling (Zhang et al., 2009a), biofilms (Xavier et al., 2005, 2007), and bacterial colonies (Allison, 2005; Picioreanu et al., 2005, 2007).
2.2. Treatment of reaction kinetics Irrespective of the model formulation, kinetics of the compartments essentially consists of two parts; what happens internally and how a compartment interacts with its surroundings or with other compartments. These can be dealt with in different ways according to the desired accuracy and resolution. This section discusses several of the most commonly used approaches to model these aspects. It should be noted that in principle, apart from practical implementation issues, one is not limited to using a single type of formalism and can mixand-match different representations in a unified model as desired. 2.2.1. Models for internal kinetics of compartments As discussed in Section 1, regardless of whether compartments are defined from a biological standpoint or as spatial grid units, an ODE-based deterministic treatment is the easiest approach to model the internal kinetics of system compartments. To develop such a model (Resat et al., 2009), one starts with identifying the input and output variables as well as the reaction intermediates, that is, key species and physical properties to be simulated are determined and tabulated. For example, assume that a receptor (R) system is studied, where R can bind to a ligand L to form ligand-bound receptors (RL). RL can then dimerize to form dimer complex (D) which can bind to an effector/adaptor/scaffold molecule E to form the DE complex and trigger the involved signal transduction cascade. Let us also assume that the level of formed DE complexes regulate the ligand shedding by the cell. In this example, one would need to include the reactants L, R, RL, D, E, and DE. If the reaction occurrences are modeled with mass-action formulas, the mathematical model would consist of the following rate equations: L
R
D
RL
E
RL
VR d[L] / dt = −k1 [L][R] + k2 [RL] + fs(DE) d[R] / dt = −k1 [L][R] + k2 [RL] d[RL] / dt = +k1 [L][R] − k2 [RL] − 2 k3 [RL]2 + 2 k4 [D] d[D] / dt = +k3 [RL]2− k4[D] − k 5[D][E] + k6 [DE] d[E] / dt = −k5 [D][E] + k6 [DE] d[DE] / dt = +k5 [D][E] − k6 [DE]
DE
ð17:1Þ In these equations, square brackets [X] denote the concentration of species X in cellular volume. The cytoplasmic to extracellular volume ratio VR accommodates the difference that free ligand L resides in the
492
Haluk Resat et al.
extracellular media while other reactant species are defined in association with the cell. fs(DE) is the function defining the ligand synthesis and shedding rate. For example, if the rate is hyperbolically proportional to the DE levels, then fs(DE) ¼ ks [DE]/(KDE þ [DE]). We note that rate equations tabulated in Eq. (17.1) could equally be written out if they were not occurring as part of the internal kinetics of a cell but instead as a set of reactions in a particular unit of the grid that discretizes the physical system into subcompartments. A grid unit may constitute a part of a subcompartment or of the medium that surrounds the compartment objects. In this case, mathematical equations (e.g., Eq. (17.1)) need to be expanded to include terms that represent the movement of reacting species in between grid units. This will be discussed in the next section. 2.2.2. Coupling the compartments In most instances, how compartments interact with each other or with the environment in the context of the larger biological system is the defining factor in biological responses. For example, the ligand/messenger molecule in the above model might be brought to the vicinity of the cell through the blood stream flow or messenger agents could be secreted and absorbed as part of intercellular communication. We note that intercellular communication itself can be autocrine, juxtacrine, or endocrine in origin, and these processes couple the system elements over vastly different length scales. Another example relevant to direct intercellular interactions in eukaryotic systems is the signaling through cell–cell junction formation. In this case, dynamics of the receptors could depend on whether they occupy the junction regions. Such details can be included in the models by splitting the receptor pool into “at the junction” and “away from the junction” categories where each category can have different binding and complexation rates. Receptor movement in-and-out of the junctions can be modeled as reactant exchanges in the equations. Similarly, cell–cell contacts can impact the cell polarities, motility, and internal cellular kinetics and cell behavior. Bacterial systems too provide many relevant examples where interactions between the system elements is critical, a prominent example being the synergy between the organisms in a microbial community where trophic interactions are one of the dominant factors in determining system characteristics. System (sub)compartments are usually coupled in the models using material transfer reactions or as rules that alter the kinetic parameters. For example, material can be transported in and out of organelles or nucleus to exchange material with the cytoplasm. One way to implement this is to add transfer terms to the rate equations d½Xi =dt ¼ “internal reactions” þ TX; j!i TX;i!j
ð17:2Þ
493
Multiscale Aspects of Biological Simulations
where TX, j!i is the transfer rate of X from (sub)compartment j to (sub) compartment i. “Internal reactions” are the kinetic terms to characterize the fate of X internally in the compartment, for example, the network model described with Eq. (17.1). Example 1 Endocytic trafficking of many types of receptors can have profound effects on subsequent signaling events. In an earlier study, we have constructed a multicompartment model to investigate the role of endocytic trafficking in epidermal growth factor receptor (EGFR) signaling (Resat et al., 2003). Our quantitative model combined the trafficking and signaling processes in an integrated model (Fig. 17.2). This model consisted of hundreds of distinct endosome compartments and about 13,000 reactions/events that occur over a broad spatiotemporal range. Dynamic formation and recycling of the endocytic vesicles were built into the model by defining them as
ls, f
lc, x les
ic es
lc, f
v
E t E lc, m
d te
pi
Sm
oo
th
ls, m
pit
ls, x EE
ves
oa
icl
es
kr, s
C
lrcyc
Sorting endosome Shc
ldeg
Grb2 PLCg
Lysosome
Figure 17.2 Multicompartment model for receptor signaling. Early endosome (EE) vesicles can form from the cell membrane. EE vesicles can be of two types, smooth and coated pit, with different receptor complex internalization characteristics. Smooth and coated pit EE vesicles also have different trafficking dynamic rates. These vesicles can either recycle back to the plasma membrane or merge into the sorting endosome, which regulates the eventual protein degradation process through the lysosome pathway. The number of EE vesicles changes in time according to the system dynamics. Each compartment is represented with its own receptor signaling network, which is essentially a more complex version of the example given in Eq. (17.1). Further details can be found in Resat et al. (2001, 2003).
494
Haluk Resat et al.
probabilistic events with a defined mean occurrence rates lf and lx, respectively (Resat et al., 2001). When an early endosome is formed, the vesicle encompasses a certain portion of the receptors and other protein complexes, which get internalized through the trafficking vesicle. As the conditions in the vesicles and the transmembrane/cytoplasm boundary may differ (e.g., due to the pH difference, concentration enhancement, colocalization, etc.), this downregulation process allows cells to regulate their signaling patterns spatially. Use of a realistic multicompartment model allowed us to investigate the distribution of the receptors among cellular compartments as well as their potential signal transduction characteristics (Resat et al., 2003). Our model accurately predicted the differential kinetics and extent of receptor downregulation induced by different ligands of EGFR, and our results led to the prediction that receptor trafficking controls the compartmental bias of signal transduction, rather than simply modulating signal magnitude. It should be noted that a similar strategy is applicable when a grid frame is used to define system subvolumes (Example 2). Continuing with the network model example from Eq. (17.1), say that we are dealing with the receptor R level in the grid unit i, Ri. Then the corresponding rate equation would need to be modified to d½Ri =dt ¼ k1 ½Li ½Ri þ k2 ½RLi þ SðTR; j!i TR;i!j Þ
ð17:3Þ
Here TR,j!i denotes the transfer rate of R from the neighboring grid units j to unit i. The sum in Eq. (17.3) is over the grid units to/from which the exchange is allowed. Notice that the mechanisms or causes of the material exchanges can be many fold, ranging from transporter mediated active transport to passive diffusion to advective flows. Depending on the involved mechanism, appropriate mathematical expressions should be used to represent the material transfer in the equations. We also note that the above discussion is not limited to hypothetical compartments or compartmentalization via gridding; a mixed approach can similarly be used where one utilizes internal kinetic models for the explicitly included objects (e.g., cells, organelles, vesicles) and overlays those objects on the grid. In this hybrid approach, objects are defined in terms of their grid location, which enables the specification of the relative physical separation between objects. Such mixed approaches are discussed in Section 2.3.3. Example 2 Spatially regulated signaling discussed in Example 1 can also be studied using a grid frame to define system subvolumes (Fig. 17.3) (Pettigrew and Resat, 2005).
495
Multiscale Aspects of Biological Simulations
Extracellular compartment Transmembrane T3 compartment
Transmembrane compartment
E1 E1.1 E1.2 E1.3 E1.4
X8
E7
Endosomes compartment Cytoplasm compartment
Figure 17.3 Schematic diagram showing how spatial details can be incorporated into grid-based multicompartment models during kinetic model development. In this example, cell and its surroundings are modeled as having four compartments: Extracellular (X ), transmembrane (T ), endosomes (E ), and cytoplasm (C ). These subcellular compartments are laid out on a regular rectangular grid where each grid unit is labeled (e.g., E1, X8, etc.) to track location information at the subcompartment level. Note that, if desired, a finer grid may be used for particular regions of the system (e.g., the subdivision of the E1 grid unit).
In this case, the reactions can be characterized either as biochemical or as mass transfer reaction, and rules can be imposed on the reactions and species locations. For example, unbound effector proteins can reside in the cytoplasm only, but upon forming a complex with receptors they become part of the compartment that the interacting receptor resides in. Similarly, upon dissociation from the receptor complex the effector protein returns to the cytoplasm. ka;t
E C Ks þ D T Kt ! DE T Kt ka;e
E C Ks þ D E Kt ! DE E Kt In the above reaction equation, the first index shows the reactant type, second index labels the compartment (Fig. 17.3), and the last index indicates the grid unit that the reacting molecule is residing in before and after the reaction. In the above reaction, effector (E ) present in the cytoplasm (C) can complex with the receptor dimers that reside in transmembrane (T ) and endosome (E ) compartments. Such association/dissociation reactions
496
Haluk Resat et al.
couple the different compartments through biochemical reactions. Compartment coupling can also occur through mass transfer reactions, which rule the localization and spatial distribution of molecules between system compartments. Material exchange between two grid units that belong to the kt same compartment; L X Ks !L X Kt results in material redistribution within that major cellular compartment. In contrast, material exchange between two adjacent grid units that belong to different major compartments results in material transport between cellular compartments. We note that modeling the mass transfer reactions as first order reactions can accurately mimic the diffusion equation if the transfer rate constants are adjusted according to the relationship between the average squared displacement hr2i of a particle and its diffusion coefficient D for random motion (Berg, 1993).
2.2.3. Issues related to internal kinetic models The development of kinetic models requires the specification of the variables to represent the system internal states and readouts, and the appropriate equations governing the interactions between these variables. In practical terms, a modeler has to carefully choose the biomolecular reactions and processes to include in the model, and decide the level of detail at which these processes are to be modeled. Cells utilize exceedingly complex interconnected networks of biomolecular reactions to carry out their myriad functions. An all encompassing model of the cell that includes all of these reactions at the molecular level of detail would be impractical. Fortunately, in practical terms, cellular processes exhibit a certain level of modularity which makes it possible to model particular portions of the cell in isolation from the rest of the cell. As a simple example, a model of a signaling pathway does not need to include the transcriptional responses triggered by the pathway if these responses have little effect on the species abundances and reaction rates within the pathway during the time scale of interest. Such notions of modularity have enabled the development of several successful models for particular cellular pathways (e.g., signaling pathways) and processes (e.g., cell cycle). These models are typically developed for wellstudied pathways and processes for which the detailed molecular mechanisms are known. While this enables the construction of an initial reaction network, inclusion of all the mechanistic details, as often done in biological modeling, leads to extremely large models even for relatively simple systems and makes parameter estimation, and hence the model unreliable when the supporting experimental information is lacking. An effective model is the one whose size is limited by the scope and limitations of the experimentally measured quantities, while still capturing the fundamental biochemical/ biophysical processes of the studied system. We further elaborate on this point below.
Multiscale Aspects of Biological Simulations
497
First, we address the issue of model scope (breadth) and granularity (depth). Consider the network model presented in Eq. (17.1). This is a model of a receptor system that is restricted to a few key species—a receptor R, a ligand L, and a scaffold molecule E. The implicit assumption here is that there are no other species that have a significant bearing on the rate expressions listed in Eq. (17.1)—that is, this reaction network can be treated as a self-contained module. Although networks such as those in Eq. (17.1) may be embedded as part of larger more complex signaling pathways, modeling such portions of the signaling pathway can often provide sufficient insight into the operation of the pathway as a whole. It is useful to identify such networks where possible, and thereby restrict the scope of the model to make modeling tractable. It should be noted that the scope of the model can also be imposed upon the modeler due to unavailability of detailed mechanistic information about large signaling networks of interest. In this regard, an important aspect of Eq. (17.1) is the use of the phenomenological function fs(DE) to link the ligand shedding rate to the number of dimer–adaptor complexes—in other words the actual molecular mechanisms that effect this link are considered to be beyond-the-scope of the model because they are either unavailable and/or irrelevant. With regards to granularity—the level of detail in the model—Eq. (17.1) is a much simpler abstraction of the actual molecular mechanisms at work in this core network. For instance, in the receptor system that the model is built for, receptor dimerization is followed by phosphorylation of the receptor cytoplasmic tails at several distinct amino acid sites with particular phosphorylation sites being capable of recruiting specific adaptor molecules (Wolf-Yadlin et al., 2006). Should we then employ distinct variables in our model to explicitly track the various possible phosphorylated receptor types and their associated adaptors? This would vastly increase the number of model parameters due to combinatorial complexity (Blinov et al., 2004) and, as discussed below, would render the model unreliable if we did not have the right kind of experimental data to establish these parameter values. In order to construct a predictive kinetic model, in addition to the reactants and rate expressions (e.g., Eq. (17.1)) we also need parameter values—rate constants and the reactant abundances that would form the initial conditions for the ODEs. Whereas at first glance, species abundances and rate constants may be available in the literature, these quantities are often cell-type specific, and may not be applicable to the particular cell type under study. Further, due to the specific experimental techniques involved, some of the available parameters may have large uncertainties associated with them. Hence, it is always desirable to establish a consistent set of parameter values by fitting the entire model to experimental data collected in the cell type of interest. Such parameter estimation exercises place additional constraints on model building. Several biochemical reactions occur at second or subsecond time scales and may be unobservable in
498
Haluk Resat et al.
traditional biological experiments where the sampling times are in the order of 10s of seconds or minutes. Such reactions can hence be treated as pseudosteady state reactions thereby reducing the number of model variables. Further, once the relevant system observables have been established based on constraints imposed by experimental techniques, one could then perform structural and practical identifiability analysis (Raue et al., 2009) to further assist in model simplification. Structural identifiability is related to the model structure—potential correlations between the model parameters—and is independent of the experimental data. Practical identifiability is related to the amount and quality of available data. There are other techniques that can specifically be applied for formally reducing kinetic models (Ho, 2008). Any model reduction should be done with care so as to ensure that the resulting lumped parameters, if any, can be properly interpreted. Finally, most parameter estimation exercises culminate with the tabulation of best-fit point estimates for the model parameters that may provide a false sense of security when it comes to understanding system behavior. Proper care should be taken to establish that the obtained solution is unique, and to calculate the confidence intervals in the parameters and the predictions (Shankaran et al., 2008). On a related note, any point estimate for a particular rate constant that is presented in the literature using model-based analysis of experimental data may be completely unrealistic when used outside the context of the presented model and needs to be interpreted with care. In summary, the development of an effective kinetic model is a nontrivial exercise that contains several pitfalls. A general rule of thumb is that a model should be parsimonious—as simple as possible to address the question at hand—and should take into account the limitations of the available information about the system and of experimentally measured quantities. Example 3 The human epidermal growth factor receptor (HER) family comprises four homologous receptors (named HER1–4; HER1 is more commonly called EGFR), and particularly HER2 and HER3 are overexpressed in cancer. HER receptors can form homo- and heterodimers, and the pattern of receptor dimerization between HER molecules is a critical determinant of cell signaling and biological outcomes. While dimer abundances are difficult to measure experimentally, we can easily measure the levels of phosphorylated HER molecules using ELISAs. The level of receptor phosphorylation contains information about receptor dimerization affinities, and this information needs to be deconvoluted from the experimental measurements using a mathematical model (Fig. 17.4). In an ongoing research effort, we measured the phosphorylation patterns of HER receptors in epithelial cells expressing endogenous levels of EGFR and varying levels of HER2 and HER3 receptors that were exposed to various ligand treatments (Zhang et al., 2009b). We then used a parsimonious mathematical model that included the fundamental
499
Multiscale Aspects of Biological Simulations
VR1
R1s
kfs R11s*
ke12 = b12kt
kr11i R11i*
R1i
R2i
R12i* kd1 = a1kt
kd12* = g12kd1
R22s*
ke22 = b22kt
kt
kr12i
kfs kr22s
R12s*
kt
VR2
R2s
kfs
kr11s
ke11 = b11kt
kd11* = g11kd1
kr12s
kr22i R22i*
kd2 = a2kt
kd22* = g22kd1
Figure 17.4 Parsimonious mathematical model to analyze receptor dimerization (Shankaran et al., 2008). The model contains 10 species: Receptor monomers (R1: EGFR and R2: HER2) and phosphorylated receptor dimers (R11*, EGFR homodimers; R12*, EGFR-HER2 heterodimer; and R22*, HER2 homodimer) at the cell surface (subscript s) and in the internal (i) compartments. Monomers at the cell surface interact to yield phosphorylated dimers with a forward rate kfs and a dimer-dependent reverse rate krs. Internalized dimers dissociate with a dimer-dependent rate kri. Monomers are internalized with rate kt. Dimers are internalized with a dimer-dependent rate ke. In the absence of ligand, EGFR and HER2 monomers are assumed to have a surface-tointernal receptor ratios of a1 and a2, respectively, and the respective monomer degradation rates kd1 and kd2 can be expressed as a product of these a values and kt. Internalized dimers are degraded with a dimer-dependent rate kd*. VR1 and VR2 are the zero-order synthesis rates for EGFR and HER2.
processes that occur in this system—receptor dimerization, phosphorylation, and endocytic trafficking to analyze the experimental data and to extract the receptor dimerization affinities (Shankaran et al., 2008). Note that while this model shares the basic features presented in Example 1 and in other detailed models from our group (Shankaran et al., 2006), it is vastly simpler—it has a single internal compartment and does not explicitly model receptor–ligand binding. These model simplification choices were made based on the nature of the experimental data being collected; for example, ELISAs measurements sampled at the minute time scales does not provide the sampling coverage needed to determine the kinetic rates for ligand–receptor
500
Haluk Resat et al.
binding that occurs at much faster time scales. The parsimonious model contained 19 parameters, all of which could be estimated with a good degree of confidence solely based on the experimental data collected in the study.
2.3. Algorithms to model spatial aspects 2.3.1. Multicompartment models Multicompartment models are not truly spatially resolved approaches. From a computational perspective, they simply are an artificial partitioning of the system into subsystems where system components are built to mimic the known spatial aspects of the problem. This is in reality nothing more than employing a system-of-systems approach where each compartment is its own internal subsystem, and these subsystems also interact with each other at the whole system level, that is, a system-of-systems is formed. Spatial information can be introduced into such models phenomenologically in a way that material transfer rates can be adjusted based on presumed locations and distances. If there are N compartments each with an internal reaction network of size Sin, then a system-of-systems modeling approach would result in a network of size NSin þ Str where Str is the number of intercompartment reactions. If the compartments interact in a pairwise manner bidirectionally through Spair interactions, then Str ¼ N(N 1)/2Spair. Once the mathematical model equations are constructed, handling of the model is the same as that of a single system case with the exception that the system size is now much larger. So either a deterministic or stochastic method can be used as needed and desired to solve the model equations during simulations. 2.3.1.1. Solution of model equations Deterministic solution of ODEs is a very well-established area with widely available software (Ascher and Petzold, 1998). Thus, it will not be discussed here. Deterministic methods can provide acceptable solutions to population-based representations where spatiotemporal properties are captured using averaging techniques (Fig. 17.1). One computes mean quantities, such as the abundances of organisms or the levels of reactant proteins. This approach may be suitable for well-mixed systems (e.g., lysates of tissue samples, samples taken from a stirred batch reactor) in which the role of spatial variations on the dynamics of constituents can be overlooked because mixing tends to equalize the local conditions. Discrete stochastic simulation algorithm (SSA) is an alternate way to solve the model equations (Gillespie, 1976, 1977b). SSA provides unbiased stochastic realizations for system dynamics based on the model equations. As the solution space is sampled statistically, in addition to the mean observations, stochastic methods can also provide information about the variability in the system. Therefore, they are more applicable when it is expected that
501
Multiscale Aspects of Biological Simulations
small portions of the system may show noticeable and functionally important differences in their dynamics. Steps of SSA are (Resat et al., 2009): (i) Initialize the system’s state x ¼ x0 at time t ¼ t0; (ii) Evaluate the reaction propensities aj(x) and their sum a0(x); (iii) Draw two random numbers r1 and r2 from a uniform distribution in the unit interval, and compute the time interval between the reactions t ¼ ln(r1)/a0(x) and determine which reaction P j type Rj occurs next by finding the smallest integer j that satisfies j ¼ 1 aj (x) > r2 a0(x); (iv) Record and update the changes in the system t ! t þ t and x ! x þ nj where the stoichiometry coefficients nj indicates how the chosen reaction affect the species; and (v) Return to step (ii) or end the simulation. 0
0
2.3.2. Models that explicitly include spatial aspects Arguably the simplest way to directly include spatial resolution into kinetic models is through the use of a grid framework. In as much as the grid units in a spatial simulation are analogous to individual system compartments, explicit spatial models can also be thought of system-of-systems formulations. Grid structures can be formed and investigated using either regular static grids or adaptive meshes that can fold and bend as simulations progress. Finite element methods employing adaptive meshes could be more appropriate when there are considerable gradients in the system but, for simplicity, we will discuss finite difference methods that employ regular grids here. Extension of the presented ideas to adaptive mesh representations is conceptually straightforward. 2.3.2.1. Deterministic solutions When spatial resolution is included, rate equations get converted to PDEs. If diffusive motion of reactants is incorporated, the resulting reaction–diffusion equations become
@S ¼ rDS rS þ “internal reactions” @t
ð17:4Þ
As in Eq. (17.2), “internal reactions” stand for the kinetic reactions that occur within a particular grid unit that have a bearing on the reactant abundances. With the use of a regular grid and the finite difference method, reaction–diffusion equations reduce to differential algebraic equations (DAEs), the discrete algebraic form of PDEs (Ascher and Petzold, 1998). Example 4 Spatial grids have long been used to simulate the reaction dynamics in 3D systems. For example, Noguera and coworkers have used such an approach to study the growth patterns of sulfate-reducing microbial communities in anaerobic environments (Noguera et al., 1999b) (Fig. 17.5).
502
Haluk Resat et al.
Biofilm containing grid units
Colony boundary grid units
Biofilm attachment surface
Figure 17.5 Description of a grid framework in biofilm simulations (adapted from Noguera et al., 1999b). The physical domain is divided into rectangular grid units. First layer forms the surface that microbial species can attach to grow into biofilms. Grid units are labeled as being occupied by the growing biofilm (gray), colony boundary units (yellow). Note that only the units in the first layer are illustrated in the figure for clarity, and the rest are not shown. The diffusing elements can occupy any of the grid units and their dynamics can be modeled using the reaction-diffusion equation, Eq. (17.4). Time evolution of biomass in the grid units is tracked. When a grid unit is filled with biomass, the growing colony occupies the neighboring sites and the interface boundary layer moves accordingly.
Even though this example employed a simple metabolic network for substrate utilization and biomass growth as well as simple geometries, these aspects are not limiting in such studies. For example, using a conceptually equivalent computational approach, Scheibe and coworkers have studied the dynamics of underground systems where heterogeneous soil structure, hydrodynamic mass transport, and genome-based metabolic network models were combined into a much more realistic and first-principles-based model (Scheibe et al., 2009). 2.3.2.2. Stochastic solutions With the exception of the diffusion term, stochastic solution of the reaction–diffusion equation parallels Section 2.3.1.1. Diffusion aspects can be incorporated in several ways. In the uniform treatment case, all reactant species are considered stochastically. In this case, diffusive motion can be handled as material transport from one grid unit to next according to the kinetic rate parameters, that is, diffusion is treated as a transfer reaction where reactants hop between grid units. This
Multiscale Aspects of Biological Simulations
503
type of jump Markov approach has roots in Ising (1925) and Potts models (Potts, 1952; Wu, 1982), and they have been adapted to biological simulations (Chatterjee and Vlachos, 2006; Collins et al., 2008; Fallahi-Sichani and Linderman, 2009; Gillespie, 1978; Pettigrew and Resat, 2005; Stiles and Bartol, 2001; Turner et al., 2004). Example 2 given above uses fixed boundaries to identify the system subparts. This rigidity can be removed from the layout to allow for temporal geometrical variations. A good example would be the CompCell3D simulation environment (www.compucell3d.org) where the assignment of the grid units are decided based on biomechanical properties to allow for cellular growth, motility, or simply shape changes. It should be noted that varying spatial resolutions can be introduced to different parts of the system using a coarse-graining approach that may employ multiple grids and/or treatment methods. Utilization of multiple grids for different regions of the system is one way of allowing the spatial resolution to differ between system parts. In addition, different regions defined by the grid partitioning can be treated with different methods, for example, spatial Monte Carlo (MC) approaches that treat the particles/objects individually and stochastic simulation approaches that ignore the local spatial details can be used for different parts of the system. These concepts are further explained in Example 5. Example 5 In receptor signaling, a model that can track individual receptor molecules would allow the investigation of how receptor distribution in the plasma membrane affects cell-wide signaling patterns. The large difference between the volume fractions of the plasma membrane and the cell interior makes it impractical to use the same resolution to model all of the relevant molecular events in the system. Hence, a multiresolution model where cytoplasm is described with the well-mixed SSA model, and a grid framework is employed to describe the reaction and diffusion processes on the heterogeneous plasma membrane could be appropriate (Costa et al., 2009). In this setup, updates for membrane-localized species were done for each individual molecule on the grid framework, whereas populations of intracellular species were updated via SSA simulations. Molecules were moved from the membrane to the cytosol and vice versa based on the stoichiometry of molecular events. The spatial MC was performed only for those reactions that contain at least one membrane-bound species. Time was updated in a “combined” approach by computing the total propensity. Using this framework, the effects of receptor clustering on downstream signal transduction was investigated. Simulation results indicated that spatial heterogeneity due to receptor dense domains can have pronounced effects on the signaling patterns of downstream proteins, indicating the importance of spatiotemporal modeling (Costa et al., 2009). The above model can be taken one step further in complexity by extending the spatial grid into the intracellular space (Fig. 17.6)
504
Haluk Resat et al.
SPT of EGFR SPT of Grb2
Lattice i = 1, layer j = 1,..7 (plasma membrane/cytosol)
60 nm
Lattice i = 2, layer j = 1 (cytosol)
50 nm
Lattice i = 3, layer j = 1 (cytosol)
50 nm
Lattice i = 4, layer j = 1 (cytosol)
100 nm
Cytosol
Plasma membrane
Figure 17.6 Construction of multigrid framework in multiresolution coarse-grained approaches. In this particular cell receptor signaling study, three levels of resolution was used: Top lattice (#1) contains seven layers at a z-lattice spacing of 10 nm. First and second layers respectively represent the plasma membrane and part of cytosol that forms the membrane boundary. Next two lattices (#2 and #3) are coarse grained at a z-spacing of 50 nm, and the last lattice (#4) is coarse-grained to 100 nm. Layers 2–7 of lattice 1 and the bottom coarse grained lattices 2–4 represent the cytoplasm. Layers in the lattices use a 50 50 grid mesh with 10 nm spacing in the x–y directions, and individual receptors are represented by their occupancy on a single grid unit. When bound to a membrane receptor, cytoplasmic species can diffuse as part of a receptor complex at the first layer of the top lattice. Upon dissociation from the receptor complex, adaptor proteins are placed at the second layer. Cytoplasmic species can diffuse in between lattices as they move around in the cytoplasm.
(Costa et al., in preparation). It is desirable to individually track cytoplasmic adaptor proteins near the membrane where location-specific effects are expected to be important. On the other hand, reactions are expected to occur homogeneously further downstream in the cytoplasm, and therefore, do not need to be tracked individually. Thus, the well-mixed SSA representation can be used for cytoplasmic regions away from the membrane. These different requirements can be met by applying coarse-graining along the axis normal to the cell membrane (z-axis in the figure). Figure 17.6 shows a framework that uses three levels of resolution: the top lattice (#1) represents the plasma membrane and the membrane-proximal portion of the cytoplasm. This lattice contains seven layers at a z-lattice spacing of 10 nm. The first and second layers respectively of the top lattice represent the plasma membrane and the part of the cytoplasm adjacent to the
Multiscale Aspects of Biological Simulations
505
membrane. The next three lattices (#2–#4) are coarse grained at the indicated lengths. Layers 2–7 of lattice 1 and the coarse grained lattices 2– 4 together represent the cytoplasm in this framework. Conceptually coarse-graining is equivalent to merging of microscopic grid units into larger entities: A unit Ck in a coarse-grained lattice can consists of qk microscopic sites, where qk ¼ qkxqkyqkz, and qkj is the number of microscopic partitions in Ck along the j axis ( j ¼ x, y, or z). To adjust for coarse-graining, the reaction rates are redefined according to the mean occupancies of the coarser grid units . For example, the diffusion transition rate from a coarse-cell Ck to coarse-cell Cl adjacent in the z-direction d is Gdk!l ¼ qk ðqGk þql Þ k ð1 l Þ, where Gd is the macroscopic diffusion rate. The coarse-grained reaction occurrence rates are adjusted analogously but with defining l as the mean concentration of the reactant l over the neighboring microscopic lattice sites. Detailed derivation of these rate expressions and further details on the implementation of adaptive coarse graining techniques can be found in Chatterjee et al. (2004, 2005) and Chatterjee and Vlachos (2006). 2.3.3. Mixed representations and individual-based methods Realistic modeling of heterogeneous biological systems often demands the incorporation of compartmental aspects in addition to explicit spatial modeling. Such mixed frameworks can provide significant advantages, particularly when the desired resolution requires the use of individual-based modeling (O’Donnell et al., 2007) for certain particles of the system (most commonly for the cells). For heterogeneous systems where local environmental conditions can vary (e.g., soil systems for bacterial and fungal dynamics, depth profiles in tissues or organs for eukaryotic system), treating each cell as a separate entity with its own dynamics and rules about its behavior would make it possible to investigate the properties of multicellular systems more realistically while including the effects of local variations on kinetics. Individual-based approaches, which are one type of agent-based modeling, have their roots in cellular automa. Each object/agent (which can be real physical objects or fictitious objects that play a direct role in defining the dynamics) can be treated as having its own rules that define the dynamics of autonomous agents and how they interact with other objects. Rules can be based on the solution of mathematical equations which incorporate the internal and external conditions of the agents, or they can be logical rules that define the regulatory dynamics. Logical rules too can depend on an agent’s environmental/boundary conditions. In most instances, logical rules lead to a hierarchy in the system, where the dynamics of the system takes the form of a branched tree in terms of decision-making mechanisms (Ratze et al., 2007). To give an example, say that receptor signaling in a 3D tissue is being investigated. The tissue can then be thought of as a collection of cell
506
Haluk Resat et al.
objects/agents that organize to form a 3D structure. The receptor signaling patterns of the system objects (i.e., cells) can then be formulated using equations such as Eq. (17.1) reported above. In this case, the local concentration around every cell provides the input (local ligand concentration) to the network of that particular cell object. Dynamics of the diffusing ligand through the tissue can be modeled using Eq. (17.4), for example. These mathematical equations can be further supplemented with equations that describe the translational motion (i.e., motility) of the cell objects. Individual-based modeling approaches have been used to model bacterial systems and biofilms for sometime (Kreft et al., 2001; Vlachos et al., 2006). For example, the BacSim modeling approach treats the biological system as consisting of two parts: bacterial cells as autonomous agents, and the metabolic substrates and products that diffuse and react in the surrounding environment. PDEs describing the metabolite dynamics are solved on a Cartesian grid, and obtained steady-state values are then used to define the local environmental conditions of the cells. In the next step, cellular dynamics are propagated using the determined metabolite concentrations where cell growth and division decisions are based on the growth pattern and size of the cells. Using BacSim simulation software, Kreft and coworkers were able to simulate the biofilm growth for a composite system consisting of nitrite and ammonia oxidizing organisms (Kreft et al., 2001). Example 6 The power of mixed modeling frameworks will be illustrated with an example from cellulose degrading microbial communities in soil aggregates (Resat et al., in preparation). Although realistic models are constructed in 3D, here we report the results for 2D simulations to keep illustration simple (Fig. 17.7). The dynamics of microbes was simulated in a cellulose rich system where the polymeric cellulose is seeded at certain locations (a total of 12 grid points in four separate locations labeled in yellow). The initial concentration of deposited cellulose was set to 95 picogram carbon (pgC) per grid unit. Initial configuration is that of 20 randomly placed microbial cells in the aggregate. Individual diffusive movements of the cells were modeled as random Markov motion while resource distribution is computed on the grid using the reaction–diffusion equation. The top panels show the time progression of the simpler carbon substrate cellobiose concentrations in the system. Modeled organism had a doubling time of 2 h when growing actively. It is clear from the time progression that the dynamics is regional in character. Since microbial cells graze around the system to locate and use the cellulose resource for growth, the growth and substrate consumption cycles are different in different regions. Hydrolysis of cellulose precedes the active growth of the cells; cells first create a substrate rich environment by hydrolyzing the cellulose and then use it to grow. As growth in turn increases the demand for carbon, hydrolysis and
507
Multiscale Aspects of Biological Simulations
A
En
zT
EnzX
Enzyme synthesis and degradation S
Cint
EnzP
Uptake Growth Cellulose Biomass
Maintenance B
Time (h): 0
Cellobiose
24
40
44
52
64
72
Biomass
Figure 17.7 Model for cellulose utilization by a microbial community. Although the actual model is constructed and simulated in 3D, results of a 2D simulation are reported here to keep illustration simple. (A) Metabolic network of the microbial cells. Cells uptake soluble carbon substrate S, and use it for maintenance, protein synthesis, and biomass growth. Cells express (i) regulatory ribosomal proteins EnzP, which controls the growth rates; (ii) transport enzymes EnzT, which facilitates the substrate uptake; and (iii) hydrolases EnzX, which convert polymeric carbon cellulose to simpler carbon forms such as cellobiose, which is consumed by the cells for maintenance and growth. Synthesis rates of these proteins depend on the growth state of the cells. (B) Spatiotemporal profile of cellobiose and biomass in the system. Simulations use a 31 31 grid with a 5 micron grid size; the dark blue areas are soil-covered grid points and the light blue units are the pores for the cells to reside. Panels report and compare the soluble substrate and the biomass in the system at the indicated times. Although the simulations keep track of individual cells, total biomass in the grid units is reported using a color scale.
growth processes create a spatially dependent causal cycle. Tracking of the cells individually made it possible to define the biochemical reactions locally, which implicitly incorporates causality effects between resource consumption and cellular metabolism into the unified biological model.
508
Haluk Resat et al.
3. Summary and Future Prospects Computational biology has gone through rapid progress in recent years. It is now quite common to use computational models to analyze and integrate experimental observations and to make in silico predictions that can be validated in wet lab experiments. Smart adaptations of the methods that have been used in other natural science fields to biological problems and the development of modeling and simulation approaches that are particularly well suited to biology research have fueled this progress. This chapter outlined the basics of how mathematical models can be constructed for biological systems and the approaches and algorithms to simulate the dynamical behavior of the constructed models. Examples to illustrate the presented methods were chosen from mostly our own research for practical purposes. Excellent recent reviews summarize the substantial contributions to the computational biology field by many groups, and the readers are referred to those publications for further reading on the state-of-the-art for kinetic simulations. Even though the progress has been immense, there is still plenty of room for improvement to develop computationally efficient methods to build and simulate biological models. The need is particularly dire for methods that can handle multiple spatiotemporal scales while still providing the necessary resolution for accurate representation. The development of efficient and adaptive multiscale algorithms is critical. Stochastic and deterministic methods are the preferred choices for better resolution and computational requirements, respectively. However, stochastic methods can be extremely expensive computationally and deterministic methods may not provide the needed resolution. Hybrid schemes that split the system into the continuous and discrete regimes to combine the traditional deterministic RRE with the SSA can nicely balance these extremes and show the best promise for progress (Elf et al., 2003; Isaacson and Peskin, 2004; Wagner et al., 2006). However, although they may be able to efficiently address the multiscale issues, there are still unsolved fundamental problems with hybrid methods as well: For example, automatic partitioning of the system into deterministic and stochastic parts needs to be done very carefully, because both accuracy and efficiency depends on a proper partitioning (Resat et al., 2009; Wagner et al., 2006). Similar concerns about automatic step size selection exist as well. Still, continuing improvements in algorithm and software development, in model development and in simulation methods are making kinetic simulations justly occupy a central role in ongoing systems biology research. While there is still plenty of room for improvement, rapid past progress is an indicator of significant future advances.
Multiscale Aspects of Biological Simulations
509
ACKNOWLEDGMENTS The research described in this chapter was funded by the National Institutes of Health Grant 5R01GM072821 to H. R. and by the Microbial Communities Initiative LDRD Program at the Pacific Northwest National Laboratory, a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy under Contract DE-AC05-76RL01830.
REFERENCES Allison, S. D. (2005). Cheaters, diffusion and nutrients constrain decomposition by microbial enzymes in spatially structured environments. Ecol. Lett. 8, 626–635. Arkin, A., et al. (1998). Stochastic kinetic analysis of developmental pathway bifurcation in phage lambda-infected Escherichia coli cells. Genetics 149, 1633–1648. Ascher, U. M., and Petzold, L. R. (1998). Computer Methods for Ordinary Differential Equations and Differential-Algebraic Equations. SIAM, Philadelphia. Berg, H. C. (1993). Random Walks in Biology. Princeton University Press, Princeton, NJ. Beyenal, H., et al. (2004). Three-dimensional biofilm structure quantification. J. Microbiol. Methods 59, 395–413. Birtwistle, M. R., and Kholodenko, B. N. (2009). Endocytosis and signalling: A meeting with mathematics. Mol. Oncol. 3, 308–320. Blinov, M. L., et al. (2004). BioNetGen: Software for rule-based modeling of signal transduction based on the interactions of molecular domains. Bioinformatics 20, 3289–3291. Bortz, A. B., et al. (1975). New algorithm for Monte-Carlo simulation of Ising spin systems. J. Comput. Phys. 17, 10–18. Burke, P., et al. (2001). Regulation of epidermal growth factor receptor signaling by endocytosis and intracellular trafficking. Mol. Biol. Cell 12, 1897–1910. Chatterjee, A., and Vlachos, D. G. (2006). Multiscale spatial Monte Carlo simulations: Multigriding, computational singular perturbation, and hierarchical stochastic closures. J. Chem. Phys. 124, 64110. Chatterjee, A., et al. (2004). Spatially adaptive lattice coarse-grained Monte Carlo simulations for diffusion of interacting molecules. J. Chem. Phys. 121, 11420–11431. Chatterjee, A., et al. (2005). Spatially adaptive grand canonical ensemble Monte Carlo simulations. Phys. Rev. E Stat. Nonlin. Soft. Matter. Phys. 71, 026702. Chaturvedi, S., et al. (1977). Stochastic analysis of a chemical reaction with spatial and temporal structures. J. Stat. Phys. 17, 469–489. Collins, S. D., et al. (2008). Coarse-grained kinetic Monte Carlo models: Complex lattices, multicomponent systems, and homogenization at the stochastic level. J. Chem. Phys. 129, 184101. Costa, M. N., et al. (2009). Coupled stochastic spatial and non-spatial simulations of ErbB1 signaling pathways demonstrate the importance of spatial organization in signal transduction. PLoS ONE 4, e6316. Elf, J., et al. (2003). Mesoscopic Reaction-Diffusion in Intracellular Signaling. SPIE’s “First International Symposium on Fluctuations and Noise” Vol. 5110, pp. 114–124. Fallahi-Sichani, M., and Linderman, J. J. (2009). Lipid raft-mediated regulation of G-protein coupled receptor signaling by ligands which influence receptor dimerization: A computational study. PLoS ONE 4, e6604. Gillespie, D. T. (1976). A general method for numerically simulating stochastic time evolution of coupled chemical-reactions. J. Comput. Phys. 22, 403–434.
510
Haluk Resat et al.
Gillespie, D. T. (1977a). Concerning validity of stochastic approach to chemical-kinetics. J. Stat. Phys. 16, 311–318. Gillespie, D. T. (1977b). Exact stochastic simulation of coupled chemical-reactions. J. Phys. Chem. 81, 2340–2361. Gillespie, D. T. (1978). Monte-Carlo simulation of random-walks with residence timedependent transition-probability rates. J. Comput. Phys. 28, 395–407. Gillespie, D. T. (1992a). Markov Processes: An Introduction for Physical Scientists. Academic Press, London, UK. Gillespie, D. T. (1992b). A rigorous derivation of the chemical master equation. Physica A 188, 404–425. Gillespie, D. T. (2000). The chemical Langevin equation. J. Chem. Phys. 113, 297–306. Ho, T. C. (2008). Kinetic modeling of large-scale reaction systems. Catal. Rev. Sci. Eng. 50, 287–378. Isaacson, S. A., and Peskin, C. S. (2006). Incorporating Diffusion in Complex Geometries into Stochastic Chemical Kinetics Simulations. SIAM J. Sci. Comput. 28, 47–74. Ising, E. (1925). Beitrag zur theorie des ferromagnetismus. Z. Phys. 31, 253–258. Judd, E. M., et al. (2003). Fluorescence bleaching reveals asymmetric compartment formation prior to cell division in Caulobacter. Proc. Natl. Acad. Sci. USA 100, 8235–8240. Kholodenko, B. N. (2003). Four-dimensional organization of protein kinase signaling cascades: The roles of diffusion, endocytosis and molecular motors. J. Exp. Biol. 206, 2073–2082. Kholodenko, B. N. (2006). Cell-signaling dynamics in time and space. Nat. Rev. Mol. Cell Biol. 7, 165–176. Kreft, J. U., et al. (2001). Individual-based modelling of biofilms. Microbiology 147, 2897–2912. Lemerle, C., et al. (2005). Space as the final frontier in stochastic simulations of biological systems. FEBS Lett. 579, 1789–1794. McAdams, H. H., and Arkin, A. (1997). Stochastic mechanisms in gene expression. Proc. Natl. Acad. Sci. USA 94, 814–819. McAdams, H. H., and Arkin, A. (1999). It’s a noisy business! genetic regulation at the nanomolar scale. Trends Genet. 15, 65–69. McAdams, H. H., and Shapiro, L. (2003). A bacterial cell-cycle regulatory network operating in time and space. Science 301, 1874–1877. Noguera, D. R., et al. (1999a). Biofilm modeling: Present status and future directions. Water Sci. Technol. 39, 273–278. Noguera, D. R., et al. (1999b). Simulation of multispecies biofilm development in three dimensions. Water Sci. Technol. 39, 123–130. O’Donnell, A. G., et al. (2007). Visualization, modelling and prediction in soil microbiology. Nat. Rev. Microbiol. 5, 689–699. Ozcelik, S., et al. (2004). FRET measurements between small numbers of molecules identifies subtle changes in receptor interactions. Proc. Int. Soc. Opt. Eng. 5323, 119–127. Pettigrew, M. F., and Resat, H. (2005). Modeling signal transduction networks: A comparison of two stochastic kinetic simulation algorithms. J. Chem. Phys. 123, 114707. Picioreanu, C., et al. (2005). Multidimensional modelling of anaerobic granules. Water Sci. Technol. 52, 501–507. Picioreanu, C., et al. (2007). Microbial motility involvement in biofilm structure formation—A 3D modelling study. Water Sci. Technol. 55, 337–343. Potts, R. B. (1952). Some generalized order-disorder transformations. Proc. Camb. Philol. Soc. 48, 106–109. Ratze, C., et al. (2007). Simulation modelling of ecological hierarchies in constructive dynamical systems. Ecol. Complex. 4, 13–25.
Multiscale Aspects of Biological Simulations
511
Raue, A., et al. (2009). Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 25, 1923–1929. Resat, H., et al. (2001). Probability-weighted dynamic Monte Carlo method for reaction kinetics simulations. J. Phys. Chem. B 105, 11026–11034. Resat, H., et al. (2003). An integrated model of epidermal growth factor receptor trafficking and signal transduction. Biophys. J. 85, 730–743. Resat, H., et al. (2009). Kinetic modeling of biological systems. In “Computational Systems Biology,” ( J. McDermott, et al., eds.). Humana Press. Scheibe, T. D., et al. (2009). Coupling a genome-scale metabolic model with a reactive transport model to describe in situ uranium bioremediation. Microb. Biotechnol. 2, 274–286. Shankaran, H., et al. (2006). Modeling the effects of HER/ErbB1-3 coexpression on receptor dimerization and biological response. Biophys. J. 90, 3993–4009. Shankaran, H., et al. (2008). Quantifying the effects of co-expressing EGFR and HER2 on HER activation and trafficking. Biochem. Biophys. Res. Commun. 371, 220–224. Slepchenko, B. M., et al. (2003). Quantitative cell biology with the Virtual Cell. Trends Cell Biol. 13, 570–576. Stiles, J. R., and Bartol, T. M. (2001). Monte Carlo methods for simulating realistic synaptic microphysiology using MCell. In “Computational Neuroscience: Realistic Modeling for Experimentalists,” (E. De Schutter, ed.), pp. 87–127. CRC Press. Stundzia, A. B., and Lumsden, C. J. (1996). Stochastic simulation of coupled reactiondiffusion processes. J. Comput. Phys. 127, 196–207. Turner, T. E., et al. (2004). Stochastic approaches for modelling in vivo reactions. Comput. Biol. Chem. 28, 165–178. Viollier, P. H., et al. (2004). Rapid and sequential movement of individual chromosomal loci to specific subcellular locations during bacterial DNA replication. Proc. Natl. Acad. Sci. USA 101, 9257–9262. Vlachos, C., et al. (2006). A rule-based approach to the modelling of bacterial ecosystems. Biosystems 84, 49–72. Wagner, H., et al. (2006). COAST: Controllable approximative stochastic reaction algorithm. J. Chem. Phys. 125, 174104. Wolf-Yadlin, A., et al. (2006). Effects of HER2 overexpression on cell signaling networks governing proliferation and migration. Mol. Syst. Biol. 2, 54. Wu, F.-Y. (1982). The Potts model. Rev. Mod. Phys. 54, 235–268. Xavier, J. B., et al. (2005). A framework for multidimensional modelling of activity and structure of multispecies biofilms. Environ. Microbiol. 7, 1085–1103. Xavier, J. B., et al. (2007). Multi-scale individual-based model of microbial and bioconversion dynamics in aerobic granular sludge. Environ. Sci. Technol. 41, 6410–6417. Zhang, L., et al. (2009a). Multiscale agent-based cancer modeling. J. Math. Biol. 58, 545–559. Zhang, Y., et al. (2009b). HER/ErbB receptor interactions and signaling patterns in human mammary epithelial cells. BMC Cell Biol. 10, 68.
C H A P T E R
E I G H T E E N
Computational Approaches to Modeling Viral Structure and Assembly Stephen C. Harvey,*,† Anton S. Petrov,* Batsal Devkota,‡ and Mustafa Burak Boz† Contents 514 514 515 519 521 522 522 524 526 526
1. Introduction 2. Double-Stranded DNA (dsDNA) Bacteriophage 2.1. DNA models 2.2. Capsid models 2.3. Packaging protocols 2.4. Data analysis 2.5. Ejection protocols 2.6. Results 3. Single-Stranded RNA Viruses 3.1. A specific model system: Pariacoto virus (PaV) 3.2. Conversion of RNA secondary structure into a 3D coarse-grained model 3.3. PaV: The RNA model 3.4. PaV: Adding the capsid to the model 3.5. PaV: Results Acknowledgments References
526 531 536 539 540 540
Abstract The structures of biological macromolecules and macromolecular assemblies can be experimentally determined by X-ray crystallography, nuclear magnetic resonance (NMR), and cryo-electron microscopy (cryo-EM). The refinement of such structures is a difficult task, because of the size of the experimental data sets, and because of the very large number of degrees of freedom. Molecular * School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia, USA Research Collaboratory for Structural Biology, Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, New Jersey, USA
{ {
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87018-7
#
2011 Elsevier Inc. All rights reserved.
513
514
Stephen C. Harvey et al.
modeling tools—particularly those based on the principles of molecular mechanics—have long been employed to assist in the refinement of macromolecular structures. Molecular mechanics methods are also used to generate de novo models when there are only limited experimental data available. Ideally, such models provide information on structure–function relationships, and on the thermodynamic and kinetic properties of the system of interest. Here, we summarize some of the molecular mechanics methods used to investigate questions of viral structure and assembly, including both all-atom and coarse-grained approaches.
1. Introduction The simplest viruses have a nucleic acid genome that is surrounded by a protein capsid. Genomes can be single-stranded or double-stranded, and they may be either DNA or RNA. In some viruses, the capsid proteins will spontaneously assemble into a procapsid that is matured as the genome is inserted in an energy-consuming process. In others, capsid formation requires the proteins to bind to the genome, which has already been partially or completely synthesized. Assembly is a critical step in the life cycle of viruses, so a detailed understanding of assembly might offer new opportunities for the design antiviral agents. In addition, the design of novel nanoparticles might be based on the principles of viral assembly. A wide variety of experimental, theoretical, and computational studies have been aimed at increasing our understanding of viral assembly ( Jardine and Anderson, 2006; Johnson and Chiu, 2007; Knobler and Gelbart, 2009; Petrov and Harvey, 2008). Wherever possible, atomic detail is desirable, but all-atom modeling is not always possible. Sometimes, there are not sufficient data to provide an atomistic representation. Sometimes—even if the structure is known in atomistic detail—simulations on biologically relevant time scales are not possible because of computational tractability. In such cases, investigators often resort to lower-resolution coarse-grained models. Here, we review methods for studying the structure and assembly of small icosahedral DNA and RNA viruses, sometimes with coarse-grained approaches, and sometimes combining all-atom and coarse-grained methods.
2. Double-Stranded DNA (dsDNA) Bacteriophage Bacteriophages are viruses that infect bacteria. They consist of a protein shell (capsid) surrounding a DNA or RNA genome. Bacteriophage capsids vary in size (from several hundred to several thousand A˚ngstroms), shape
Modeling Virus Structure and Assembly
515
(from isometric to highly elongated with axial ratios up to 5:1), and T number (from 1 to 7; Ackermann and DuBow, 1987; Granoff and Webster, 1999). The genome of most bacteriophages is in the form of dsDNA and ranges in size from about 20,000 to 150,000 bp. The genome generally occupies 30–50% of the available volume inside the capsid (Purohit et al., 2005). Packaging of dsDNA into a highly compacted state requires energy to overcome the electrostatic repulsions, hydration forces, and the loss of conformational entropy. DNA is forced into bacteriophage by an ATPdriven protein motor located in one vertex of the icosahedral capsid (Smith et al., 2001). In vivo, packaging has a characteristic timescale on the order of minutes. Because of the large size of bacteriophage and the time scale of packaging, all-atom simulations of packaging using conventional molecular dynamics (MD) are not possible. Therefore, it is necessary to use coarsegrained models. This is not a serious limitation, however, as many of the structural, kinetic, and thermodynamic aspects of DNA packaging can be well described by simplified low-resolution models. Here, we discuss coarse-grained models used to represent the constituents of bacteriophages (i.e., dsDNA, capsid, and the protein portal and core structure). We also summarize our studies on the packaging of DNA into bacteriophages, and our studies on ejection of DNA from the capsid and into the host bacterium.
2.1. DNA models In our simulations, we have two distinct DNA models (Locker and Harvey, 2006; Rollins et al., 2008; Tan et al., 2006). The first model represents dsDNA as a string of beads on a chain, with each spherical bead (pseudoatom) representing N consecutive base pairs. In our viral packaging studies, we most commonly use a model with N ¼ 6, which we designate 1DNA6 (Fig. 18.1A). The model accounts for the stiffness of stretching and bending, volume exclusion effects, and long-range interactions between DNA strands, but it excludes torsional stiffness from consideration. The elastic stretching and bending properties of DNA are reproduced by appropriately parameterized harmonic terms for bond stretching and bond angle bending: Ebond ¼ kb ðb b0 Þ2
ð18:1Þ
Eangle ¼ ky ðy y0 Þ
ð18:2Þ
and
kb and ky are stretching and bending force constants, b0 is the equilibrium value of the distance between two consecutive beads, and y0 is the
516
Stephen C. Harvey et al.
B
A b
F7
C7
q
L7
C2 F2 L2
F1 C1
L1
Figure 18.1 Coarse-grain models for DNA. (A) An all-atom representation of doublehelical DNA (bottom) can be simplified to the 1DNA model with one spherical pseudoatom per base pair (lower middle). Further coarse-graining leads to the 1DNA6 model, with one bead for every 6 bp (upper middle). Both the 1DNA and ˚ . Chain stretching is opposed 1DNA6 models have pseudoatoms with a diameter of 25 A by elastic bonds, while bending is opposed by elastic bond angle terms. All four representations in this panel are shown to the same scale, but the radii of the beads have been scaled down for graphical purposes in the top representation of the 1DNA6 model, to permit visualization of one bond length (b) and one angle (y). (B) The 3DNA model, in which the energetic cost of torsional deformations is included. Each base pair is represented by three pseudoatoms: the center atom (C), lying on the axis of the double-stranded DNA molecule; the “left” dummy atom (L), whose position approximates that of one phosphate group; and the “front” dummy atom (F), which lies somewhere in the major groove. The stretching elastic modulus determines the force constant for the harmonic bond between successive C atoms. Bending stiffness requires parameterization of several bond angles, for example, F1–C1–C2; L1–C1–C2; C1–C2– C3; C1–C2–F2; C1–C2–L2. Torsional stiffness requires parameterization of two improper torsions per base pair step, for example, F1–C1–C2–F2 and L1–C1–C2–L2. Volume exclusion is treated by the radius of the C atoms, since the dummy F and L atoms do not have volume; it is identical to the volume exclusion of the 1DNA model. We can generate a double-helical graphical representation of any conformation by reversing each C–L vector to generate “right” dummy atoms located symmetrically opposite each L atom. This model can be further coarse-grained to the 3DNA6 model (not shown) by eliminating all pseudoatoms for base pairs 2–6 and making appropriate choices for parameters for bond stretching, angle bending, and improper torsions for the successive triads representing base pairs 1, 7, 13, and so on. The volume of the 3DNA6 model is essentially identical to that of the 1DNA6 model.
equilibrium bending angle for consecutive triplets. The stretching modulus was parameterized from the variance in the distance between successive base pairs (rise) of B-DNA from Nucleic Acids Data Bank (www.pdb.org; Berman et al., 2000), and the bending modulus was parameterized to ˚ (Hagerman, reproduce the value of the DNA persistence length of 510 A 1988). The details of parameterization are given elsewhere (Locker and
Modeling Virus Structure and Assembly
517
Harvey, 2006; Locker et al., 2007). In the 1DNA6 model, the numerical ˚ 2), b0 ¼ 19.9 A ˚, values of the parameters are kb ¼ 3.5 kcal/(mol A 2 ky ¼ 22.4 kcal/(mol rad ), and y0 ¼ 0. To avoid interpenetration between DNA strands, each bead is spherical, ˚ . Nonbonded (volume exclusion) interactions are with a radius of 12.5 A modeled by a semiharmonic repulsive potential, often called a “soft sphere” potential: knb ðd0 d Þ2 ; if d < d0 Enb ¼ ð18:3Þ 0; if d d0 ; where d is the distance between the two interacting pseudoatoms, ˚ . When modeling kDNA–DNA ¼ 11.0 kcal/(mol A˚2), and d0 ¼ 25.0 A DNA as a simple elastic polymer (ignoring electrostatic effects), we used a ˚ for all volume exclusion calculations. cutoff of 50 A The second model allows the definition of a local DNA twist angle and allows the inclusion of torsional stiffness in the simulation (Fig. 18.1B). It contains two additional “left” and “front” dummy atoms attached to the central bead and placed orthogonally to the DNA helical axis (Rollins et al., 2008; Tan and Harvey, 1989; Tan et al., 2009). In the original model (“3DNA1”), each triad of atoms defines a plane representing a single base pair; the “left” atom points toward the position of one backbone phosphate group, and the “front” atom defines the major grove of dsDNA. The torsional stiffness of DNA is represented by defining an improper torsion angle about the bond connecting successive backbone beads (Fig. 18.1B), with deformation energy, Eimproper ¼ kf ðf f0 Þ2
ð18:4Þ
and by proper choice of the torsional force constant kf. The 3DNA1 model is suitable for studying supercoiling in closed circular DNAs with lengths up to about 3000 bp (Tan et al., 1996), but its application to bacteriophage systems is impractical because of their sizes. We use a coarser version of this model for large DNA molecules, with N base pairs being represented by a single triad. The 3DNA6 model has N ¼ 6, and we used it in our investigations into the effects of torsional stiffness on viral packaging (Rollins et al., 2008). The 1DNA and 3DNA models are easily parameterized for other values of N (Tan et al., 2006). DNA is a charged polyelectrolyte, so it is essential to describe DNA– DNA interactions as accurately as possible. Experimental data on osmotic pressure show that this interaction is very complex (Parsegian et al., 1995, 2000). In monovalent salts, DNA molecules are electrostatically repelled, though these repulsions are partially screened by counterions at long range.
518
Stephen C. Harvey et al.
˚ ), hydration forces become important. These are At short distances (25–30 A due to the loss of conformational freedom of water molecules at the DNA surface. Trivalent or tetravalent cations in solution cause DNA condensation (Bloomfield, 1991; Hud and Vilfan, 2005). Because of the complexity of the problem and very large size of bacteriophage systems, we used a phenomenological approach to describe DNA–DNA interactions; instead of providing the exact physical formulation for every component of this interaction, we derived a set of functions and parameters that accurately match the experimental potentials of mean force of DNA interactions in vitro. We treat two regimes: the repulsive regime is observed in the presence of most monovalent and divalent cations, while the attractive regime appears upon the addition of condensing agents (trivalent and tetravalent cations). For the repulsive regime, we empirically derived the functional form of DNA–DNA interactions from the experimental data of Rau and Parsegian (Rau et al., 1984) and modeled them as a function of distance, r, by a modified Debye–Hu¨ckel function (Petrov and Harvey, 2007): rep EDNADNA ðr Þ ¼ 0:59Lb
q2eff expðkeff ðr 2aÞÞ ; r
ð18:5Þ
˚ is the Bjerrum length and 0.59 is the conversion factor where Lb ¼ 7.135 A to kcal/mol. The other parameters (effective charge, qeff ¼ 12.6e per pseudoatom; effective screening constant, keff ¼ 0.31 A˚ 1; and DNA radius, a ¼ 10.0 A˚) correspond to the buffer containing 10-mM MgCl2, 100-mM NaCl, and 10-mM TrisCl. The interaction between DNA double helices in the attractive regime is described by the following empirical relationship, applied to pairs of DNA pseudoatoms in separate double helices, separated by a distance r: 2ðb1 r Þ 2ðb1 r Þ attr EDNADNA ðr Þ ¼ A1 exp 2 exp c1 c1 ð18:6Þ 2ðb2 r Þ 2ðb2 r Þ 2 exp A2 exp c2 c2 ˚, with A1 ¼ 0.011 kcal/(mol bp), A2 ¼ 0.012 kcal/(mol bp), b1 ¼ 30.5 A ˚ , c1 ¼ 2.6A ˚ , and c2 ¼ 2.2 A˚. The parameters were derived to b2 ¼ 37.5 A match the data for the attractive interactions occurring in the range r 25– ˚ , with a minimum of 130 cal/(mol bp) at r 27.2 A ˚ (Tzlil et al., 34 A ˚ as experimentally 2003), and the repulsive interactions in the range 35–50 A observed by osmotic pressure data obtained in the presence of polycations (Rau and Parsegian, 1992). A cutoff of 70 A˚ was used to treat all long-range DNA–DNA interactions. We stress that parameterization was done to
Modeling Virus Structure and Assembly
519
mimic properties of DNA free in solution, and there are no free parameters in our model that must be adjusted to match force–distance curves or other data from viral packaging experiments.
2.2. Capsid models The protein–protein interactions in a bacteriophage capsid are relatively strong, and the capsid assembles spontaneously in the absence of genomic DNA. In contrast, the interactions between DNA and the walls of the capsid are relatively weak. The major role of capsid proteins is to keep DNA stored inside the capsid volume under high pressure after it is packaged. Thus, the capsids in our models play the role of a container to keep DNA confined within a volume of a defined geometry. We implemented two different approaches to model DNA capsids. Many bacteriophage capsids have icosahedral isometric morphology. The simplest approximation for such a capsid is a sphere. We model spherical capsids by placing an additional dummy atom in the center of the spherical cavity of radius R and applying semiharmonic restraints between this pseudoatom and all DNA pseudoatoms. We call this energy function a “NOEN”, because of its resemblance to the semiharmonic restraint often used in the refinement of nuclear magnetic resonance structures using contacts detected by the Nuclear Overhauser Effect. The energy is zero for any pseudoatom that lies within the sphere, and the energy penalty rises quadratically for pseudoatoms that violate the spherical boundary. The dummy atom does not move in response to the NOEN forces, and the energy for a pseudoatom at a distance d from the center of the sphere is KNEON ðd RÞ2 ; if d R ENEON ¼ ð18:7Þ 0; if d < R; ˚ 2). where kNOEN ¼ 8.8 kcal/(mol A Some spherical dsDNA viruses, for example, bacteriophage Lambda, undergo a significant capsid expansion process during maturation (Lander et al., 2008). Partially packed DNA pushes against the capsid walls and triggers the transition of capsid proteins to a new conformation. The expansion also affects the thermodynamics of the packaging process. In order to account for the expansion in a phenomenological fashion when modeling Lambda, we gradually increased the radius of confinement from ˚ between 20% and 40% of Lambda genome-packed, which is 210 to 290 A in the range where expansion occurs (Dokland and Murialdo, 1993; Fuller et al., 2007). This simple model of capsid expansion is empirical and does not contain any regulatory feedback mechanism.
520
Stephen C. Harvey et al.
The second model describes the capsid as a polyhedron, either an icosahedron, an elongated icosahedron, or a more complex polyhedron (Fig. 18.2). We build such models from a set of triangular faces, edges, and vertices, each of which is filled with a set of spherical pseudoatoms. The function of these spheres is to prevent the DNA chain from leaking out of the capsid, so the most important parameter of this model is the density of soft spheres; a low density runs the risk of DNA escape, while a high density increases simulation time. We cover the capsid surface with a hexagonal array of soft spheres, each of radius of 8 A˚, and we have found that the minimum density required to keep DNA inside the capsids corresponds to a ˚ (Petrov and Harvey, 2007). The separation between the spheres of 28 A interactions between DNA and soft spheres are purely repulsive (Eq. (3)); the parameters for the DNA-capsid interactions are knb ¼ 8.8 kcal/ ˚ 2) and d0 ¼ 20.5 A ˚. (mol A Both of the above capsid models may (optionally) have an additional feature. In bacteriophages, such as T7 (Agirrezabala et al., 2005), epsilon15 ( Jiang et al., 2006), and P22 (Lander et al., 2006), there are other portal proteins at one of the capsid’s vertices, in addition to the motor assembly. Sometimes, there is a well-developed structure (the core) that propagates into the viral interior, occupying as much as 15–20% of the inside volume of the capsid. The presence of a core structure can affect both DNA conformation inside the bacteriophage and the thermodynamics of DNA packaging, so we have included the cores in the models for those viruses where they are known to occur. The simplest model of the core structure is a
Figure 18.2 The model capsid for epsilon15. The triangular faces, edges, and vertices of the icosahedral capsid are defined by collections of appropriately placed pseudoatoms, which are shown as opaque spheres in the left panel; a fragment of the portal structure emerges from the vertex at the bottom of the capsid. The full structure of the portal and core assembly are made visible in the right panel, by making the capsid pseudoatoms transparent. The portal/core assembly is composed of a series of coaxial cylinders whose diameters are based on the structures seen in cryo-electron microscopic reconstructions ( Jiang et al., 2006).
Modeling Virus Structure and Assembly
521
hollow cylinder with an inner diameter of 30–40 A˚, composed of soft spheres identical to those in the capsid walls. The outer radius and the length of the cylinder depend on the particular bacteriophage. In a few bacteriophages, for example, epsilon15, the outer radius of the protein portal varies as it goes into the depth of capsid ( Jiang et al., 2006). We use a set of hollow, connected, coaxial cylinders to model such complex geometries, for example, Fig. 18.2 (Petrov et al., 2007a).
2.3. Packaging protocols The packaging of the DNA genome into bacteriophages is not a spontaneous process, but is driven by a motor. The current level of the simulations cannot model the dynamics of the motor itself, but only the phenomenological result of its action. In the framework of our model, the packaging is driven by four auxiliary atoms (“stud atoms”) separated exactly by the equilibrium distance between DNA pseudoatoms, b0, and placed along the DNA axis, either outside of the capsid or inside of the core structure, if present. Four successive DNA pseudoatoms ( j through j þ 3) are attached via harmonic springs to the stud atoms (Locker and Harvey, 2006; Petrov and Harvey, 2007). The functional form of the stud energy function is identical to Eq. (1), with b0 ¼ 0 and a force constant of 0.01 pN/A˚. We ratchet the DNA forward into the capsid in a series of steps. The first half-step is achieved by moving the stud positions toward the center of the capsid a distance of b0/2, followed by extensive equilibration using MD, to gradually move the DNA forward the same distance. The other half-step involves resetting the stud atoms back to their original positions and changing the harmonic restraints so that the studs are now attached to DNA pseudoatoms j þ 1 through j þ 4. Again, extensive MD equilibration moves the DNA forward by a distance of b0/2. All MD trajectories were generated using the YUP package (Tan et al., 2006), specifically designed for molecular modeling of coarse-grained systems. Simulations were performed with a time step of 1 ps in the repulsive regime and 0.5 ps in the attractive regime. Packaging was performed at 300 K by coupling the systems to a Berendsen thermostat (Berendsen et al., 1984). The nonbonded lists were updated every 10 steps. Extensive equilibration is required during each step along the packaging trajectory to ensure that the structure and thermodynamic properties are not far from the equilibrium along the packaging trajectory. Each simulation begins with an equilibration time of 6 ns per half-step. As more DNA is crowded into the capsid, it takes longer to equilibrate the structure after each advance, so equilibration time is linearly increased by 4–8 ps per monomer as packaging progresses. The total trajectory time depends on the size of the model genome but typically ranges from 10 to 250 ms (Petrov and Harvey, 2007).
522
Stephen C. Harvey et al.
2.4. Data analysis The MD trajectories yield a range of structural and thermodynamic information. To determine the packaging forces, equilibrated intermediate conformations obtained at regular intervals along the packing trajectories (typically at intervals of 10% of the length of the DNA) are taken as starting points for a series of new MD runs, with DNA atoms at the entrance point held fixed. The time step during the force calculations is reduced to 0.1 ps. As the DNA tries to push its way out of the capsid, the springs connecting DNA beads with the stud atoms are stretched from their equilibrium lengths. To collect statistically uncorrelated data, 1000 of these displacements are collected at 500 ps intervals along the MD trajectory. The forces are calculated by multiplying the displacements by the force constants. Integrating the force–distance curve over the full genome length gives the work done during DNA packaging. Since the force is calculated in a series of simulations with a fixed amount of DNA held in the capsid, there is no net motion during force calculations, and the forces are equilibrium values. As a consequence, the work that is done represents the free energy cost of packaging. The internal energies are extracted from the same MD trajectories, simply by summing the average component energies (Eqs. (1)–(6)) and subtracting the corresponding values for free DNA at the same temperature and in the absence of capsid restraints. The entropic penalty associated with DNA confinement is then calculated as the difference between the free energy and the internal energy (Petrov and Harvey, 2007). Typically, 10–50 independent packaging trajectories were carried out for each system that we investigated; by averaging over all of these, we obtained very accurate estimates of the forces and free energies. Simulated low-resolution electron density maps are reconstructed by averaging over individual structures from 10 to 50 independent packaging trajectories. In a single structure, each DNA segment between successive ˚. pseudoatoms along the chain is modeled as a cylinder with a radius of 10 A Each cylinder is uniformly filled with 2000 points (“atoms”), and the sets of these atoms are converted to corresponding values of single particle density maps with a voxel size of 3 A˚ using Spider (Frank, 2002). Superposing the individual densities generates average density maps that can be compared with experimental density maps from electron microscopy.
2.5. Ejection protocols The main difference between packaging and ejection is that the latter is a spontaneous process (at least at the initial stage) and does not require the help of an external motor. Ejection is driven by the high pressure of packaged DNA ( Jeembaeva et al., 2008), which arises from hydration, electrostatic, and entropic forces (Petrov and Harvey, 2007). The models
523
Modeling Virus Structure and Assembly
and parameters for DNA and capsids applied to study ejection are essentially the same as those used to study packaging, except there are no stud atoms, so the DNA spontaneously escapes from the capsid. In addition, we include a model of the bacterial cell by constraining the ejected portion of DNA in a sphere of appropriate volume (Fig. 18.3). The full ejection model includes DNA, the capsid, the connector channel, and a bacterial cell. For simplicity, we describe the capsid using the spherical approximation. The protein channel connecting the capsid and a bacterial cell is constructed as a hollow cylinder made of soft spheres, with an inner diameter of 40 A˚ and a length of 200 A˚, similar to the protein cores used in the packaging simulations. The bacterial cell is modeled as a second NOE-like sphere with a radius of 1 m. During the course of the simulation, we maintain and update a list of pseudoatoms that have been ejected from the capsid; let us designate this list as containing beads 1 Nejected. We also maintain a list of 20 ejection candidates, atoms Nejectedþ1 Nejectedþ20), which are still located inside ˚ of the capsid the capsid. If a pseudoatom in this list is found within 60 A boundary, the spherical NOE-like capsid constraint for this atom is removed, so this bead is free to move down the connector channel and leave the capsid. After a pseudoatom comes out of the channel, enters the ˚ into the cell, it is subjected to the bacterial cell, and moves at least 100 A spherical restraint of the bacterial cell. Thus, a pseudoatom cannot reenter the capsid after entering the bacterial cell, because our model assumes that the probability of this event is very small. Addition and deletion of restraints are done on the fly during the course of the ejection simulations, which is possible due to the structure of YUP. The frequency of updating the ejection candidate list and modifying the spherical restraints varies between 0.5 and 10 ns, depending on the rate of ejection.
0.1 m
Figure 18.3 Simulation of the ejection of genomic dsDNA from bacteriophage f29. The genome was packaged into the spherical capsid as described in the text, and the full model is shown in the left panel, with the hollow core connecting the interior of the virus with the interior of a large sphere with the same radius (1 m) as a typical bacterium. Upon release of the restraint holding the DNA inside the virus, it is ejected into the bacterium, because of the combined electrostatic and entropic forces (right).
524
Stephen C. Harvey et al.
The viscosity of the medium inside the bacterial cell (or outside of the bacteriophage, if the bacterial cell is excluded from the model) strongly affects the kinetics of both packaging and ejection (Evilevitch et al., 2003), so we carried out ejection simulations using the Langevin Dynamics (LD) protocol. The temperature was 298 K, and the simulation time step was 0.5 ps. The frequency of applied stochastic forces (the collision frequency) varied over the range 0.001–0.02 ps 1. Different viscosity regimes were studied to probe how the viscosity of the medium affects ejection kinetics. Figure 18.3 shows the result of a typical ejection trajectory. We have analyzed these trajectories by plotting the amount of genome ejected versus time. Numerical differentiation of this function gives the ejection rate along the trajectory. Additionally, the forces acting on the DNA were calculated according to a procedure similar to that described in the packaging protocol. Ejection was interrupted at every 10% of DNA genome ejected, and four successive DNA pseudoatoms inside the channel were connected to four stud atoms placed inside the protein channel with harmonic restraints (Eq. (1), with b0 ¼ 0; recall that stud atoms are dummy atoms and do not move). We measured the average displacements of these DNA atoms with respect to the stud atoms along the packaging axis and converted these to forces by multiplying them by the stud force constant, in accordance with Hooke’s law. No net motion of the DNA occurred during the force calculations, so these are equilibrium measurements. After the force measurements were complete, the stud atoms were detached and the ejection resumed. The proposed model of DNA ejection could be further improved to account for the explicit presence of proteins, DNA, and organelles that occupy bacterial cells. It is known that the total volume fraction of DNA and proteins inside the bacterial cells is 0.35–0.4. The presence of these crowders is expected to affect both the thermodynamics and kinetics of ejection. A reduced void volume should result in the appearance of additional osmotic pressure that would act against the ejection force and eventually may stall ejection. A high concentration of crowders also changes the viscosity of the solvent, which is considered implicitly in the framework of our model. An increase of the collision frequency parameter in the Langevin Dynamics simulations would slow down the kinetics. All of these additional factors would increase the complexity of the model, resulting in a significant increase in required computational resources.
2.6. Results We have recently summarized our understanding of DNA packaging inside bacteriophage systems elsewhere (Harvey et al., 2009; Petrov and Harvey, 2008); here, we present only the highlights.
Modeling Virus Structure and Assembly
525
The high force developed by an ATP-driven motor is required to confine DNA inside the small volume of bacteriophage capsid. The free energy cost of packaging is primarily electrostatic and entropic in nature. These two components account for up to 90% of the total free energy cost, while the elastic bending energy accounts for most of the rest (Petrov and Harvey, 2008). The confined DNA may fold into a number of conformations. All of these have significant disorder around certain idealized forms, including coaxial spools, concentric spools, twisted toroids, and folded toroidal structures (Petrov et al., 2007b). The specific DNA conformation inside a specific bacteriophage depends upon the size and shape of the capsid, the size and shape of the core at the portal (if any), and on the ionic composition of buffers in the surrounding media. Under fixed environmental conditions, the electrostatic and entropic costs of confinement are largely independent of the final conformation, so the optimum conformation minimizes the elastic bending energy (Petrov et al., 2007b). Simulations reproduce the multiple shell pattern of DNA density often seen in the experimental reconstructions. The latter reveal little about individual conformations, because the reconstructions are averages over thousands of individual viruses (Petrov and Harvey, 2007; Petrov et al., 2007a), and the simulations provide these details. The current modeling method captures the essential physics of DNA packaging, but has not yet been capable of describing complex features such as specific interactions between DNA and proteins in the capsid walls. Nor does it treat the interactions of DNA with the packaging motor in enough detail to understand the mechanochemical transduction process behind the mechanism of DNA translocation. Torsional stiffness does not significantly affect either the final DNA conformation or the thermodynamics of packaging, if one end of the DNA molecule is free (unattached) inside the bacteriophage, so it is free to rotate and relax torsional strain (Rollins et al., 2008). When both ends are tethered, torsional stiffness has only a small effect on the thermodynamics of packaging, but the final conformations are different than for the untethered case (Spakowitz and Wang, 2005). Upon ejection of the first 50–60% of the ejected genome, the ejection forces drastically decrease, dropping to a few piconewtons. However, further ejection leads to a slight increase in the force that acts on DNA and pulls it outside of the capsid. This observation lends support to the dual “push-pull” mechanism of DNA ejection (Grayson and Molineux, 2007; Jeembaeva et al., 2008). The initial decrease of the force during genome ejection is due to the drop in pressure inside the capsid. The subsequent increase of the force, which pulls the remaining DNA outside of the capsid, is due to the entropic force developed by the ejected portion of the genome. This force is on the order of a few piconewtons and correlates well with the radius of gyration of the ejected DNA.
526
Stephen C. Harvey et al.
3. Single-Stranded RNA Viruses 3.1. A specific model system: Pariacoto virus (PaV) PaV is an icosahedral T ¼ 3 RNA virus with a bipartite genome. The 4322 nucleotide genome consists of RNA1 (3011 nucleotides) and RNA2 (1311 nucleotides). The protein capsid is composed of 180 identical subunits, each containing 401 amino acids. There are 60 copies of the crystallographic asymmetric unit, each of which contains three copies of the capsid protein in three different conformations, called A, B, and C (Tang et al., 2001). The asymmetric unit also contains an RNA segment of 25 nucleotides. The RNA forms half of a double-stranded duplex that is perpendicular to the crystallographic twofold axis and that lies just inside the protein capsid. The full structure of the virus can be generated from the asymmetric unit using the 60 matrices provided in REMARK 350 of the PDB file (PDB id: 1F8V), using the oligomer generator application from the VIPER website (Shepherd et al., 2006). The RNA forms a dodecahedral cage with a 25-bp duplex lying on each of the 30 edges. Thus, the crystallographically resolved RNA accounts for about 35% (25 2 30 ¼ 1500 nt) of the total genome. The remaining 65% of the RNA lies inside the dodecahedral cage and is not resolved in the crystal structure because it lacks icosahedral symmetry. In addition, the RNA at the 20 vertices at which the duplexes are connected are not crystallographically resolved, presumably because fragments at different vertices have different structures. Similarly, protein subunit A is missing 6 residues at the N-terminal end and 15 at the Cterminus in the crystal structure, while the B and C subunits are missing about 50 residues at the N-terminus and 19 residues at the C-terminus, due to the lack of clear electron density. Again, this almost certainly represents structural heterogeneity. The challenge is to model the complete virus in as much detail as possible. The structure revealed by crystallography is very large, and there are only limited experimental data to guide modeling efforts on the rest of the structure. Because of the size of the system and the limited data on the protein tails and the RNA in the interior of the virus, coarse-grained modeling is appropriate for building and refining the model, although we converted the final coarse-grained model to an all-atom model at the end.
3.2. Conversion of RNA secondary structure into a 3D coarse-grained model As will be seen presently, we based the model of the PaV RNA genome on a plausible secondary structure model (Tihova et al., 2004). We built the 3D model by connecting fragments from crystal structures with junctions that
Modeling Virus Structure and Assembly
527
we built manually at the all-atom level, interconverting all-atom and coarsegrained representations as appropriate. In some of our RNA modeling efforts, we use an entirely automated procedure for converting secondary structures into 3D models. Although we did not use this procedure in our PaV model (Devkota et al., 2009), we present this automated method here, for completeness. RNA presents a more difficult modeling challenge than dsDNA. Unlike dsDNA, ssRNA molecules contain various structural motifs, including double-stranded regions, single-stranded regions, stem-loops, and a variety of bulges and junctions. The simplest coarse-grained model of RNA is a linear beads-on-a-string model, but it cannot model the variety of structural motifs, and it does not describe the RNA secondary structure, which plays a crucial role in defining RNA conformation in 3D space. Such a model necessarily has limited utility for investigating the structure and assembly of RNA viruses. Previously, we developed a coarse-grained “PX” model of RNA that provides a good 3D description of RNA composed of different structural elements (Malhotra et al., 1994; Tan et al., 2009). Figure 18.4B shows that model, which we have implemented in YUP as the rrRNAv1 model (Tan et al., 2006). In the framework of this model, each nucleotide is represented by one pseudoatom (P-atom). Single-stranded regions are described by flexible strings composed of connected P-atoms, and helices are explicitly represented by semirigid fragments, in which hydrogen bonding between the strands is replaced by unbreakable bonds between P-atoms on the two
Figure 18.4 Models of tRNA. (A) All-atom model, with phosphorus atoms highlighted as small dark spheres. The larger gray spheres are the “2N” pseudoatoms, each representing 2 bp, and each placed at the midpoint of two successive glycosidic nitrogen atoms. (B) The PX model, also implemented as the rrRNAv1 model. Each residue is represented by a single P-atom, centered at the position of the phosphate group (black). There is an additional pseudoatom (X-atom) for each base pair in the double-stranded regions. It is located at the geometric center of the base pair and has a sufficiently large radius to provide appropriate volume exclusion. (C) The 2N model, with one pseudoatom representing two successive nucleotides.
528
Stephen C. Harvey et al.
strands. There are terms in the energy function that describe the bond angle bending between successive triplets of P-atoms along the backbone, and other angular terms to define the ideal geometry of double-helical regions. An improper torsion ( j 1, j, k, k þ 1) is associated with the j k base pair to enforce the right-handed chirality of double helices. A model containing only P-atoms would have hollow double helices, running the risk of artifactual interhelical penetrations. Proper treatment of volume exclusion arises from the presence of a series of additional X-atoms along the axis of each double-helical fragment (Fig. 18.4B). Both the PX and rrRNAv1 models have too many parameters to be given here; they are reported elsewhere (Malhotra et al., 1994; Tan et al., 2009). If the coordinates of all RNA atoms are known in 3D, then the positions of P-atoms can be easily extracted and the rrRNAv1 model can be generated according to a previously described procedure (Cui et al., 2006). If the crystal structure is not known, small fragments can be built by manual modeling. For large systems of unknown structure, one of the common goals is to create a plausible 3D model that is compatible with a specified secondary structure. This is particularly important in studies on viral assembly and other properties of viral RNAs. We have developed an algorithm that generates the rrRNAv1 model from a specified secondary structure. It can be used without providing any additional 3D data, or, when such data are available, they can be incorporated into the model as restraints. RNA secondary structure predictions from programs like Mfold (Zuker, 2003) are often given in a CT file format. Columns 1 and 2 specify the index (residue number) and type (A, C, G, U) of each residue, while Column 5 contains the index of the complementary base-pairing residue, if any (zero, otherwise). This information is extracted and converted to the BLUEPRINT format of the rrRNAv1 model using the utility CT2BLUE.py located in the rrRNAv1 folder of the YUP package (Tan et al., 2006). The format of the BLUEPRINT file used by YUP to create the rrRNAv1 model is described in the YUP documentation and will only be outlined here. Fragments of the secondary RNA structure must be given in hierarchical form. In the simplest case, all the elements of the 2D RNA structure (loops, single- and double-stranded regions) may be described at the same hierarchical level, but a more complex organization containing multiple levels is also possible. The latter does not affect the properties of the rrRNAv1 model but simply provides an additional amount of structural information for complex RNA molecules containing multiple domains. The BLUEPRINT file (BP_NAME.py, written in python) contains a dictionary “BLUE” with several keywords. The first keyword “RNA_RNA” contains information about RNA secondary structure and given in the following format: (DOMAIN, ‘all’, (D_1, D_2, . . .)), is a tuple of tuples, where D_i is the label of the ith region.
Modeling Virus Structure and Assembly
529
For example, (DOMAIN, ‘all’, (S_1, H_1, S_2, H_2, S_3, H_3, . . .)) could specify a single-stranded region at the 50 end of the molecule, followed by a series of three double-helical regions connected by single-stranded regions, with other entries to identify the structure of the rest of the molecule. Here, the entries, S_1 and S_2, are labels for single-stranded regions, and the entries, H_1, H_2, and H_3, represent double-helical regions. (Other labels might be used for loops, bulges, and strands that are part multibranch junctions; these are all “single-stranded” in the sense that they do not have base-paired partners.) Each entry in the nested tuple is given in a format that defines the characteristics of the corresponding region, for example, S_1 ¼ (TRACT, ‘tract_1’, (1,3)) and H_1 ¼ (HELIX, ‘helix_1’, (4,7,45)) , where the first entry defines the type of the RNA fragment, the second entry labels it, and the third entry provides the structural information. TRACT and HELIX define single- and double-stranded domains, respectively. The third entry is a tuple that contains two or three residue indices for tracts and helices, respectively. For tracts, two indices define the beginning and the end of a single-stranded fragment (In this example, S_1 is singlestranded and contains nucleotides 1–3). For double helices, the first and third indices define 50 end positions of antiparallel strands that form a double-helical region, and the second index defines the length of the doublestranded region. (Here, H_1 contains 7 bp, between residues 4 and 10 and residues 51 and 45.) The second keyword “RNA_BSQ” contains information about sequence in the format of a tuple: (“C,” “A,” “U,” “C,” “C,” . . .). Finally, the last two keywords, RNA_XYZ and RNA_FIX, are, by default, empty tuples: (). They may contain information about the positions of the P-atoms and additional constraints (e.g., for loop regions), if such data are known from other sources. The file BP_NAME.py is used as an input file to generate the rrRNAv1 model. The model is generated in several steps using the YUP package. The first step: M ¼ rrRNAFFA() activates the model. The second and most important step reads the data from the BLUEPRINT file and creates the RNA: R.addRNA(blueprint(“BP_NAME”), modelname ¼ “M_NAME,” randomize ¼ 1, dimensions ¼ (5.6, 0.0, 180.0/n, 0.0, 0.0, 0.0)). The procedure blueprint reads python dictionary “BLUE” from the file “BP_NAME.py,” which contains the keywords describing the RNA secondary structure. The variable modelname is a string that defines the name of the molecule. If the variable randomize is set to 1, the coordinates of RNA are generated by an internal YUP routine. If it is set to 0, the coordinates will be read from the dictionary entry RNA_XYZ (if available). The variable dimensions is a tuple that contains the average and standard ˚ ) between two adjacent P-atoms, the average deviation of the distances (A and standard deviation in the angles (degrees), and the average and standard
530
Stephen C. Harvey et al.
deviation in the improper torsions (degrees). The dimensions argument is used to generate the initial coordinates of the RNA model in the form of a circular arc. It can generate a random chain using a random walk algorithm, but we found that the random initial coordinates of RNA may result in topological traps once the constraints describing helical regions are applied, whereas an initial conformation of RNA in the form of an arc avoids this problem. To generate an initial model where all P-atoms lie on a planar 180 circular arc, one sets the variable dimensions to (5.6, 0.0, 180.0/n, 0.0, ˚ is the equilibrium distance between 0.0, 0.0). In this example, 5.6 A adjacent P-atoms, n is the number of residues in the model, and the initial torsions and standard deviations are all set to zero. Figure 18.5A shows the result for a more open circular arc. The method R.addRNA () also activates all necessary force field terms. Note that the structure in Fig. 18.5A does not satisfy any of the restraints in the model, except for P–P bond lengths along the chain; optimization of the structure produces a model that does satisfy those restraints (Fig. 18.5B). Finally, the model is completed by the M¼R.finish() method, which creates an object of the RNA model in YUP. The model object contains the detailed description of the model, including all force field terms and the initial coordinates. It exists virtually in the computer’s memory, so its properties can be easily modified. After creation, the model is optimized by extensive minimization (e.g., 500,000 steps of steepest decent), followed by thermal equilibration using
Figure 18.5 Conversion of the tRNA secondary structure model into a three-dimensional model. (A) 76 Successive P-atoms are initially equally spaced along a circular arc in the xy plane, with pseudobonds corresponding to the secondary structure; although Xatoms are present, they are not shown, simply for graphical clarity. (B) Simulated annealing and minimization yields a three-dimensional structure that satisfies all the distance, angle, and pseudotorsion restraints of the secondary structure, as well as the volume exclusion requirements. (C) A plausible three-dimensional model of tRNA is produced by refinement after the addition of restraints representing the 18–55 and 19– 56 bp between the D-loop and T-loop, along with restraints for correct stacking of the acceptor stem on the T-stem, and the anticodon stem on the D-stem. These restraints are not sufficient to completely define the three-dimensional structure of tRNA, because there are fewer restraints than degrees of freedom. The addition of a single distance restraint between the anticodon loop and the 30 tip of the acceptor stem does produce a model that resembles the crystal structure (not shown).
Modeling Virus Structure and Assembly
531
MD (e.g., simulated annealing; or, in the example of Fig. 18.5, 8.10 ns at 300 K with a time step of 10 fs.) After this procedure, RNA adopts a 3D conformation that is folded in accordance with the secondary structure (Fig. 18.5B), plus any 3D restraints (Fig. 18.5C), as enforced by the rrRNAv1 force field. At this point, one may continue the simulations on RNA within the YUP package, or one can convert the rrRNAv1 model (force field terms and XYZ coordinates) into the format for AMBER (Pearlman et al., 1995) or LAMMPS (Plimpton, 1995) for further simulations. AMBER is, of course, a very widely used package for biomolecular simulations; LAMMPS (http:// lammps.sandia.gov) is a newer open source package developed for simulating a wide range of condensed systems. Previously, we have published the AMBER conversion protocol (Cui et al., 2006) but have not yet done so for the LAMMPS conversion. Briefly, the conversions are done by executing the utility programs AMBER.py and LAMMPS.py, which are also contained in the rrRNAv1 folder of the YUP package (Tan et al., 2006). Simulations in AMBER and LAMMPS significantly speed up the production stage of MD simulations, because these packages are available in parallel versions, while YUP is currently available only as a single-processor code.
3.3. PaV: The RNA model To begin with, we converted the all-atom initial model to coarse-grain representation, with each nucleotide represented by a single pseudoatom at the phosphate position. A more complete description of this “all-P” model is available elsewhere (Malhotra et al., 1994). We built the complete PaV model in two steps. First, we modeled those parts of the viral genome that are not resolved in the crystal structure, attaching them to the 1500 crystallographically defined nucleotides in the RNA dodecahedral cage. Then, we added the missing residues of the protein subunits. Modeling the missing parts of the PaV RNA require us to visualize, manually manipulate, and refine the coarse-grained RNA model without tangling it. It is impossible to do this within the confines of the model capsid, because it is so small. Instead, we built the model in an expanded framework and then shrunk it down to the correct size in a series of scaling/ optimization steps (Fig. 18.6). The initial, correctly scaled framework is defined by 20 pseudoatoms, each at the vertex of a virtual dodecahedron whose edges are coaxial with the RNA double helices that define the RNA dodecahedral cage in the crystal structure. Multiplying the coordinates of the 20 pseudoatoms by a factor of two provides a dodecahedral framework with eight times the volume of the virus, in which it is easy to build and manipulate the RNA model (Fig. 18.6A). Once that is done, we shrink the
532
Stephen C. Harvey et al.
Figure 18.6 Optimization of the RNA model for pariacoto virus (PaV). It is not possible to manipulate the RNA model within the confines of the virus, so we define a dodecahedral framework that initially has twice the diameter and eight times the volume of the actual virus, build the RNA model in that framework, then refine by a series of shrinkage/minimization steps. (A) RNA is modeled in the expanded framework. Each RNA double helix on one edge of the original dodecahedral framework is cut into two fragments, with one attached to each vertex in the expanded framework. The “stalactites” of RNA that reach from 12 vertices into the interior of the virus are then attached, giving a complete model of the genome. (B) and (C) two snapshots during the refinement, as the dodecahedral framework is shrunk stepwise to the correct size, followed by minimization of the RNA model at each step. (D) The final RNA model after complete contraction of the dodecahedral framework to the size it has in the crystal structure.
framework back down to its correct size by repeated scaling steps, each of which is followed by extensive minimization. To expand the RNA dodecahedral cage without deformation, we separated each RNA duplex between the 12th and 13th nucleotides and moved each half duplex to the appropriate vertex of the expanded dodecahedral frame. This gave three pieces of RNA at each vertex. Previously, we had postulated a plausible secondary structure of the PaV RNA genome (Tihova et al., 2004), based in part on the density of the cryo-electron microscopy map just below the vertices, which had suggested that there are approximately 12 connections between the RNA dodecahedral cage and the remainder of the genome in the center of the virus. Our secondary structure model defines a set of two-, three- and four-way junctions at the 20 vertices of the dodecahedral cage, with 12 of these connecting to RNA in the center of the virus through short double-helical “stubs”.
533
Modeling Virus Structure and Assembly
We began 3D modeling by building all-atom models of the junctions and stubs. We then modeled the rest of the RNA inside the dodecahedral cage by attaching 12 identical copies of a globular RNA to the stubs. For this, we chose a 225-nucleotide fragment (residues 1764–1988) from domain IV of the large subunit of the E. coli ribosome (PDB id: 2WA4). We call these pieces of RNA “stalactites.” Within the expanded framework, it was relatively easy to add these stalactites without any steric clashes (Fig. 18.6A). The model in the expanded framework has 4322 P-atoms (one per RNA residue) plus the 20 pseudoatoms at the vertices of the virtual dodecahedral framework. Some RNA fragments are based on crystal structures, while others are based on idealized double helices and junctions, so the RNA model is stereochemically correct, except that the double helices connecting the adjacent vertices on the dodecahedral cage are split into two separate pieces on the expanded framework. To rejoin these, we scaled the framework downward in size (and moved the RNA radially inward) in a series of steps, each of which shortens the edges of the framework by 5 A˚; the RNA model was reminimized after each scaling. Figure 18.6 shows a series of snapshots from this process. Minimization was done using yammp (Tan and Harvey, 1993), which requires two input files. The archive file consists of the (x, y, z) coordinates of the structure. The descriptor file contains the ideal values for different parameters (bonds, angles, etc.) and the force constants. These are given in Table 18.1. Standard bond and angle energy functions are used for the connections between appropriate pairs and triplets of pseudoatoms. There are, for Table 18.1 Energy terms used during refinement of the coarse-grained model of the PaV RNA genome Energy
Equation
Force constant
a
Eq. (1)
Anglea
Eq. (2)
Improper torsiona
Eq. (4)
Nonbond NOEN Stud
˚ Eq. (3), d0 ¼ 10 A Eq. (7) Eq. (3), d0 ¼ 0
RNA cage: 20 kcal/(mol A˚2) ˚ 2) Stalactites: 2 kcal/(mol A RNA cage: 20 kcal/mol Stalactites: 2 kcal/mol RNA cage: 20 kcal/mol Stalactites: 2 kcal/mol ˚ 2) 2 kcal/(mol A ˚ 2 kcal/(mol A2) ˚ 2) 40 kcal/(mol A
Bond
a
Ideal values of bond lengths (b0), angles (y0), and improper torsions (f0) are based on crystallographic values for those fragments of the model derived from either the PaV or ribosome crystal structure, and on ideal values for those fragments for which we built manual models.
534
Stephen C. Harvey et al.
example, pseudobonds between successive P-atoms along the backbone of the molecule; there are also pseudoatoms connecting P-atoms representing the phosphate groups of a pair of nucleotides that interact through WatsonCrick base pairing. As in the case of the full PX and rrRNAv1 models discussed above, the simplified all-P model also includes pseudotorsions to guarantee the proper chirality of the right-handed double helices (Malhotra et al., 1994). There are two classes of bond, angle, and pseudotorsion energy terms. The first class is designed to enforce idealized local geometry on the RNA model. In this model, “idealized” refers to values taken from the crystal structure of the RNA dodecahedral cage, from the crystal structure of the ribosomal RNA fragment used to model the stalactites, from model stemloops, from the model three- and four-way junctions, and from the doublehelical stubs used to connect the dodecahedral cage to the stalactites. The second class consists of a set of restraints between the pseudoatoms of the expanded dodecahedral framework and pseudoatoms in the broken RNA double helices from the crystallographic dodecahedral cage; these keep the double helices correctly positioned as the framework is contracted, so that they are reconnected with the crystallographic geometry at the end of the contraction/refinement process. As seen in Table 18.1, there are two different families of force constants (not to be confused with two different classes of bonds, angles, and pseudotorsions). One family is applied to those distances and angles between atoms in the double-helical RNA cage, while the other is applied to those in the stalactites. The former are 10 times stronger than the latter, to prevent distortion of the cage away from the structure seen in the crystal; almost all deformations are thus forced onto the stalactites, since there are no data on the actual RNA structures in the viral interior. As in the rrRNAv1 model discussed above, a soft sphere semiharmonic repulsion is used for the nonbonded interaction between pairs of P-atoms that are not covalently connected through a bond or angle term and that are not part of the same double helix. To reduce computational complexity, no X-atoms were included in the coarse-grained PaV model, so we used a rather large P–P contact distance (10 A˚) to prevent interpenetration of double helices. This has the added advantage of keeping the RNA structure rather open, mimicking RNA–RNA electrostatic repulsions in the real world, and leaving room in the interior of the virus model for the penetration of positively charged protein tails to help neutralize the RNA and stabilize the structure (see below). In early trials, we observed that the stalactite RNAs had a tendency to escape through the faces of the RNA dodecahedral cage during the contraction/minimization steps. To prevent this, we added an NOE-like restraint (NOEN in yammp) to confine all the RNA within a spherical boundary of radius R (Eq. (7)). This parameter is decreased by 7% during
Modeling Virus Structure and Assembly
535
each step of scaling. This term also helps to keep the RNA helices attached at the vertex pseudoatoms properly oriented with respect to the dodecahedral framework during contraction. The NOEN is defined with respect to the center of the virus, which coincides with the origin of coordinates. The 20 pseudoatoms defining the vertices of the dodecahedral framework are tethered to specified points in space with a harmonic “stud” energy function, as discussed above. There are also 30 bonds between adjacent pairs of these pseudoatoms, to help rigidify the framework; they coincide with the edges of the dodecahedron (Fig. 18.6). The tethering positions of the vertex pseudoatoms were moved inward and the ideal bond lengths of the edges of the dodecahedral framework (b0) were shortened in a ˚ steps. The initial framework had an edge length of 149.0 A˚ and series of 5 A ˚ . The model is minimized to converthe final framework has b0 ¼ 78.5 A gence using the energy minimization protocol of yammp after each step. Since all the terms used in the potential energy function of all-P models are harmonic, full minimization of the model should lead to zero energy if all restraints can be satisfied without steric overlaps. During minimization, the stalactite RNAs were free to move and adjust their conformations, to avoid steric overlap. They had softer force constants in the energy terms than did the RNA domains on the dodecahedral cage (Table 18.1). The crystallographic regions were restrained by the use of strong force constants in the energy terms and by the addition of pseudobonds connecting each vertex pseudoatom to the ends of the RNA duplexes on each edge. These regions did not deviate significantly from the crystal structure during the contraction/minimization cycles. The output file at the end of each step is a new archive file representing an intermediate model with the total energy converged to a minimum. This structure became the starting model for the next round of contraction/ minimization, using a new descriptor file with ideal values for the edges, and NOEN radius decreased appropriately. Our collaborator Se´bastien Le´mieux (University of Montreal) converted the coarse-grained RNA model to an energy-refined all-atom model using a suite of programs that he had developed. This is quite straightforward for double-helical regions. In single-stranded regions, conversion begins by generating candidate structures for fragments defined by four successive phosphate atoms along the backbone. Candidates are extracted from the same library that is used for modeling with MC-SYM (Parisien and Major, 2008), based on the requirement that the four phosphate groups in the library fragment must have a root-mean-square deviation of less than 1.5 A˚ from the P-atom positions in the coarse-grained model. Once candidates are identified, the problem then becomes one of searching all combinations of candidates to identify which set will satisfy the RMSD restriction with the lowest nonbonded energy (van der Waals plus electrostatics). This optimizes base pairing and stacking while minimizing steric clashes.
536
Stephen C. Harvey et al.
3.4. PaV: Adding the capsid to the model The final step in generating the model of PaV was to reconstruct the protein residues missing from the crystal structure. As mentioned before, the crystal structure of the asymmetric unit is missing residues from the N- and Cterminal tails of each protein because of the lack of clear electron density. The missing residues are shown in Table 18.2. The N-terminal tails contain an excess number of arginine and lysine residues compared to the rest of the protein, so the tails have a net positive charge. These basic residues interact with the RNA through electrostatic attractions, presumably stabilizing the structure of the virus. The C-terminal tails are composed of neutral residues. Our approach to modeling the capsid proteins was similar to the approach we used for modeling the RNA. We defined a framework with the same icosahedral symmetry as the virus, with 60 triangular faces, one for each copy of the asymmetric unit, and then expanded it by a factor of three (a 27 expansion in volume). We radially translated a coarse-grained model of the crystallographically resolved parts of the capsid proteins to this expanded frame, keeping the RNA fixed at the center to generate enough space for us to place the missing protein residues (Fig. 18.7). We generated C- and N-terminal tails, then compressed the capsid in radius in multiple steps, with minimization of the tails at each compression step, using YUP (Tan et al., 2006). This repeated compression/minimization protocol allows the protein tails to find their way into the fixed RNA. The details of the modeling and simulation protocol are as follows: We first converted all of the RNA models and the crystallographically resolved protein residues into coarse-grain models to reduce the number of atoms. Our goal was to remove as many residues as possible from both the RNA and protein while maintaining their surface integrity. The inner side of the capsid proteins and the outer surface of the RNA are particularly important, because these surfaces are in contact with the missing amino acid residues. We used a very coarse-grained model for regions of the protein on the outer surface of the capsid. To make this selection quantitative, we defined a triangle connecting the alpha carbons of residue 175 in the A, B, and C proteins. Atoms outside this triangular plane were completely removed and replaced with 12 pseudoatoms (12C-model), each with a radius of 35 A˚. These pseudoatoms covered the whole triangle, preventing any flexible Table 18.2 Protein residues not observed in the PaV crystal structure Protein
N-terminal
C-terminal
Total number
A B C
1–6 1–48 1–50
379–393 383–401 383–401
21 64 68
Modeling Virus Structure and Assembly
537
Figure 18.7 Addition of the capsid proteins to the RNA model for PaV, starting with a protein cage structure that is expanded to three times its final diameter. Coarse-grained models of different resolutions are used to model different regions of the proteins. Those parts of the crystallographically resolved regions that lie nearest to the RNA are represented in a model, with one pseudoatom representing two successive amino acids. The remaining residues are represented by a very coarse-grained model, with 12 pseudoatoms representing the face, side, and vertices of the triangular asymmetric unit. The protein tails, whose conformations are not revealed in the crystal structure, are represented by one pseudoatom per amino acid and extend radially inward from the inside of the capsid toward the RNA genome. The expanded cage is shrunk to its crystallographic dimension in a series of steps, with energy minimization at each step. The protein tails are pulled toward the center of the virus during this process, and their ability to penetrate the porous RNA cage depends on the van der Waals radius assigned to these residues.
chains from leaving the virus during the minimization protocol. The atoms below the triangular plane were converted into a 2Ca-model by averaging the coordinates of successive pairs of Ca atoms and replacing them with a single pseudoatom. Residues 7–50 of the protein A were exceptions to this conversion. These residues are in contact with the RNA in the crystal structure, so they were kept in their crystallographically defined positions; we modeled them with one pseudoatom per residue, placed at the position of the alpha carbon. We converted the all-atom RNA model into a 2N model, with two consecutive nucleotides represented by one pseudoatom at the center of two consecutive glycosidic nitrogen atoms. This model conserves the minor and major groves of the RNA double helices. It has less excluded volume
538
Stephen C. Harvey et al.
than an actual RNA molecule, so a 2N model of a viral genome is quite porous. In the case of PaV, this facilitates penetration of the polycationic tails of the capsid proteins into the RNA grooves in the viral interior. After the RNA and the nonmissing part of the asymmetric unit were converted into coarse-grain models, the asymmetric unit was moved out from the center of the RNA. This radial expansion was achieved by multiplying all coordinates by a factor of three, since the model is centered on the origin. This provided enough space for us to generate the missing tail residues, using one pseudoatom per amino acid (Fig. 18.7). Residues 7–50 of protein A were not moved, because these residues interact with the RNA. We generated the positively charged N-terminal tails of both proteins B and C as linear chains extending radially inward toward the center of the virus (Fig. 18.7). The C-terminal tails of proteins B and C were generated as random coils, because they are not charged. The gap between Residues 379 and 393 of protein A was closed by a random coil connected to those residues, using a Monte Carlo algorithm, as follows: Given the first pseudoatom in the chain, the algorithm first generates trial coordinates for the second pseudoatom at a fixed distance from the first, but in a random direction from it. If the new pseudoatom is within ˚ of any other atom, the trial position is rejected and a new one is 3.0 A generated. Repeating this process 11 times generates a 12-residue chain of random configuration. This chain is rotated into a position where it lies in the gap between Residues 379 and 393 of protein A; energy minimization yields a conformation that closes that gap. After generating the missing residues, the complete coarse-grained capsid was generated by applying icosahedral transformation matrices to the asymmetric unit (Fig. 18.7). The coarse-grained capsid was compressed in a series of steps, with each step followed by steepest descent minimization of the protein tails, while keeping the rest of the capsid proteins and the RNA fixed. The protein tails are pulled toward the center of coordinates and penetrate into the genomic RNA. The force field terms and parameters used for the different components of the coarse-grain model are summarized in Table 18.3. In the expanded framework, the interior of the capsid is a distance ˚ from the outside of the RNA (Fig. 18.7). We divided the D 300 A process of compressing the capsid to its correct size into two stages. The first stage consisted of a series of nine scalings, each of which moved the capsid inward by a distance 0.1D, and each of which was followed by extensive minimization. At this point, it becomes more difficult to resolve steric problems with large scaling steps, so the second stage consisted of a series of five scaling steps, moving the capsid inward 0.02D at each step, each followed by extensive minimization. After the final step of compression/minimization, we converted the protein tails into an all-atom model using PULCHRA (Rotkiewicz and
539
Modeling Virus Structure and Assembly
Table 18.3 Energy terms used for the protein component of the coarse-grain model for pariacoto virus, and for the protein–RNA volume exclusion term Energy term Atoms affected
Equation Parameters
Bond
Flexible Tails (Ca model)
Eq. (1)
Angle
Flexible Tails (Ca model)
Eq. (2)
Volume Outer Capsid (12C model) Eq. (3) exclusion Inner Capsid (2Ca model) Flexible Tails (Ca model) Tails/RNA (Ca/2N)
kb¼ 3 kcal/(mol A˚2), ˚ b0 ¼ 3.8 A ky ¼ 3 kcal/mol, y0 ¼ 1.94 (111.154 ) ˚ 2), ky ¼ 3 kcal/(mol A d0 ¼ 35.0 A˚ ˚ 2), ky ¼ 3 kcal/(mol A ˚ d0 ¼ 7.6 A ˚ 2), ky ¼ 3 kcal/(mol A ˚ d0 ¼ 3.8 A ˚ 2), ky ¼ 3 kcal/(mol A ˚ d0 ¼ 12.5 A
Skolnick, 2008) and connected these with the rest of the all-atom protein crystal structure. Since the RNA had not been allowed to move during the modeling of the protein tails, we simply replaced the coarse-grained RNA model with the all-atom model described above. The final all-atom model of PaV was further minimized with NAMD, using the CHARMM27 force field (MacKerell et al., 2000), with all protein and RNA atoms free to move. This eliminates any unacceptable steric conflicts and gives bond lengths and angles within standard ranges.
3.5. PaV: Results We generated two different models, to determine the energetic consequences of allowing the polycationic protein tails to penetrate deeply into the viral interior versus having them associate predominantly with RNA in the outer regions. The first was achieved with the tail-RNA soft sphere contact distance d0 ¼ 8 A˚, while a larger contact distance (d0 ¼ 12 A˚) provides less penetration. We evaluated the electrostatic energies of these two models, finding that deep penetration does, as expected, provide substantial additional stabilization ˚ ) is shown in Fig. 18.8 (Devkota et al., 2009). The final model (d0 ¼ 8 A This study also led to a new model for the assembly of icosahedral singlestranded RNA viruses like PaV, which are quite different from bacteriophage. Phage capsids are formed from proteins that interact strongly with one another, so that capsid formation is the first step in viral assembly, and the DNA must be loaded into the empty capsid by an ATP-driven motor. In contrast, protein–protein interactions in PaV are weak, and capsid formation requires the presence of the viral genome. We have suggested
540
Stephen C. Harvey et al.
Figure 18.8 Final all-atom model of pariacoto virus. Half of the protein capsid is shown, with all nonhydrogen atoms represented as van der Waals spheres. The RNA model also specifies the coordinates of all nonhydrogen atoms, but only the backbone trace is shown here, for clarity. Some RNA double helices that are part of the dodecahedral cage are clearly seen around the periphery.
that assembly begins with the condensation of the RNA by the polycationic protein tails, and that this compaction leaves the globular protein cores in a spherical shell surrounding the condensate, where their effective concentration is high enough to drive the cooperative association of those globular cores into the mature capsid (Devkota et al., 2009).
ACKNOWLEDGMENTS Supported by NIH R01-GM70785 to SCH. We are grateful to Robert Tan for the development of yammp and YUP, to Se´bastien Lemieux for converting coarse-grained RNA models to all-atom form, and to our collaborators, Jack Johnson and Anette Schneemann, for experimental data and stimulating discussions.
REFERENCES Ackermann, H.-W., and DuBow, M. S. (1987). Viruses of Prokaryotes. CRC Press, Boca Raton, FL. Agirrezabala, X., et al. (2005). Structure of the connector of bacteriophage T7 at 8 angstrom resolution: Structural homologies of a basic component of a DNA translocating machinery. J. Mol. Biol. 347, 895–902.
Modeling Virus Structure and Assembly
541
Berendsen, H. J. C., Postma, J. P. M., Vangunsteren, W. F., Dinola, A., and Haak, J. R. (1984). Molecular-dynamics with coupling to an external bath. J. Chem. Phys. 81, 3684–3690. Berman, H. M., et al. (2000). The protein data bank. Nucleic Acids Res. 28, 235–242. Bloomfield, V. A. (1991). Condensation of DNA by multivalent cations: Considerations on mechanism. Biopolymers 31, 1471–1481. Cui, Q., Tan, R. K., Harvey, S. C., and Case, D. A. (2006). Low-resolution molecular dynamics simulations of the 30S ribosomal subunit. Multiscale Model. Simul. 5, 1248–1263. Devkota, B., et al. (2009). Structural and electrostatic characterization of Pariacoto virus: Implications for viral assembly. Biopolymers 91, 530–538. Dokland, T., and Murialdo, H. (1993). Structural transitions during maturation of bacteriophage-lambda capsids. J. Mol. Biol. 233, 682–694. Evilevitch, A., Lavelle, L., Knobler, C. M., Raspaud, E., and Gelbart, W. M. (2003). Osmotic pressure inhibition of DNA ejection from phage. Proc. Natl. Acad. Sci. USA 100, 9292–9295. Frank, J. (2002). Single-particle imaging of macromolecules by cryo-electron microscopy. Annu. Rev. Biophys. Biomol. Struct. 31, 303–319. Fuller, D. N., et al. (2007). Measurements of single DNA molecule packaging dynamics in bacteriophage lambda reveal high forces, high motor processivity, and capsid transformations. J. Mol. Biol. 373, 1113–1122. Granoff, A., and Webster, R. G. (eds.), (1999). Encyclopedia of Virology, Academic Press, San Diego, CA. Grayson, P., and Molineux, I. J. (2007). Is phage DNA ‘injected’ into cells-biologists and physicists can agree. Curr. Opin. Microbiol. 10, 401–409. Hagerman, P. (1988). Flexibility of DNA. Annu. Rev. Biophys. Biophys. Chem. 17, 265–286. Harvey, S. C., Petrov, A. S., Devkota, B., and Boz, M. B. (2009). Viral assembly: A molecular modeling perspective. Phys. Chem. Chem. Phys. 11, 10553–10564. Hud, N. V., and Vilfan, I. D. (2005). Toroidal DNA condensates: Unraveling the fine structure and the role of nucleation in determining size. Annu. Rev. Biophys. Biomol. Struct. 34, 295–318. Jardine, P. J., and Anderson, D. L. (2006). DNA packaging in double-stranded DNA phages. In “The Bacteriophages,” (R. Calendar, ed.), 2nd edn. pp. 49–65. Oxford University Press, Oxford. Jeembaeva, M., Castelnovo, M., Larsson, F., and Evilevitch, A. (2008). Osmotic pressure: Resisting or promoting DNA ejection from phage? J. Mol. Biol. 381, 310–323. Jiang, W., et al. (2006). Structure of epsilon15 bacteriophage reveals genome organization and DNA packaging/injection apparatus. Nature 439, 612–616. Johnson, J. E., and Chiu, W. (2007). DNA packaging and delivery machines in tailed bacteriophage. Curr. Opin. Struct. Biol. 17, 237–243. Knobler, C. M., and Gelbart, W. M. (2009). Physical chemistry of DNA viruses. Annu. Rev. Phys. Chem. 60, 367–383. Lander, G. C., et al. (2006). The structure of an infectious P22 virion shows the signal for headful DNA packaging. Science 312, 1791–1795. Lander, G. C., et al. (2008). Bacteriophage lambda stabilization by auxiliary protein gpD: Timing, location, and mechanism of attachment determined by cryo-EM. Structure 16, 1399–1406. Locker, C. R., and Harvey, S. C. (2006). A model for viral genome packing. Multiscale Model. Simul. 5, 1264–1279. Locker, C. R., Fuller, S. D., and Harvey, S. C. (2007). DNA organization and thermodynamics during viral packaging. Biophys. J. 93, 2861–2869.
542
Stephen C. Harvey et al.
MacKerell, A. D., Jr., Banavali, N., and Foloppe, N. (2000). Development and current status of the CHARMM force field for nucleic acids. Biopolymers 56, 257–265. Malhotra, A., Tan, R. K., and Harvey, S. C. (1994). Modeling large RNAs and ribonucleoprotein particles using molecular mechanics techniques. Biophys. J. 66, 1777–1795. Parisien, M., and Major, F. (2008). The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature 452, 51–55. Parsegian, V. A., Rand, R. P., and Rau, D. C. (1995). Macromolecules and water: Probing with osmotic stress. Energetics Of Biological Macromolecules MethodsIn Enzymology. Vol. 259, pp. 43–94. Parsegian, V. A., Rand, R. P., and Rau, D. C. (2000). Osmotic stress, crowding, preferential hydration, and binding: A comparison of perspectives. Proc. Natl. Acad. Sci. USA 97, 3987–3992. Pearlman, D. A., et al. (1995). AMBER: A computer program for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to elucidate the structures and energies of molecules. Comput. Phys. Commun. 91, 1–41. Petrov, A. S., and Harvey, S. C. (2007). Structural and thermodynamic principles of viral packaging. Structure 15, 21–27. Petrov, A. S., and Harvey, S. C. (2008). Packaging double-helical DNA into viral capsids: Structures, forces, and energetics. Biophys. J. 95, 497–502. Petrov, A. S., Lim-Hing, K., and Harvey, S. C. (2007a). Packaging of DNA by bacteriophage epsilon15: Structure, forces, and thermodynamics. Structure 15, 807–812. Petrov, A. S., Boz, M. B., and Harvey, S. C. (2007b). The conformation of double-stranded DNA inside bacteriophages depends on capsid size and shape. J. Struct. Biol. 160, 241–248. Plimpton, S. (1995). Fast parallel algorithms for short-range molecular dynamics. J. Comp. Phys. 117, 1–19. Purohit, P. K., et al. (2005). Forces during bacteriophage DNA packaging and ejection. Biophys. J. 88, 851–866. Rau, D. C., and Parsegian, V. A. (1992). Direct measurement of the intermolecular forces between counterion-condensed DNA double helices. Biophys. J. 61, 246–259. Rau, D. C., Lee, B., and Parsegian, V. A. (1984). Measurement of the repulsive force between polyelectrolyte molecules in ionic solution: Hydration forces between parallel DNA double helices. Proc. Natl. Acad. Sci. USA 81, 2621–2625. Rollins, G. C., Petrov, A. S., and Harvey, S. C. (2008). The role of DNA twist in the packaging of viral genomes. Biophys. J. 94, L38–L40. Rotkiewicz, P., and Skolnick, J. (2008). Fast procedure for reconstruction of full-atom protein models from reduced representations. J. Comput. Chem. 29, 1460–1465. Shepherd, C. M., et al. (2006). VIPERdb: A relational database for structural virology. Nucleic Acids Res. 34, D386–D389. Smith, D. E., et al. (2001). The bacteriophage phi29 portal motor can package DNA against a large internal force. Nature 413, 748–752. Spakowitz, A. J., and Wang, Z. G. (2005). DNA packaging in bacteriophage: Is twist important? Biophys. J. 88, 3912–3923. Tan, R. K. Z., and Harvey, S. C. (1989). Molecular mechanics model of supercoiled DNA. J. Mol. Biol. 205, 573–591. Tan, R. K.-Z., and Harvey, S. C. (1993). Yammp: Development of a molecular mechanics program using the modular programming method. J. Comput. Chem. 14, 455–470. Tan, R. K.-Z., Sprous, D., and Harvey, S. C. (1996). Molecular dynamics simulations of small DNA plasmids: Effects of sequence and supercoiling on intramolecular motions. Biopolymers 39, 259–278. Tan, R. K., Petrov, A. S., and Harvey, S. C. (2006). YUP: A molecular simulation program for coarse-grained and multi-scaled models. J. Chem. Theory Comput. 2, 529–540.
Modeling Virus Structure and Assembly
543
Tan, R. K.-Z., Petrov, A. S., Devkota, B., and Harvey, S. C. (2009). Coarse-grained models for nucleic acids and large nucleoprotein assemblies. In “Coarse-Graining of Condensed Phase and Biomolecular Systems,” (G. A. Voth, ed.), pp. 225–236. CRC Press, Boca Raton, FL. Tang, L., et al. (2001). The structure of pariacoto virus reveals a dodecahedral cage of duplex RNA. Nat. Struct. Biol. 8, 77–83. Tihova, M., et al. (2004). Nodavirus coat protein imposes dodecahedral RNA structure independent of nucleotide sequence and length. J. Virol. 78, 2897–2905. Tzlil, S., Kindt, J. T., Gelbart, W. M., and Ben-Shaul, A. (2003). Forces and pressures in DNA packaging and release from viral capsids. Biophys. J. 84, 1616–1627. Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31, 3406–3415.
C H A P T E R
N I N E T E E N
ROSETTA3: An Object-Oriented Software Suite for the Simulation and Design of Macromolecules Andrew Leaver-Fay,* Michael Tyka,† Steven M. Lewis,* Oliver F. Lange,† James Thompson,† Ron Jacak,* Kristian W. Kaufmann,‡ P. Douglas Renfrew,§ Colin A. Smith,} Will Sheffler,† Ian W. Davis,k Seth Cooper,** Adrien Treuille,†† Daniel J. Mandell,} Florian Richter,‡‡‡ Yih-En Andrew Ban,‡‡ Sarel J. Fleishman,† Jacob E. Corn,† David E. Kim,† Sergey Lyskov,§§ Monica Berrondo,}} Stuart Mentzer,kk Zoran Popovic´,k James J. Havranek,*** John Karanicolas,††† Rhiju Das,§§§ Jens Meiler,‡ Tanja Kortemme,} Jeffrey J. Gray,§§ Brian Kuhlman,* David Baker,† and Philip Bradley}}} Contents 1. Introduction 2. Requirements 2.1. Preserving existing functionality 2.2. Generality requirements
546 548 548 548
* Department of Biochemistry, University of North Carolina, Chapel Hill, North Carolina, USA { Department of Biochemistry, University of Washington, Seattle, Washington, USA { Department of Chemistry, Vanderbilt University, Nashville, Tennessee, USA } Center for Genomics and Systems Biology, New York University, New York, USA } University of California, San Francisco, California, USA k GrassRoots Biotechnology, Durham, North Carolina, USA ** Department of Computer Science, University of Washington, Seattle, Washington, USA {{ Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA {{ Arzeda Corporation, Seattle, Washington, USA }} Chemical & Biomolecular Engineering and the Program in Molecular Biophysics, Johns Hopkins University, Baltimore, Maryland, USA }} Rosetta Design Group, Fairfax, Virginia, USA kk Objexx Engineering, Boston, Massachusetts, USA *** Washington University, St. Louis, Missouri, USA {{{ Center for Bioinformatics and Department of Molecular Biosciences, University of Kansas, Lawrence, Kansas, USA {{{ Interdisciplinary Program in Biomolecular Structure & Design, University of Washington, Seattle, Washington, USA }}} Stanford University, Stanford, California, USA }}} Fred Hutchinson Cancer Research Center, Seattle, Washington, USA # 2011 Elsevier Inc. Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87019-9 All rights reserved.
545
546
Andrew Leaver-Fay et al.
2.3. Code quality requirements 2.4. Speed requirements 3. Design Decisions 3.1. Object-oriented architecture 3.2. Residue centrality 3.3. Pose 3.4. Scoring 4. Architecture 4.1. core library 4.2. core::chemical 4.3. core::kinematics 4.4. core::conformation 4.5. core::pose 4.6. core::scoring 4.7. core::optimization 4.8. core::pack 4.9. protocols Library 4.10. protocols::moves 4.11. JobDistributor 4.12. protocols::loops 4.13. Protocols from text files 5. Conclusion Acknowledgments References
549 550 550 550 551 553 554 554 555 555 556 557 558 559 565 566 567 568 569 570 570 571 572 572
Abstract We have recently completed a full rearchitecturing of the ROSETTA molecular modeling program, generalizing and expanding its existing functionality. The new architecture enables the rapid prototyping of novel protocols by providing easy-to-use interfaces to powerful tools for molecular modeling. The source code of this rearchitecturing has been released as ROSETTA3 and is freely available for academic use. At the time of its release, it contained 470,000 lines of code. Counting currently unpublished protocols at the time of this writing, the source includes 1,285,000 lines. Its rapid growth is a testament to its ease of use. This chapter describes the requirements for our new architecture, justifies the design decisions, sketches out central classes, and highlights a few of the common tasks that the new software can perform.
1. Introduction The ROSETTA molecular modeling suite has proved useful in solving a wide variety of problems in structural biology (Das and Baker, 2008; Kaufmann et al., 2010; Table 19.1). ROSETTA was initially written in
547
ROSETTA3
Table 19.1 Some representative applications available within the ROSETTA molecular modeling suite Application name
Brief description
AbinitioRelax
Predict the structure of a protein from its sequence (Bonneau et al., 2001, 2002; Bradley et al., 2005; Das et al., 2007; Raman et al., 2009; Simons et al., 1997) enzdes Design a protein active site to catalyze a chemical reaction ( Jiang et al., 2008; Rothlisberger et al., 2008; Zanghellini et al., 2006) FixedBBProteinDesign Redesign the amino acids on a fixed protein backbone (Dantas et al., 2003; Kortemme et al., 2004; Kuhlman and Baker, 2000) protein_docking Predict the docked conformation of two proteins with a known structure (Gray et al., 2003; Wang et al., 2005) ligand_docking Predict the orientation that a small molecule binds to a protein (Davis and Baker, 2009; Kaufmann et al., 2008; Meiler and Baker, 2006) loop_modeling Predict the conformation of a set of protein loops (Mandell et al., 2009; Rohl et al., 2004) rna_denovo Predict the folded structure of an RNA molecule given its sequence (Das and Baker, 2007) rna_design Design a new sequence for an RNA molecule (Das et al., 2010)
FORTRAN77
as two separate programs for protein structure prediction (Simons et al., 1997) and for protein design (Kuhlman and Baker, 2000), merged, mechanically ported to Cþþ, and refactored for several years thereafter. The code base has been in upheaval through the majority of its existence. Three years ago, we began a complete rewrite to recenter the program using modern software design principles. The final product, like its predecessor, remains in a state of flux; however, several core modules have solidified to provide a reliable foundation on which to build new protocols for macromolecular modeling. This document attempts to describe these central modules in the way one might describe industrial software: in terms of requirements, design decisions, and architecture. It provides the necessary background for constructing new modeling simulations using these library modules. We close the chapter with a concrete example of one such simulation. The new architecture has enabled a rapid expansion in ROSETTA’s functionality. In addition to providing a solid foundation on which many new protocols have been built, the new architecture has enabled functionality
548
Andrew Leaver-Fay et al.
that would have been virtually impossible in ROSETTA2, including Python bindings for all ROSETTA classes (Chaudhury et al., 2010) and an interactive game, FOLDIT, which challenges users to predict a protein’s structure (Cooper et al., 2010).
2. Requirements The driving requirements for our reimplementation of ROSETTA can be categorized into four major groups. Our new code should preserve the existing functionality. It should generalize that functionality to enable expansion. It should adhere to certain code-quality standards to enable new execution pathways. Finally, it should be fast.
2.1. Preserving existing functionality Our new implementation was needed to recreate the existing ROSETTA functionality. In particular, we required the new implementation to faithfully reproduce the terms in ROSETTA’s score function (Rohl et al., 2004). We required that it reproduce the central algorithms: gradient-based minimization, rotamer packing/protein design, Monte Carlo conformational search, and ROSETTA’s efficient reuse of scores when rescoring a structure that has changed very little. It needed to update a structure’s Cartesian coordinates following changes to its internal degrees of freedom (DOFs; e.g., to a protein’s backbone dihedral angles). Finally, it had to allow for user-defined restraints (a.k.a. “constraints” in ROSETTA jargon) between arbitrary groups of atoms.
2.2. Generality requirements Beyond ensuring that ROSETTA3 was capable of performing the same functions as ROSETTA2, we required that it be more general on several levels so that it could be applied to new challenges in computational structural biology. (1) It should be able to represent new chemical moieties. (2) It should be amenable to the addition of new energy terms. (3) It should encourage the development of new algorithms. The implementation for these three aspects of the code should be as loosely coupled as possible to minimize the amount of work necessary to expand in one direction; adding a new term to the score function should require no updates to the chemical representation of structures, or the algorithms used to evaluate that term on structures (Fig. 19.1). We required that the system be allowed to change its chemical composition at any point during a simulation; the new software could make no
549
ROSETTA3
Chemical composition
Energy terms
Algorithms
Figure 19.1 Generality Wheel. Expanding ROSETTA’s functionality in one area (Energy Terms, Chemical Composition, or Algorithms) should not require an expansion to the other areas. The areas should be protected from each other through the use of generic interfaces.
assumptions that the sequence composition or length be fixed. We further required generality within protocols such that they be nestable within one another. Moreover, we wanted to decouple job distribution from protocols themselves so that protocols could be run in any one of several jobmanagement environments (e.g., desktop computer, commodity cluster, distributed computing environment, and supercomputer).
2.3. Code quality requirements In addition to requiring the new code to perform new computational tasks and broach new problems in macromolecular modeling, we also required that we be able to perform these tasks in novel ways. We wanted to ensure that ROSETTA could be executed in a multithreaded environment where multiple threads could execute simultaneously working with separate structures and score functions without corrupting one another’s data. As a consequence, ROSETTA could not rely on nonconstant shared data (e.g., a global array containing the coordinates of the current structure, or a global score function). Furthermore, we were interested in enforcing code-quality requirements for the purpose of ensuring the greatest reusability of our code. Consider a piece of code, P, written to perform some task, T, in some context, C; P’s reusability can be measured as the number of other contexts besides C in which P can perform T. While it is impossible to list all the alternate contexts in which a piece of code should be able to function, reusable code contains certain identifiable features, and so we imposed requirements on our code that it should contain these features. In particular:
550
Andrew Leaver-Fay et al.
Reusable code is clearly written with descriptive variable names and function names, and comments describing the behavior of the classes and functions, so that users would understand what will happen when invoking a particular function. Reusable code is factored into its component pieces, resulting in short functions and small classes with well-defined responsibilities, so that users can pick out just the pieces of functionality they are interested in reusing. Reusable code is easy to use and hard to misuse because code that frustrates developers does not get reused.
2.4. Speed requirements To ensure that our code was absolutely as fast as it could be, we required certain specific features of our algorithms and code. Score function evaluation: scoring an N residue structure, assuming there are no long-range energy terms, should proceed in O(N) time. Scoring a long-range energy term defined over M pairs of atoms should proceed in O(MlgM) time; that is, if M 2 O(N), then long-range energy evaluation should be only logarithmically more expensive than short-range energy evaluation. Kinematics: the number of coordinate update operations should be minimal and transparent—the user should never be allowed to access out-of-date coordinate data. Together, these requirements suggest using a just-in-time (lazy) coordinate update algorithm. Furthermore, updating the coordinates for k atoms following a set of changes to m internal DOFs should take O(k þ m) time. General: energy and coordinate calculations should be performed at double precision since our gradient-based minimization techniques converge after fewer scorefunction evaluations at higher precision. Calls to new and delete should be avoided in performance-sensitive code. Finally, function calls in inner-most loops should be inlined to the greatest extent possible.
3. Design Decisions In response to the requirements for our new software, we made a series of decisions that shaped its design. This section lays out the rationale for some of the most important decisions, connecting these decisions to the requirements they were meant to address.
3.1. Object-oriented architecture Our earliest design decision was that we would follow object-oriented design principles in the creation of our new software. There are two prominent features of object-oriented programs that we sought to take advantage of: the encapsulation of data within classes and the pairing of data
ROSETTA3
551
and algorithms through polymorphic lookup (virtual functions). Data encapsulation is arguably the most important advance in software design since the advent of high-level programming languages. Classes encapsulate data through a compiler-provided mechanism of privacy: code that is outside of a class is unable to read from or write to private data inside of a class. Instead, classes gate access to private data through public function calls. The use of gating functions allows a class to enforce data-integrity rules that might otherwise be broken if external code were able to change the class’s data without its knowledge. The fact that classes assume responsibility for their data frees the remaining code in the program from the responsibility of maintaining that data. Global variables, in contrast, inflict their integritymaintenance requirements on the entire program. Every new line of code has to be aware of and respect the data integrity requirements stemming from the program’s global variables. The more global variables a program contains, the more complicated extending that program becomes. Global variables restrict the alternate contexts in which a working piece of code could be harnessed; they make code hard to use and easy to misuse. Indeed, ROSETTA2’s reliance on global variables motivated our rearchitecturing more than any other factor. Finally, it should be mentioned that multithreaded applications are significantly easier to write when data is held in classes instead of global variables. In addition to pursuing an object-oriented architecture, we also decided to impose “const-correctness” requirements of the classes we created. Cþþ compilers enforce an idea that, if an instance of a class is const, then its nonconst functions may not be called, and that the data inside the instance may not be modified. The primary benefit of const-correct code is speed. Class A holding an instance of class B can provide read-only access to B by delivering a “B const &” instead of delivering a copy of B (which would be slow). As a secondary benefit, const-correctness makes code hard to misuse, as the const status of an object conveys which function calls are appropriate and which are not.
3.2. Residue centrality Two requirements—the preservation of ROSETTA’s protein design functionality, and the desire to easily incorporate new chemical moieties—quickly led to an early design decision that shaped much of our new implementation: ROSETTA3 would be “residue centric.” This decision manifested in two ways: all atoms in a molecular system would be represented within residues (Fig. 19.2) and residues would be the unit for scoring. To justify this design decision, the remainder of this section introduces the fundamental concepts behind ROSETTA’s protein-design module, the packer. In the packer, the task of designing a new sequence is accomplished by building new amino acids onto a fixed-protein-backbone scaffold.
552
Andrew Leaver-Fay et al.
Pose
Energies
• • • •
• Total energy: –136.45 • Energy components: • fa_atr –86.32 • hbond_sc –14.34 • EnergyGraph
DataCache
Conformation Energies ConstraintSet DataCache
User-defined data, copied along with the pose (e.g., DNA base pairings)
ConstraintSet • User-defined restraints on atomic distances, angles, dihedrals...
EnergyGraph Residue-residue energies 2 3 4 7 EnergyMap (3–7) fa_atr – 0.54 hbond_sc 0.00
AtomTree Kinematics
AtompairConstraint Atom1: (2, 06) Atom2: (5, 2HH2) Func: Harmonicfunc (1.9 Å, 0.25 Å)
6
1
Conformation 8
• AtomTree • Container of residue objects: • rsd1 ... rsd8
Residue
2
• ResidueType pointer • Container of atoms • Atom1 ... atom33 Atom (x, y, z)
ResidueType • • • •
Name (GUA) Neighbor_atom (C1*) Ideal-coordinate geometry Container of atomTypes • Atom_type1 ... atom_type33
AtomType • Element • Lennard-Jones radius and well-depth, ...
Figure 19.2 Pose architecture. The components of the Pose class are illustrated for the case of a simple eight-residue system consisting of a two base-pair DNA duplex (residues 1–4) and a protein segment (residues 5–8). Conformational and chemical information are stored within the Conformation class as Residue objects (coordinates) with pointers to ResidueTypes (chemistry); the AtomTree class records the kinematic connectivity (the mapping between internal and Cartesian coordinates). Energies from the most recent evaluation of the scoring function are stored in the Energies class, which holds residue–residue interactions in the EnergyGraph. Finally, user-defined coordinate restraints are stored in the ConstraintSet, and additional Pose-associated data can be stored in the DataCache, where it will be copied along with the Pose during simulations.
Each design task is modeled as a combinatorial optimization problem where the optimal solution is the sequence and structure of side chains built upon the scaffold that minimizes the score function. At each residue, i, the algorithm considers a set of rotamers (Ponder and Richards, 1987), Si, which represent one or more amino acid types. The algorithm then searches for the vector assignment of rotamers to the backbone, s, where the assignment to residue i, si 2 Si. The rotamer-vector Q search space is the Cartesian product of the individual rotamer spaces: S ¼ iSi. This problem is NP-complete (Pierce and Winfree, 2002). ROSETTA’s design algorithm searches for a low-energy (if suboptimal) rotamer assignment using a Monte Carlo with simulated annealing approach (Kuhlman and Baker, 2000).
ROSETTA3
553
Starting from a particular rotamer vector s, it computes the change in the score, DE, induced by substituting the rotamer at residue i, a ¼ si with a new rotamer b 2 Si. The computed DE is then fed to the Metropolis Criterion, leading either to the acceptance or rejection of the rotamer substitution. For a typical design simulation, several million rotamer substitutions are considered. At the conclusion of simulated annealing, the design algorithm replaces residues from the input structure with the new rotamers it had selected. The task of computing DE for a given rotamer substitution suggests the utility of a residue-centric design. Design software that relies on a pairwisedecomposable energy function (as almost all design software does; Chowdry et al., 2007; Dahiyat and Mayo, 1996; Desmet et al., 1992; Hellinga et al., 1991) typically pretabulates rotamer-pair-interaction energies; that is, the interaction energies for rotamers on residue i and the rotamers on residue j are computed before rotamer search begins and stored in a table Eij of size jSij jSjj. When computing DE for replacing rotamer a on residue i with rotamer b, the interaction energy for both a and b may be looked up from the set of tables holding residue is interactions with is neighbors. If residues were the unit of scoring, then residue-pair energies could easily be pretabulated. Moreover, we required the new code to be sufficiently general to perform fixed-backbone-like design on nucleic acids and small molecules in addition to proteins. The analogy from designing protein residues to designing nucleic acid residues is so straight forward that it barely qualifies as an analogy; instead of building amino-acid rotamers at the design positions, the algorithm would build nucleic-acid rotamers. Each downstream step in the design process (energy pretabulation, simulated annealing, and residue replacement) could be identical. Design of RNA using ROSETTA3 has already been tested (Das et al., 2010). The analogy from designing protein residues to designing small molecules is similar; the algorithm would need to build small-molecule rotamers, whatever they might look like. For the design algorithm to handle amino acids, nucleic acids, and small molecules uniformly, it is best to represent each class of molecule in the same fashion; ergo, we would represent all chemical entities with residues.
3.3. Pose Following the residue-centrality decision, we sought to define the complete state of a molecular system within a single container class, termed a Pose. The Pose would be responsible for holding a set of residues, and the result of a score-function evaluation on those residues (class Pose, Section 4.5, would hold a Conformation object Section 4.4) and an Energies object (Section 4.6.3; see Fig. 19.2). By storing the scores with the structure, it would be possible to reuse certain scores from the last score-function evaluation when rescoring a structure. In addition to holding a structure and
554
Andrew Leaver-Fay et al.
its energies, class Pose would also be responsible for holding generic data for presently unforeseen purposes; as a generic container, this would allow protocol developers to pair information relevant for a particular structure with that structure. Then, when copying a Pose, all the information that is relevant for that Pose, would be copied with it. To recreate ROSETTA2’s Monte Carlo mechanism, we would thus rely on Pose copy operations— the history of how a particular structure came into being could be recorded in a Pose during the course of its trajectory, and would be automatically copied along with the structural and energetic information.
3.4. Scoring With the residue-centrality design decision (Section 3.2), residues would be the unit of scoring in ROSETTA. We extended this idea slightly by imposing a decomposition of the score-function terms into those that are defined on residues, those that are defined on residue pairs, and those that are defined on entire structures, and by representing this decomposition through a class hierarchy (Section 4.6.2). To include a new term in ROSETTA requires finding the appropriate base class from which to derive and implementing the interface to that class. As a consequence, once a new term was incorporated into this class hierarchy, it could be used in scoring, packing, and minimizing and in any future algorithm that relies only on the class interfaces. This class hierarchy separates energy terms from algorithms that use those terms, making it easy to add new terms and new algorithms. Currently, ROSETTA developers are experimenting with 174 terms, proving that adding new terms is quite easy.
4. Architecture The remainder of this chapter describes the layout of ROSETTA’s classes and further sketches the rationale for the way we have organized data and algorithms. At its highest level, ROSETTA is composed of three sets of libraries: (a) a core library that defines structures and supports structure I/O, scoring, packing, and minimization, (b) a protocols library that consists of common structural modifications one might wish to make to a structure, and a means to control the distribution of jobs, and (c) several utility libraries that collect common data structures (a 1-indexed container, an owning pointer class, an optimized graph class) and numeric subroutines (vector and matrix classes, random number generators). Individual executables link against these libraries allowing a protocol writer to rapidly prototype by creating a new
ROSETTA3
555
executable rather than modifying the single monolithic executable, a design flaw in ROSETTA2. Code is organized so that libraries and namespaces (Cþþ namespaces provide a mechanism for grouping related class and function names) mirror the directory structure, thus making it easy to find code. The top-level directory src/ (source) contains directories utility/, numeric/, core/, protocols/, and devel/, each corresponding to their own library. It also contains an apps/ directory, in which executables with main() functions live; apps is not linked as a library. Each library corresponds to a top-level namespace. Any subdirectory of a library directory corresponds to a nested namespace. Classes are generally declared and defined in files with the same name and a .hh and .cc extension. For example, class ScoreFunction is declared in src/core/scoring/ ScoreFunction.hh and defined in src/core/scoring/ScoreFunction.cc. It lives in namespace core::scoring. Dividing up code into namespaces allows class writers to communicate their purpose by association, and avoids problems of name-collision that are common in large software projects.
4.1. core library Namespace core contains data structures and algorithms for describing macromolecules chemically (Section 4.2) and structurally (Section 4.3), for scoring macromolecular conformations (Section 4.6), and for optimizing these conformations with two common techniques: minimizing (Section 4.7) and packing (Section 4.8).
4.2. core::chemical The main class housed in the chemical namespace is ResidueType. A ResidueType class describes the chemical connectivity of a single, abstract residue type. Every instance of alanine in a single structure will point to the alanine ResidueType; this ensures a minimal memory footprint by avoiding redundant representations. Class ResidueType lists the set of atoms, their names, their elements, their atom type, their set of intraresidue chemical bonds, and notes which atoms are able to form interresidue chemical bonds. Each atom in a ResidueType must have a unique name. Two residue types are different if they have different chemical bonds; this means the two tautomers of histidine (where either ND1 or NE2 is protonated) are represented by two different ResidueTypes. To change the tautomerization state of a structure is to change its chemical identity. Similarly, the N- and C-terminal variants on each of the 20-amino acids must be represented by separate ResidueTypes from the “mid”
556
Andrew Leaver-Fay et al.
variants since they have different atom counts and different interresidue connection capacities. To avoid defining by hand each of the 60 additional variants for the 20-amino acids (the N-terminal variant, the C-terminal variant, and the free-amino-acid variant), we have implemented a system for patching residue types similar to the one used by CHARMM (Brooks et al., 2009). ResidueType is also responsible for defining two features that are used to determine “neighborness” of residue pairs: it nominates one of its atoms as its “neighbor atom,” the coordinate of which is used in neighbor detection; and it defines a “neighbor radius” measured as the longest distance possible from the neighbor atom to all other heavy-atoms in the residue under all possible assignments of dihedral angles. Neighbor detection is discussed further in Sections 4.6.2 and 4.6.3. To hold the set of ResidueTypes, the chemical namespace also houses class ResidueTypeSet; each ResidueType must have a unique name among the other ResidueTypes belonging to the same ResidueTypeSet. ROSETTA protocols often rely on multiple residue type sets, with the most common being the “centroid,” or low-resolution, residue type set, and the “fullatom” residue type set. The protein design subroutines (Section 4.8) consider alternate ResidueTypes for an existing residue by requesting the list of all available ResidueTypes from the ResidueTypeSet of the existing residue. Because the fullatom and centroid residue type sets are both represented with a single class, the design subroutines are capable of performing both full-atom and centroid design.
4.3. core::kinematics uses an internal-coordinate representation of a molecular system when performing updates to the conformation. Thus, the primary DOFs during Monte Carlo perturbations and gradient-based minimization are dihedral angles about rotatable bonds and rigid-body transformations between subunits, rather than the xyz coordinates of individual atoms common in molecular dynamics simulations (bond lengths and angles can also be included but are typically held fixed). The internal coordinate representation makes possible a dramatic reduction in the number of DOFs, permitting efficient exploration of conformational space. At the same time, it introduces a complication into the process of specifying a molecular system, namely that one must explicitly define the path by which internal coordinate changes propagate through the system. In the case of flexible-backbone protein docking, for example, the kinematics can be specified by the choice of an anchor residue in each partner. Changes to the six rigid-body DOFs modify the relative orientation of these two
ROSETTA
ROSETTA3
557
anchor residues; changes to the dihedral angles of the monomers preserve the relative orientation of the anchors, while the coordinates of the monomers are updated by folding the chains outward from the anchor points. In general, the kinematic connectivity of any molecular system can be specified by defining a tree (connected, acyclic graph) whose nodes correspond to the atoms in the system and whose edges represent kinematic connections. We generate Cartesian coordinates for the system by starting at a defined root node and traversing outward (downstream) to the leaves. The internal coordinates of the system are mapped onto the individual atoms, with most atoms storing a bond length, bond angle, and dihedral angle relative to a reference frame defined by three upstream atoms. Where rigidbody connections between subunits are needed, a second flavor of atom is introduced which stores a full rigid-body rotation and translation between reference frames defined on the two partners. This kinematic tree of atoms is referred to as an AtomTree (Abagyan et al., 1994). To simplify the process of specifying the AtomTree, a user can define a residue-level tree of kinematic connectivity, termed the FoldTree, in which nodes correspond to individual residues rather than atoms. The AtomTree can be built automatically from the FoldTree. For efficient scoring, the AtomTree tracks which DOFs have changed since the last score function evaluation; if two residues have not moved with respect to each other, then some of their interaction energies from the last score function evaluation may be reused (Section 4.6.2). To communicate which residues have moved, the AtomTree creates a “coloring” of the residues in the structure (wherein each residue is assigned an integer) through a recursive traversal of the tree; if two residues have not moved with respect to each other, they are assigned the same color in this traversal. If a residue has undergone a change to its internal DOFs, then it is assigned the color zero, signaling to the scoring machinery that none of its old energies may be reused in the next score evaluation. The data structure for holding the coloring is called the DomainMap.
4.4. core::conformation The conformation layer describes the physical instantiation of a macromolecule; its main class, class Residue, contains the coordinate information (both Cartesian and internal) for a single residue. It contains none of the chemical information needed to describe the residue, but rather, keeps a pointer to the ResidueType (Section 4.2) that it is an instance of. Class Residue also holds all the information about the interresidue chemical bonds that it forms to other residues. Such information would not be appropriately held in the chemical layer.
558
Andrew Leaver-Fay et al.
A full structure is represented by a Conformation object, which is composed of a set of Residues and an AtomTree. As mentioned in Section 2.4, the user should never be able to access out-of-date coordinate information. The Conformation handles this responsibility by controlling access to the Residues it contains; it provides a set of mutator methods for setting Cartesian and internal coordinates so that it can shuttle these changes between the Residue objects and the AtomTree nodes efficiently, and it allows efficient read access (const access) to its Residues. By disallowing nonconst access to its Residue objects, it ensures data integrity; only the Conformation has the permission to modify its Residues. The central classes of the chemical, conformation, and kinematic namespaces and their relationships are illustrated in Fig. 19.2.
4.5. core::pose Class Pose, as described in Section 3.3, represents the complete state for a molecular system; it stores a Conformation object, an Energies object (Section 4.6.3), a ConstraintSet (Section 4.6.5), and a generic DataCache container (Fig. 19.2). When a user copies a Pose, they copy all relevant information for that structure. The MonteCarlo object (Section 4.10) relies on the Pose’s copy operations to keep track of the best scoring structure encountered in a trajectory. We have two levels of observers for Pose objects: active and passive observers. The passive observers deliver just-in-time information about a Pose; these are the PoseMetrics classes. A PoseMetric will report a certain property of a Pose that is pertinent for decision-making about the structure, but which may be slow to compute, for example, its solvent accessible surface area. The PoseMetric observes the Pose so that, if the Pose has not changed since the last time the property was calculated, then the PoseMetric can report the previously calculated value. We have a second set of active observers that respond immediately to particular events, for example, residue insertion or deletion events. Such observers are commonly used to maintain residue mapping information when a Pose is undergoing a series of residue insertions or deletions. For example, a Constraint (Section 4.6.5) between residue 5 and residue 50 needs to be remapped when residue 39 is deleted so that it applies to residue 5 and residue 49. These active observers allow protocol writers to perform residue insertions and deletions without having to provide an additional residue-remapping interface; and it allows users to rely on these protocols even when they would like to maintain application-specific residuemapping information.
ROSETTA3
559
4.6. core::scoring Namespace core::scoring contains the many classes that define and evaluate ROSETTA’s score function. The key classes in this namespace are ScoreFunction, EnergyMethod, and Energies. 4.6.1. ScoreFunction as a container The score for a structure is the weighted sum of the component energies. A ScoreFunction object holds a set of weights and a set of classes (EnergyMethods) that are able to evaluate the energies and derivatives for the components with nonzero weight. A score function may be evaluated on a Pose and will return its score: core::scoring::ScoreFunction sfxn; core::pose::Pose p; ... // initialization double the_score ¼ sfxn( p );
The ScoreFunction acts as a container, making it easy to pass the active components into subroutines that require a score function to guide their behavior (e.g., packing or minimizing) but that are indifferent to which components are active. The ScoreFunction holds its EnergyMethods in seven lists (one for each of the direct base classes in Fig. 19.3), and when scoring a Pose, iterates across each list to request each EnergyMethod evaluate the energies for certain residues and/or residue pairs in the Pose. Unlike ROSETTA2, there is not a singular (global) score function that is active. Separate threads will instantiate separate ScoreFunction instances; subroutines and classes that rely on computing the score will be given ScoreFunction objects to use. The terms available in ROSETTA’s score function are listed in the ScoreType enumeration. Each element in this enumeration corresponds to one term. EnergyMethods are allowed to compute more than one term at a time; they place the scores they calculate into an object of type EnergyMap, which contains an array with one double for each element in the ScoreType enumeration. The ScoreFunction and its EnergyMethods communicate through EnergyMaps. The total score is simply the dot product between the unweighted-energies vector and the weights vector. To activate a term in a ScoreFunction, a user needs merely to set the weight for that component to a nonzero value. For example, the call to sfxn.set_weight(fa_atr,0.8) would trigger the activation of the attractive portion of the Lennard-Jones energy. Behind the scenes, the ScoreFunction fetches an instance of the EnergyMethod that is responsible for evaluating the fa_atr ScoreType and stores that EnergyMethod. The class responsible for doling out EnergyMethods to ScoreFunctions is the ScoringManager; the ScoringManager maintains a map from ScoreTypes to EnergyMethodCreators, each of
560
Andrew Leaver-Fay et al.
EnergyMethod
1B
2B
WS
S2
CI1
CD1
CIS2
L2
CDS2
CIL2
CDL2
Figure 19.3 EnergyMethod class hierarchy. The first level divides the one-body (1B), two-body (2B), and whole-structure (WS) energies. The second level divides the two-body energies into short-ranged (S2) and long-ranged (L2). The final level divides context-dependent (CD) from context-independent (CI) energy methods. The seven classes in gray are the direct base classes for concrete energy methods; for example, the HydrogenBondEnergy derives from the CDS2 class, as it is contextdependent, short-ranged, and two-body.
which is responsible for instantiating a particular EnergyMethod. EnergyMethodCreators register with the ScoringManager at load time—not compile time—thereby allowing definition of EnergyMethods outside of core (e.g., in protocols or devel). 4.6.2. Energy method class hierarchy There are 12 abstract EnergyMethod classes, seven of which are intended for deriving concrete energy methods (Fig. 19.3). The ScoreFunction treats each of the seven classes differently when it comes to score evaluation and bookkeeping. At the top of the hierarchy is the EnergyMethod class. Three classes derive from it directly: OneBodyEnergy, TwoBodyEnergy, and WholeStructureEnergy. These classes represent energy functions that are defined on single residues, on residue pairs, or on entire structures. Derived OneBodyEnergy classes implement a method: void residue_energy( conformation::Residue const & res, pose::Pose const & p, ScoreFunction const & sfxn, EnergyMap & emap ) const;
and derived TwoBodyEnergy classes implement a method
ROSETTA3
561
void residue_pair_energy conformation::Residue const & res1, conformation::Residue const & res2, pose::Pose const & p, ScoreFunction const & sfxn, EnergyMap & emap ) const;
The presence of a Pose in the interfaces allows context-dependent EnergyMethods to use the Pose for context (see below). The presence of the ScoreFunction in the interface allows for EnergyMethods to alter their behavior in the presence of other EnergyMethods; for example, the Lennard-Jones term changes the way it counts the interaction energies for atom pairs separated by either three or four bonds when the CHARMM torsion term (mm_twist) is active in the score function. WholeStructureEnergy classes perform all of their work in the final stage of scoring (Section 4.6.4) in the call to their finalize_total_energy method. For example, the radius-of-gyration score is implemented as a WholeStructureEnergy. Two classes derive from the TwoBodyEnergy class: ShortRangeTwoBodyEnergy and LongRangeTwoBodyEnergy. The “short range” property of ShortRangeTwoBodyEnergy classes lies in the fact that they define some distance cutoff, d, beyond which any heavy-atom pair interaction is guaranteed to be zero. The ScoreFunction uses the maximum cutoff of its short-ranged two-body energy instances and the neighbor-radii (Section 4.2) to define a sparse graph representing residue– neighbor relationships, an EnergyGraph, described in the next section. ShortRangeTwoBodyEnergy classes are not responsible for determining which pairs of residues to evaluate during scoring; rather, the ScoreFunction directs the short-ranged energy methods to evaluate particular residue-pair-interaction energies using the EnergyGraph. Long-range energy terms do not define a cutoff distance and so they cannot rely on the ScoreFunction to determine their neighbor relationships for them. Instead, they must provide their own data structure for directing the residue pairs over which they should be evaluated and for storing those energies once computed: a LongRangeEnergyContainer. This data structure may be as sparse or dense as the EnergyMethod requires. Truly long-ranged energy functions, such as the Generalized Born solvation model (Onufriev et al., 2004), provide upper triangles of N N tables to store all residue pair interactions, but sparse nonlocal energy functions provide graphs (graphs are introduced in the next section). User-defined constraints (Section 4.6.5) are treated as long-ranged since a constraint score should be evaluated regardless of how far apart two residues become in a structure. Once a user has input their desired constraints, the ConstraintsEnergy (see below) creates a sparse graph representing their relationships. Thus, the cost to
562
Andrew Leaver-Fay et al.
evaluate M constraints costs O(MlgM) time. Similarly, the DisulfideEnergy is defined as long range so that it can control which pairs of residues it is evaluated on; for one, most interacting residue pairs are not disulfide bonded, but more importantly, the disulfide bond-stretch term should be applied regardless of the distance separating two disulfide-bonded residues. The final split in the EnergyMethod hierarchy is between contextdependency and context-independency. The short- and long-range twobody energy methods both split, as does the one-body energy method, defining six of the seven abstract classes meant for direct inheritance from by concrete classes. Context-dependent terms are those where the context for a residue (or for a residue pair) influences the score; for example, many ˚ . The terms in ROSETTA depend on the number of neighbors within 10 A centroid “environment” term depends on the number of neighbors, as do the fullatom hydrogen-bond terms. The Lennard-Jones term, in contrast, does not depend on the context, and is thus implemented as a contextindependent term. Context-dependency is a crucial attribute in determining whether stored residue-pair energies may be reused; the Lennard-Jones interaction energy between two residues is unchanged provided that their relative orientation has not changed since the last energy evaluation, whereas the environment of a hydrogen bond (and hence its strength) may change even if the relative orientation of the interacting atoms does not. 4.6.3. Class Energies and class EnergyGraph The Pose stores the results of its most recent score function evaluation in an Energies object. Class Energies stores the total weighted energy, the unweighted component energies, the per-residue and per-residue-pair unweighted component energies, and the LongRangeEnergyContainers. It holds its per-residue-pair unweighted energies from ShortRangeTwoBodyEnergy classes in a “sparse graph” data structure, class EnergyGraph. The key to the EnergyGraph data structure is that it stores energies for pairs of residues without the O(N2) cost associated with N N tables. In this section, we show that the memory use for the EnergyGraph is O(N); this means that when copying a Pose, the expense of copying its EnergyGraph is O(N). It also means that the expense of traversing the graph during scoring is O(N) (Section 4.6.4). The concept of a graph comes from computer science: a graph is a set of vertices and edges; G ¼ {V, E}. A vertex, v 2 V, represents an object; an edge, e ¼ {u,v} 2 E, represents a relationship between two objects, u and v. In sparse graphs, the number of edges, jEj, is bound by a linear function of the number of vertices; jEj 2 O(V). In our graph implementation, edge addition and deletion costs O(1) time; this feature is necessary for the O(N) complexity bound for scoring a Pose. Edge addition and deletion in our graph data structures is further speeded up by the use of “pool” data
ROSETTA3
563
structures (Cleary, 2001) which reduce the number of calls to new and delete. Each vertex in the EnergyGraph represents a residue in the Pose. Each edge in the EnergyGraph represents a short-range interaction between two residues (Fig. 19.2). Contained on each edge is an array used to store the unweighted energies for the active short-range twobody-energy components. If there are five active two-body components, then exactly five elements are allocated for each array on each edge. The EnergyGraph contains O(N) edges: the EnergyGraph contains an edge between residues i and j if the distance between the neighbor atoms of i and j is less than the sum of the neighbor radius of i and j (ri and rj), (Section 4.2), and the ScoreFunction’s maximum short-range distance cutoff, d, (Section 4.6.2). Under the assumption that our simulations never exceed some residue-density maximum, r, (e.g., by collapsing all residues on top of each other) then each residue has fewer than (4/3)rp(2 max iri þ d)3 2 O(1) neighbors, and thus the EnergyGraph contains O(N) edges. 4.6.4. Score function evaluation To expedite score function evaluation, the ScoreFunction reuses previously computed interaction energies where possible. Here, we present the logic for rescoring a Pose. The Pose, to communicate its structural changes since the previous score function evaluation, hands a DomainMap (Section 4.3) to the ScoreFunction at the beginning of scoring. (Re)Scoring a Pose proceeds in eight stages: 1. The ScoreFunction iterates across all edges in the EnergyGraph (whose edges reflected the interactions present at the last score-function evaluation) and deletes out-of-date edges—edges whose nodes have a different color or are color 0 (O(N)), 2. it detects residue neighbors (with an STL map in O(NlgN) time (Stepanov and Lee, 1995) or quite rapidly with a 3D grid in O(N3) time), 3. it iterates across all residue neighbors, and for any pair of residues with nonmatching colors, adds new edges to the EnergyGraph (O(N)), 4. it calls a function setup_for_scoring on each of its EnergyMethods (O(N)*), 5. it evaluates the one-body energies, reusing context-independent energies for residues with a nonzero color (O(N)), 6. it iterates across all edges in the EnergyGraph and evaluates the shortrange two-body energies for each neighboring residue pair, reusing the context-independent energies for residue pairs with the same nonzero color (O(N)), 7. it iterates across all long-range two-body energies and iterates across the corresponding long-range two-body-energy containers to evaluate the
564
Andrew Leaver-Fay et al.
requisite two-body energies, again reusing context-independent energies for those residue pairs assigned the same nonzero color (O(N2)), 8. finally, it calls finalize_total_energy on each of its Energy1 Methods (O(N) ). WholeStructureEnergy classes perform all of their work during the finalize stage. Discounting the neighbor-detection expense and the presence of longrange energy terms, score function evaluation is an O(N) operation. During minimization (Section 4.7), we assume that the neighbor relationships will remain fixed, and avoid the neighbor-detection and graph update steps (steps 1, 2, and 3) during score function evaluation. 4.6.5. core::scoring::constraints Constraints are used in ROSETTA protocols either to bias conformational sampling toward regions where the user thinks their solution lies, or to force the conformation into a high-energy state in order to study that state (e.g., to design an active site around the transition state geometry for a reaction). Constraints are used to bias sampling with homology information, or experimentally derived data. Their prominent role in so many protocols earns them a place directly in a Pose; each Pose contains a ConstraintSet object. The ConstraintSet holds the collection of Constraints and manages their assignment into 1-body, 2-body, and multibody terms; it creates a graph, the ConstraintGraph, as the LongRangeEnergyContainer for the ConstraintsEnergy class. The constraint system makes extensive use of polymorphism to provide tremendous expressibility. Class Constraint is the abstract base class for the various constraint forms in use in ROSETTA: for example, an AtomPairConstraint will compute a score and its derivative based on the distance of two particular atoms; AngleConstraints and DihedralConstraints operate on atom triple and atom quadruples. The actual score that is computed for most Constraints comes from a generic Func class; the interface for class Func is simply two functions, func (x) and dfunc(x), that report the value and the derivative for some value x (x can be a distance, an angle, or a dihedral). Example concrete Func classes include the HarmonicFunc, the CircularHarmonicFunc, the PeriodicFunc, and the SquareWellFunc. Evaluating the score for a Constraint requires that it be provided access to coordinates, but each Constraint requires a different number of coordinates. To provide a uniform interface for all constraints, we pass coordinates into a Constraint via an XYZFunc whose job is to return a coordinate given an atom identifier. There are three concrete XYZFunc 1
These steps are O(N) assuming that the EnergyMethods perform only O(N) work in these steps.
ROSETTA3
565
classes; the
ResidueXYZFunc, the ResiduePairXYZFunc, and the ConformationXYZFunc, which are used to evaluate 1-body, 2-body, and multibody constraints. The same AngleConstraint class can be used
for either intraresidue or interresidue constraints, and can be used with a wide variety of functional forms.
4.7. core::optimization The classes in core::optimization provide the functionality for gradient-based minimization of arbitrary DOFs. The most commonly used class is AtomTreeMinimizer. This class is specialized for the DOFs of interest to molecular modelers, although it depends upon more general minimization classes described below. The run method of the AtomTreeMinimizer takes as input a Pose, a MoveMap, a ScoreFunction, and a MinimizerOptions object. The MoveMap provides a detailed selection of the AtomTree-defined DOFs to vary. The MinimizerOptions provides a description of the desired minimization algorithm and control parameters such as required tolerances in objective function and maximum iterations before termination (Fig. 19.4A, Section 4). The low-level class for implementing gradient-based minimization is Minimizer. This class is configured with Multifunc and MinimizerOptions objects at construction. The Multifunc class is an abstract class that calculates the objective function and the derivative of the objective function for a given set of DOFs. The AtomTreeMultifunc is the most commonly used subclass of Multifunc. In fact, one of the major tasks of the AtomTreeMinimizer is to construct a suitable AtomTreeMultifunc from the input MoveMap and ScoreFunction, and to enforce the correspondence between the DOFs expected by the AtomTreeMultifunc object and the DOFs manipulated by the Minimizer object. The AtomTreeMultifunc is responsible for converting the Cartesian derivative vectors into derivatives for the torsional DOFs (Abe et al., 1984). The minimization algorithm options are split into the direction-determining and line-minimization options. Currently, the only direction-determining option is a variable metric method using a Broyden–Fletcher–Goldfarb–Shanno (BFGS) update (Nocedal and Wright, 2006). The available line minimization algorithms are an inexact method using the Armijo backtracking acceptance criterion (Nocedal and Wright, 2006), a similar but “nonmonotone” method that allows the minimization trajectory to temporarily move uphill in energy, and a more exact method due to Brent (1973). The termination criterion for the minimization is specified by a tolerance, which may be absolute or relative to the current value for the objective function.
566
Andrew Leaver-Fay et al.
Figure 19.4 Simple ROSETTA3 protocol for performing a binding specificity calculation on a protein-single-stranded-DNA complex. The simulation code (A) is broken into five segments: (1) initialization of the molecular system from a PDB file and the scoring function from a text file containing the energy terms and weights; (2) setup of the kinematic connectivity via a FoldTree (illustrated in B) with a long-range rigid-body connection between residue 4 in the DNA and residue 15 in the protein; (3) redesign of the DNA sequence and simultaneous optimization of the protein sidechain conformations using a PackerTask object to direct the operation of Rosetta’s packing subroutine pack_rotamers; (4) gradient-based minimization of the resulting Pose with flexibility of all chi angles (including glycosidic dihedrals in the DNA), the rigid-body linkage between the protein and the DNA, and the DNA backbone dihedrals (the MoveMap object communicates the allowed flexibility to the minimizer); and (5) output of the final optimized structures (superimposed in C) and sequence and score information (text output shown in D, sequences summarized by a sequence logo representation in E, which can be compared with the DNA sequence in the starting PDB file: GTTAGGG). This simulation code could be compiled into a free-standing Cþþ executable by linking against the ROSETTA libraries.
4.8. core::pack Namespace core::pack houses the classes associated with two commonly used subroutines in ROSETTA protocols, pack_rotamers and rotamer_trials, which optimize rotamer placement. The packer builds a set of rotamers at each of several residues, computes their interaction energies, and, in the case of pack_rotamers, performs simulated
ROSETTA3
567
annealing to find low-energy rotamer placements. In rotamer_trials, the best rotamer is chosen at each residue, where each residue is optimized one at a time in a random order. Both subroutines take as input a Pose, a ScoreFunction, and a PackerTask. In previous versions of ROSETTA, the most complicated portion of the packer was in how it decomposed the score function into rotamer-one-body energies and rotamer-pair energies. The packer had to be aware of all the score function components, their nature, and their interface. This knowledge was duplicated in several places as packer functionality expanded, making the incorporation of new terms difficult and error-prone. With the EnergyMethod hierarchy, the ScoreFunction is predecomposed; once an EnergyMethod can be incorporated as a one- or two-body energy into score-function evaluation, it can be included in packing. Namespace core::pack::task houses a class whose sole purpose is to let users control the packer’s behavior: PackerTask. The PackerTask communicates rotamer-building instructions, simulated annealing parameters, and other data between the various classes and subroutines of the packer. The PackerTask contains a host of options, many of which are configurable on the per-residue level (see Fig. 19.4A; Section 3 for a packing example). Namespace rotamer_set holds class RotamerSet, which builds the set of rotamers for a particular position in the structure (e.g., residue 10). It represents each rotamer with a single Residue object. Class RotamerSet also stores trie data structures for EnergyMethods that are able to take advantage of the trie-vs-trie algorithm (Leaver-Fay et al., 2005b). Namespace interaction_graph houses several “interaction graph” classes (Leaver-Fay et al., 2005a) that store tables of rotamer-pair energies (or compute them on-the-fly; Leaver-Fay et al., 2008) on edges between neighboring residues. The InteractionGraph abstraction also serves as an interface to the simulated annealing algorithms. Due to their ease of use, protocols that create and store their own InteractionGraph for use in multiple annealing trajectories have flourished. Namespace annealer houses the last components of the packer, the annealers, which search for low-energy rotamer assignments by considering rotamer substitutions and deciding whether each substitution should be accepted or rejected. Currently, three annealers are available; the first two differ mainly in their temperature schedule, the third, for DNA design, makes simultaneous base substitutions to preserve Watson-Crick base pairing.
4.9. protocols Library The protocols library contains code representing protocols for specific purposes (e.g., protein/protein docking), whereas the core library contains code with more general purposes that the protocols rely on. This chapter
568
Andrew Leaver-Fay et al.
does not detail all the protocols contained within this library (and which are published separately), but rather, highlights a few classes and algorithms that are common to various protocols and could be useful for developers interested in writing their own protocols.
4.10. protocols::moves Namespace protocols::moves houses the class responsible for recapitulating ROSETTA’s classic Monte Carlo technique, class MonteCarlo. In particular, this class keeps track of two Poses: the best scoring Pose it has ever encountered, and the most-recently-accepted Pose. After the class is initialized (with a starting Pose and a ScoreFunction), a protocol writer can invoke its boltzmann(core::pose::Pose & p) method passing in a structurally perturbed Pose, p for evaluation. The MonteCarlo object scores p, and will then accept or reject p either by copying p into the mostrecently-accepted Pose, or by copying the most-recently-accepted Pose back into p, thereby undoing whatever structural perturbation had just been applied. The MonteCarlo object also keeps track of how frequently structural perturbations are rejected, and can optionally increase its temperature if it has been too long since the last acceptance. Namespace protocols::moves also contains an important base class that makes ROSETTA3 protocols nestable. On one level, protocols are nestable simply because protocols are represented by classes; one could create an instance of a protocol, or one could have two instances of a protocol and arrange the second instance to be invoked from within the first. Our protocols are generically nestable because our protocols all derive from a common base class: Mover. Mover defines an interface method virtual apply( core::pose::Pose & p ) ¼ 0;
which takes a nonconst Pose reference where the Mover is meant to change the input Pose. A Mover can model an entire protocol (e.g., docking) which can be seen as taking an input Pose and producing an output Pose. ROSETTA’s job distribution system, described in the next section, relies on this premise. The TrialMover exemplifies the way in which this generic nestability is so powerful. A TrialMover (itself derived from Mover) is constructed from an instance of another Mover and a MonteCarlo object. In its apply method, it invokes the Mover’s apply method and follows up by invoking the MonteCarlo’s boltzmann method. A portion of a protocol might be written succinctly as a series of TrialMover applications. A final class housed in namespace protocols::moves worth mentioning is the abstract base class, Filter. Filter defines a single function virtual bool apply (Pose const & p) const ¼ 0; which will return “true” if Pose p meets some quality standard, and “false” otherwise. Classes
569
ROSETTA3
Mover and Filter both play an important role in language, described in Section 4.13 below.
ROSETTA3’s
scripting
4.11. JobDistributor Due to the vast size of conformation space, and the ruggedness of the energy landscape, most protocols require the simulation of thousands of trajectories to produce reliable predictions. The job distributor layer is responsible for abstracting the details of structure I/O and the allocation of computational resources away from the task of writing protocols. If a protocol is written to interface with the job distributor classes, then it can be run on any kind of cluster supported by our job distribution framework. The main task for a job distributor is to execute a pair of nested for loops: one outer loop over all input structures and one inner loop over all trajectories required for each input structure. Inside the inner loop, the job distributor generates a Pose from input, runs the intended protocol (by invoking a Mover’s apply method) on that Pose, and then writes the results to disk. These central loops are implemented in the base JobDistributor class. A series of derived classes inherit from this base and provide the logic for how to farm out the jobs across the available resources. As of this writing, the choices include one job distributor for distributed computing within the BOINC framework (Anderson, 2004), one for use with one or more processes that communicate via the file system, and four MPI variants. The MPI variants run as nearly independent processes (that are “embarrassingly parallel”), where interprocess communication is limited to signaling which jobs have been completed and managing disk access. Besides managing the assignment of jobs to processors, the JobDistributor has two extra responsibilities: determining which jobs exist and writing output at their conclusion. These tasks are handled by the JobInputter and JobOutputter classes. Derived members of these classes specialize for particular types of input and output. The JobInputter informs the JobDistributor which jobs are present from commandline inputs and also turns the information for those jobs into Pose objects. The major JobInputters handle standard PDBs and silent files (compressed output files in which a structure is described by its internal DOFs, only). There are also JobInputters specialized for ab initio folding (which has no starting structure) and other purposes. The JobOutputter is responsible for outputting the final Pose from a trajectory, usually to disk. Again, the standard output methods are PDBs and silent files, with more exotic choices available. The JobOutputter class is also responsible for determining what trajectories have already been finished; if ROSETTA is interrupted and later restarted, it can resume the last job it was processing by examining the set of outputs that have already been generated.
570
Andrew Leaver-Fay et al.
4.12. protocols::loops Loop modeling is heavily utilized during protein homology modeling and refinement as well as during flexible-backbone protein design. Loops are defined by a Loop class that stores the start and end residues for the loop, as well as a “cutpoint” residue position at which a chainbreak is introduced into the AtomTree (Wang et al., 2007). This chainbreak allows kinematic perturbations to propagate inward from the loop endpoints toward the cutpoint residue so that regions outside the loop are unmodified by dihedral angle changes within the loop. Loop modeling in ROSETTA3 can be used for de novo prediction of protein loop conformations (loop reconstruction), or for refinement of given loop structures. Loop refinement protocols derive from the class LoopMover and may alter the conformation of any loops in the Pose. Protocols that reconstruct only a single loop derive from the class IndependentLoopMover. A typical loop building task starts with a centroid representation of the loop, followed by an all-atom refinement. Two main algorithms for loop building are implemented in ROSETTA. The first algorithm uses rounds of fragment insertion to modify the loop conformations where a “chainbreak” term is included in the score function to keep the two cutpoint residues close together, followed by cyclic-coordinate descent (Canutescu and Dunbrack, 2003) to close the chain. This algorithm is fast and is used in homology modeling to construct the gap regions of a sequence alignment. The second type uses an algorithm called kinematic closure (KIC; Coutsias et al., 2004; Mandell et al., 2009). KIC uses inverse kinematics to solve analytically for assignments to six torsional DOFs that close the loop exactly, while sampling all other phi/psi angles in the loop region from Ramachandran space. The KIC method provides enhanced high-resolution sampling and can be used for de novo prediction of loop conformations, highresolution refinement of protein structures following low-resolution rebuilding, and remodeling of defined regions of proteins such as during protein design procedures.
4.13. Protocols from text files Most ROSETTA protocols can be abstracted into a series of steps where each step either changes the structure being operated on (the Pose) or decides that the trajectory should be restarted. The classes Mover and Filter introduced above (Section 4.10) provide the building blocks for such an abstraction. We have leveraged these two classes to generate a framework for writing protocols in a user-friendly, XML scripting language that can be read by a ROSETTA application called ROSETTASCRIPTS. ROSETTASCRIPTS programs are written as text files and are converted into
ROSETTA3
571
a sequence of Movers and Filters at runtime. Each Mover and Filter accessible within ROSETTASCRIPTS implements an initialization function, parse_my_tag, which allows the script writer to control various features of each class; however, great care has been taken to make the default behavior for each Mover or Filter as robust and intuitive as possible. There are several advantages of this scripting functionality. First, it allows users to rapidly create and tune protocols without having to recompile Cþþ source code. This is especially useful for ROSETTA@home (Chivian et al., 2003) where distributing a new executable is expensive both in labor and in server load. Second, the ability to pass parameters to ROSETTASCRIPTS Movers through the XML input file helps eliminate command-line flags, which, as global variables, frustrate code reuse. Third, ROSETTASCRIPTS protocols are self-contained and written to work within the standard jobdistribution framework (Section 4.11), making ROSETTASCRIPTS protocols as easy to deploy on a given cluster as any other ROSETTA application. Finally, ROSETTASCRIPTS is just as fast as any hard-coded protocol it might replace, since the underlying Movers and Filters it relies upon are fully compiled Cþþ classes.
5. Conclusion Our new architecture has greatly advanced the functional capacity of It has allowed users to rapidly develop new protocols, to model a wider set of chemical structures, and to easily experiment with new scoring terms. As a concrete example, Fig. 19.4 illustrates a simple ROSETTA3 simulation for predicting protein-single-stranded-DNA binding specificity using DNA redesign, followed by gradient-based minimization. The new architecture has allowed the creation of a multithreaded, interactive game where players are given access to ROSETTA’s minimization, packing, and loop modeling routines to compete for the best protein structure predictions (Cooper et al., 2010). It has enabled the creation of PYROSETTA (Chaudhury et al., 2010) that allows command-line interactivity with ROSETTA classes and functions from within the Python interpreter, which in turn, promotes even faster protocol prototyping. With an object-oriented approach toward protocol development, we are able to construct arbitrarily complicated protocols from component Mover classes using a simple XML scripting language. It is our fervent hope that this rearchitecturing will be a lasting foundation for the code base so that the tumult of another complete rewrite may be avoided for the next decade if not longer. ROSETTA.
572
Andrew Leaver-Fay et al.
ACKNOWLEDGMENTS This work was funded by NIH and HHMI. OFL was funded by the Human Frontier Science Program.
REFERENCES Abagyan, R., Totrov, M. M., and Kuznetsov, D. N. (1994). ICM—A new method for protein modeling and design: Applications to docking and structure prediction from the distorted native conformation. J. Comput. Chem. 15, 488–506. Abe, H., Braun, W., Noguti, T., and Go, N. (1984). Rapid calculation of first and second derivatives of conformational energy with respect to dihedral angles for proteins. General recurrent equations. Comput. Chem. 8, 239–247. Anderson, D. P. (2004). BOINC: A system for public-resource computing and storage. In 5th IEEE/ACM International Conference on Grid Computing pp. 4–10. IEEE Computer Society, Pittsburg, PA. Bonneau, R., Tsai, J., Ruczinski, I., Chivian, D., Rohl, C., Strauss, C. E., and Baker, D. (2001). Rosetta in CASP4: Progress in ab initio protein structure prediction. Proteins 5, 119–126. Bonneau, R., Strauss, C. E., Rohl, C. A., Chivian, D., Bradley, P., Malmstrom, L., Robertson, T., and Baker, D. (2002). De novo prediction of three-dimensional structures for major protein families. J. Mol. Biol. 322, 65–78. Bradley, P., Malmstrom, L., Qian, B., Schonbrun, J., Chivian, D., Kim, D. E., Meiler, J., Misura, K. M., and Baker, D. (2005). Free modeling with Rosetta in CASP6. Proteins 61 (Suppl 7), 128–134. Brent, R. P. (1973). Algorithms for minimization without derivatives. Prentice Hall, Englewood Cliffs, NJ. Brooks, B. R., III, Brooks, C. L., Mackerell, A. D., Nilsson, L., Petrella, R. J., Roux, B., Won, Y., Archontis, G., Bartels, C., Boresch, S., Caflisch, A., Caves, L., et al. (2009). CHARMM: The biomolecular simulation program. J. Comput. Chem. 30, 1545–1615. Canutescu, A. A., and Dunbrack, R. L., Jr. (2003). Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci. 12, 963–972. Chaudhury, S., Lyskov, S., and Gray, J. J. (2010). PyRosetta: A script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691. Chivian, D., Kim, D. E., Malmstrom, L., Bradley, P., Robertson, T., Murphy, P., Strauss, C. E., Bonneau, R., Rohl, C. A., and Baker, D. (2003). Automated prediction of CASP-5 structures using the Robetta server. Proteins 53(Suppl. 6), 524–533. Chowdry, A. B., Reynolds, K. A., Hanes, M. S., Voorhies, M., Pokala, N., and Handel, T. M. (2007). An object-oriented library for computational protein design. J. Comput. Chem. 28, 2378–2388. Cleary, S. (2001). The Boost Pool Library, www.boost.org/libs/pool. Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M., Leaver-Fay, A., Baker, D., and Popovic´, Z. (2010). Predicting protein structures with a multiplayer online game. Nature 466, 756–760. Coutsias, E. A., Seok, C., Jacobson, M. P., and Dill, K. A. (2004). A kinematic view of loop closure. J. Comput. Chem. 25, 510–528. Dahiyat, B. I., and Mayo, S. L. (1996). Protein design automation. Protein Sci. 5, 895–903. Dantas, G., Kuhlman, B., Callender, D., Wong, M., and Baker, D. (2003). A large scale test of computational protein design: Folding and stability of nine completely redesigned globular proteins. J. Mol. Biol. 332, 449–460.
ROSETTA3
573
Das, R., and Baker, D. (2007). Automated de novo prediction of native-like RNA tertiary structures. Proc. Natl. Acad. Sci. USA 104, 14664–14669. Das, R., and Baker, D. (2008). Macromolecular modeling with Rosetta. Annu. Rev. Biochem. 77, 363–382. Das, R., Qian, B., Raman, S., Vernon, R., Thompson, J., Bradley, P., Khare, S., Tyka, M. D., Bhat, D., Kim, D. E., Sheffler, W. H., Malmstrom, L., et al. (2007). Structure prediction for CASP7 targets using extensive all-atom refinement with Rosetta@home. Proteins 106, 18978–18983. Das, R., Karanicolas, J., and Baker, D. (2010). Atomic accuracy in predicting and designing noncanonical RNA structure. Nat. Methods 7, 291–294. Davis, I. W., and Baker, D. (2009). RosettaLigand docking with full ligand and receptor flexibility. J. Mol. Biol. 385, 381–392. Desmet, J., De Maeyer, M., Hazes, B., and Lasters, I. (1992). The dead-end elimination theorem and its use in protein side-chain positioning. Nature 356, 539–541. Gray, J. J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C. A., and Baker, D. (2003). Protein–Protein docking with simultaneous optimization of rigidbody displacement and side-chain conformations. J. Mol. Biol. 331, 281–299. Hellinga, H. W., Caradonna, J. P., and Richards, F. M. (1991). Construction of new ligand binding sites in proteins of known structure. II. Grafting of a buried transition metal binding site into Escherichia coli thioredoxin. J. Mol. Biol. 222, 787–803. Jiang, L., Althoff, E. A., Clemente, F. R., Doyle, L., Rothlisberger, D., Zanghellini, A., Gallaher, J. L., Betker, J. L., Tanaka, F., Barbas, C. F., Hilvert, D., Houk, K. N., et al. (2008). De novo computational design of retro-aldol enzymes. Science 319, 1387–1391. Kaufmann, K., Glab, K., Mueller, R., and Meiler, J. (2008). Small molecule rotamers enable simultaneous optimization of small molecule and protein degrees of freedom in ROSETTALIGAND docking. In “German Conference on Bioinformatics,” (A. Beyer and M. Schroeder, eds.), pp. 148–157. Gesellschaft fu¨r Informatik, Bonn, Germany. Kaufmann, K. W., Lemmon, G. H., DeLuca, S. L., Sheehan, J. H., and Meiler, J. (2010). Practically useful: What the Rosetta protein modeling suite can do for you. Biochemistry 49, 2987–2998. Kortemme, T., Joachimiak, L. A., Bullock, A. N., Schuler, A. D., Stoddard, B. L., and Baker, D. (2004). Computational redesign of protein–protein interaction specificity. Nat. Struct. Mol. Biol. 11, 371–379. Kuhlman, B., and Baker, D. (2000). Native protein sequences are close to optimal for their structures. Proc. Natl. Acad. Sci. USA 97, 10383–10388. Leaver-Fay, A., Kuhlman, B., and Snoeyink, J. S. (2005a). An adaptive dynamic programming algorithm for the side chain placement problem. In “Pacific Symposium on Biocomputing”, 2005, pp. 17–28. World Scientific, The Big Island, HI. Leaver-Fay, A., Kuhlman, B., and Snoeyink, J. S. (2005b). Rotamer-Pair Energy Calculations using a Trie Data Structure (Workshop on Algorithms in Bioinformatics (WABI). pp. 500–511.). Leaver-Fay, A., Snoeyink, J. S., and Kuhlman, B. (2008). On-the-fly rotamer pair energy evaluation in protein design. The 4th International Symposium on Bioinformatics Reasearch and Applications (ISBRA 2008), pp. 343–354. Mandell, D. J., Coutsias, E. A., and Kortemme, T. (2009). Sub-angstrom accuracy in protein loop reconstruction by robotics-inspired conformational sampling. Nat. Methods 6, 551–552. Meiler, J., and Baker, D. (2006). ROSETTALIGAND: Protein-small molecule docking with full side chain flexibility. Proteins 65, 538–548. Nocedal, J., and Wright, S. J. (2006). Numerical Optimization, 2nd edn. Springer. Onufriev, A., Bashford, D., and Case, D. A. (2004). Exploring protein native states and large-scale conformational changes with a modified generalized born model. Proteins 55, 383–394.
574
Andrew Leaver-Fay et al.
Pierce, N. A., and Winfree, E. (2002). Protein design is NP-hard. Protein Eng. 15, 779–782. Ponder, J. W., and Richards, F. M. (1987). Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. J. Mol. Biol. 193, 775–791. Raman, S., Vernon, R., Thompson, J., Tyka, M., Sadreyev, R., Pei, J., Kim, D., Kellogg, E., Dimaio, F., Lange, O., Kinch, L., Sheffler, W., et al. (2009). Structure prediction for CASP8 with all-atom refinement using Rosetta. Proteins 77, 89–99. Rohl, C. A., Strauss, C. E., Chivian, D., and Baker, D. (2004). Modeling structurally variable regions in homologous proteins with Rosetta. Proteins 55, 656–677. Rothlisberger, D., Khersonsky, O., Wollacott, A. M., Jiang, L., DeChancie, J., Betker, J., Gallaher, J. L., Althoff, E. A., Zanghellini, A., Dym, O., Albeck, S., Houk, K. N., et al. (2008). Kemp elimination catalysts by computational enzyme design. Nature 453, 190–195. Simons, K. T., Kooperberg, C., Huang, E., and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268, 209–225. Stepanov, A., and Lee, M. (1995). The Standard Template Library (WG21/N0482, ISO Programming Language Cþþ Project). Wang, C., Schueler-Furman, O., and Baker, D. (2005). Improved side chain modeling for protein–protein docking. Protein Sci. 14, 1328–1339. Wang, C., Bradley, P., and Baker, D. (2007). Protein–protein docking with backbone flexibility. J. Mol. Biol. 373, 503–519. Zanghellini, A., Jiang, L., Wollacott, A. M., Cheng, G., Meiler, J., Althoff, E. A., Rothlisberger, D., and Baker, D. (2006). New algorithms and an in silico benchmark for computational enzyme design. Protein Sci. 15, 2785–2794.
C H A P T E R
T W E N T Y
Computational Design of Intermolecular Stability and Specificity in Protein Self-assembly Vikas Nanda,*,† Sohail Zahid,*,† Fei Xu,*,† and Daniel Levine*,† Contents 1. Introduction 2. Similarities and Differences Between Unimolecular Folding and Self-assembly 3. Computational Approaches to Optimizing Stability and Specificity 4. Collagen Self-assembly 5. Considerations in Computational Design of Collagen Heteromers 6. Conclusions References
576 577 579 584 587 591 591
Abstract The ability to engineer novel proteins using the principles of molecular structure and energetics is a stringent test of our basic understanding of how proteins fold and maintain structure. The design of protein self-assembly has the potential to impact many fields of biology from molecular recognition to cell signaling to biomaterials. Most progress in computational design of protein self-assembly has focused on a-helical systems, exploring ways to concurrently optimize the stability and specificity of a target state. Applying these methods to collagen self-assembly is very challenging, due to fundamental differences in folding and structure of a- versus triple-helices. Here, we explore various computational methods for designing stable and specific oligomeric systems, with a focus on a-helix and collagen self-assembly.
* Department of Biochemistry, Robert Wood Johnson Medical School, UMDNJ, Piscataway, New Jersey, USA The Center for Advanced Biotechnology and Medicine, Piscataway, New Jersey, USA
{
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87020-5
#
2011 Elsevier Inc. All rights reserved.
575
576
Vikas Nanda et al.
1. Introduction This chapter presents computational strategies for designing selfassembling protein complexes, using model a-helix and collagen peptides as an instructive example. The goal of de novo protein design and computational protein design is to apply our knowledge of intermolecular interactions and protein folding toward the construction of new proteins tailored for novel biomedical or industrial applications. It also serves as a stringent test of our understanding of these principles. Carl Sagan remarked, “If you want to make an apple pie from scratch, you must first invent the universe.” In order to effectively create proteins with previously unrealized threedimensional topologies, or tailor-made enzymes capable of carrying out new chemistries, one must clearly understand how proteins fold, maintain their structure, and function. Computational methods are playing an increasingly prominent role in molecular design. The number of possible amino acid combinations to form a protein of length n is 20n, an astronomical number for even the shortest of proteins. Similarly, predicting the three-dimensional structure of a novel protein sequence is time-consuming, and the accuracy of the outcome depends strongly on the degree of sequence homology with existing highresolution protein structures. Together, these two factors make a brute-force, explicit consideration of all sequences impossible. A number of computational tools have been developed to reduce the process of searching sequence and conformational space to a more tractable scale. Using such tools, researchers have replaced a structural metal with a hydrophobic core in a zinc-finger fold (Dahiyat and Mayo, 1997), designed a novel three-dimensional fold from scratch (Kuhlman et al., 2003), and repurposed the surfaces of existing proteins to carry out novel enzymatic reactions ( Jiang et al., 2008). Computational protein design software is becoming more sophisticated and has the potential to make engineering novel protein functionalities an accessible and powerful strategy for answering questions in many fields of biology. With some exceptions, much of the recent progress in computational design has focused on globular, single-chain proteins. In contrast, early foundational work in protein design instead focused on the self-assembly of oligomeric, a-helical complexes. In this methods chapter, we will compare and contrast the challenges faced in the molecular design of singlechain versus self-assembling systems. We will discuss computational approaches to optimizing the stability and specificity of self-assembling systems, focusing primarily on work concerning a-helical coiled-coils. More recent work on computational design of model collagen peptides will be presented, focusing on unique challenges such as the stabilityspecificity tradeoffs in design optimal assemblies. Finally, we will present
Computational Design of Self-Assembly
577
some strategies envisioned for multiscale design of self-assembling proteins, bridging the nanoscale world of proteins with the meso- and macroscale design of protein and peptide-based materials.
2. Similarities and Differences Between Unimolecular Folding and Self-assembly The first successful de novo-designed proteins were self-assembling helical coiled-coils. The coiled-coil has proved a useful system in understanding the molecular principles governing stability and structure not only of oligomeric assemblies, but single-chain globular proteins as well. To a large extent, the intermolecular forces that govern folding and self-assembly are the same: hydrophobic packing in protein cores, hydrogen bonding primarily driving secondary structure formation, electrostatic interactions providing stability, and specificity on the protein surface. These principles are nearly universal to all classes of proteins. The application of compositional constraints, specifically, “binary patterning” of sequences with the correct order of nonpolar and polar amino acids, is a powerful strategy in protein design (Kamtekar et al., 1993; West and Hecht, 1995). This approach seeks to specify the protein fold by burying hydrophobic amino acids in the protein core or interface between protein elements. Natural coiled-coils form a continuous hydrophobic core by following a seven residue repeat of H–P–P–H–P–P–P (H ¼ hydrophobic, P ¼ polar) amino acids. Violations of this pattern might have adverse effects on folding: placing a polar amino acid within the core would require the overcoming of the energy of desolvation, breaking favorable protein– water interactions; placing a nonpolar amino acid on the surface might cause misfolded alternative proteins where that position is buried, or it may drive aggregation with other like proteins presenting an exposed nonpolar surface. Using this strategy, libraries of random sequences that followed this essential pattern resulted in a high degree of successfully folded coiled-coil designs (West and Hecht, 1995). Modifying the pattern slightly to an 11residue repeat resulted in a coiled-coil with a right-handed superhelical twist, rather than the traditional left-handed motif achieved by nearly all natural coiled-coils (Harbury et al., 1998). Using an H–P alternating pattern results in a b-strand-like structure rather than an a-helix (West and Hecht, 1995). These two elements were combined together in the de novo design of an a/b-barrel (Silverman et al., 2001). Although binary patterning can be sufficient to drive assembly and folding in both self-assembling systems and single-chain proteins, a key difference between the two processes is the order of the reaction in the case of self-assembly. The efficiency of self-assembly reactions are
578
Vikas Nanda et al.
concentration-dependent, with the fraction of folded species determined by the equilibrium dissociation constant (Kd) between the components. This allows biophysical characterization of folding thermodynamics without the use of chaotropic denaturants such as guanidine hydrochloride or urea, or perturbation by temperature or pressure. A key challenge in designing self-assembling systems is specifying the structure of the native state. This is shown schematically in Fig. 20.1, where two scenarios are compared, one unimolecular folding and the other coupled folding and assembly. In the case of the unimolecular folding reaction, two a-helices A and B are connected by a flexible linker, which permits association of nonpolar amino acids to form a continuous hydrophobic core. If this linker is removed, as in the case of self-assembly, these interactions may still drive assembly of a helix–helix dimer. However, at least six states are possible, including homodimers and heterodimers combined with parallel and antiparallel topologies. Often, because these conformations are equivalent in terms of the burial of hydrophobic amino acids, there are minimal differences in their relative stabilities, resulting potentially in a mixture of multiple species. Many examples of this heterogeneity have been observed in designed a-helical complexes, requiring the use of linkers and disulfides to drive specificity (Grosset et al., 2001; Harbury et al., 1993).
U
Unimolecular folding
UA + UB
A
+
AA
AA
AB
AB
BB
BB
B
Folding and assembly
Figure 20.1 Achieving a specific native state is more difficult in intermolecular folding due to the number of states with equivalent stabilities. At least six topologies that bury hydrophobic residues (ovals) are possible, and even more might be considered if trimers, tetramers, or higher order structures form.
Computational Design of Self-Assembly
579
The problem is further compounded by the choice of nonpolar amino acids at core positions in an a-helical coiled-coil, which can promote trimer, tetramer, or pentamer topologies. An alternative to using linkers or disulfides to achieve specificity is to include unique polar interactions to enforce a unique structure. From the analysis of natural coiled-coils such as GCN4-1, it has been found that an asparagine at a core position, normally occupied by a nonpolar amino acid, can specify both dimer stoichiometry and a parallel topology (Harbury et al., 1993). This is achieved by a sidechain–sidechain hydrogen bond between the two asparagines in trans across the dimer interface. It is thought that the formation of higher order oligomers or antiparallel topologies would result in desolvation of this buried asparagine, without the compensating hydrogen bond. In replacing a nonpolar amino acid such as valine, which promotes trimer formation in the case of GCN4-1 with asparagine, the final structure is a unique dimer, but less stable than the trimer. This tradeoff between stability and specificity is a common theme in the folding of both single-chain and self-assembling proteins. The inclusion of a core asparagine is a good strategy for enforcing a parallel homodimer, but what if an alternative topology, stoichiometry, or heterometric association is desired? In such cases, the judicious patterning of surface electrostatics can facilitate the folding of a unique state. A scheme of how such interactions may be used is described in Fig. 20.2; in this instance, the parallel homodimer is destabilized by four repulsive-like charge pairs, whereas the antiparallel structure has four attractive charge pairs. This energy difference drives the assembly of predominantly antiparallel associations. Amino acids such as arginine, lysine, glutamate, and aspartate are primarily introduced at surface positions, as the energetic cost of burying charged amino acids is very high. Most of the examples of computational design described in this chapter will make use of electrostatics in achieving target specificity. It should be noted that the extent to which charge pairs stabilize proteins is still a matter of some debate. As with polar interactions in protein cores, there is a stabilityspecificity tradeoff in incorporating extensive electrostatic interactions (Sindelar et al., 1998). It is even suspected that thermophilic proteins may be highly charged to prevent the formation of misfolded states, rather than to stabilize a native state (Berezovsky et al., 2007).
3. Computational Approaches to Optimizing Stability and Specificity The separate and sometimes competing goals of optimizing stability and specifying a unique native state have been traditionally termed “positive” and “negative” design, respectively. A typical design protocol starts
580
Vikas Nanda et al.
+
+
+
+
+
−
−
−
−
Parallel
+ − − +
−
+
−
−
+
−
+
Antiparallel
Figure 20.2 One strategy for achieving specificity in self-assembly is to pattern charges on the surface such that the native state, in this case, antiparallel, has the maximum number of favorable interactions. Unfavorable interactions in competing states can also drive target specificity.
with a random or otherwise nonoptimal sequence mapped onto a threedimensional scaffold representing the target structure. Mutations are introduced either singly or in groups, and the change in calculated stability is determined based on the perturbations of favorable and unfavorable interactions. Over the course of a simulation, deleterious substitutions are discarded and stabilizing mutations are maintained. In positive design, the primary goal is to stabilize the native state. All calculations are focused on maximizing the number of favorable interactions in the target topology. In negative design, alternate conformational states are considered either explicitly or implicitly when evaluating the impact of a mutation. In the case of self-assembling protein systems, the difficulty of achieving a target conformation can be exacerbated by the stabilities of alternative, near-native structures. In this section, we analyze several approaches to implement computational negative design in constructing unique a-helical assemblies. Binary patterning implicitly implements negative design. Enforcing a particular pattern of polar and nonpolar amino acids favors one secondary structure over others, such as HPPHPPP preferentially stabilizing the a-helix
581
Computational Design of Self-Assembly
over the b-strand. It reduces the opportunity for hydrophobic amino acids to be placed at surface-exposed positions, alleviating the potential for favoring aggregation over the native state. Disallowing polar amino acids at buried positions favors the native state over the ensemble of unfolded states by preventing unfavorable side chain desolvation upon folding. In computational protein design, many of these constraints have been implemented as reference energies which describe the contribution of amino acids to the stability of the unfolded state (Kuhlman et al., 2003). Often, these are parameterized based on amino acid partition coefficients in polar and nonpolar solvent phases, reflecting the favorability of an amino acid to be hydrated. The reference energy for a sequence is subtracted from the solvation energy for the target conformation, which scales relative to the solvent accessible surface area of the sidechain (Wesson and Eisenberg, 1992). A nonpolar amino acid such as leucine will have an unfavorable reference energy in the unfolded state. If the same leucine is buried in the native state, the solvation energy will be near zero and the net contribution of solvation and the reference will be favorable. Conversely, a buried lysine will have an unfavorable contribution due to its preference to be hydrated. This approach results in sequences that are generally consistent with the constraints of binary patterning. Both computational and binary patterning approaches have difficulty in estimating the importance to target the stability of interfacial positions which are only partially buried, and the choice of polar versus nonpolar amino acid is not as clear (Marshall and Mayo, 2001). As described earlier, binary patterning is not sufficient to enforce a unique native state in the case of a-helical coiled-coils. Additional interactions must be introduced to favor the native state over alternate conformations. The most widely used approach is to use electrostatic interactions, mediated through pairs of acidic and basic residues at adjacent positions across a helix–helix interface. Sequences in coiled-coils are assigned unique positions based on the heptad notation (Fig. 20.3), where a and d positions
b e g e
g e
f
g e
a
c d g
Figure 20.3 A single a-helix heptad consists of seven residues, labeled a–g. By convention, a and g face the helix–helix interface of a coiled-coil. e and g positions are interfacial and are often exploited in design for engineering stabilizing intermolecular charge–pair interactions.
582
Vikas Nanda et al.
present the hydrophobic core, b, c, and f positions are fully solvent-exposed, and e, g positions are intermediate, often referred to as interfacial. It is these positions where charge pairs are often found. Computational design schemes to enforce both stability and specificity of the native coiled-coil have focused primarily on these e and g positions. Interestingly, the earliest studies in the computational design of selfassembly were carried out on DNA tetramer junctions, not proteins (Seeman and Kallenbach, 1983). This was accomplished using a stepwise sequence selection algorithm, where oligonucleotides sequences were first selected for the formation of stable base-pairing complexes, and then screened according to a fidelity score where the relative stability of the target state was compared to base-pairing energies of multiple competing states. Nautiyal et al. (1995) adopted a similar stepwise approach in the computational design of coiled-coil trimers. In this study, the goal was to form an ABC heterotrimer where each helix was a different sequence (denoted A, B, or C). For each peptide, there were eight e, g positions, resulting in 28 or 256 charge patterns per helix. For an ABC heterotrimer, there would be 10 million (2563) possible combinations of sequences. In order to efficiently search these, sequences were first screened for ones where at least six repulsions were found in a homotrimer (i.e., AAA), which halved the number of possibilities from 256 to 176. These were then screened for triplets of sequences that formed a stable ABC heterotrimer, resulting in 208 ABC candidates. The remaining sequences were again screened for ones that had repulsive interactions if either A, B, or C was antiparallel to the other two, reducing the number of possible sequences to two. The approach was successful, resulting in a set of sequences for A, B, and C where the most stable trimer required all three peptides. Although all sequences were sampled at various steps of this procedure, the order of specificity and stability selections was judiciously chosen such that the computational burden was minimized. An alternative to the stepwise selection of sequences based on stability or specificity criteria is the use of stochastic sequence optimization algorithms that minimize the value of a single scoring function. This score usually reflects the stability of the target native state, and optimizing this is often sufficient to ensure specificity as well (Shakhnovich and Gutin, 1993a,b). However, this approach is unlikely to work in the case of coiled-coils, where it has proved difficult to compute the effect of amino acid sequence on oligomerization state and topology (Ramos and Lazaridis, 2006). Summa and colleagues used a stochastic sequence optimization approach on the e and g positions of a four helix bundle metalloprotein (Summa et al., 2002). Specificity was selected for by optimizing the difference in energies between the native topology and a single competing state: Etarget Ecompeting. A sequence-based scoring function was used, summing over all interactions where favorable charge pairs were assigned an energy of 1 and repulsive
583
Computational Design of Self-Assembly
interactions þ2 or þ3. At each step, a position was picked at random and its charge flipped from þ to or vice versa. After 700,000 such trials, one of the top-ranking sequence found during the search was selected for experimental validation. This approach was extended to the design of coiled-coil dimers which preferentially formed an AB heterodimer rather than AA and BB homodimers (Fig. 20.4; Havranek and Harbury, 2003). Although conceptually similar to previous approaches, this study calculated interaction energies using an explicit atomic model. Additionally, unfolded and aggregated states were included as competitors for the native state. The aggregate was modeled as folded, except with a lower dielectric and the unfolded state including breaking the dimer as well as the energetic cost of unfolding the helix. All of these terms were combined together into a single fitness function: X fitness ¼ RT ln eAi =RT Atarget ; ð20:1Þ where Ai represents the free energy of competing states. This study sampled sequences for a, d, e, and g states in the central heptad of the design, using a core asparagine in the first heptad to enforce a dimer structure. Three designs were tested, all exhibiting the predicted specificity. The field continues to progress rapidly with recent improvements in both computational estimates of stability and algorithmic approaches for cooptimizing stability and specificity. A common obstacle in design is the
Unfolded
Energy
Aggregate
Ecompeting
Etarget
2
+
Figure 20.4 Havranek and Harbury (2003) used the gap between target and competing states to optimize specificity of homodimer or heterodimer formation.
584
Vikas Nanda et al.
tradeoff between stability and specificity. To address this, the Keating group developed CLASSY (cluster expansion and linear programming-based analysis of specificity and stability; Grigoryan et al., 2009). Key to this approach was the conversion of interaction energies determined from expensive structure calculations to sequence-based functions. These functions were trained on a set of calculations and then applied to the design of coiled-coils. The approach was constructed such that it could be generalized to any number of targets and competing states. It was recently applied to develop a synthetic “interactome” where 55 peptides formed 27 hetero-associating pairs (Reinke et al., 2010). The ability for such systems to be linked together to form more complex nanoscale topologies are now being explored (Bromley et al., 2009).
4. Collagen Self-assembly Much of protein design has focused on a-helical proteins, leading to an extensive understanding of coiled-coil interactions. Comparatively, the design of b-sheet structures lags behind. Even less frequently considered is the collagen triple-helix, which is surprising given that collagen is the most abundant protein in higher animals, accounting for approximately one-third of total protein mass. Individual collagen proteins trimerize into triple-helices, which further assemble into higher order structures such as long fibers and mesh-like networks (Kadler et al., 1996; Ramachandran and Kartha, 1955; Rich and Crick, 1961). These structures provide tensile strength and flexibility to tissues. Collagens also play important functional roles in mediating cell polarity, initiating thrombosis and modulating tumor metastasis (Kalluri, 2003). Triple-helix forming domains are defined by a canonical Gly–X–Y triplet repeat. These repeats can extend for over one thousand amino acids, resulting in triple-helices around 300 nm long. The X and Y positions are frequently proline and (4R)-hydroxyproline (abbreviated as Hyp or O), respectively. Peptides as short as 18 amino acids can form triple-helices if they adhere to a G–X–Y pattern where X is often P and Y is often O. While the motif is simple, designing specific hetero-associations of collagen is more difficult. Our understanding of the molecular forces that drive triple-helix formation is less than that of other classes of proteins. Unlike globular proteins, the fibrillar regions of collagens do not have a hydrophobic core. Instead, the triple-helix structure is mediated by a network of interchain backbone hydrogen bonds (Bella et al., 1994; Rich and Crick, 1961). Sidechains of nonglycine positions project into solvent, where the energetic contributions of these groups to folding and stability are highly dependent on interactions with water. High-resolution crystal structures of collagen peptides show an extended, structured hydration network surrounding the triple-helix (Bella
585
Computational Design of Self-Assembly
et al., 1995). Modeling solvent contributions to protein folding and structure has always been a major challenge in computational methods, trying to strike a balance between the efficiency of continuum solvation and the accuracy of explicit water models ( Jaramillo and Wodak, 2005; Pokala and Handel, 2004). Additionally, the folding of collagen is slow, not two-state, complicating a thermodynamic characterization of the contribution of amino acid sequence to the energies of native and unfolded states (Persikov et al., 2004). A number of natural collagens form heteromeric associations. There are 28 known collagen types in humans (Heino, 2007). Multiple subtypes exist within each of these types. Subtypes are often coexpressed to assemble in heterotrimeric triple-helices. The most abundant collagen, Type I, is composed of two a1(I) and one a2(I) chains. Type IV collagen, a primary component of basement membranes, can exist as 2a1(IV):a2(IV), a3(IV):a4(IV):a5(IV), or a5(IV):2a6(IV) heterotrimers (Khoshnoodi et al., 2008). It is believed that stoichiometry is controlled at multiple levels, from gene expression to protein–protein interactions. Due to the extended structure of the collagen triple-helix, interactions that are adjacent in structure are also proximal in sequence (Fig. 20.5). For a homotrimeric collagen
1
2
3
1
1 X
X
3
Y
Y
X
G
Y
X
G
X′
G
Y
X′
Y′
X′
G
Y′
G
Y′
X′
G
G
Y′
X′′
2
Y1 G
5.7
Å 5.5 Å
Y′ 1
X′ 2
Figure 20.5 Pairwise interactions in the collagen triple-helix are proximal in sequence. The “Y” position on one chain is able to make charge–pair interactions with both the X and X0 position on the adjacent chain. In the high-resolution structure of a Type III collagen fragment, one observes a single X0 aspartate on chain 2 interacting with two Y position arginines on chain 1.
586
Vikas Nanda et al.
with a sequence G–X–Y–G–X0 –Y0 , the Y position on one chain is within a few angstroms of both the X and the X0 positions on an adjacent chain. Natural collagens have a higher than normal frequency of acidic and basic residues in triple helical domains, suggesting that electrostatic interactions play an important role in stability and the specificity of chain–chain recognition during folding (Salem and Traub, 1975; Traub and Fietzek, 1976). In human Type I collagen, which has a 1014-residue long triple-helix, if one assigns a score of þ 1 for each charge repulsion across Y–X or Y–X0 positions and 1 for each charge attraction, the net score is 135, indicating that around a third of all G–X–Y triplets are involved in some form of favorable electrostatic interaction. The correct stoichiometry of Type I collagen isoforms: 2a1:a2, has a more favorable charge pairing score than an a1 homotrimer, suggesting that both levels of protein synthesis as well as energetics of association may contribute to ensuring the correct stoichiometry of Type I collagen in biological systems (Traub and Fietzek, 1976). Due to the repetitive nature of the collagen sequence, it is possible to envision staggered associations of individual chains, where one or more chains are shifted by three amino acids. Such staggering destabilizes the triple-helix in two ways. As chains shift, the overhangs on the N and C termini are not able to form interchain hydrogen bonds, which are the primary driving force for folding. Furthermore, shifting the chains breaks interchain Y–X and Y–X0 charge–pair interactions. The further from the native registry, the weaker the electrostatic interactions are, resulting in a funnel-like energy landscape (Fig. 20.6). Although this has not been demonstrated experimentally, this downhill energy surface leading to the native state may facilitate the correct folding and assembly of natural collagens. Model peptide systems have been essential tools in exploring the molecular basis for stability, specificity, and stoichiometry of collagen assembly. Most model peptide work has focused on homotrimers, although work with cysteine-knot systems can be used to engineering controlled heterotrimeric assemblies (Fiori et al., 2002; Ottl et al., 1996). Host-guest studies on model collagen peptides by Brodsky and colleagues established that favorable interchain charge–pair interactions can increase thermal stability (Persikov et al., 2005; Venugopal et al., 1994; Yang et al., 1997). Complementary pairing of electrostatic interactions can also drive the formation of highly stable heterotrimeric triple-helices. The Hartgerink laboratory has made extensive use of charge–pair interactions between individual chains in the triple-helix to form an A:B:C heterotrimer (Gauba and Hartgerink, 2007a, b, 2008). Two of the peptides, (EOG)10, (PRG)10, are highly charged, and the third is neutral, (POG)10. Together, they form an extensive network of charge–pair interactions between arginines at the Y position and glutamates at the X position. Between the peptides, there are 19 potential ion pairs. The high P/O content of (POG)10 makes it highly stable as a homotrimer (Tm 68 C). When heated past 80 C and annealed in an A:B:C mixture,
587
Computational Design of Self-Assembly
Charge pair interaction energy
−20 −40 −60 −80 1 −100
2
−120 5
3 5 0 istr yo
Reg
f2
−5
−5
0 3 try of is Reg
Figure 20.6 Pairwise electrostatic interactions in human Type I collagen. The nativelike registry results in the maximum number of favorable charge pairs. Displacing either chain by several triplets quickly disrupts many of these interactions. This funnel-shaped energy surface is not observed for randomized Type I collagen sequences where the amino acid frequencies are held constant.
the peptides form a heterotrimeric species with a Tm around 54 C (Gauba and Hartgerink, 2007a,b).
5. Considerations in Computational Design of Collagen Heteromers We recently sought to apply computational approaches for optimizing interfacial electrostatic interactions between a-helical domains to design self-assembling heterotrimeric collagen-like peptides (Xu et al., 2010). The goal was to develop a computational scheme that optimized the stability of a target species, an ABC collagen-like heterotrimeric triplehelix, while disfavoring the formation of competing states. The interaction energies of all species, the target ABC heterotrimer and 26 undesired states (AAA, AAB, AAC, . . ., CCA, CCB, CCC) were calculated using a scoring function adapted from a-helical coiled-coil design (Summa et al., 2002): 2 3 þ2; R=R Ei;j ¼ 4 þ3; E=E 5: ð20:2Þ 1; R=E; E=R
588
Vikas Nanda et al.
The energy of an Arg-Arg repulsion was weighted less than Glu-Glu due to the expectation that the longer, flexible sidechain of Arg would more easily avoid unfavorable interactions. Interactions of any residue with proline or hydroxyproline were assigned a score of zero. A single-body energy of 1.5 kcal/mol was added for each POG triplet to approximate the favorable backbone stability provided by this neutral motif. Stability and specificity were simultaneously optimized to separate the computed target heterotrimer ABC stability from other competing heteroand homotrimers. The stability of the ABC species was optimized using positive design, maximizing the number of favorable interactions in a discrete, sequence-based interaction model. Specificity was targeted using negative design, applying the same discrete interaction model to competing species, and optimizing a Boltzmann probability-based specificity score: PABC ¼ ebEABC =
27 X
ebEi ;
ð20:3Þ
i¼1
where i 2 (AAA, AAB, AAC, . . ., CCC) and b is a system temperature set to 1.0. In cases where all peptides in solution are predicted to assemble as ABC, PABC approaches one. If ABC is not formed, PABC approaches zero. To evaluate the relative contributions of positive and negative design to energy and specificity distributions, 1000 simulations were performed for each scenario: positive design alone, negative design alone, or concurrent optimization of both stability and specificity. Using the energy gap between the native target and the lowest energy of an unwanted competing state: Egap ¼ EABC minðEi Þi6¼ABC :
ð20:4Þ
As parameterized in this study, combining positive and negative design raised the mean gap energy over either component alone (Fig. 20.7). Based on 10,000 MCSA simulations combining positive and negative design elements, a final set of A, B, and C peptide sequences was selected that had an optimal stability of ABC and a maximal energy gap. Experimental characterization of these designs did not match the computational predictions, and upon careful evaluation, both errors in the method and shortcomings of the energy function were found to be at fault. However, this first round of design raised a number of interesting questions that would be relevant to future successful construction of heteromeric collagen associations. One feature of the ABC designs was the near balance of acidic and basic triplets across the three peptides. Peptide A was primarily acidic, B primarily basic, and C half and half. Of the top-scoring structures, the difference
589
Computational Design of Self-Assembly
20 Specificity
Energy (ABC) kcals/mol
10
Combined
0 −10 −20 −30
Stability
−40 −50 0
5
15
10
20
25
Energy gap (kcals/mol)
Figure 20.7 Using a Monte Carlo search algorithm, sequences for an ABC collagen heterotrimer were optimized using positive design (stability), negative design (specificity), and both terms simultaneously (combined). In the combined case, solutions with good target energies and large energy gaps were found.
between acidic and basic triplets was never greater than 4 out of a possible 30. This makes sense in terms of positive design. The greater the number of favorable charge pairs, the lower the energy of state ABC. However, part of the scoring function also considered negative design, the specificity of ABC over other states. To test the role of the net charge balance on specificity, a series of sequence optimization simulations was performed where the net charge across the three peptides was fixed. Positive- and negative-charged triplets in different locations were swapped randomly, and charges were allowed to move from X to Y positions. Overall, it was found that the greater the charge differential, either in favor of acidic or basic residues, the smaller the energy gap between ABC and the next most stable state (Fig. 20.8). Thus, a zero net charge balance ensures both stability and specificity. Another consequence of balancing charge is an increase in sequence diversity across the three peptides. In cells where protein concentrations are very high, it has been suggested that sequence diversity prevents nonspecific aggregation. One way of quantifying sequence diversity is the number of amino acid types in a protein (Slovic et al., 2003): ! 20 20 X Y S ¼ ln Ni = Ni ! ð20:5Þ 1
1
590
Vikas Nanda et al.
60
50 Diversity 40
30
20 Egap
10
0 0:10
2:8 4:6 6:4 8:2 Ratio of basic to acidic groups
10:0
Figure 20.8 Energy gap between native and competing states and sequence diversity (Eq. (20.5)) versus the relative fraction of acidic and basic groups in an ABC heterotrimer.
10
5
Pairwise interaction score
Competing state 0 0 −5
6 Energy gap
12
18
24
30
Target state
−10 −15 −20 −25
Number of POG triplets per peptide
Figure 20.9 Effect of POG triplet content on the pairwise interaction energy of an ABC heterotrimer target state and lowest energy competing state. Reduction in the energy gap is primarily due to loss of favorable interactions in the native state.
Computational Design of Self-Assembly
591
where Ni is the number of instances of a specific amino acid type. As the simulation shows, there is a strong correlation between the energy gap and sequence diversity. In addition to presenting a net charge of zero, the ABC designs also had a high frequency of charged amino acids, with no neutral POG triplets found. The POG content represents a tradeoff between target stability and specificity. The greater the POG content of ABC, the more stable it is, but similar increases in stability would be expected for other stoichiometries, which would also have high POG content. Replacing charged triplets with POG reduces the number of favorable interactions in ABC as well as unfavorable interactions in competing states. To assess the relative contributions of these to specificity, a series of simulations were run with varying fractions of POG fixed across the three sequences. As POG content was increased, the number of native state charge–pair interactions decreased more quickly than unfavorable interactions in competing states (Fig. 20.9).
6. Conclusions Progress in computational design of self-assembling protein systems has both contributed to and benefited from our basic understanding of how proteins fold and are stabilized. Work on a-helices indicates that even with approximate energy functions, the design of target stability and specificity can be successfully achieved. Extending these ideas to collagen self-assembly has been challenging, due to the rarity of high-resolution structural information and good thermodynamic models for triple-helix folding and stability. Similar strategies to those used for a-helices, such as mediating specificity through electrostatics, has seen some success, but has yet to be generalized in computational design. A better understanding of how positive and negative design can be effectively implemented is needed. The relative importance of electrostatics, sequence diversity, and backbone stability are being explored in model peptide systems.
REFERENCES Bella, J., Eaton, M., et al. (1994). Crystal and molecular structure of a collagen-like peptide at 1.9 A resolution. Science 266(5182), 75–81. Bella, J., Brodsky, B., et al. (1995). Hydration structure of a collagen peptide. Structure 3(9), 893–906. Berezovsky, I. N., Zeldovich, K. B., et al. (2007). Positive and negative design in stability and thermal adaptation of natural proteins. PLoS Comput. Biol. 3(3), e52. Bromley, E. H., Sessions, R. B., et al. (2009). Designed alpha-helical tectons for constructing multicomponent synthetic biological systems. J. Am. Chem. Soc. 131(3), 928–930.
592
Vikas Nanda et al.
Dahiyat, B. I., and Mayo, S. L. (1997). De novo protein design: Fully automated sequence selection. Science 278(5335), 82–87. Fiori, S., Sacca, B., et al. (2002). Structural properties of a collagenous heterotrimer that mimics the collagenase cleavage site of collagen type I. J. Mol. Biol. 319(5), 1235–1242. Gauba, V., and Hartgerink, J. D. (2007a). Self-assembled heterotrimeric collagen triple helices directed through electrostatic interactions. J. Am. Chem. Soc. 129(9), 2683–2690. Gauba, V., and Hartgerink, J. D. (2007b). Surprisingly high stability of collagen ABC heterotrimer: Evaluation of side chain charge pairs. J. Am. Chem. Soc. 129(48), 15034–15041. Gauba, V., and Hartgerink, J. D. (2008). Synthetic collagen heterotrimers: Structural mimics of wild-type and mutant collagen type I. J. Am. Chem. Soc. 130(23), 7509–7515. Grigoryan, G., Reinke, A. W., et al. (2009). Design of protein-interaction specificity gives selective bZIP-binding peptides. Nature 458(7240), 859–864. Grosset, A. M., Gibney, B. R., et al. (2001). Proof of principle in a de novo designed protein maquette: An allosterically regulated, charge-activated conformational switch in a tetraalpha-helix bundle. Biochemistry 40(18), 5474–5487. Harbury, P. B., Zhang, T., et al. (1993). A switch between two-, three-, and four-stranded coiled coils in GCN4 leucine zipper mutants. Science 262(5138), 1401–1407. Harbury, P. B., Plecs, J. J., et al. (1998). High-resolution protein design with backbone freedom. Science 282(5393), 1462–1467. Havranek, J. J., and Harbury, P. B. (2003). Automated design of specificity in molecular recognition. Nat. Struct. Biol. 10(1), 45–52. Heino, J. (2007). The collagen family members as cell adhesion proteins. Bioessays 29(10), 1001–1010. Jaramillo, A., and Wodak, S. J. (2005). Computational protein design is a challenge for implicit solvation models. Biophys. J. 88(1), 156–171. Jiang, L., Althoff, E. A., et al. (2008). De novo computational design of retro-aldol enzymes. Science 319(5868), 1387–1391. Kadler, K. E., Holmes, D. F., et al. (1996). Collagen fibril formation. Biochem. J. 316(Pt 1), 1–11. Kalluri, R. (2003). Basement membranes: Structure, assembly and role in tumour angiogenesis. Nat. Rev. Cancer 3(6), 422–433. Kamtekar, S., Schiffer, J. M., et al. (1993). Protein design by binary patterning of polar and nonpolar amino-acids. Science 262(5140), 1680–1685. Khoshnoodi, J., Pedchenko, V., et al. (2008). Mammalian collagen IV. Microsc. Res. Tech. 71 (5), 357–370. Kuhlman, B., Dantas, G., et al. (2003). Design of a novel globular protein fold with atomiclevel accuracy. Science 302(5649), 1364–1368. Marshall, S. A., and Mayo, S. L. (2001). Achieving stability and conformational specificity in designed proteins via binary patterning. J. Mol. Biol. 305(3), 619–631. Nautiyal, S., Woolfson, D. N., et al. (1995). A designed heterotrimeric coiled coil. Biochemistry 34(37), 11645–11651. Ottl, J., Battistuta, R., et al. (1996). Design and synthesis of heterotrimeric collagen peptides with a built-in cystine-knot. Models collagen catabolism by matrix metalloproteases. FEBS Lett. 398(1), 31–36. Persikov, A. V., Xu, Y., et al. (2004). Equilibrium thermal transitions of collagen model peptides. Protein Sci. 13(4), 893–902. Persikov, A. V., Ramshaw, J. A., et al. (2005). Electrostatic interactions involving lysine make major contributions to collagen triple-helix stability. Biochemistry 44(5), 1414–1422. Pokala, N., and Handel, T. M. (2004). Energy functions for protein design I: Efficient and accurate continuum electrostatics and solvation. Protein Sci. 13(4), 925–936.
Computational Design of Self-Assembly
593
Ramachandran, G. N., and Kartha, G. (1955). Structure of collagen. Nature 176(4482), 593–595. Ramos, J., and Lazaridis, T. (2006). Energetic determinants of oligomeric state specificity in coiled coils. J. Am. Chem. Soc. 128(48), 15499–15510. Reinke, A. W., Grant, R. A., et al. (2010). A synthetic coiled-coil interactome provides heterospecific modules for molecular engineering. J. Am. Chem. Soc. 132(17), 6025–6031. Rich, A., and Crick, F. H. (1961). The molecular structure of collagen. J. Mol. Biol. 3, 483–506. Salem, G., and Traub, W. (1975). Conformational implications of amino acid sequence regularities in collagen. FEBS Lett. 51(1), 94–99. Seeman, N. C., and Kallenbach, N. R. (1983). Design of immobile nucleic acid junctions. Biophys. J. 44(2), 201–209. Shakhnovich, E. I., and Gutin, A. M. (1993a). Engineering of stable and fast-folding sequences of model proteins. Proc. Natl. Acad. Sci. USA 90, 7195–7199. Shakhnovich, E. I., and Gutin, A. M. (1993b). A new approach to the design of stable proteins. Protein Eng. 6(8), 793–800. Silverman, J. A., Balakrishnan, R., et al. (2001). Reverse engineering the (beta/alpha)8 barrel fold. Proc. Natl. Acad. Sci. USA 98(6), 3092–3097. Sindelar, C. V., Hendsch, Z. S., et al. (1998). Effects of salt bridges on protein structure and design. Protein Sci. 7(9), 1898–1914. Slovic, A. M., Summa, C. M., et al. (2003). Computational design of a water-soluble analog of phospholamban. Protein Sci. 12(2), 337–348. Summa, C. M., Rosenblatt, M. M., et al. (2002). Computational de novo design, and characterization of an A(2)B(2) diiron protein. J. Mol. Biol. 321(5), 923–938. Traub, W., and Fietzek, P. P. (1976). Contribution of the A2 chain to the molecular stability of collagen. FEBS Lett. 68(2), 245–249. Venugopal, M. G., Ramshaw, J. A., et al. (1994). Electrostatic interactions in collagen-like triple-helical peptides. Biochemistry 33(25), 7948–7956. Wesson, L., and Eisenberg, D. (1992). Atomic solvation parameters applied to molecular dynamics of proteins in solution. Protein Sci. 1(2), 227–235. West, M. W., and Hecht, M. H. (1995). Binary patterning of polar and nonpolar aminoacids in the sequences and structures of native proteins. Protein Sci. 4(10), 2032–2039. Xu, F., Zhang, L., et al. (2010). De novo self-assembling collagen heterotrimers using explicit positive and negative design. Biochemistry 49(11), 2307–2316. Yang, W., Chan, V. C., et al. (1997). Gly-Pro-Arg confers stability similar to Gly-Pro-Hyp in the collagen triple-helix of host-guest peptides. J. Biol. Chem. 272(46), 28837–28840.
C H A P T E R
T W E N T Y- O N E
Differential Analysis of 2D Gel Images Feng Li* and Franc¸oise Seillier-Moiseiwitsch† Contents 596 597 599 600 601 603 606 608 609
1. Introduction 2. Differential Analysis of 2D Gel Images 3. Analyzing 2D Gel Images Using RegStatGel 3.1. Fully automatic mode 3.2. Interactive automatic mode 3.3. Stepwise operation and exploration mode 4. Illustration of an Exploratory Analysis Using RegStatGel 5. Concluding Remarks References
Abstract Two-dimensional polyacrylomide gel electrophoresis remains a popular and powerful tool for identifying proteins that are differentially expressed across treatment conditions. Due to the overwhelming number of proteins and the tremendous variation shown in gel images, the differential analysis of 2D gel images is challenging. While commercial software packages are available for such analysis, they require considerable human intervention for spot detection and matching. Moreover, the quantitative comparison across groups of gels is based on simple classical tests that often do not fully account for the experimental design. We developed software with a graphical user interface, RegStatGel, which implements a novel statistical algorithm for identifying differentially expressed proteins. Unlike current commercial software packages, it is free, open-source, easy to use and almost fully automated. It also provides more advanced statistical tools. More importantly, by using a master watershed map, RegStatGel bypasses the spot-matching procedure, which is a time-consuming bottleneck in gel image analysis. The software is freely available for academic use and has been tested in Matlab 7.01 under Windows XP. Detailed instructions on how to use RegStatGel to analyze 2D gel images are provided. * Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland, USA Infectious Disease Clinical Research Program, Department of Preventive Medicine and Biometrics, Uniformed Services University of the Health Sciences, Bethesda, Maryland, USA
{
Methods in Enzymology, Volume 487 ISSN 0076-6879, DOI: 10.1016/S0076-6879(11)87021-7
#
2011 Elsevier Inc. All rights reserved.
595
596
Feng Li and Franc¸oise Seillier-Moiseiwitsch
1. Introduction Although there are newer technologies for protein separation and quantification, two-dimensional polyacrylomide gel electrophoresis (2D PAGE, 2D gel) (O’Farrel, 1975) remains an important tool in proteomics. It has often been used by researchers as a screening tool for selecting interesting proteins and combined with mass spectrometry for sequence identification, which is a central paradigm in proteomic analysis. The goal of 2D gel images is to display proteomes in a form that is amenable to human vision and computer analysis, which can subsequently facilitate the comparison of samples from different experimental conditions and assist in protein localization (Wilkins et al., 1996, 2007). The volume of a protein on a gel is proportional to the darkness of the corresponding spot on the scanned gel image. By comparing spot intensities across images, volumes of the same protein under different treatment conditions can be compared and proteins that are changed in quantity can be identified. To date, many researchers are still manually selecting protein spots by visual inspection, which is unwieldy when there are thousands of spots to consider. Moreover, gel images show a great deal of variability, which should be appropriately accounted for by utilizing suitable statistical methods. In contrast to the analysis of microarrays, there is limited published research, and freely available software packages are scarce for statistical differential analysis of 2D PAGE images. The main difficulties in automated analysis of 2D PAGE images are the discrimination between actual protein spots and noise, the quantification of protein expression levels thereafter, and spot matching for individual comparison (Roy et al., 2003). Although there are commercial software packages for 2D gel image analysis (e.g., PDQuest, Dymension), considerable human intervention is still needed for spot detection and matching. Indeed, spot matching is a time-consuming bottleneck in the currently available software packages. Spot matching is the process by which one maps each spot on a gel to the corresponding spots on other gels in order to compare the volume of a specific protein across all gels. Moreover, the comparison of quantitative spot features is based on simple classical tests, such as the t-test or the F-test, that do not always reflect the experimental design adequately. Therefore, there has been a longstanding need for a fast and statistically sound procedure, along with software that is simple, robust, and easy to implement. We developed software with a graphical user interface, RegStatGel, based on a novel method for identifying differentially expressed proteins. In contrast to commercial software packages, it is free, open-source, easy to utilize, and requires limited human intervention. It also contains more advanced statistical tools. More importantly, by using a master watershed
Differential Analysis of 2D Gel Images
597
map generated by applying the watershed algorithm (Vincent and Soille, 1991) to the mean image, RegStatGel bypasses the spot-matching procedure and thus speeds up the whole analysis. The software is freely available for academic use and has been tested in Matlab 7.01 under Windows XP. The organization of this chapter is as follows. First, we briefly review the current workflow in 2D gel image analysis and introduce our region-based statistical methodology implemented in RegStatGel. Second, instructions on analyzing images with RegStatGel are provided. Third, an exploratory analysis using the software is presented. Lastly, advantages and limitations of the procedure are summarized and discussed.
2. Differential Analysis of 2D Gel Images The differential analysis of 2D gel images involves a number of steps (Roy et al., 2003). The first step is to clean the image and detect spots. Gel preparation and image scanning introduce noise into the gel images such as spikes or streaks. Some spikes and streaks look very similar to true protein spots. Some spots merge into a streak. The images must be denoised and smoothed before actual spots can be located. Caution must be exercised to avoid too much smoothing that would blur small spots into what an automated procedure would recognize as noise. Wavelet denoising methods are often used to remove spiky noise. Spot detection aims to determine the location and size of spots. The challenge in spot detection is the irregularity in their shape. Next, the detected spots are quantified by means of one of various measurements such as volume, median intensity, or shape parameters. The local background of a spot must be subtracted to account for the background variation from spot to spot. The second step is image alignment. The gel material is not rigid and undergoes distortions. The location of spots corresponding to the same protein may be quite different from image to image. To enable the detection of true differences (due to treatment, exposure, disease–whichever is the goal of the study for which gels were run), gel images must be aligned to some reference image. Landmarks must be picked carefully since they are crucial to the estimation of the dewarping function. Smoothness of the dewarping function is a desired feature in order to avoid awkward results such as discontinuities after alignment (Potra et al., 2006). Even after image alignment, spot matching is still necessary since the alignment is never perfect. Spot matching is the process by which one maps a spot on a particular gel to the corresponding spots on the other gels so that spots corresponding to the same protein are identified and labeled. Checking matches produced by an automated algorithm is laborious since images contain hundreds or even thousands of spots. Spot matching has indeed
598
Feng Li and Franc¸oise Seillier-Moiseiwitsch
been recognized as the bottleneck in 2D PAGE image analysis. Because automated algorithms often fail to recognize corresponding spots, one often ends up with an analysis dataset that contains a large amount of missing information. In fact, the more gels for comparison, the higher the number of missing spots. Hence, we end up with the paradoxical situation where the more observations are generated the less reliable the dataset is. Once the images are cleaned, aligned, and spots are matched, the next step is to analyze the gel-to-gel variation. The variation within the data is a mixture of three types of variability: (i) experimental variation, (ii) biological variation, (iii) treatment (or disease or exposure) effects. The experimental variation is largely due to the complicated protocol involved in the assay. Any difference in the amount of material loaded on the gel or in the length of time electrophoresis is run, for instance, will be reflected in changes in spot intensities on the image. Biological variation is inevitable due to subject differences. Treatment (or disease or exposure) effects are what the investigators are interested in. To distinguish treatment effects from nuisance variation, valid statistical methods are needed. Statistical differential analysis of 2D gel images is still in its infancy. Models based on image pixels (Conradsen and Pedersen, 1992) are not practical given the huge number of pixels, high variability in the background, sensitivity to misalignment, and strong spatial correlation among pixels. Instead of working with individual pixels or spots, a region-based approach (Li, 2007) was developed, eschewing the need for spot matching. Instead of striving for perfect quantification of protein spots, we consider protein regions as our analytical units. The spot-matching procedure is bypassed by creating a single master watershed map from the mean image (Vincent and Soille, 1991). With this master watershed map, we divide the gel into different protein regions first, with each region containing a single spot, and compute a region-based measurement thereafter (such as total volume or median pixel intensity). The master watershed map, which defines the protein regions, is imposed on individual images. A master watershed map that separates spots on the mean image well should also separate spots on the individual images well given that all spots are represented on the mean image. While most of the proteins are segmented into different watershed regions, a few regions may contain overlapping protein spots. The pixels in each watershed region on each gel image are then classified as belonging either to the object or to the background. The object is the protein spot or the overlapping protein spots. A summary statistic for each protein region is obtained as a means of quantification of the object for comparative analysis. At this point, one of a number of normalization techniques can be applied. We implement a new normalization procedure for correction of spatial bias. After normalization, hypothesis testing is conducted on the normalized regional statistics. Regions showing statistically significant differences in the summary statistics across treatment
Differential Analysis of 2D Gel Images
599
conditions are highlighted for subsequent analysis. The usual Benjamini– Hochberg (BH-FDR) procedure for controlling the false discovery rate (FDR) is less powerful when there are high correlations among subsets of proteins (Benjamini and Hochberg, 1995). Evidence (Efron, 2007; Qiu et al., 2005) shows that such correlations can have a strong impact on statistical inference. Based on the detected correlations, we separate protein regions into independent sets of correlated proteins. We use MANOVA tests for correlated protein regions and ANOVA for independent regions. The final p-values are considered together to find those protein regions with significant changes across experimental conditions.
3. Analyzing 2D Gel Images Using RegStatGel RegStatGel is implemented in Matlab 7.01. There are three modes of operation: fully automatic mode, semiautomatic mode, stepwise or exploratory mode. These modes of operation are activated when the user clicks different menus and their submenus. Figure 21.1 shows a snapshot of the
Figure 21.1 A snapshot of RegStatGel in fully automatic mode.
600
Feng Li and Franc¸oise Seillier-Moiseiwitsch
interface of RegStatGel under the fully automatic mode. The “Design Input/Output” panel displays the number of experimental groups, the number of replicates in each group, and the image size. The “Regional ANOVA,” “Region ID,” and “Graphics” panels in the main window are active under the exploratory mode where the features of a selected protein region can be investigated. The “Options” menu contains global options that affect all modes of operation. Detailed instructions for each mode of operation are provided below.
3.1. Fully automatic mode New users should opt for this mode first as it is the simplest and fastest implementation. After starting the software under Matlab, the first step is to load the gel images using the “Load” menu. The gel images are loaded from separate image files when the analysis involves a new dataset or from a previously saved file that includes all gel images. The user will be prompted to input the number of groups (i.e., treatment/exposure/disease groups) and the number of replicates (i.e., images) within each group. The user will also be requested to select the gel replicate to be displayed. For example, in Fig. 21.1, gel 1 in each of the two groups is selected for display. Once all the gel images are loaded, the user can start the fully automatic analysis procedure by clicking the “Gel Analysis” menu and choosing the “Fully Automatic” submenu. The software then analyzes the gel images using the default options and a set of widely applicable parameter values. The processing status is conveyed to the user through the waiting bars and the Matlab command window. At the end of the analysis, protein regions with significant expression changes will be highlighted. The user has the option to update the list of marked protein regions for further inspection using the “Explore” menu. All the results generated by the software can be saved for future usage using the submenus under the “Save” menu. The fully automated procedure executes the following key steps sequentially: (1) smoothing and rescaling the raw images; (2) generating the master watershed map from the mean image; (3) segmenting each watershed region; (4) quantifying each region; (5) separating regions into independent sets of correlated proteins; (6) MANOVA for each set of correlated proteins; (7) selecting significantly changed protein regions using the BH-FDR procedure. First, the loaded images are smoothed using the default level 1 “sym8” wavelet approximation. Alternative families of wavelets will not yield notable differences. The smoothed images are rescaled to the same intensity range [0, 1]. The mean image is computed, smoothed with the wavelet approximation, and enhanced via morphological processing techniques (Gonzalez and Woods, 2002) to improve contrast. This mean image contains all the protein spots from each image, which is key for bypassing spot
Differential Analysis of 2D Gel Images
601
matching. The watershed algorithm is applied to the enhanced mean image to separate protein spots into different regions. The watershed regions of the mean image define the master watershed map and are imposed on each individual image for protein separation. The pixels in each region are classified as “background” or “object” by applying Otsu’s method (Otsu, 1979) to the log-transformed intensity. This procedure is similar to K-means clustering. For each region, by default, the difference in the means of the log-transformed pixel intensity of the background and the object is used as the summary statistic for differential analysis. Based on these statistics, regions are separated into mutually independent sets of correlated proteins using a prespecified threshold. Within each set, protein regions are highly correlated and analyzed using MANOVA. If a protein region is not correlated with any other region, ANOVA is automatically applied. Finally, the BH-FDR procedure is utilized to find the threshold for the p-values from the multivariate analysis to control the FDR. When no significantly correlated protein regions are identified, the program automatically switches to ANOVA. The analysis results generated from the automated procedure depend on a set of default parameter values that can be modified using the corresponding submenu under “Options”. Figure 21.2 displays a snapshot of the “Options” menu. The default analysis is a one-way multivariate analysis. If a two-way analysis is required, the user should check the twoway option under “Regional ANOVA” before starting the analysis. The user can change the default quantification method in the “Quantification” submenu. For example, the height or volume of a spot can be used as summary statistics for comparison. By clicking “Edit Default Parameters”, an input dialog window will pop up for the user to modify or check the default parameter values. Most of these values are of wide applicability and should be updated when deemed necessary or purely for exploratory purposes. The “Options” menu has a global impact in the sense that it affects all operation modes.
3.2. Interactive automatic mode This implementation is more flexible than the fully automatic mode, but at the price of the user needing to answer a sequence of prompted questions. It executes all the key steps in the fully automatic mode but prompts the user to make a decision whenever comparable options are available. It also features processing options not included in the fully automatic mode. The user should have a good understanding of the underlying procedures and the influence of parameter values to take advantage of this mode. Specifically, the user is asked to modify or keep the default parameter values for each processing step. The parameters are those pertaining to smoothing the raw images, enhancing contrast in the mean image, constructing the master
602
Feng Li and Franc¸oise Seillier-Moiseiwitsch
Figure 21.2 A snapshot of the items in the “Options” menu.
watershed map, and statistical analysis. To avoid oversmoothing the raw images, the level 1 wavelet approximation is the default. The user can select a level 2 approximation when smoother images are needed. The quality of the mean image has a direct influence on how well the master watershed boundaries separate spots. The intensity distribution of an image can be altered with a power transformation, with a larger power yielding a darker image. The default value for the power parameter is 1, which produces a linear scaling. The contrast of the mean image can be enhanced by using morphological processing tools such as the top-hat and bottom-hat filters (Gonzalez and Woods, 2002). Morphological processing techniques utilize a predefined structuring element to smooth uneven illumination of an image. In RegStatGel, the user can change the size of the default disk-shaped structuring element to alter the contrast. To separate protein spots into different regions, the watershed transform is applied to the enhanced mean image. The watershed transform finds “catchment basins” and “watershed ridge lines” in an image by treating it as a surface where light pixels are high and dark pixels are low. The algorithm delineates the watershed boundaries
Differential Analysis of 2D Gel Images
603
around local minima on the image. The procedure needs input from the user to know the depth of a spot so that shallow regional minima can be suppressed to avoid oversegmentation. Thus, the parameter needs to be adjusted according to the depth of the faintest pot. The software always rescales the mean image to the intensity range [0, 255] for morphological processing and watershed transform; the related default parameters are thus applicable to any intensity range. The values of these parameters have a direct impact on the watershed boundaries and should be changed if the mean image is oversegmented (i.e., too many tiny regions containing only noise) or undersegmented (i.e., too many regions containing more than one spot). The user should be aware that the minimum depth of a spot should be modified based on the 0-255 scale. All the above parameters affect the master watershed map directly and the statistical analysis indirectly; the only parameter that directly affects the statistical analysis is the correlation threshold for identifying correlated protein regions. The default value is 0.96, targeting strong correlations. The software provides a permutation method to help the user choose the threshold by displaying the histogram of the correlation coefficients and estimating the FDR under the “Check Region Correlation” submenu of the “STAT” menu. Under the interactive automatic mode, the user will be prompted to load the original raw images for regional quantification instead of the smoothed and rescaled images in case it is a concern that smoothing and rescaling bias the comparison of the regional statistics. Moreover, the user will be asked whether to apply the normalization procedure based on the 2D LOESS (Cleveland, 1993) to remove possible spatial bias. Since background correction is performed by subtracting the background mean (or maximum depending upon the “Quantification” option) in the regional quantification, normalization of the summary statistics should be used with caution.
3.3. Stepwise operation and exploration mode It is the most flexible and versatile mode designed for purposes of exploratory analysis. It enables the user to execute each step of the analysis and investigate the intermediate results. The quality of the master watershed map can be inspected and different sets of parameter values can be applied to improve protein separation on the mean image. The user can also select protein regions for closer visual inspection of features such as their 3D shapes. The “Region ID”, “Regional ANOVA”, and “Graphics” panels in the main window play important roles in this exploration mode. The main menus in this mode include “Options”, “Stepwise Operation”, “Explore”, and “STAT”. Figure 21.3 displays the items in the “Stepwise Operation” menu. Most of the submenus are self-evident. Those operations listed sequentially in the
604
Feng Li and Franc¸oise Seillier-Moiseiwitsch
Figure 21.3 A snapshot of the items under the “Stepwise Operation” menu.
“Stepwise Operation” menu are not necessarily to be implemented in the same order unless it is the very first run after starting the software. For example, the user can jump to “Regional Quantification” in order to recalculate the regional summary statistics based on alternative quantification methods selected in a submenu of “Options” and then choose ANOVA, nonparametric one-way analysis, or multivariate analysis for statistical analysis. The “Build Master Watershed Region” menu is very helpful for understanding the influence of different parameters involved in image enhancement and watershed algorithm on the resulting master watershed map. Whenever needed, the user can explore the intermediate or final results using the “Explore” menu. Its submenus are displayed in Fig. 21.4. By clicking the “Master Watershed Region” submenu, the user can use the “Next” and “Back” button within the “Region ID” panel in the main window to explore protein regions sequentially. The user can also type a region’s identification number into the text box in the “Region ID” panel for checking a specific region. The size of the displayed image section is determined by the slider position within the “Graphics” panel. The user can zoom in or out by moving the slider position. Figure 21.4 shows the image section surrounding a protein region, with the black borders denoting the master watershed boundary. The “Regional ANOVA” panel displays the one-way ANOVA and nonparametric Kruskal-Wallis test results. The user can also choose not to show the watershed boundary by checking “off” the “Impose Watershed Region” item in the “Display Options” submenu under “Options.” The 3D shape of spots can be displayed when “Display 3D Image of a Region” is checked, as shown in Fig. 21.5 in the case of four groups of gels under a two-way factorial design. A spot is flipped upside down for displaying its 3D shape. When there are few gel images, all replicates within a
Differential Analysis of 2D Gel Images
605
Figure 21.4 A snapshot of the items under the “Explore” menu.
group can be displayed simultaneously by checking “Display all Region Replicates” under “Display Options” of “Options”, as shown in Fig. 21.6. The whole image is displayed whenever the “Show Whole Image” button is clicked. An image section or a specific region from the currently displayed image can be selected by clicking the “Manually Pick an Image Section” and “Check Clicked Region”, respectively. This is very useful for checking the quality of the master watershed map and assisting in setting parameters for the construction of the master watershed boundaries. When the “Regional ANOVA and Distribution of Regional Statistics” submenu is selected, the normality plot of the summary statistics is displayed in a new figure. By clicking the “Check Spatial Bias” submenu, all differences between pairs of gels are shown together with the trend fitted by the LOESS method. When deemed necessary, the spatial trend can be corrected using the spatial normalization item in the “Normalization” submenu under the “STAT” menu. The submenus of the “STAT” menu are exhibited in Fig. 21.7. The “STAT” menu contains submenus for showing statistical plots for the regional quantifications such as boxplots and histograms. Various normalization procedures for the regional statistics are also
606
Feng Li and Franc¸oise Seillier-Moiseiwitsch
Figure 21.5
A snapshot of RegStatGel displaying 3D shape of four spots.
provided. Using the items in the “Checking Region Correlation” submenu, the independent sets of correlated proteins are constructed based on the user-selected threshold. The “Permutation” item within the submenu is used for assisting the identification of a suitable threshold.
4. Illustration of an Exploratory Analysis Using RegStatGel The previous section presents the features and key functions of RegStatGel. In this section, we illustrate usage of the software by showing explicitly all the steps a novice user would follow to analyze a set of gel images. Consider the situation where the user wants to analyze a set of gels generated from a two-way factorial design. After loading the gels, the user would select the two-way option in the “Regional ANOVA” submenu of “Options” since the default is a one-way analysis. Then, the user would activate the fully automatic operation by choosing the “Fully Automatic” submenu of “Gel Analysis”. After the analysis is completed, a list of
607
Differential Analysis of 2D Gel Images
Figure 21.6 A snapshot of RegStatGel displaying all replicates.
Figure 21.7
A snapshot of “STAT” menu.
significantly changed protein regions will be highlighted. The user would then choose to update this list when prompted by the software. To visually go through the marked regions for closer inspection, the user would opt for “Marked Region List” by clicking the “Working Data” submenu under “Options”. Hence, the user can go through the marked regions by using the
608
Feng Li and Franc¸oise Seillier-Moiseiwitsch
“Next” and “Back” button in the “Region ID” panel. Those regions with significant expression changes are also displayed in the Matlab command window. So, the user can always manually type a selected region number to explore a specific region. All replicates within each group can also be displayed by checking the relevant item in the “Display Options” submenu of “Options”. If the analysis results are found unsatisfactory, for instance when many detected regions contain more than one protein spot, the user may need to reconstruct the master watershed map by using different sets of parameter values. This can be done by editing the default parameters for the fully automatic operation or by using the interactive automatic or stepwise operations. The user also needs to check whether normalization of the regional statistics is needed using the functions under the “STAT” menu. The regional quantification methods can also be changed for exploratory purposes. For example, the height or volume of the object part of each region can be used as a quantitative measure instead of the mean value. All results can be saved using the “Save” menu for future usage.
5. Concluding Remarks In this chapter, we introduced software with graphical user interface for analyzing 2D gels images based on a novel method. This software contains automatic processing features as well as many alternative options for stepwise exploratory analysis. It is aimed at providing the user with a friendly tool for fast statistical analysis. The time-consuming spotmatching procedure is bypassed using a master watershed map built from the mean image to assign the same label to spots in the same region across all gel images. The spot’s quantitative characteristics within a watershed region are extracted by local segmentation and background correction. The software contains different quantification methods since a single summary statistic only partially captures the features of a protein region. Once region quantification is obtained, appropriate statistical methods can be applied easily. To cope with protein correlations, a multivariate analysis based on correlation sets is included in the software. Although different image preprocessing methods are available, it is recommended that the extent of preprocessing of the raw images be limited as it may alter the spots features and create artificial differences or blur group effects. The default processing methods only involve a first-level wavelet smoothing and linear scaling. It should be noted that RegStatGel does not provide an image alignment function. Instead, it aims at postalignment statistical analysis. However, the underlying methodology implemented in this software is not sensitive to slight misalignment. The software runs under Matlab, and the open-source code allows users to add on further macros.
Differential Analysis of 2D Gel Images
609
REFERENCES Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300. Cleveland, W. (1993). A model for studying display methods of statistical graphics. J. Comput. Graph. Stat. 2, 323–364. Conradsen, K., and Pedersen, J. (1992). Analysis of two-dimensional electrophoretic gels. Biometrics 48, 1273–1287. Efron, B. (2007). Correlation and large-scale simultaneous significance testing. J. Am. Stat. Assoc. 102, 93–103. Gonzalez, R., and Woods, R. (2002). Digital Image Processing, Prentice Hall, New York. Li, Feng (2007). Empirical Bayes Methods for Proteomics (Ph.D. Dissertation, University of Maryland, Baltimore County). O’Farrell, P. H. (1975). High resolution two-dimensional electrophoresis of proteins. J. Biol. Chem. 250, 4007–4021. Otsu, N. (1979). A threshold selection method from gray level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66. Potra, F., Liu, X., Seillier-Moiseiwitsch, F., Roy, A., Hang, Y., Marten, M. M., and Raman, B. (2006). Protein image alignment via piecewise afine transformations. J. Comput. Biol. 13, 614–630. Qiu, X., Klebanov, L., and Yakovlev, A. (2005). Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Stat. Appl. Genet. Mol. Biol. 4, 1–13. Roy, A., Seillier-Moiseiwitsch, F., Lee, K., Hang, Y., Marten, M. M., and Raman, B. (2003). Analyzing two-dimensional gel images. Chance 16, 13–18. Vincent, L., and Soille, P. (1991). Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell. 13, 583–598. Wilkins, M. R., Pasquali, C., Appel, R. D., Ou, K., Golaz, O., Sanchez, J., Yan, J. X., Gooley, A. A., Hughes, G., Humphery-Smith, I., Williams, K. L., and Hochstrasser, D. F. (1996). From proteins to proteomes: Large scale protein identification by two-dimensional electrophoresis and arnino acid analysis. Nat. Biotechnol. 14, 61–65. Wilkins, M. R., Appel, R. D., Williams, K. L., and Hochstrasser, D. F. (eds.), (2007). Proteome Research: Concepts, Technology and Application, Springer, Berlin.
Author Index
A Abagyan, R., 557 Abbyad, P., 28 Abe, H., 565 Abodeely, M., 44 Abou-Jaoude´, W., 171, 174–175, 185–186, 204, 208 Ackerman, M. S., 102 Ackermann, H.-W., 515 Acrivos, A., 176 Adalsteinsson, D., 394 Agirrezabala, X., 520 Agutter, P. S., 209 Aihara, K., 193, 207 Akhoon, B. A., 357, 359–360 Akman, O. E., 247–248 Akutsu, T., 336 Albeck, S., 547 Albert, A., 68 Aldridge, B. B., 218 Alex, A., 3, 15 Allan, D. W., 413–414 Allan, K., 410 Allan, W., 410 Allen, D. L., 43 Allen, M. S., 148–149, 254 Allison, S. D., 491 Almeida, J. S., 336 Alonso, A. A., 339 Alon, U., 173, 185–187, 190, 205, 209 Alper, H. E., 76 Althoff, E. A., 547, 576 Altinok, A., 345 Altman, R. B., 78–79, 221 Alvarado, C., 103 Alvarez-Vasquez, F., 341 Alves, R., 335 Amann, H., 157 Amaral, L. A. N., 410–412, 414–416, 418, 425, 428 Amato, N. M., 103 Amzel, L. M., 101 Anderson, D. L., 514 Anderson, D. P., 569 Anderson, K. S., 73, 76, 89, 93, 95 Andersson, K., 15 Andrec, M., 434 Andrew Ban, Y.-E., 545
Andrianantoandro, E., 312 Andricioaei, I., 2, 25 Anfinsen, C. B., 100 Ang, J., 279 Angulo-Brown, F., 409–410 Appel, R. D., 596 Archontis, G., 556 Arkin, A. P., 149, 174, 182, 207, 228, 487 Artesi, M., 68 Arumugam, K., 77, 84 Ascher, U. M., 487, 500–501 Ashburner, M., 134 Ashyraliyev, M., 338 A˚stro¨m, K. J., 280, 292, 307 Auffray, C., 361 Ausio, J., 102 Austin, D. W., 254 Av-Ron, E., 174, 208 B Bachmann, J., 343 Bae, K., 43 Baevsky, R., 428 Bahar, I., 105 Baiocchi, D., 348 Bajzer, Z., 25 Baker, D., 105, 545–548, 552–553, 570–571 Balakrishnan, R., 577 Baldwin, R. L., 102 Ballarini, P., 233, 248 Balsa-Canto, E., 335–336, 339 Baltrusch, S., 361 Banavali, N., 539 Banga, J. R., 301, 335–336, 339, 352 Bapat, R. B., 152–153, 156, 160 Barbas, C. F., 547 Barbero, J., 548, 571 Bargiello, T. A., 42 Barik, D., 182 Barkai, N., 174 Barkley, M. D., 26–27 Barta, A., 74 Bartels, C., 556 Bartol, T. M., 503 Barysz, M., 15 Basche´, T., 441 Bashford, D., 561 Basu, S., 312
611
612 Battistuta, R., 586 Baumann, W. T., 182, 208 Baumgartner, B. L., 281, 297, 307, 309, 311–313 Baxter, D. A., 45, 48, 50, 174, 208 Baxter, J. D., 467 Baylies, M. K., 42 Bayliss, L. E., 305 Beausang, J. F., 431, 433–434, 439, 447 Beck, B., 3, 15 Becker, O. M., 76 Becker-Weimann, S., 204 Beckwith, J., 387 Beebe, D. J., 281 Beechem, J. M., 4 Beenen, M., 548, 571 Beierlein, F. R., 3, 6, 25 Bella, J., 584–585 Bell, E., 63 Benjamini, Y., 599 Bennett, M. R., 281, 297, 307, 309, 311–313 Ben-Shaul, A., 518 Berendsen, H. J. C., 521 Berezovsky, I. N., 579 Berg, H. C., 281, 297, 306–309, 313, 488, 496 Berg, O. G., 209 Berman, A., 152–153, 156 Berman, H. M., 103, 516 Bernaola-Galva´n, P., 410–412, 415 Bernhardsson, A., 15 Berrondo, M., 545 Berry, R. S., 207–208 Bersani, A. M., 150, 177, 182 Betker, J. L., 547 Bevington, P. R., 450 Beyenal, H., 489 Bhalerao, K. D., 73, 89 Bhat, D., 547 Bhat, T. N., 103 Bhola, P., 233 Biham, O., 193 Birshtein, T. M., 102 Birtwistle, M. R., 348, 489 Blackford, J. A. Jr., 468, 470–475, 480–481 Blair, K. B., 76 Blake, W. J., 173, 254, 312 Blau, H. M., 186, 205 Blau, J., 40, 44–45 Bleisteiner, B., 3 Blinov, M. L., 497 Block, S. M., 281, 297, 306–309 Blomberg, M. R. A., 15 Blom, J. G., 338 Bloomfield, V. A., 518 Boehr, D. D., 102 Bolouri, H., 228 Bona, J. L., 42, 46, 49, 58–61, 63–66 Bonneau, R., 547, 571 Boresch, S., 556
Author Index
Borghans, J. A. M., 177 Bornstein, B. J., 228 Bortz, A. B., 487 Bourne, P. E., 103 Bours, V., 68 Bowen, J. R., 176 Boxer, S. G., 28 Boyce, W. E., 292 Boyd, R. H., 102–103 Boz, M. B., 513, 524–525 Bracken, C., 102 Bradley, P., 545, 547, 570–571 Brand, L., 4, 24 Braun, W., 565 Bray, D., 228 Breaker, R., 74 Bredel, M., 68 Breitling, R., 148 Brenowitz, M., 77 Brent, R. P., 41–42, 68, 565 Briggs, G. E., 176–177 Brigham, K. L., 361 Brodsky, B., 584–585 Bromley, E. H., 584 Brooks, B. R. III., 556 Brooks, C. L., 175, 556 Brooks, S. P., 175 Broos, J., 14–15, 28 Brovelli, A., 135 Brown, C. J., 102 Brown, M. P., 166 Brown, P. E., 247 Bruck, J., 315 Brugnano, L., 157 Bryngelson, J. D., 101 Buchsbaum, A. M., 40, 45 Bullock, A. N., 547 Buma, W. J., 15, 28 Bunde, A., 426, 428 Burant, J. C., 15 Burdick, J. W., 102 Burgess, B. K., 15, 21 Burke, A., 53 Burke, J. M., 218 Burke, P., 489 Burkhardt, F., 15 Burrage, K., 150, 177, 182 Burrage, P., 150 Butte, A. J., 134 Byrne, H. M., 67 Byrne, J. H., 45, 48, 174, 208 Byrn, J. H., 48, 50 C Caflisch, A., 556 Cagatay, T., 209 Cagney, G., 403
613
Author Index
Cai, L., 166 Cai, X., 208 Califice, S., 68 Callender, D., 547 Callis, P. R., 1, 3–4, 7, 13–15, 21–24, 26–28 Campen, A. M., 102 Canters, G. W., 14 Cantor, C. R., 174, 186, 198–199, 206, 210, 254 Canutescu, A. A., 103, 570 Cao, Y., 150, 182 Capuani, F., 177 Caradonna, J. P., 553 Cardelli, L., 220, 224, 228 Carlson, G., 324 Caron, E., 220, 228 Carpena, P., 411 Cartwright, H. M., 403 Caruana, R., 399 Cascante, M., 350–352, 354–355 Case, D. A., 528, 561 Castagna, G., 228 Castelnovo, M., 522 Caves, L. S. D., 76, 556 Cech, T., 74 Ceriani, M. F., 44 Chan, D. C., 356 Chandler, D., 14 Chandrasekhar, J., 15 Chan, H. S., 102 Chan, K., 346 Chan, V. C., 586 Cha, S., 177 Chatterjee, A., 503, 505 Chatterjee, S., 135 Chaturvedi, S., 489 Chaudhury, S., 547, 571 Chaulk, S., 77 Chavali, A. K., 334 Cheeseman, J. R., 15 Chemla, D. S., 441 Cheng, F., 357 Cheng, G., 547 Chen, J., 434, 474 Chen, K. C., 291, 322 Chen, L. X. Q., 25–26, 193, 207 Chen, R., 63, 102 Chen, S.-J., 74, 102 Childs, W., 28 Chin, D. N., 76 Chirikjian, G. S., 99–100, 102–105, 107–108, 111, 113, 115, 123–124 Chiu, W., 514 Chivian, D., 547, 571 Choi, P. J., 166 Cho, S., 473–475, 480 Chou, I., 326, 338 Chow, C. C., 465, 468–474, 480–481 Chowdry, A. B., 553
Chuang, K. Y., 43 Chun, H. M., 76 Cieplak, P., 79 Ciliberto, A., 53, 177, 207 Ciocchetta, F., 220, 225, 228, 247–248 Clark, A., 242 Clarke, E. M., 220 Clark, T., 3, 6, 15, 25 Cleary, S., 563 Clemente, F. R., 547 Clementi, C., 103 Cleveland, W., 603 Cluett, W. R., 307 Cohen, B. E., 28 Cohen, I. R., 227 Cohn, M., 389 Collins, J. J., 173–174, 186, 198–199, 206, 210, 254, 291, 312 Collins, S. D., 503 Conrad, E. D., 312 Conradsen, K., 598 Cooke, J. F., 148–149 Cooper, D. L., 15 Cooper, S., 545, 548, 571 Cornish-Bowden, A., 173, 175–177, 208, 228 Corn, J. E., 545 Corrie, J. E. T., 439, 447 Cortese, M. S., 102 Cory, M. G., 14 Costa, M. N., 485, 503–504 Coutsias, E. A., 570 Cova, S., 2, 25 Cowart, L. A., 341 Cox, C. D., 148–149, 254 Cox, E. C., 166 Cramer, C. J., 15 Crick, F. H., 584 Crippen, G. M., 102 Crouzy, S., 77, 84 Crozier, P. S., 76 Cuellar, A. A., 228 Cui, Q., 528, 531 Curto, R., 350–352, 354–355 Cyran, S. A., 40, 45 D Dahiyat, B. I., 553, 576 Dahms, T. E. S., 25 Danos, V., 220, 222 Dantas, G., 547, 576, 581 Dapprich, S., 15 D’Aquino, J. A., 101 D’Ari, R., 172–173, 204 Darlington, T. K., 44 Dar, R. D., 148–149, 254 Das, P., 103 Das, R., 545–547, 553
614
Author Index
Dauwalder, B., 44 Davey, R. E., 233 Davidson, F. A., 182, 205 Davis, I. W., 545, 547 de Atauri, P., 351 De Boer, R. J., 177 DeChancie, J., 547 Degasperi, A., 247–248 de Gennes, P. G., 103 de Graaf, C., 15 de Groot, M., 15, 28 Delahaye-Brown, A. M., 40, 43 Delp, S. L., 78 DeLuca, S. L., 546 De Maeyer, M., 553 Demin, O. V., 227 Deprez, M., 68 des Cloizeaux, J., 103 Desmet, J., 553 Devkota, B., 513, 517, 524, 527, 539–540 de Waal, E., 14 Dill, K. A., 74, 102–103, 570 Dimaio, F., 547 Ding, M., 411 Dinola, A., 521 DiPrima, R. C., 292 Diraviyam, K., 14 Dodd, I. B., 166 D’Odorico, P., 394 Doi, M., 103 Dokland, T., 519 Domeniconi, C., 402 Donaldson, R., 222 Donder-Lardeux, C., 15 Donzel, B., 25 Dor, Y., 227 Dougherty, E. J., 465 Doyle, J. C., 228 Doyle, L., 547 Dronov, S., 227–228 Dubey, A., 103 DuBow, M. S., 515 Dudek, S. M., 45 Dudka, A., 220, 228–229, 233, 235, 242–243, 246 Duguid, A., 240 Dunbrack, R. L. Jr., 103, 570 Dunker, A. K., 102 Dunwiddie, C. T., 320 Dussmann, H., 341 Dym, O., 547 E Eaton, M., 584 Eccher, C., 228 Eddy, J. A., 334 Edery, I., 43–44, 64
Edwards, A., 450 Edwards, S. F., 103 Efron, B., 599 Eftink, M. R., 4 Egan, J. B., 166 Eichner, J. F., 426 Eisenberg, D., 581 Eisen, M. B., 134–135 Elaydi, S., 154–156 Elf, J., 508 Ellner, S. P., 52 Elowitz, M. B., 136, 149, 174, 186, 190–191, 205–206, 209–210, 254 Elston, T. C., 254, 394 Emery, P., 43 Enderle, T., 441 Endy, D., 41–42, 68 Engh, R. A., 25 Evilevitch, A., 522, 524 F Fallahi-Sichani, M., 503 Fang, Q. J., 105 Fathallah-Shaykh, H. M., 39, 42, 46, 49, 58–61, 63–68 Featherstone, R., 89 Fender, B. J., 23 Feng, Z., 103 Fengzhong, W., 411 Ferrell, J. E., 205 Fiebig, K. M., 102 Fieire, E., 101 Fietzek, P. P., 586 Finney, A., 228, 325 Finn, P. W., 103 Fiori, S., 586 Fisher, J., 219, 227, 237 Fitzgerald, J. B., 324, 349–350 Fitzkee, N. S., 102, 116 Flach, E. H., 205 Fleishman, S. J., 545 Fleming, G. R., 21, 25 Flocard, J., 77, 84 Florescu, A. M., 209 Flores, S. C., 73, 78–79 Flory, P. J., 103, 115 Foloppe, N., 539 Fomekong-Nanfack, Y., 338 Forger, D. B., 208 Forkey, J. N., 433, 439, 441, 447, 459 Fo¨rster, Th., 23 Frank, J., 522 Frederick, K. K., 102 Freidman, J. H., 400, 402 Frieda, K., 166 Friedman, N., 134–135 Frisch, M. J., 15
615
Author Index
Fritsch-Yelle, J., 428 Fromm, H. J., 469, 471 Fujita, A., 135 Fukuda, K. H., 412 Fuller, C. A., 42 Fuller, D. N., 519 Fuller, S. D., 517 Fu¨lscher, M. P., 15 Funahashi, A., 227 G Galas, D., 324 Galitski, T., 291 Gallaher, J. L., 547 Gammaitoni, L., 394 Garcia-Ojalvo, J., 209 Gardiner, C. W., 174 Gardner, T. S., 134, 174, 186, 198–199, 206, 210 Gauba, V., 586–587 Gaudechon, P., 25 Gedeck, P., 3, 15 Gehart, J. C., 254 Gehlen, J. N., 14 Gekakis, N., 40, 43–44 Gelbart, W. M., 514, 518, 524 Gel’fand, I. M., 102 Ge´rard, C., 207 Ghazal, P., 227 Gibney, B. R., 578 Gibson, M. A., 315 Gilbert, D., 222 Gillespie, D. T., 49–50, 149–151, 174, 208, 211, 221, 255, 268, 315, 384–386, 486–487, 500, 503 Gilliland, G., 103 Gilmore, S., 240, 242 Girolami, M. A., 219 Glab, K., 547 Glass, J., 441 Glass, L., 414 Glendinning, P., 313 Glossop, N. R., 40, 44–45 Goel, G., 326 Golaz, O., 596 Goldberger, A. L., 410, 412–414, 416, 418, 425, 428 Goldbeter, A., 42–43, 45, 47–48, 50–51, 173–175, 183, 185, 191, 206–208, 345, 466, 480 Golding, I., 166 Goldman, R., 103 Goldman, Y. E., 431, 433, 439, 441, 447, 459 Golub, G. H., 158 Gomez, J., 101 Go´mez-Uribe, C., 281, 297, 303, 307, 309–313 Go, N., 565 Gong, H. P., 102
Gonza´lez-Alco´n, C., 350–352 Gonzalez, R., 600, 602 Gonze, D., 47–48, 50, 171, 174, 183, 191, 206, 208 Gooley, A. A., 596 Gordon, H. L., 25 Goryanin, I., 227 Goutsias, J., 182 Granger, C. W. J., 135, 141 Granoff, A., 515 Grant, R. A., 584 Gray, J. J., 545, 547, 571 Grayson, P., 525 Gregoriou, G. G., 135 Gregory, P. C., 481 Griffin, R. G., 103 Grigoryan, G., 584 Grillo, A. O., 166 Grilly, C., 313 Grima, R., 182 Groetsch, C. W., 158 Gronenborn, A. M., 24 Grosberg, A. Yu., 103 Grosset, A. M., 578 Grosshans, C., 74 Grossman, A. D., 209, 254 Grumberg, O., 220 Grundy, F., 74 Guckenheimer, J., 52 Guerriero, M. L., 217, 220, 228–229, 233, 235, 237, 240, 242–243, 246–248 Guicherit, O. M., 186, 205 Gunawan, R., 338–339 Gunopulos, D., 402 Guo, F., 74 Guo, S., 135, 144 Gupta, A., 434 Gupta, S. K., 319, 357, 359–360 Gutin, A. M., 582 Gu, W., 175 Guzma´n-Vargas, L., 409–411, 415 H Haak, J. R., 521 Hager, G. L., 469 Hagerman, P., 516 Hagiwara, M., 45, 48 Hahn, D. K., 23 Hahn, J., 233 Haile, J., 74 Hains, M. D., 282 Hairer, E., 76 Haldane, J. B., 176–177 Halle, B., 25 Hall, J. C., 42–43 Halloy, J., 47–48, 50, 171, 174, 183, 191, 206, 208
616 Halperin, D., 103 Hamilton, J. D., 136, 140 Hamm, H. E., 282 Handel, T. M., 585 Hanes, M. S., 553 Haney, D. N., 76 Hanggi, P., 394 Hang, Y., 596–597 Hannun, Y. A., 341 Hao, H., 43 Haranczyk, M., 2 Harbury, P. B., 577–579, 583 Hardin, P. E., 40, 42–45, 48, 64 Harel, D., 227 Hartgerink, J. D., 586–587 Harvey, S. C., 513–515, 517–518, 520–521, 524–525, 527–528, 533 Hassanali, A. A. P., 24–25 Hassett, B., 103 Hasty, J., 208, 281, 291, 297, 307, 309, 311–313 Ha, T., 434, 441 Hausdorff, J., 410, 412, 414, 425 Havlin, S. H., 410–414, 425–426, 428 Havranek, J. J., 545, 583 Hayden, C., 441 Haykin, S., 280, 297 Hazes, B., 553 Heath, J. K., 217–220, 228–229, 233, 235, 242–243, 246 Hecht, M. H., 577 Heijnen, J. J., 335 Heiner, M., 222 Heino, J., 585 Hellinga, H. W., 553 Henderson, R., 443–444, 449, 459 Hendsch, Z. S., 579 Henkin, T., 74 Henzinger, T. A., 219, 237 Herna´ndez-Bermejo, B., 335 Herna´ndez-Pe´rez, R., 409 Hersen, P., 281, 296–297, 307, 309–311, 313 Herzel, H., 204, 288 Hesp, B. H., 14–15, 28 Hess, B. A., 15 He, Y., 473, 480 Higham, D. J., 151 Hillston, J., 220, 225, 240 Hilser, V. J., 101 Hilvert, D., 2, 547 Hinze, G., 441 Hipps, K. W., 102 Hnadel, T. M., 553 Hochberg, Y., 599 Hochstrasser, D. F., 596 Hoffhines, A., 320 Hoffmann, A., 208 Hohwy, M., 103 Holmes, D. F., 584
Author Index
Hood, L., 291, 324 Hopfield, J. J., 23 Horibata, K., 389 Horike, D., 386, 388 Horn, A., 15 Ho, T. C., 498 Houk, K. N., 547 Houl, J. H., 41, 44–46, 63–64 Hsu, C. P., 186 Hsu, D., 103 Huang, E., 547 Huang, Y., 474 Hubbard, E. J., 227 Huber, H. J., 341 Hucka, M., 228, 325 Hud, N. V., 518 Hudson, B. S., 25 Hughes, G., 596 Humphery-Smith, I., 596 Hunter-Ensor, M., 43 Huq, E., 313 Huse, B., 474 Huston, J. M., 25 Hutcheson, R. M., 7 Hutter, M., 15 Hwang, M. J., 186 I Iakoucheva, L. M., 102 Ibrahim, S., 341 Ideker, T., 291 Iglesias, P. A., 305 Ingalls, B. P., 279, 305 Isaacs, F., 291, 312 Isaacson, S. A., 508 Ishiwata, S., 441 Ising, E., 503 Isorce, M., 77, 84 Itsukaichi, T., 43 Ivanov, P. C., 410–412, 414–416, 418, 425, 428 J Jacak, R., 545 Jackson, F. R., 42 Jackson, J. B., 14 Jacobson, M. P., 570 Janes, K. A., 218 Jannink, G., 103 Jaramillo, A., 585 Jardine, P. J., 514 Jaroniec, C. P., 103 Jarrell, H. C., 25 Jayaraman, A., 233 Jeembaeva, M., 522, 525 Jernigan, R. L., 103, 105 Jiang, L., 547, 576
617
Author Index
Jiang, W., 520–521 Jimenez, R., 21 Joachimiak, L. A., 547 Johnson, J. E., 514 Joo, C., 434 Jouvet, C., 15 Joyeux, M., 209 Judd, E. M., 489 Jung, P., 394 K Kaandorp, J. A., 338 Kadener, A., 63 Kadener, S., 40, 42, 46, 49, 58–61, 63–66 Kadler, K. E., 584 Kaern, M., 136, 173, 254 Kagan, B. L., 468, 470–473, 480–481 Kahramanogullari, O., 220, 228 Kalisch, S. M., 209 Kallenbach, N. R., 582 Kalluri, R., 584 Kamerlin, S. C. L., 2 Kampen, V., 254 Kamtekar, S., 577 Kaneko, M., 43 Kane, T. R., 95 Kantelhardt, J. W., 426 Kao, Y. T., 24–25 Karanicolas, J., 545, 547, 553 Karig, D. K., 148–149, 312 Karlov, V. I., 76 Karlstro¨m, G., 15 Karnchanaphanurach, P., 2, 25 Kærn, M., 312 KarNovak, B., 53 Karplus, M., 2, 25, 102, 105 Kar, S., 53, 207–208 Kartaschoff, P., 413 Kartha, G., 584 Kasukawa, T., 41, 46, 63–64 Kato, M., 2 Kaufman, K., 545 Kaufman, M., 174–175, 185–186, 204, 208 Kaufmann, K. W., 546–547 Kaul, S., 473–475, 480 Kavraki, L. E., 103 Kay, L. E., 102 Kay, S. A., 44 Kazanci, C., 371, 394 Kazerounian, K., 103 Kazuko, Y., 411 Kearney, R. E., 303, 306 Keizer, J., 254, 257–262, 264 Keller, A. D., 173, 185–186 Kellogg, E., 547 Kemper, P., 242 Kent, O., 77
Khalil, A. S., 312 Khalili, M., 74 Khammash, M., 150 Khare, S., 547 Khatib, F., 548, 571 Khersonsky, O., 547 Khokhlov, A. R., 103 Kholodenko, B. N., 489 Khoo, M. C. K., 305 Khoshnoodi, J., 585 Kierdaszuk, B., 25 Kim, D. E., 545, 547, 571 Kim, E. Y., 44, 64 Kim, H., 441 Kim, J. S., 103, 115 Kim, M. K., 103 Kimple, R. J., 282 Kim, P. S., 21 Kim, Y., 468–469, 473–474, 480 Kinch, L., 547 Kindt, J. T., 518 King, J. T., 158 Kinosita, K., 441 Kirkpatrick, B., 103 Kirschner, M. W., 254 Kitano, H., 40–41, 45, 48, 68, 134, 227–228, 282–283, 291, 293, 313, 315 Klebaner, F. C., 262, 267 Klebanov, L., 599 Klenin, K. V., 209 Klimyk, A. U., 102, 112 Klingmu¨ller, U., 233 Kloss, B., 44 Knobler, C. M., 514, 524 Kobayashi, T. J., 193, 207 Koca, J., 77 Kohn, K. W., 227 Ko, H. W., 44, 64 Kollman, P. A., 79 Kooperberg, C., 547 Kopelman, R., 209 Korswagen, H. C., 134 Kortemme, T., 105, 545, 547, 570 Koshland, D. E. J., 466, 480 Kramer, A., 204 Kraus, M., 174, 208 Kreft, J. U., 506 Kremer, K., 76 Kringstein, A. M., 186, 205 Kryschi, C., 3 Kugler, H., 227 Kuhara, S., 336 Kuhlman, B., 545, 547, 552, 567, 576, 581 Kulasiri, D., 45, 49, 56, 60, 253, 255, 269 Kumar, P. V., 21 Kunz, M., 341 Kurtser, I., 209, 254 Kuznetsov, D. N., 557
618
Author Index
Kwiatkowska, M. Z., 218, 220, 248 Kyatkin, A. B., 102–103, 107–108, 111 L Laederach, A., 73, 77 Lahav, G., 322 Laio, F., 394 Lais, P., 174, 208 Lai, X., 343, 346, 349 Lakowicz, J. R., 4, 25–26 Lamouroux, A., 41, 46 Lander, G. C., 519–520 Laneve, C., 220, 222 Lange, O. F., 545, 547 Langowski, J., 209 Lang, X., 47, 49–50 Lanig, H., 3, 6, 25 Larjo, A., 227 Larsson, F., 522 Lasters, I., 553 Latif, K., 103 Latombe, J. C., 103 Lauffenburger, D. A., 218, 467 Laughton, C. A., 356 Lavalle, S. M., 103 Lavelle, L., 524 Lavery, R., 74 Lawson, J. D., 102 Laws, W. R., 4 Lazaridis, T., 105, 582 Leach, A. R., 74 Leaver-Fay, A., 545, 548, 567, 571 Lebrun, A., 74 Lee, A., 5 Lee, B., 518 Lee, C., 43, 63 Lee, I., 324 Lee, J., 548, 571 Lee, K. H., 101, 596–597 Lee, S., 100, 103 Leibler, S., 174, 186, 190–191, 206, 210 Leise, T., 48 Leloup, J.-C., 43, 48, 50, 174–175, 185, 208 Lemerle, C., 489 Lemmon, G. H., 546 Le Nove`re, N., 325 Leontis, N., 77 Le´vi, F., 345 Levin, D. A., 50 Levine, A. J., 149, 174, 209, 254 Levine, D., 575 Levinson, D. A., 95 Levitt, M., 102 Levy, R. M., 434 Lewin, B., 173, 378 Lewis, S. M., 545 Liang, J., 102
Liebal, U. W., 346, 349 Liebovitch, L. S., 411 Li, F., 595, 598 Li, J., 15 Lim-Hing, K., 521 Lim, H. N., 205, 210, 389, 394 Linderman, J. J., 467, 503 Lindh, R., 15 Lin, M. C., 40, 45, 77, 84, 102 Lin, S., 74 Lipan, O., 297, 308, 312–313 Li, Q., 47, 49–50 Li, T. P., 24–25 Liu, H. B., 2 Liu, J. L., 182, 205 Liu, L., 102 Liu, T. Q., 3, 7, 15, 22, 28 Liu, W. E. D., 150 Liu, X., 597 Li, W., 357 Liwo, A., 74 Li, Z., 102 Ljung, L., 281, 303 Lo, B. S., 45, 48 Locker, C. R., 515–517, 521 Lockhart, D. J., 21 Loeb, J. N., 468 Loewe, L., 240, 248 Loinger, A., 193 Lomholt, M. A., 209 Longo, D. M., 208 Loring, R. A., 3 Lotan, I., 103 Louie, T.-M., 2, 25 Lozano-Perez, T., 103 Lumsden, C. J., 488 Luo, G. B., 2, 25 Luo, M., 474 Lutkepohl, H., 136, 140 Lyons, L. C., 44 Lyskov, S., 545, 547, 571 M Machleder, E. M., 205 MacKerell, A. D. Jr., 539, 556 Mackey, M. C., 161, 386–388 MacMillan, A., 77 Macnamara, S., 150, 177, 182 Maddalena, F., 28 Mahadevan, L., 281, 296, 307, 309–311, 313 Mahdavi, A., 233 Maia, M. A. G. M., 336 Maini, P. K., 53, 67, 177 Major, F., 74, 535 Makse, H. A., 425 Malhotra, A., 527–528, 531, 534 ˚ ., 15 Malmqvist, P.-A
619
Author Index
Malmstrom, L., 547, 571 Malone, P. C., 209 Mandell, D. J., 545, 547, 570 Mandziuk, M., 76 Manocha, D., 103 Marchesoni, F., 394 Marchi, M., 14 Marcus, R. A., 2 Marı´n-Sanguino, A., 319, 331, 350–352 Mark, R., 414 Marlow, M. S., 102 Maroncelli, M., 21, 26 Marshall, S. A., 581 Marten, M. M., 596–597 Martin, B., 15 Martin, D. H., 68 Martinez, T. J., 3 Matsumoto, A., 41, 46, 63–64 Mattice, W. L., 103 Mavroidis, C., 103 Mayo, S. L., 553, 576, 581 May, R., 210 McAdams, H. H., 149, 174, 207, 487, 489 McAnaney, T. B., 28 McCarthy, D. D., 413 McClean, M. N., 281, 296–297, 307, 309–311, 313 McCollum, J. M., 148–149, 254 McCudden, C. R., 282 McDonald, M., 40, 46, 63 McKinney, S. A., 434 McMahon, M. T., 103 McMillen, D. R., 291, 312, 394 Mcnally, J. G., 469 McQuarrie, D. A., 207–208 Meadow, N. D., 24 Meech, S. R., 5 Meghan, A., 53 Meiler, J., 545–547 Mendes, P., 301 Mendoza, E. R., 331 Menet, J. S., 63 Mensing, G. A., 281 Menten, M. L., 52, 177 Mentzer, S., 545 Merlitz, H., 209 Messer, B., 2 Mettetal, J. T., 281, 297, 303, 307, 309–313 Metzler, R., 209 Meyer, C. D., 154, 156, 160 Michaelis, L., 52, 177 Michalet, X., 433 Michard-Vanhee, C., 41, 46 Mietus, J., 412, 414 Mi, H., 325 Millam, J. M., 15 Millar, A. J., 247 Miller, C. J., 137, 140
Miller, G. W., 324, 343, 345–346 Miller, W. Jr., 102 Milner, R., 223–224 Miner, J., 320 Minlos, R. A., 102 Minton, A. P., 209 Misura, K. M., 547 Mitra, S., 77 Miyano, S., 336 Miyata, H., 441 Moin, E., 48 Moles, C. G., 301 Molineux, I. J., 525 Moll, M., 103 Montgomery, J. A., 15 Moody, G., 414 Moore-Ede, M. C., 42 Moorman, J. R., 400–403 Morin, S., 77, 84 Morohashi, M., 227 Morozov, A. V., 105 Motwani, R., 103 Moughon, S., 547 Moult, J., 105 Mueller, R., 547 Muı´n˜o, P. L., 3, 21–22, 26–27 Mukherjee, R. M., 76, 89, 93, 95 Mukhopadhyay, N. D., 135 Mu¨ller, T. G., 233 Muller, W. G., 469 Munoz-Diosdado, A., 410 Munsky, B., 150 Murialdo, H., 519 Murphy, P., 571 Murray, J. D., 52–53, 58, 67 Murray, R. M., 280, 292, 307 Muzzey, D., 281, 297, 303, 307, 309–313 Myers, M. P., 40, 43 Mytelka,D. S., 320 N Nagaich, A. K., 469 Nagarajan, R., 133, 135, 140 Nagle, R., 76 Nakajima, T., 15 Nanda, V., 575 Narang, A., 186 Nautiyal, S., 582 Nawathean, P., 40, 46, 63 Nayak, S., 281, 297, 307, 309, 311–313 Nelson, P. C., 431 Newell, G. F., 411, 426 Newton, C. D., 102 Newton, M., 14 Ngarajan, R., 134 Ng, F. S., 45 Nguyen-Khac, M. T., 68
620
Author Index
Niculescu-Mizil, A., 399 Nielsen, U. B., 324 Nielson, F., 225 Nielson, H. R., 225 Nikerel, I. E., 335 Nikolov, S., 343, 346, 349 Nilsson, L., 25, 556 Nissen, M. S., 102 Nocedal, J., 565 Noe´, M., 443 Noguera, D. R., 489, 501–502 Noguti, T., 565 Norman, G., 220 Novak, B., 207, 291, 322 Novick, A., 389 Novotny, J., 105 Nussinov, R., 102 Nyberg, A., 76 O Obradovic, Z., 102 O’Donnell, A. G., 505 O’Farrell, P. H., 596 Ogletree, D. F., 441 Ogryzko, V. V., 474 Oh, J. S., 102 Okamoto, Y., 341 Okamura, H., 175, 288 Okoniewski, M. J., 137, 140 Oldfield, C. J., 102 Oldham, W. M., 282 Olsson, M. H. M., 2 Ong, K. M., 465, 468, 470–473, 480–481 Onuchic, J. H., 101 Onufriev, A., 561 Oppenheim, A. K., 176 O’Shea, E. K., 174, 254 Ostroff, N. A., 281, 297, 307, 309, 311–313 Othersen, O. G., 3, 6, 25 Otsu, N., 601 Ottl, J., 586 Ouattara, D. A., 171, 174–175, 185–186, 204, 208 Oudenaarden, A., 389, 394 Ou, K., 596 Ousley, A., 43 Overmars, M. H., 103 Owen, A. B., 443 Owen, M. R., 67 Ozbudak, E., 389, 394 Ozbudak, E. M., 205, 209–210, 254 Ozcelik, S., 489 Ozkan, S. B., 74 P Pace, E. A., 349–350 Padilla, C. E., 76
Pahle, J., 211 Paliy, K., 227 Palmer, A. G. III., 102 Palsson, B. O., 331 Pan, C.-P., 26–27 Pande, V. S., 78, 102 Pang, W. L., 281, 297, 307, 309, 311–313 Panina, E. M., 220, 224 Papin, C., 41, 46 Papin, J. A., 334 Pappu, R. V., 102, 104, 115 Parikh, V., 43 Parisien, M., 74, 535 Parker, D., 220 Parsegian, V. A., 517–518 Pasquali, C., 596 Patriciu, A., 104 Paul, M. R., 182, 208 Paul, S. M., 320 Paulsson, J., 149, 166 Pa˘un, G., 220, 222 Pearlman, D. A., 531 Pedchenko, V., 585 Pedersen, J., 598 Pedraza, J. M., 149, 254 Pe’er, D., 134–135 Pei, J., 547 Peled, D., 220 Peleg, M., 221 Peng, C.-K., 410, 412–414, 425 Peng, J., 402 Peng, K., 102 Peng, X., 150 Peon, 26 Peres, Y., 50 Perival, V., 41, 50, 59, 68 Persikov, A. V., 585–586 Perun, S., 15 Peskin, C. S., 76, 208, 508 Petrella, R. J., 556 Petrenko, A., 3, 22, 26 Petrov, A. S., 513–515, 517–518, 520–522, 524–525 Pettersson, G., 175 Pettigrew, M. F., 494, 503 Petzold, L. R., 150, 385, 487, 500–501 Pfeifer, A. C., 343 Phillips, A., 228 Phillips, D., 5 Phillips, G. N. Jr., 103 Phillips, P. J., 102–103 Picioreanu, C., 491 Pierce, N. A., 552 Pilegaard, H., 225 Piterman, N., 227 Plecs, J. J., 577 Plemmons, R. J., 152–153, 156 Plimpton, S. J., 76, 531
621
Author Index
Pokala, N., 553, 585 Pommier, Y. G., 468–469, 473–474, 480 Ponder, J. W., 552 Popovic´, Z., 545, 548, 571 Postma, J. P. M., 521 Potra, F., 597 Potts, R. B., 503 Poursina, M., 73, 89, 95 Prakash, M. K., 2 Praprotnik, M., 76 Prehn, J. H., 341 Prendergast, F. G., 4, 25 Priami, C., 219–220, 224–225, 228–229, 233, 235, 242–243, 246 Price, J. L., 44 Ptashne, M., 173 Ptitsyn, O. B., 102 Purnick, P. E. M., 312 Purohit, P. K., 515 Q Qian, B., 547 Qiu, X., 599 Qi, Z., 324, 343, 345–346 Quaglia, P., 220, 225, 228, 233 Quail, P. H., 313 Queener, S. F., 356 Quinlan, M. E., 433, 439, 441, 447, 459 R Raatz, Y., 341 Radivojac, P., 102 Raghavan, T. E. S., 152–153, 156, 160 Raj, A., 174 Ramachandran, G. N., 115, 584 Ramakrishnan, C., 115 Ramanathan, S., 281, 296–297, 307, 309–311, 313 Raman, B., 596–597 Raman, S., 547 Ramos, J., 582 Ramshaw, J. A., 586 Rand, D. A., 247 Rand, R. P., 517 Rao, C. V., 182 Raser, J. M., 174, 254 Raspaud, E., 524 Rath, O., 335–336 Ratliff, C. R., 102 Ratze, C., 505 Rau, D. C., 517–518 Raue, A., 498 Rauhut, G., 15 Raychaudhuri, S., 102 Rayner, D. M., 25 Ra´zga, F., 77
Re´blova, K., 77 Rech, I., 2, 25 Reddy, K. L., 40, 45 Redon, S., 77, 84 Regev, A., 219–220, 223–224 Rehm, M., 341 Reif, B., 103 Reinke, A. W., 584 Reisig, W., 220–221 Renfrew, P. D., 545 Resat, H., 485–486, 489, 491, 493–494, 501, 503, 506, 508 Reyes-Ramı´rez, I., 409, 411, 415 Reynolds, K. A., 553 Rhee, Y. M., 102 Rice, S. A., 207–208 Rich, A., 584 Richards, F. M., 552–553 Richier, B., 41, 46 Richman, J. S., 397, 400–403 Richter, F., 545 Ridley, J., 14 Ridolfi, L., 394 Rienstra, C. M., 103 Robb, M. A., 15 Robe, P. A., 68 Roberson, R. E., 91, 95 Robertson, T., 547, 571 Robeva, R., 148 Robinson, D. K., 450 Roca, M., 2 Rodriguez, K., 103 Rohl, C. A., 547, 571 Rollins, G. C., 515, 517, 525 Romero, P., 102 Roos, B. O., 15 Rorner, P. R., 102 Rosa, D., 225 Rosbash, M., 40, 42, 46, 63 Rose, G. D., 102, 115–116 Rosenberg, S. A., 433, 439 Rosenblatt, M. M., 411, 426, 582, 587 Rosenblum, M., 410, 414 Rosenfeld, N., 205, 209, 322 Roshbash, M., 43 Rossi, F. M., 186, 205 Rossi, R., 77, 84 Ross, J. B. A., 4, 149, 207–208 Rothenfluh, A., 44 Rothlisberger, D., 547 Rotkiewicz, P., 538–539 Roux, B., 556 Rouyer, F., 41, 46 Rowland, M., 341 Roy, A., 596–597 Royer, C. A., 166 Ruczinski, I., 547 Rusconi, S., 474
622
Author Index
Russell, R., 78–79 Rusu, C. F., 3 S Sabouri-Ghomi, M., 53, 207 Sacca, B., 586 Sachs, K., 134–135 Sadreyev, R., 547 Saeki, M., 308 Saez, L., 40, 43–44 Saito, K., 308 Salem, G., 586 Saltelli, A., 346 Samatova, N. F., 254 Sanchez, J., 596 Sandra, O., 233 Santilla´n, M., 147, 150, 161–162, 164, 386–388 Sase, I., 441 Sasisekharan, V., 115 Sasson-Corsi, P., 44 Sauer, W., 15 Sauro, H. M., 228 Savageau, M. A., 209, 335–336 Savtchenko, R. S., 24 Saxena, A., 14 Sayler, G. S., 254 Scheibe, T. D., 502 Scheraga, H. A., 74 Schiffer, J. M., 577 Schindler, T., 15 Schlegel, H. B., 15 Schleif, R., 378 Schlick, T., 76 Schmidt, J. P., 78 Schmitt, D. T., 410 Schneider, S., 3, 6, 25 Schnell, S., 177, 205 Schoeberl, B., 324, 349–350 Schoer, R., 63 Schonbrun, J., 547 Schroeder, R., 74 Schueler-Furman, O., 547 Schuler, A. D., 547 Schulten, K., 14 Schultz, J., 341 Schwartz, M., 425 Schwarzer, F., 103 Schwertassek, R., 91, 95 Scott, E. M., 346 Scott, S. K., 382 Scribner, E. Y., 39 Scuseria, G. E., 15 Seber, G. A. F., 154, 156, 291 Seeman, N. C., 582 Segall, J. E., 281, 297, 306–309 Segel, I. H., 173, 175, 185, 469, 471, 475 Segel, L. A., 175–177
Sehgal, A., 40, 43 Seidelmann, P. K., 413 Seillier-Moiseiwitsch, F., 595–597 Selkov, A., 227 Selvin, P. R., 441 Semrad, K., 74 Seo, D. O., 63 Seok, C., 570 Sept, D., 14 Serre, D., 156 Sessions, R. B., 584 Seth, A. K., 135 Setty, Y., 227 Shahrezaei, V., 149, 162 Shakhnovich, E. I., 582 Shankaran, H., 485, 489, 498–499 Shannon, C. E., 105, 124 Shapiro, E. Y., 219–220, 223–224 Shapiro, L., 489 Shapiro, Z. Ya., 102 Sharma, P. K., 2 Shaw, M. A., 439, 447 Shcherbakova, I., 77 Shea, M. A., 4 Shearwin, K. E., 166 Sheehan, J. H., 546 Sheffler, W. H., 545, 547 Shehu, A., 103 Shell, M. S., 74 Shen, J., 357 Shepherd, C. M., 526 Sherman, M. A., 78 Shimizu-Sato, S., 313 Shimizu, T. S., 281, 297, 306–307, 309, 313 Shindyalov, I. N., 103 Shi, X. H., 28 Short, K. W., 23–24 Shortle, D., 102, 105 Shraiman, B. I., 205, 210, 389, 394 Siderovski, D. P., 282 Sidje, R. B., 150, 177, 182 Sidote, D., 43 Sigal, A., 322 Siggia, E. D., 149, 174, 209, 254 Sikes, J. G., 102 Silverman, J. A., 577 Silverman, W., 220, 223–224 Simon, J. D., 207–208 Simon, M. I., 282–283, 293, 313, 315 Simons, K. T., 547 Simons, S. S. Jr., 465, 468–475, 480–481 Simpson, M. L., 148–149, 254 Sims, K. J., 341 Sindelar, C. V., 579 Singer, S. J., 24–25 Singh, A., 233 Sippl, M., 105 Site, L., 76
623
Author Index
Skliros, A., 103 Skolnick, J., 538 Slater, L. S., 4, 23 Slemrod, M., 176 Slepchenko, B. M., 488 Slovic, A. M., 589 Smith, C. A., 545 Smith, C. L., 469 Smith, D. E., 515 Smith, D. K., 102 Smolen, P., 45, 48, 50, 174, 208 Snoeyink, J. S., 567 Sobel, D., 410 Sobolewski, A. L., 15 Socci, N. D., 101 Soille, P., 597–598 Somorjai, R. L., 25 Song, G., 103 Song, H., 174, 208 Song, L.-N., 473–474 Soosaar, K., 76 Sorger, P. K., 218, 324 Sorokin, A., 227 Sorribas, A., 335, 355 Soto-Campos, G., 25 Sourjik, V., 307, 309 So, W. V., 43 Spackova, N., 77 Spakowitz, A. J., 525 Spicher, A., 186, 205 Sponer, J., 77 Sprous, D., 517 Sreerama, N., 15 Sridharan, D., 135 Srinath, S., 338–339 Srinivasan, R., 102, 115 Staknis, D., 44 Stamati, H., 103 Stamper, I. J., 67 Stanley, H. E., 410–416, 418, 425, 428 Stavreva, D. A., 469 Steeves, T. D., 44 Stelling, J., 41, 50, 59, 68 Stern, M. J., 227 Stevens, M. F., 356 Stiles, J. R., 503 Stoddard, B. L., 547 Stoleriu, I., 182, 205 Stoleru, D., 40, 46, 63 Storey, K. B., 175 Storti, R. V., 40, 45 Stratmann, R. E., 15 Strauss, C. E., 547, 571 Strickland, S., 468 Strickler, J., 313 Stundzia, A. B., 488 Stuzik, Z., 410, 414 Subramanian, A., 134
Suel, G. M., 209 Sugihara, G., 410 Sugiura, M., 102, 112 Sulzman, F. M., 42 Summa, C. M., 582, 587, 589 Sun, Y., 433, 439, 447, 468–469, 473–474, 480 Suter, U. W., 103 Svestka, P., 103 Swain, P. S., 149, 162, 174, 205, 209, 254 Swameye, I., 233 Szabo, A. G., 25, 441 Szallasi, Z., 41, 50, 59, 68 Szapary, D., 473–474 T Takahashi, J. S., 44 Talaga, D. S., 434, 441 Talman, J., 102 Tamanini, F., 175 Tanaka, F., 547 Tang, L., 526 Tang, X. Y., 103 Tang, Y., 357 Tanimura, N., 227 Tan, R. K.-Z., 515, 517, 521, 527–528, 531, 533, 536 Tao, Y.-G., 480 Taylor, C. A., 78 Tedesco, P. M., 307 Teodoro, M., 103 Tepperman, J. M., 313 Thattai, M., 149, 162, 205, 209–210, 254, 389, 394 Theiste, D., 15 Thomas, R., 172–173, 204 Thomas, S., 103 Thompson, J., 545, 547 Thompson, M. A., 14 Tian, T., 150 Tidor, B., 103 Tihova, M., 526, 532 Timmer, J., 233 Tomkins, G. M., 467 Toptygin, D., 24 Torres, N. V., 335, 350–352, 354–355 Totrov, M. M., 557 Tozer, T. N., 341 Traub, W., 586 Treuille, A., 545, 548, 571 Trigiante, D., 157 Troein, C., 248 Trucks, G. W., 15 Truhlar, D. G., 15 Tsai, J., 547 Tsimring, L. S., 208, 281, 297, 307, 309, 311–313 Tucker, B., 74 Tucker-Kellogg, L., 103
624
Author Index
Turcotte, M., 209 Turner, T. E., 503 Tusell, J. R., 3, 22, 26 Tu, Y., 137, 140, 281, 297, 306–307, 309, 313 Tveen-Jensen, K., 14 Tyka, M. D., 545, 547 Tymchyshyn, O., 248 Tyson, J. J., 53, 177, 182, 207–208, 291, 312, 322 Tzlil, S., 518 U Ueda, H. R., 45, 48 Ufimtsev, I. S., 3 Ukai-Tadenuma, M., 41, 46, 63–64 Underhill-Day, N., 220, 228–229, 233, 235, 242–243, 246 Upreti, M., 133, 135, 140 Uversky, V. N., 102 V Vacic, V., 102 Vajda, S., 105 Vaknin, A., 307, 309 Valentine, K. G., 102 Vanbelle, S., 68 van den Broek, B., 209 Vanden-Eijnden, E., 150 van der Horst, G. T., 175 van Gulik, W. M., 335 Vangunsteren, W. F., 521 Vanier, J., 413 van Kampen, N. G., 174 Van Loan, C. F., 158 van Oudenaarden, A., 149, 162, 174, 205, 209–210, 254, 281, 297, 303, 307, 309–313 Van Veen, B., 280, 297 van Winden, W. A., 335 Veflingstad, S. R., 336 Venugopal, M. G., 586 Vera, J., 319, 322, 326, 335–336, 341, 343, 346, 349–352, 354–355 Verma, V., 357, 359–360 Vernon, R., 547 Vig, J., 413 Vilaprinyo, E., 335 Vilela, M., 336 Vilenkin, N. J., 102, 112 Vilfan, I. D., 518 Vincent, L., 597–598 Vinga, S., 336 Viollier, P. H., 489 Vivaudou, M., 77, 84 Vivian, J. T., 3–4, 13, 22–24 Vlachos, C., 503, 506 Vlachos, D. G., 503, 505 Voit, E. O., 319, 322, 324–326, 331, 335–336, 338, 341, 343, 345–346, 352, 355, 361
von Gall, C., 63 von Hippel, P. H., 209 von Kriegsheim, A., 348 Voorhies, M., 553 Vucetic, S., 102 Vyshemirsky, V., 219 W Wager-Smith, K., 44 Wagner, H., 508 Wahl, P., 25 Walker, D. A., 469 Walker, G. M., 281 Wand, A. J., 102 Wand, J., 102 Wang, C. S. E., 103, 547, 570 Wang, J. Y., 79, 102 Wang, K., 324 Wang, L., 103, 307 Wang, L. J., 25 Wang, R., 193, 207 Wang, W., 102 Wang, Y., 102–103, 113, 123, 150 Wang, Z. G., 525 Wanner, G., 76 Wan, Y., 78–79 Warshel, A., 2–3, 13–14, 24 Watenabe, M., 76 Watkins, L. P., 437–438, 443, 445, 447, 449–450, 458–459, 461 Weaver, D. L., 102 Webster, R. G., 515 Weikl, T. R., 74 Weissig, H., 103 Weiss, R., 312 Weiss, S., 433, 441 Weitz, C. J., 40, 43–44 Wellstead, P., 335–336 Welsh, D. K., 288 Wesson, L., 581 Westbrook, J., 103 Westermark, P. O., 288 West, M. W., 577 Westwick, D. T., 303, 306 Wheatley, D. N., 209 White, R. A., 103 Whitfield, M. L., 143–144 Whittemore, J. D., 334 Wiener, M., 389 Wiener, N., 305 Wild, C. J., 291 Wilgus, R. J. R., 254 Wilkie, J., 385 Wilkins, M. R., 596 Willard, F. S., 282 Williams, B., 15 Williams, K. L., 596
625
Author Index
Williams, K. T., 334 Williams, R. M., 102 Willis, K. J., 25 Wilmer, E. L., 50 Winfree, E., 552 Winkler, G. M. R., 413 Wlodarczyk, J., 25 Wodak, S. J., 585 Wolf, B., 174, 208 Wolf, J., 204 Wolford, R., 469 Wolf-Yadlin, A., 497 Wolkenhauer, O., 322, 326, 335–336, 341, 343, 346, 349, 361 Wollacott, A. M., 547 Wolynes, P. G., 101 Wong, D., 14 Wong, J. W., 403 Wong, M., 547 Wong, W. H., 297, 308 Wong, Y. M., 385 Won, Y., 556 Woodson, S., 74 Woods, R., 600, 602 Woody, R. W., 15 Woolfson, D. N., 582 Wright, P. E., 102 Wright, S. J., 565 Wright, W., 103 Wu, C. X., 209 Wu, F.-Y., 503 Wuite, G. J., 209 X Xavier, J. B., 491 Xiang, Y., 2 Xie, X. S., 2, 25, 166, 441 Xie, Z., 41, 45–46, 49, 54, 56, 58, 60, 255, 269 Xu, C., 441 Xu, F., 575, 587 Xu, H. E., 480 Xu, M., 474 Xun, L., 2, 25 Xu, X. C., 14, 26 Xu, Y., 357, 480, 585 Y Yaffe, M. B., 218
Yagita, K., 175 Yakovlev, A., 599 Yamada, R. G., 41, 46, 63–64 Yang, D., 102 Yang, H. T., 2, 25, 186, 437–438, 441, 443, 445, 447, 449–450, 458–459, 461 Yang, W., 411, 586 Yang, Y., 25 Yan, J. X., 596 Yannoni, N. F., 413 Yan, Z., 150 Yarmush, M. L., 103 Yeh, I., 221 Yildirim, N., 371, 386–388 Ying, S., 74 Yin, T., 233 Yi, T.-M., 282–283, 293, 305, 313, 315 Young, J. W., 205, 209 Young, M. W., 40, 42–43, 45 Yuan, Z. M., 208 Yu, W., 44, 64 Z Zacharias, M., 77 Zahid, S., 575 Zakrzewski, V. G., 15 Zandstra, P. W., 233 Zanghellini, A., 547 Zaug, A., 74 Zawilski, S. M., 166 Zeldovich, K. B., 579 Zerner, M. C., 14 Zeron, E. S., 147, 150, 162–165 Zerr,D. M., 42 Zhang, G., 76 Zhang, J., 102 Zhang, L. Y., 25, 491, 587 Zhang, M., 103 Zhang, T., 578–579 Zhang, Y., 498 Zheng, H., 44–45 Zhong, D. P., 24–25 Zhou, H.-X., 122 Zhou, W., 150 Zhou, Y., 103 Zhu, G., 102 Zhu, Y. S., 103 Zuker, M., 528
Subject Index
A Adaptive coarse grain simulation, RNA base-pairing and stacking, 77 DCA scheme (see Divide and conquer algorithm) degrees of freedom, 76 development of model description, 78–79 simulation results, 80–83 guide transitions coarser to finer models, 897–89 finer to coarser models, 83–87 implicit-Euler (IE), 76 molecular dynamics (MD) types, 74, 75 multirigid-body dynamics approach, 76, 77 simulation cost, 76 subdomain, 77 Van der Waals/electrostatic forces, 76 Allan variance (AVAR), 411 AtomTree, 557 B BacSim modeling approach, 506 Bacterial chemotaxis, 307–309 Bayesian Markov Chain Monte Carlo method, 475, 476, 481 Bayes’ rule, 101 Benjamini–Hochberg (BH-FDR) procedure, 599 Biochemical pathway modeling biomedical knowledge and data retrieval (see Biomedical knowledge and data retrieval) network model calibration, 337–341 drug target detection (see Drug target detection) dynamic model, 335–337 flux balance analysis, 331–334 model sensitivity analysis, 345–350 network reconstruction, 329–331 predictive model, drug discovery, 341–345 Biochemical reaction networks deterministic and stochastic modeling, 372, 394 lactose operon b-galactosidase, 387 E. coli, 386 mRNA concentration, 389 S-shaped curve, 390
stable steady state, 392–393 Yildirim–Mackey model, 388 mathematical modeling chemical reaction system, 373 coupled reactions and bistability, 378–381 higher order kinetics and Hill equations, 377–379 mass-action kinetics, 372, 373 simple enzymatic reactions and Michaelis–Menten equation, 374–375 steady state and linear stability analysis, 377–378 stochastic simulations algorithms, 384–386 stochasticity matters, 382–384 Biochemical reaction systems binding and unbinding TF to E-boxes forward and backward reactions, 270 mean fluctuations, 271 mean molecular numbers, 275–276 variance time evolution, 272 Wiener processes, 274 Drosophila, 255 elementary chemical reactions chemical potential, 263–264 collision, 260 column vector, 261 diffusion process, 262 Fokker-Planck type equation, 261–262 H matrix, 265 Orenstein-Uhlenbeck process, 262 thermodynamic extensive variables, 263 transition rates, 261 zero-mean stochastic process, 260 gene expresssion, 254–255 Hill function, 255 intrinsic and extrinisic noise, 254 mathematical model, 255 molecular fluctuations, 254 SDE, 255–256 theoretical developments binary collisions, 258–259 Boltzmann equations, 258 equilibrium, 259 extensive variables, 256 fluctuations models, 257 Onsager regression hypothesis, 257–258
627
628 Biochemical reaction systems (cont.) physical processes, 256 total differential entropy, 256–257 zero-mean stochastic process, 260 transcriptional factors activation dimmers formation, 266 fluctuations variance, 268 Gillespie’s Monte Carlo simulation algorithm, 268 Ito SDE theory, 267 mean-zero property, 268 phosphorylation, 266 Biological system simulation algorithm BacSim modeling approach, 506 explicitly include spatial aspects, 501–505 heterogeneous systems, 505 individual-based approaches, 505–506 Markov motion, 506 microbial cells, 507 model equations, 500–501 computational models, 508 discrete stochastic approach, 488 hierarchal organization, 488 mathematical approach, 486 spatial frameworks explicitly incorporate spatial dimension, 489–490 grid structure, 490 multicompartment models, 489 stochastic effects, 487 treatment model biomolecular reactions, 496 cell-cell junction formation., 492 compartments, 491–492 cytoplasm, 495 endocytic vesicles, 493–494 experimental data, 498 grid frame, 494 HER receptors, 498–499 integrated model, 493 internal reactions, 492–493 kinetic model, 498 ligand-receptor binding, 499–500 mass transfer reaction, 495–496 model scope, 497 signaling pathway, 496–497 Biomedical knowledge and data retrieval annotated map, 326 biosynthetic pathway, 327 conceptual maps., 326 M1 and M2 metabolites, 327 protein-protein interactions, 325 BlenX model, 243–247 Bode plots, 285 Boltzmann distribution, 103–104 Bond mobility, 79
Subject Index C Carbon source utilization, 311 Cartesian configurational entropy, 105–106 Cartesian conformational entropy, 118–119 Changepoint analysis application, 440 multiple-channel, 441–442 photon emission rate, 436 single-molecule fluorescence measurements, 434 (see also Polarized total internal reflection fluorescence microscopy) Charmm program, 18 Chemical Langevin equations, 385 Chemical master equation (CME), 47 biochemical noise, 149 deterministic models, 149 finite state projection algorithm, 150 gene expression profiles, 160 global chemical-reaction, 148 irreducible chemical reaction systems, 152–153 kth reaction, 151 Langevin equation, 148 negative feedback regulation (see Negative feedback regulation) ODE, 148 probability distribution function, 151–152 SSA, 149–150 stationary probability distribution arbitrary intial distribution, 158–159 diagonal matrix D, 154 Jordan blocks, 154 lexico-graphic order, 153 nonnegative functions, 157, 159–160 nonzero eigenvalue, 155–156 nonzero stationary distribution, 157 vector denoting, 148–149 Chromophore bond lengths, 18 Charmm program, 18 Coulomb sum, 19 electron density matrix, 21 energy calculation, 20 Fock operator, 20 linker atoms, 17 QM procedure, 17 Stark effect, 21 state energy shifts, 20–21 subroutine bond, 18–19 potential, 19–20 Circadian clock. See Drosophila circadian clock Clockwork orange (CWO), 40 Cluster expansion and linear programming-based analysis of specificity and stability (CLASSY), 584 CME. See Chemical master equation Collagen
629
Subject Index
cell polarity, 584 electrostatic interactions, 586 heteromers ABC heterotrimer, 590, 591 Arg-Arg repulsion, 588 computational scheme, 587 energy gap, 589, 590 a-helical coiled-coil design, 587 positive and negative design, 588, 589 sequence optimization simulations, 589 human type I, 586, 587 model peptide systems, 586 pairwise interactions, 585 triple-helix forming domains, 584 Computational modeling circadian clocks, 247–248 formal models biological formulation, 226 computer languages, 226–227 graphical notations, 227 intuitive notations, 227 NL (see Narrative language) SBML, 228 SPiM, 228 supporting tools, 232–233 textual representations, 227–228 graphical user interface, 219 JAK/STAT pathway bio-PEPA model, 237, 240–243 blenX model, 243–247 dephosphorylation, 234 graphical representation, 234 LIF and OSM, 235 narrative language model, 235, 237 signaling, 234 STAT3, 235 language expressiveness, 219 mathematical formalisms, 218 modeling languages model analysis techniques, 220 Petri nets, 221–222 PRISM, 220 process algebra (see Process algebra) rewriting systems, 222–223 stochastic simulation techniques, 221 techniques definition, 220 SymBiology toolbox, 219 system behavior, 219 Concentration limiting step (CLS), 470 Conformational entropy, 100 Corner frequency, 290 Coulomb sums, 18 CWO. See Clockwork orange D Data encapsulation, 551 DCA. See Divide and conquer algorithm
3D coarse-grained model AMBER and LAMMPS, 531 coarse-grained model, 527 double-helical regions, 529 P-atoms, 528 RNA_BSQ information, 529 RNA_RNA information, 528–529 tRNA secondary structure model, 530 variable dimensions, 529–530 Debye Hu¨ckel function, 518 Detrended fluctuation analysis (DFA), 411 CHF data, 422–423 excursions, 421 scaling exponent, 424 vs. healthy and heart failure groups, 424 wake and sleep excursion, 423 2D gel images automated analysis, 596 differential analysis BH-FDR procedure, 599 gel-to-gel variation, 598 hypothesis testing, 598 image alignment, 597 image cleaning and spot detection, 597 master watershed map, 598 statistical differential analysis, 598 treatment effects, 598 2D polyacrylomide gel electrophoresis, 596 proteomes, 596 RegStatGel (see RegStatGel image analysis) spot matching, 596 watershed algorithm, 597 Dihedral configurational entropy, 105–106 Divide and conquer algorithm (DCA) acceleration levels, 91 adaptive framework, 92 assembly process, 92, 93 binary tree, 95 coarse models transition, 93, 94 disassembly process, 93 moving-window standard deviation, 93 multiple closed loops, 89 new coarse model, 94 open loops, 89 spatial joint free-motion maps, 93–94 two-handle equations, 89, 91 two handle impulse–momentum equations, 95 DNA models bond stretching and bond angle bending, 515–516 coarse-grain models, 516 DNA DNA interactions, 518–519 3DNA1 model, 517 torsional stiffness, 517 Dose-response curves binary reactions, 469 biochemical reactions, 467 CLS, 470
630 Dose-response curves (cont.) enzyme kinetics, 480 first-order Hill equation, 466 gene-induction reactions, 479–480 inhibitors definition, 472 FHDC, 471 uncompetitive and noncompetitive, 472 model application graphical analysis, 475–479 inferring cofactor mechanisms, 473–475 Amax and EC50, 472–473 telescoping property reaction, 472 parameters functions, 471 properties, 467 second-order Hill function, 466 sigmoidal curve, 466 steady state equation, 467–468 steroid-receptor binding reaction, 467, 468 Ubc9 and glucocorticoid receptor, 481–482 Double-stranded DNA (dsDNA) bacteriophage capsid models bacteriophage, 519 hollow cylinder, 521 Lambda genome-packed, 519 spherical pseudoatoms, 520 data analysis, 522 DNA models bond stretching and bond angle bending, 515–516 coarse-grain models, 516 DNA DNA interactions, 518–519 3DNA1 model, 517 torsional stiffness, 517 ejection protocols bacterial cell, 523–524 pseudoatoms, 523 thermodynamics and kinetics, 524 packaging protocols MD trajectories, 521 stud atoms, 521 push-pull mechanism, 525 2D polyacrylomide gel electrophoresis, 596 Drosophila circadian clock clockwork orange (CWO), 40 CWO anomaly and network regulatory rule CLK–CYC-mediated transcription, 63 master equation, 65 null mutations, 63–64 oscillating rhythms, 67 rate equations parameters, 63 strong activator (SA), 64–65 wild-type and mutant model, 64 wild-type model, 66 developmental history, 43 CLK-CYC, 44 clock-controlled gene vrille, 45 CME, 47
Subject Index
cwo gene, 46, 47 expanded model, 44, 45 gene dClock (Clk), 43 genetic screens, 42 Goldbeter model, 42, 43 Leloup model, 43, 44 mathematical models and computer simulations, 47–49 par-domain-protein, 45 per gene, 42 PER/TIM complex, 43 SSA, 50 direct target genes, 40–41 Hill-type kinetics, 41 Michaelis–Menten kinetics, 41 molecule activation, 40 network behavior, 42 network regulatory models binding probabilities, transcription factors, 54 CWO-expanded network, 58 E-boxes, 54–55 first-order linear expressions, 54 logistic equation, 62 Michaelis Menten enzyme kinetics, 51–53 module identification, 59 parameters, 59 phase-adjustments, 58 positive regulatory influences (PRI), 61 probability binding equations, 55–57 probability binding rates, 60 regulatory equation, 62–63 sigmoidal graph, 61 simplification, 54 single activation/repression parameters, 59 transcription rate, 57 replication ability, 41 repression/negative feedback, 40 Drug target detection biochemical systems, 363 biological data, 321 biomedical knowledge and data retrieval annotated map, 326 biosynthetic pathway, 327 conceptual maps., 326 M1 and M2 metabolites, 327 protein-protein interactions, 325 computational modeling techniques, 321 computer assisted software, 366 databases and software tools, 363 mathematical modeling techniques biochemical network, 328–329 dynamic models, 335–337 flux balance analysis, 331–334 network reconstruction, 329–331 model calibration experimental data, 337–338 left-bottom panel, 339, 340
631
Subject Index
parameter estimation, 338–339 right-bottom panel, 339, 340 model optimization critical parameters and processes, 352 hyperuricemia, 354–355 metabolic diseases, 351 multifactorial strategies, 353, 354 therapeutic strategies, 351–352 UA levels, 355–356 model sensitivity analysis bifurcation point, 349 biochemical processes, 345–346 computational simulations, 348 ErbB receptors, 349 ERK signaling, 348–349 global sensitivity analysis, 346–347 local sensitivity analysis, 346 parameter modulation, 347 ODE equations, 361–362 potential drug targets, 364–365 predictive model simulations application, 345 cronotherapy, 347 dopamine system, 345 Epo blood levels, 343 HS and PS, 343 murine erythropoiesis, 343 pharmacokinetics and pharmacodynamics, 342 physiological effects, 341 protein docking-based techniques drug design, 356–357 HER-2, 357–359 oligopeptide, 359 pharmacokinetics properties, 357 structural analysis, 356 systems biology approach, 322–323 therapeutic effects, 320 validation techniques, 324 E Electron transfer (ET) electrostatic interactions, 11, 12 energy gap modulation, 6 flavins and dyes quenching, 7 Hamiltonian operator, 6 reorganization stabilization, 10 S1 and CT states identification, 21–22 Entropy algorithmic classifiers., 399 classification and prediction, 398 data reduction/feature selection, 398–399 kNN analysis, 407 multivariate neighborhood sample entropy (see Multivariate neighborhood sample entropy) sample entropy (see Sample entropy)
statistical models, 399 Epidermal growth factor receptor (EGFR), 493 Escherichia coli, 190 ESTs. See Expressed sequence tags Euler angles, 112 Excursions and simulated noise Allan deviation statistics, 427 AVAR and CHF data, 426 cumulative distribution, 425 evolution, 420 heart failure group, 421 long-term correlated data., 426 segmentation process, 425 stability, 426 Exponential parameterization, 110 Expressed sequence tags (ESTs), 135 F FBA. See Flux balance analysis Fermi rule, 6 FHDC. See First-order Hill dose-response curve First-order Hill dose-response curve (FHDC), 466 First-order Hill functions, 470 Fluorescence ab initio computation, ET coupling matrix, 29–34 intensities and lifetimes energy gap modulation, 6 ET, 5–6 Fermi rule, 6 FRET, 7 intersystem crossing, 5 quantum yield, 4 spatial distance modulation, 5 time-resolved methods, 4 intuition development blue-shifted fluorescence, 14 calculated vs. experimental quantum yields, 7, 8 CT and S1 energy separation, 12 electron density shifts, 13 electrostatic interactions, 11 energy fluctuations, 11 QM–MD trajectories, 7, 10 quenching event, 7 solvent reorganization, 9 stabilizing and destabilizing interactions, 13 staphylococcal nuclease (SNase), TrP, 9–10 MD simulations, 14 nonexponential fluorescence decay, 25–28 philosophy, 14 QM and MD interface chromophore input file, 17–21 CT energy, 22 flow chart, 15, 16 Fortran programs, 16 MD charges updation, 22
632
Subject Index
Fluorescence (cont.) raw structure editing, 16–17 S1 and CT states, 21–22 vaccum, 24–25 quantum mechanics, 14–15 wavelength prediction, 4–5 Fluorescence resonance energy transfer (FRET), 7 Flux balance analysis degenerate solutions, 333 drug action, 334 feasible region, 332–333 steady states, 331–332 Flux balance analysis (FBA), 331 Fock operator, 20 Fokker-Planck type equation, 261–262 Fo¨rster resonance energy transfer (FRET) fluorescent probes, 433 photon streams, 441 simple derivation, 437 Fortran programs, 16 FRET. See Fluorescence resonance energy transfer G Gaussian distribution, 112–113 Gene expression profiles, 160. See also Negative feedback regulation Bayesian structure, 135 cell cycle, GC analysis, 143–144 diagnostic tests, 140–141 EST, 135 first order bivariate VAR mean-squared forecast error, 137–138 noise variance affects, 139 power series expansion, 137 synthetic two-gene network, 136–137 transcriptional mechanisms, 136 high-throughput assays, 134 noise variance impact, GC, 141–143 signaling pathways, 134 temporal data interference, 135 Gene ontology (GO), 325 Gibbs free energy, 263 Gillespie algorithm, 50, 182 Monte Carlo simulation algorithm, 268 stochastic method, 385 Glucocorticoid receptor (GR), 469, 481–482 Goldbeter model, 42, 43 G-protein pathway experimental design and oscillatory response bacterial chemotaxis, 307–309 Nyquist limit, 297 output response, 297 signaling cascades, yeast (see Yeast) sinusoidal inputs, 298 SNR, 297 square waves, 296
Fourier filtering, 298–299 frequency response alpha factor pheromone, 283, 284 Bode plots, 285, 287 corner frequency, 290 filtering behaviors, 288–289 gain plot, 285 high-pass filter, 291 input-output map, 291 input-output pair, 288 linear systems, 283 low-frequency (DC) gain, 290 low-pass filter, 288–289 oscillatory input signals, 283 phase plot, 285 roll-off, 290 sinusoidal inputs, 283 steady-state input-response pairs, 285, 286 system phase shift, 284 time-series measurements, 284 model trajectories simulation, 301–302 model validation, 303–304 nonlinear rectifier, 302–303 nonsinusoidal inputs, 304–305 procedure decay time, 295 dose-response behavior, 293 exponential behavior, 295 rectangular pulse input, 295, 296 Saccharomyces cerevisiae, 282 transfer function model, 299–301 in yeast, 282 Granger casuality (GC) cell cycle gene expression profiles, 143–144 forecasting ability, 136 mean-squared forecast errors, 136 noise variance impact, 141–143 VAR parameter estimation, 140 Guide transitions coarser to finer models constraint load magnitude, 87–88 constraint torque, 88 kinetic energy, 89 spatial constraint loads, 87 finer to coarser models dynamic behavior vs. coarse models, 84–86 math-based metrics, 83–84 modes of motion, 83 moving-window average, 84 truth model, 84 H Heartbeat, wake-sleep period characterization, 428 data analysis CHF patients, 416 correlation excursions, 420
633
Subject Index
cumulative distributions, 415, 417 day-night transitions, 414 detrended fluctuation analysis, 421–424 excursions and simulated noise, 425–427 heart failure group, 421 kth successor, 419, 420 wake vs. sleep average, 418 DFA analysis, 427 evaluation, 410 methods Allan variance, 413–414 autocorrelation function, 412–413 stationary segments, 411–412 parameters, 427 scaling properties, 411 Hill kinetics, 41 amplitude reduction, 189–190 binding sites cooperativity, 186 compact vs. developed version, 186 cooperativity, 188 free inhibitor, 187 quasi-steady-state approximation, 187 sigmoidal functions, 185 synthesis rate, 186 transcriptional inhibition, 185 transients, 188 Human epidermal growth factor receptor (HER), 498 I Inferring cofactor mechanisms. See also Doseresponse curves direct data fitting cofactor effects on Amax and EC50, 473, 474 dose-response curve, GR and Ubc9, 475, 476 posterior statistics, 475, 476 Ubc9 effects on GR concentration, 473, 475 graphical analysis Amax/EC50 plot, 475 decision tree, 477 TIF2 coactivator activity, 478, 479 Ubc9 activity, 477, 478 Irreducible chemical reaction systems, 152–153 Ito SDE theory, 267 J Jacobian factor, 112 JAK/STAT pathway blenX model, 243–247 dephosphorylation, 234 graphical representation, 234 LIF and OSM, 235 narrative language model, 235, 237 signaling, 234 STAT3, 235
K Kappa, 222–223 k-nearest neighbors (kNN), 399–400 L Lactose operon b-galactosidase, 387 E. coli, 386 mRNA concentration, 389 S-shaped curve, 390 stable steady state, 392–393 Yildirim-Mackey model, 388 Langevin dynamics (LD), 524 Langevin equation, 148 Law of mass action, 52 Leloup model, 43, 44 Lie theory, 108 Loop entropy model Bayes’ rule, 101 Cartesian conformational entropy, 118–119 conformational, 100 covariance matrices to entropy, 124–126 folded ensemble, 119–120 Gaussian chains, 120–122, 126 helix–helix crossing angle, 103 hyper-redundant, 102 noncommutative harmonic analysis, 102 protein folding, 101–102 reference frames, polypeptide chain, 100, 101 rigid-body motion Euclidean group, 107 Euler angles, 112 exponential parameterization, 110 Gaussian distribution, 112–113 Jacobian determinant, 112 Lie theory, 108 transformation, 109 vee operator, 110 semiflexible polymers, 122–124, 126–127 serial polymer chains Cartesian conformational distribution, 115–116 conformational distribution, 114 end-to-end distance distribution, 115 pairwise energy model, 113–114 Ramachandran map, 114–115 reference frame attachment, 113 statistical mechanics ab initio potentials, 105 Boltzmann distribution, 103–104 Cartesian configurational entropy, 105–106 conformational changes, 104 dihedral configurational entropy, 105–106 mass metric tensor, 104 volume effects average density, 116–117 phantom polymer chain model, 116
634
Subject Index M
Mass action law, 176 Mathematical modeling biochemical network, 328–329 chemical reaction system, 373 coupled reactions and bistability bistable system, 378 Hill function, 380 nonlinear equation, 379 steady states, 380–381 dynamic models, 335–337 power law, 335–336 rate law, 335 flux balance analysis degenerate solutions, 333 drug action, 334 feasible region, 332–333 steady states, 331–332 higher order kinetics and Hill equations equilibrium assumption, 376 mass-action law, 377 mass-action kinetics, 372, 373 network reconstruction algebraic property, 331 genome, 330 stoichiometric matrix, 329 simple enzymatic reactions and Michaelis Menten equation, 374–375 steady state and linear stability analysis, 377–378 Metropolis criterion, 553 Metropolis Hastings algorithm, 481 Michaelis Menten kinetics cellular rhythmic behavior, 51 compact model vs. developed model, 175 embedded enzymatic reaction decomposing, 179 linear chains of reactions, 182 mass section law, 179 periodic forcing, 180 stochastic simulations, 179 substrate synthesis, 180 total enzyme concentration, 180 in vivo enzymatic reaction, 178 enzymatic reactions, 175 enzyme-catalyzed transformation, 52 Hill-type equation, 53 isolated reaction classical reaction scheme, 176 mass action law, 176 QSSA, 176–177 rate of production, 176 time evolution vs. compact model, 177–178 total substrate concentration, 177 law of mass action, 52 PER protein phosphorylation, 51
phosphorylation-dephosphorylation, proteins, 175 standard kinetic equations, 175 stochastic simulation compact version, 182, 183 developed version, 182, 183 enzyme concentration, 183 Molecular dynamics (MD), 515 Monte Carlo (MC) approach, 503, 556 Multiple-channel changepoint (MCCP), 458 Multirigid-body dynamics approach, 76, 77 Multivariate neighborhood sample entropy (MN SampEn) algorithmic implementation, 403–405 data-reduction method, 407 definition, 401 kernel function, 401 large and complex dataset analysis, 407–408 limitations, 408 optimization, 403–405 predictive algorithm, 407 proteomics data, 403 vs. kNN, 402, 406 vs. SampEn, 402–403 N Narrative language binding reaction, 231 components and compartments, 229–231 definition, 228 molecular representation, 228–229 narrative of event, 231–232 Navier Stokes equation, 490 Negative feedback regulation mean of the variable, 163 metabolite-synthesis reaction, 162–163 mRNA bursting, 166 parameter values, 163, 164 Poisson distribution, 162 promoter flips, active and inactive states, 166 Riboswitch B12, 161 simple gene network, schematic representation, 161 stationary probability distribution function, 164, 165 stochastic dynamic behavior, 160–161 strong and weak feedback loop, 164 synthesis and degradation, metabolites, 162 Nonexponential fluorescence decay chromophore–quencher intermolecular distance, 25–26 ET coupling, 26–27 lifetime and wavelength, 26 QM–MD simulation, 25 relaxation model, 26 universal physical principle, 27–28
635
Subject Index O Object-oriented architecture, 550–551 ODE. See Ordinary differential equation OneBodyEnergy class, 560–561 Onsager regression hypothesis, 257–258 Ordinary differential equation (ODE), 148, 329, 381, 486 Orenstein-Uhlenbeck process, 262 Osmo-adaptation, 310–311 P Pairwise energy model, 113–114 Pariacoto virus (PaV) capsid model coarse-grained model, 537–539 crystal structure, 536 N-and C-terminal, 536 RNA model coarse-grained model, 533 pseudoatoms, 533–534 RNA dodecahedral cage, 531–532 rrRNAv1, 534 stud energy function, 535 specific model system, 526 Partial differential equations (PDEs), 486 Petri nets, 221–222 Phantom polymer chain model, 116 Phospho-AKT (pAKT), 350 Picogram carbon (pgC), 506 Poisson distribution, 162 Polarized total internal reflection fluorescence microscopy (polTIRF) automated, multiple-channel changepoint detection algorithm, 447–448 critique, 448–450 false positives, 445–447 heuristic, 436–437 multiple channels changepoint analysis, 441–442 experimental procedure, 439–441 no-changepoint simulations correction factors, 451–453 false-positive threshold, 453 nonuniform distribution, 443–445 simple derivation, 437–439 single-changepoint simulations arbitrary rate change detection, 453–455 myosin lever arm change detection, 455–456 single-molecule biophysics, 433 SPC technology log likelihood function, 458–459 N and SBR, 459–460 photon emission rates, 458 polTIRF, 460 threshold, false positives detection, 442–443 traditional approach, 434–436 transient state detection, 460–461
two-changepoint detection, 456–457 PRISM, 220 Process algebra beta-binders, 225 bioambients, 224–225 biochemical p-calculus, 224 biochemical system abstraction, 223 bio-PEPA, 225–226 components, 223 reachability and causality analysis, 224 Protein docking-based techniques drug design, 356–357 HER-2, 357–359 oligopeptide, 359 pharmacokinetics properties, 357 structural analysis, 356 Protein self-assembly collagen cell polarity, 584 electrostatic interactions, 586 human type I, 586, 587 model peptide systems, 586 pairwise interactions, 585 triple-helix forming domains, 584 computational methods, 576 enzymatic reactions, 576 heteromers, 587–591 model a-helix and collagen peptides, 576 multiscale design, 577 stability and specificity optimization ABC heterotrimer, 582 CLASSY, 584 computational protein design, 581 electrostatic interactions, 581 a-helical assemblies, 580 homodimer or heterodimer formation, 583 leucine, 581 positive and negative design, 579 stepwise sequence selection algorithm, 582 stochastic sequence optimization algorithms, 582 vs. unimolecular folding asparagines, 579 binary patterning, 577 continuous hydrophobic core, 577 de novo-designed proteins, 577 efficiency, 577–578 GCN4–1, 579 helix–helix dimer, 578 H–P alternating pattern, 577 schematic representation, 578 thermophilic proteins, 579 Q Quantitative structure activity relationships (QSARs), 357 Quasi-steady-state assumption (QSSA), 176–177
636
Subject Index R
Reaction rate equations (RREs), 486 RegStatGel image analysis fully automatic mode, 599–601 interactive automatic mode advantage, 601 catchment basins, 602 morphological processing techniques, 602 normalization procedure, 603 regional quantification methods, 608 stepwise operation and exploration mode Build Master Watershed Region, 604 Explore menu, 604, 605 Graphics, 603 Permutation menu, 603, 604 Region ANOVA, 603 Region ID, 603 STAT menu, 605, 607 Stepwise Operation menu, 603, 604 usage, 606 Repressilator model cooperative binding sites, 191–193 deterministic simulation, 193–196 Escherichia coli, 190 Hill-based, 190–191 stochastic simulation, 196–198 Rewriting systems Kappa rules, 222–223 membrane systems, 222 RNA model coarse-grained model, 533 pseudoatoms, 533–534 RNA dodecahedral cage, 531–532 rrRNAv1, 534 stud energy function, 535 Rosetta3 software applications, 547 architecture AtomTree, 557 class energies and energygraph, 562–563 class pose, 558 conformation layer, 557–558 ConstraintsEnergy, 561–562 flexible-backbone protein docking, 556 libraries, 554 LongRangeEnergyContainer, 561 Monte Carlo perturbations, 556 namespace core, 555 OneBodyEnergy class, 560–561 residue type class, 555–556 ScoreFunction, 559–560, 563–564 TwoBodyEnergy class, 561 code quality requirements, 549–550 core optimization, 565–566 pack, 566–567 scoring and constraints, 564–565
DNA redesign, 571 generality requirements, 548–549 industrial software, 547 JobDistributor, 569 object-oriented architecture, 550–551 pose, 553–554 preserving existing functionality, 548 protocols library, 567–568 loops, 570 moves, 568–569 from text files, 570–571 residue centrality low-energy rotamer assignment, 552 metropolis criterion, 553 packer, 551 pose architecture, 551, 552 requirements, 551 RNA design, 553 scoring, 554 speed requirements, 550 S Sample entropy nonlinear/chaotic processes, 400 Shannon’s information-theoretic entropy, 400 vs. MN-SampEn, 402–403 ScoreFunction, 559–560, 563–564 Second-order Hill function, 466 Singele-to-noise ratios (SNRs), 281 Single-changepoint simulations arbitrary rate change detection, 453–455 myosin lever arm change detection, 455–456 Single-molecule measurement. See Polarized total internal reflection fluorescence microscopy Single-photon counting (SPC), 437 log likelihood function, 458–459 N and SBR, 459–460 photon emission rates, 458 polTIRF, 460 Single-stranded RNA viruses 3D coarse-grained model AMBER and LAMMPS, 531 coarse-grained model, 527 double-helical regions, 529 P-atoms, 528 RNA_BSQ information, 529 RNA_RNA information, 528–529 tRNA secondary structure model, 530 variable dimensions, 529–530 pariacoto virus, 540 (see also Pariacoto virus) protein–protein interactions, 539 Singular perturbation theory, 176 SNRs, 281 Spatial frameworks explicitly incorporate spatial dimensio, 489–490
637
Subject Index
grid structure, 490 multicompartment models, 489 Spatial model, biological systems. See also Biological system simulation BacSim modeling approach, 506 explicitly include spatial aspects deterministic solutions, 501–502 stochastic solutions, 502–505 heterogeneous systems, 505 individual-based approaches, 505–506 Markov motion, 506 microbial cells, 507 model equations, 500–501 SPC. See Single-photon counting Stark effect, 21 Stationary probability distribution function, 164, 165 Stepwise sequence selection algorithm, 582 Stochastic differential equations (SDE), 255–256, 381 Stochastic sequence optimization algorithms, 582 Stochastic simulation, 500. See also Gillespie algorithm algorithms, 50, 149–150 chemical Langevin equations, 385 Gillespie’s stochastic method, 385 iterative scheme, 386 ODE solutions, 384 stochasticity matters bistable system, 383–384 ODE and SDE simulations, 382–383 SymBiology toolbox, 219 Synthetic two-gene network, 136–137 System biology markup language (SBML), 228 System identification method bandwidth, 281 cellular mechanisms, 280 frequency response, 280–281 G-protein pathway (see G-protein pathway) transfer function models, 291–293 T Timescale, 280 Toggle switch model cooperative binding sites, 199–200 deterministic simulation, 200–201 Hill-based, 198–199 stochastic simulation, 201–204 Transcriptional factors (TF) activation dimmers formation, 266 fluctuations variance, 268 Gillespie’s Monte Carlo Simulation algorithm, 268 Ito SDE theory, 267 mean-zero property, 268 phosphorylation, 266
binding and unbinding, E-boxes forward and backward reactions, 270 mean fluctuations, 271 mean molecular numbers, 275–276 variance time evolution, 272 Wiener processes, 274 Transfer function models, 281 black box, 292 Fourier transform, 292–293 input-output relationships, 292 Laplace transform, 292 tRNA secondary structure model, 530 TwoBodyEnergy class, 561 Two-handle equations, 89, 91 U Ubiquitin-conjugating enzyme, 143–146 Uric acid (UA), 355 V Vector-autoregressive (VAR) process cell-cycle gene expression profiles, 140–141 frequency-domain approach, 141 mean-squared forecast error, 137–138 noise impact, 141 noise variance magnitude, 139 power series expansion, 137 reverse characteristic polynomial, 142 synthetic two-gene network, 136–137 transcriptional mechanisms, 136 Viral structure and assembly double-stranded DNA (see Double-stranded DNA (dsDNA) bacteriophage) genome, 514 icosahedral DNA and RNA viruses, 514 single-stranded RNA (see Single-stranded RNA viruses) W Watershed algorithm, 597 Wavelength prediction, 4–5 Wiener processes, 274 Wild-type model, 66 X Xie and Kulasiri model, 45, 46 Y Yeast G-protein signaling pathway, 281, 282 signaling cascades carbon source utilization, 311 osmo-adaptation, 310–311 Yildirim-Mackey lac operon model, 394