METHODS IN ENZYMOLOGY Editors-in-Chief
JOHN N. ABELSON AND MELVIN I. SIMON Division of Biology California Institute of Technology Pasadena, California Founding Editors
SIDNEY P. COLOWICK AND NATHAN O. KAPLAN
Academic Press is an imprint of Elsevier 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA 225 Wyman Street, Waltham, MA 02451, USA 32 Jamestown Road, London NW1 7BY, UK First edition 2011 Copyright # 2011, Elsevier Inc. All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: permissions@ elsevier.com. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made For information on all Academic Press publications visit our website at elsevierdirect.com ISBN: 978-0-12-385120-8 ISSN: 0076-6879 Printed and bound in United States of America 11 12 13 14 10 9 8 7 6 5 4 3 2 1
CONTRIBUTORS
Laura Adam Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA Rivka Adar Weizmann Institute of Science, Biological Chemistry Department, Rehovot, Israel Andreu Alibe´s EMBL/CRG Systems Biology Research Unit, Center for Genomic Regulation (CRG), UPF, Barcelona, Spain J. Christopher Anderson Department of Bioengineering; SynBERC: Synthetic Biology Engineering Research Center; QB3: California Institute for Quantitative Biological Research, Emeryville, and Physical Biosciences Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, California, USA Aseem Z. Ansari Department of Biochemistry, and Genome Center, University of Wisconsin, Madison, Wisconsin, USA Swapnil Bhatia Department of Electrical and Computer Engineering, Boston University, Boston, Massachusetts, USA Lesia Bilitchenko Department of Computer Science, California State Polytechnic University, Pomona, California, USA Jennifer Brophy Department of Bioengineering, University of California, Berkeley, California, USA Ben Bubenheim Department of Bioengineering, University of California, Berkeley, California, USA George M. Church Department of Genetics, Harvard Medical School, Boston, and Wyss Institute for Biologically Inspired Engineering, Harvard University, Massachusetts, USA
xi
xii
Contributors
Maisam Dadgar Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA Sarah E. Davis Department of Biochemistry, and Genome Center, University of Wisconsin, Madison, Wisconsin, USA Douglas Densmore Department of Electrical and Computer Engineering, and Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA Andrew D. Ellington Applied Research Laboratories, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, USA Daniel G. Gibson1 J. Craig Venter Institute Inc., Synthetic Biology Group, La Jolla, California, USA Jeff Grass Department of Biochemistry, University of Wisconsin, Madison, Wisconsin, USA Ilan Gronau Weizmann Institute of Science, Computer Science and Mathematics Department, Rehovot, Israel Claes Gustafsson DNA2.0, Inc., Suite A, Menlo Park, California, USA Russell Hertzberg Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA Tony Ho Life Technologies Corporation, Carlsbad, California, USA Randall A. Hughes Applied Research Laboratories, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, USA Mitsuhiro Itaya Institute for Advanced Biosciences of Keio University, Nipponkoku, Tsuruoka-shi, Yamagata, Japan Elenita I. Kanin Department of Biochemistry, and Genome Center, University of Wisconsin, Madison, Wisconsin, USA
1
Present address: Daniel G. Gibson, Science Center Drive, San Diego, CA
Contributors
xiii
Shai Kaplan Weizmann Institute of Science, Biological Chemistry Department, and Computer Science and Mathematics Department, Rehovot, Israel Federico Katzen Life Technologies Corporation, Carlsbad, California, USA Yiannis N. Kaznessis Department of Chemical Engineering and Materials Science, and Digital Technology Center, University of Minnesota, Minneapolis, Minnesota, USA Tae Yong Kim Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Program), and Bioinformatics Research Center, KAIST, Daejeon, Republic of Korea Thomas F. Knight2 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA Wieslaw Kudlicki Life Technologies Corporation, Carlsbad, California, USA Robert Landick Department of Biochemistry, and Department of Bacteriology, University of Wisconsin, Madison, Wisconsin, USA Sang Yup Lee Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Program); BioProcess Engineering Research Center, Center for Systems and Synthetic Biotechnology, Institute for the BioCentury; Bioinformatics Research Center, and Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea Mariana Leguia Department of Bioengineering; QB3: California Institute for Quantitative Biological Research, and Synthetic Biology Engineering Research Center, University of California, Berkeley, California, USA Ke Li Life Technologies Corporation, Carlsbad, California, USA Xiquan Liang Life Technologies Corporation, Carlsbad, California, USA
2
Current address: Ginkgo BioWorks, Inc., Boston, Massachusetts, USA
xiv
Contributors
Gregory Linshiz Weizmann Institute of Science, Biological Chemistry Department, and Computer Science and Mathematics Department, Rehovot, Israel Michael Liss Life Technologies Inc./GeneArt AG, Regensburg, Germany Adam Liu Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA Meagan Lizarazo Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA Aleksandr E. Miklos Applied Research Laboratories, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, USA Jeremy Minshull DNA2.0, Inc., Suite A, Menlo Park, California, USA Kentaro Miyazaki Bioproduction Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Central 6, 1-1-1 Higashi, Tsukuba, Ibaraki, Japan Rachel A. Mooney Department of Biochemistry, University of Wisconsin, Madison, Wisconsin, USA Alejandro D. Nadra Departamentos de Quı´mica Biolo´gica y Fisiologı´a, Biologı´a Molecular y Celular, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Argentina Frank Notka Life Technologies Inc./GeneArt AG, Regensburg, Germany Jong Myoung Park Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Program), and BioProcess Engineering Research Center, Center for Systems and Synthetic Biotechnology, Institute for the BioCentury, KAIST, Daejeon, Republic of Korea Jean Peccoud Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA Lansha Peng Life Technologies Corporation, Carlsbad, California, USA Todd Peterson Life Technologies Corporation, Carlsbad, California, USA
Contributors
xv
Jason Potter Life Technologies Corporation, Carlsbad, California, USA Sivan Ravid Weizmann Institute of Science, Computer Science and Mathematics Department, Rehovot, Israel Randy Rettberg Department of Biological Engineering, and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA Howard M. Salis Department of Chemical Engineering, and Department of Agricultural and Biological Engineering, Pennsylvania State University, University Park, Pennsylvania, USA Luis Serrano EMBL/CRG Systems Biology Research Unit, and ICREA Professor, Center for Genomic Regulation (CRG), UPF, Barcelona, Spain Ehud Shapiro Department of Biological Chemistry, and Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel Reshma Shetty3 Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA Josh Shirley Life Technologies Corporation, Carlsbad, California, USA Tetsuro Toyoda Bioinformatics and Systems Engineering division, RIKEN, Yokohama, Kanagawa, Japan Kenji Tsuge Institute for Advanced Biosciences of Keio University, Nipponkoku, Tsuruokashi, Yamagata, Japan Billyana Tsvetanova Life Technologies Corporation, Carlsbad, California, USA Alan Villalobos DNA2.0, Inc., Suite A, Menlo Park, California, USA
3
Current address: Ginkgo BioWorks, Inc., Boston, Massachusetts, USA
xvi
Contributors
Ralf Wagner Life Technologies Inc./GeneArt AG, and Institute of Medical Microbiology and Hygiene, Molecular Microbiology and Gene Therapy, University of Regensburg, Regensburg, Germany Harris H. Wang Department of Genetics, Harvard Medical School, Boston, and Wyss Institute for Biologically Inspired Engineering, Harvard University, Massachusetts, USA Mark Welch DNA2.0, Inc., Suite A, Menlo Park, California, USA Mandy L. Wilson Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA Bing Xia Department of Bioengineering, University of California, Berkeley, California, USA Liewei Xu Life Technologies Corporation, Carlsbad, California, USA Jian-Ping Yang Life Technologies Corporation, Carlsbad, California, USA Tuval Ben Yehezkel Department of Biological Chemistry, Weizmann Institute of Science, Rehovot, Israel
PREFACE
This is the second volume of a two-part series in Methods in Enzymology on tools and techniques used in synthetic biology. Synthetic biology is an engineering discipline that seeks to construct living systems that do not exist in nature. The field refers to the process by which genetic systems are designed and constructed, as opposed to any particular application. Along these lines, these volumes are organized into two areas. Volume I focuses on the assay techniques and design principles underlying the characterization of genetic parts, their combination into devices and programs, and their integration into various hosts. Volume II focuses on computational tools and biophysical models to aid in the design and organization of genetic programs and modern methods to synthesize and assemble the associated DNA. The first set of chapters focus on computational tools to predict the function of genetic parts and to organize them into more complex systems. Biophysical methods for modeling common genetic parts are described, including promoters, codon usage, and ribosome binding sites. Computeraided design (CAD) will become an increasingly important tool for synthetic biology, as designs become larger and more complex. Several academic simulation packages have emerged (e.g., Clotho and SynBioSS) that enable the prediction of how multiple genetic parts will behave. Several innovative grammars (e.g., EUGENE and GenoCAD) have been proposed that simplify the rules for the combination of genetic parts in a way that enables desires to organize large systems and to facilitate the sharing of designs. Ultimately, these methods will need to be able to be applied to genome-scale engineering. The output of the CAD programs is a DNA sequence that needs to be assembled. The combination of automated DNA synthesis with modern cloning techniques has enabled a revolution in the ambition of the scale of genetic constructs. Several methods for gene synthesis are described, including error correction, which becomes critical for large constructs. The “Gibson Method” and variations thereof provide a convenient one-step method for the assembly of multiple genetic parts. This builds on restriction enzyme-based methods, such as BioBricksTM assembly. Our lab has made great use of the MEGAWHOP method for rapidly inserting large pieces of DNA into a plasmid (anybody familiar with Quickchange will get the gist of the method). Synthetic biology ultimately needs to work at the scale of wholegenome engineering. Methods are described for the assembly, engineering, and analysis of genome-sized fragments of DNA. xvii
METHODS IN ENZYMOLOGY
VOLUME I. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME II. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME III. Preparation and Assay of Substrates Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME IV. Special Techniques for the Enzymologist Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME V. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VI. Preparation and Assay of Enzymes (Continued) Preparation and Assay of Substrates Special Techniques Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VII. Cumulative Subject Index Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VIII. Complex Carbohydrates Edited by ELIZABETH F. NEUFELD AND VICTOR GINSBURG VOLUME IX. Carbohydrate Metabolism Edited by WILLIS A. WOOD VOLUME X. Oxidation and Phosphorylation Edited by RONALD W. ESTABROOK AND MAYNARD E. PULLMAN VOLUME XI. Enzyme Structure Edited by C. H. W. HIRS VOLUME XII. Nucleic Acids (Parts A and B) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XIII. Citric Acid Cycle Edited by J. M. LOWENSTEIN VOLUME XIV. Lipids Edited by J. M. LOWENSTEIN VOLUME XV. Steroids and Terpenoids Edited by RAYMOND B. CLAYTON xix
xx
Methods in Enzymology
VOLUME XVI. Fast Reactions Edited by KENNETH KUSTIN VOLUME XVII. Metabolism of Amino Acids and Amines (Parts A and B) Edited by HERBERT TABOR AND CELIA WHITE TABOR VOLUME XVIII. Vitamins and Coenzymes (Parts A, B, and C) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME XIX. Proteolytic Enzymes Edited by GERTRUDE E. PERLMANN AND LASZLO LORAND VOLUME XX. Nucleic Acids and Protein Synthesis (Part C) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME XXI. Nucleic Acids (Part D) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XXII. Enzyme Purification and Related Techniques Edited by WILLIAM B. JAKOBY VOLUME XXIII. Photosynthesis (Part A) Edited by ANTHONY SAN PIETRO VOLUME XXIV. Photosynthesis and Nitrogen Fixation (Part B) Edited by ANTHONY SAN PIETRO VOLUME XXV. Enzyme Structure (Part B) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVI. Enzyme Structure (Part C) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVII. Enzyme Structure (Part D) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVIII. Complex Carbohydrates (Part B) Edited by VICTOR GINSBURG VOLUME XXIX. Nucleic Acids and Protein Synthesis (Part E) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XXX. Nucleic Acids and Protein Synthesis (Part F) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME XXXI. Biomembranes (Part A) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME XXXII. Biomembranes (Part B) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME XXXIII. Cumulative Subject Index Volumes I-XXX Edited by MARTHA G. DENNIS AND EDWARD A. DENNIS VOLUME XXXIV. Affinity Techniques (Enzyme Purification: Part B) Edited by WILLIAM B. JAKOBY AND MEIR WILCHEK
Methods in Enzymology
VOLUME XXXV. Lipids (Part B) Edited by JOHN M. LOWENSTEIN VOLUME XXXVI. Hormone Action (Part A: Steroid Hormones) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XXXVII. Hormone Action (Part B: Peptide Hormones) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XXXVIII. Hormone Action (Part C: Cyclic Nucleotides) Edited by JOEL G. HARDMAN AND BERT W. O’MALLEY VOLUME XXXIX. Hormone Action (Part D: Isolated Cells, Tissues, and Organ Systems) Edited by JOEL G. HARDMAN AND BERT W. O’MALLEY VOLUME XL. Hormone Action (Part E: Nuclear Structure and Function) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XLI. Carbohydrate Metabolism (Part B) Edited by W. A. WOOD VOLUME XLII. Carbohydrate Metabolism (Part C) Edited by W. A. WOOD VOLUME XLIII. Antibiotics Edited by JOHN H. HASH VOLUME XLIV. Immobilized Enzymes Edited by KLAUS MOSBACH VOLUME XLV. Proteolytic Enzymes (Part B) Edited by LASZLO LORAND VOLUME XLVI. Affinity Labeling Edited by WILLIAM B. JAKOBY AND MEIR WILCHEK VOLUME XLVII. Enzyme Structure (Part E) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XLVIII. Enzyme Structure (Part F) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XLIX. Enzyme Structure (Part G) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME L. Complex Carbohydrates (Part C) Edited by VICTOR GINSBURG VOLUME LI. Purine and Pyrimidine Nucleotide Metabolism Edited by PATRICIA A. HOFFEE AND MARY ELLEN JONES VOLUME LII. Biomembranes (Part C: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER
xxi
xxii
Methods in Enzymology
VOLUME LIII. Biomembranes (Part D: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LIV. Biomembranes (Part E: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LV. Biomembranes (Part F: Bioenergetics) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LVI. Biomembranes (Part G: Bioenergetics) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LVII. Bioluminescence and Chemiluminescence Edited by MARLENE A. DELUCA VOLUME LVIII. Cell Culture Edited by WILLIAM B. JAKOBY AND IRA PASTAN VOLUME LIX. Nucleic Acids and Protein Synthesis (Part G) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME LX. Nucleic Acids and Protein Synthesis (Part H) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME 61. Enzyme Structure (Part H) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 62. Vitamins and Coenzymes (Part D) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 63. Enzyme Kinetics and Mechanism (Part A: Initial Rate and Inhibitor Methods) Edited by DANIEL L. PURICH VOLUME 64. Enzyme Kinetics and Mechanism (Part B: Isotopic Probes and Complex Enzyme Systems) Edited by DANIEL L. PURICH VOLUME 65. Nucleic Acids (Part I) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME 66. Vitamins and Coenzymes (Part E) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 67. Vitamins and Coenzymes (Part F) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 68. Recombinant DNA Edited by RAY WU VOLUME 69. Photosynthesis and Nitrogen Fixation (Part C) Edited by ANTHONY SAN PIETRO VOLUME 70. Immunochemical Techniques (Part A) Edited by HELEN VAN VUNAKIS AND JOHN J. LANGONE
Methods in Enzymology
xxiii
VOLUME 71. Lipids (Part C) Edited by JOHN M. LOWENSTEIN VOLUME 72. Lipids (Part D) Edited by JOHN M. LOWENSTEIN VOLUME 73. Immunochemical Techniques (Part B) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 74. Immunochemical Techniques (Part C) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 75. Cumulative Subject Index Volumes XXXI, XXXII, XXXIV–LX Edited by EDWARD A. DENNIS AND MARTHA G. DENNIS VOLUME 76. Hemoglobins Edited by ERALDO ANTONINI, LUIGI ROSSI-BERNARDI, AND EMILIA CHIANCONE VOLUME 77. Detoxication and Drug Metabolism Edited by WILLIAM B. JAKOBY VOLUME 78. Interferons (Part A) Edited by SIDNEY PESTKA VOLUME 79. Interferons (Part B) Edited by SIDNEY PESTKA VOLUME 80. Proteolytic Enzymes (Part C) Edited by LASZLO LORAND VOLUME 81. Biomembranes (Part H: Visual Pigments and Purple Membranes, I) Edited by LESTER PACKER VOLUME 82. Structural and Contractile Proteins (Part A: Extracellular Matrix) Edited by LEON W. CUNNINGHAM AND DIXIE W. FREDERIKSEN VOLUME 83. Complex Carbohydrates (Part D) Edited by VICTOR GINSBURG VOLUME 84. Immunochemical Techniques (Part D: Selected Immunoassays) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 85. Structural and Contractile Proteins (Part B: The Contractile Apparatus and the Cytoskeleton) Edited by DIXIE W. FREDERIKSEN AND LEON W. CUNNINGHAM VOLUME 86. Prostaglandins and Arachidonate Metabolites Edited by WILLIAM E. M. LANDS AND WILLIAM L. SMITH VOLUME 87. Enzyme Kinetics and Mechanism (Part C: Intermediates, Stereo-chemistry, and Rate Studies) Edited by DANIEL L. PURICH VOLUME 88. Biomembranes (Part I: Visual Pigments and Purple Membranes, II) Edited by LESTER PACKER
xxiv
Methods in Enzymology
VOLUME 89. Carbohydrate Metabolism (Part D) Edited by WILLIS A. WOOD VOLUME 90. Carbohydrate Metabolism (Part E) Edited by WILLIS A. WOOD VOLUME 91. Enzyme Structure (Part I) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 92. Immunochemical Techniques (Part E: Monoclonal Antibodies and General Immunoassay Methods) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 93. Immunochemical Techniques (Part F: Conventional Antibodies, Fc Receptors, and Cytotoxicity) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 94. Polyamines Edited by HERBERT TABOR AND CELIA WHITE TABOR VOLUME 95. Cumulative Subject Index Volumes 61–74, 76–80 Edited by EDWARD A. DENNIS AND MARTHA G. DENNIS VOLUME 96. Biomembranes [Part J: Membrane Biogenesis: Assembly and Targeting (General Methods; Eukaryotes)] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 97. Biomembranes [Part K: Membrane Biogenesis: Assembly and Targeting (Prokaryotes, Mitochondria, and Chloroplasts)] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 98. Biomembranes (Part L: Membrane Biogenesis: Processing and Recycling) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 99. Hormone Action (Part F: Protein Kinases) Edited by JACKIE D. CORBIN AND JOEL G. HARDMAN VOLUME 100. Recombinant DNA (Part B) Edited by RAY WU, LAWRENCE GROSSMAN, AND KIVIE MOLDAVE VOLUME 101. Recombinant DNA (Part C) Edited by RAY WU, LAWRENCE GROSSMAN, AND KIVIE MOLDAVE VOLUME 102. Hormone Action (Part G: Calmodulin and Calcium-Binding Proteins) Edited by ANTHONY R. MEANS AND BERT W. O’MALLEY VOLUME 103. Hormone Action (Part H: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 104. Enzyme Purification and Related Techniques (Part C) Edited by WILLIAM B. JAKOBY
Methods in Enzymology
VOLUME 105. Oxygen Radicals in Biological Systems Edited by LESTER PACKER VOLUME 106. Posttranslational Modifications (Part A) Edited by FINN WOLD AND KIVIE MOLDAVE VOLUME 107. Posttranslational Modifications (Part B) Edited by FINN WOLD AND KIVIE MOLDAVE VOLUME 108. Immunochemical Techniques (Part G: Separation and Characterization of Lymphoid Cells) Edited by GIOVANNI DI SABATO, JOHN J. LANGONE, AND HELEN VAN VUNAKIS VOLUME 109. Hormone Action (Part I: Peptide Hormones) Edited by LUTZ BIRNBAUMER AND BERT W. O’MALLEY VOLUME 110. Steroids and Isoprenoids (Part A) Edited by JOHN H. LAW AND HANS C. RILLING VOLUME 111. Steroids and Isoprenoids (Part B) Edited by JOHN H. LAW AND HANS C. RILLING VOLUME 112. Drug and Enzyme Targeting (Part A) Edited by KENNETH J. WIDDER AND RALPH GREEN VOLUME 113. Glutamate, Glutamine, Glutathione, and Related Compounds Edited by ALTON MEISTER VOLUME 114. Diffraction Methods for Biological Macromolecules (Part A) Edited by HAROLD W. WYCKOFF, C. H. W. HIRS, AND SERGE N. TIMASHEFF VOLUME 115. Diffraction Methods for Biological Macromolecules (Part B) Edited by HAROLD W. WYCKOFF, C. H. W. HIRS, AND SERGE N. TIMASHEFF VOLUME 116. Immunochemical Techniques (Part H: Effectors and Mediators of Lymphoid Cell Functions) Edited by GIOVANNI DI SABATO, JOHN J. LANGONE, AND HELEN VAN VUNAKIS VOLUME 117. Enzyme Structure (Part J) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 118. Plant Molecular Biology Edited by ARTHUR WEISSBACH AND HERBERT WEISSBACH VOLUME 119. Interferons (Part C) Edited by SIDNEY PESTKA VOLUME 120. Cumulative Subject Index Volumes 81–94, 96–101 VOLUME 121. Immunochemical Techniques (Part I: Hybridoma Technology and Monoclonal Antibodies) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 122. Vitamins and Coenzymes (Part G) Edited by FRANK CHYTIL AND DONALD B. MCCORMICK
xxv
xxvi
Methods in Enzymology
VOLUME 123. Vitamins and Coenzymes (Part H) Edited by FRANK CHYTIL AND DONALD B. MCCORMICK VOLUME 124. Hormone Action (Part J: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 125. Biomembranes (Part M: Transport in Bacteria, Mitochondria, and Chloroplasts: General Approaches and Transport Systems) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 126. Biomembranes (Part N: Transport in Bacteria, Mitochondria, and Chloroplasts: Protonmotive Force) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 127. Biomembranes (Part O: Protons and Water: Structure and Translocation) Edited by LESTER PACKER VOLUME 128. Plasma Lipoproteins (Part A: Preparation, Structure, and Molecular Biology) Edited by JERE P. SEGREST AND JOHN J. ALBERS VOLUME 129. Plasma Lipoproteins (Part B: Characterization, Cell Biology, and Metabolism) Edited by JOHN J. ALBERS AND JERE P. SEGREST VOLUME 130. Enzyme Structure (Part K) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 131. Enzyme Structure (Part L) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 132. Immunochemical Techniques (Part J: Phagocytosis and Cell-Mediated Cytotoxicity) Edited by GIOVANNI DI SABATO AND JOHANNES EVERSE VOLUME 133. Bioluminescence and Chemiluminescence (Part B) Edited by MARLENE DELUCA AND WILLIAM D. MCELROY VOLUME 134. Structural and Contractile Proteins (Part C: The Contractile Apparatus and the Cytoskeleton) Edited by RICHARD B. VALLEE VOLUME 135. Immobilized Enzymes and Cells (Part B) Edited by KLAUS MOSBACH VOLUME 136. Immobilized Enzymes and Cells (Part C) Edited by KLAUS MOSBACH VOLUME 137. Immobilized Enzymes and Cells (Part D) Edited by KLAUS MOSBACH VOLUME 138. Complex Carbohydrates (Part E) Edited by VICTOR GINSBURG
Methods in Enzymology
xxvii
VOLUME 139. Cellular Regulators (Part A: Calcium- and Calmodulin-Binding Proteins) Edited by ANTHONY R. MEANS AND P. MICHAEL CONN VOLUME 140. Cumulative Subject Index Volumes 102–119, 121–134 VOLUME 141. Cellular Regulators (Part B: Calcium and Lipids) Edited by P. MICHAEL CONN AND ANTHONY R. MEANS VOLUME 142. Metabolism of Aromatic Amino Acids and Amines Edited by SEYMOUR KAUFMAN VOLUME 143. Sulfur and Sulfur Amino Acids Edited by WILLIAM B. JAKOBY AND OWEN GRIFFITH VOLUME 144. Structural and Contractile Proteins (Part D: Extracellular Matrix) Edited by LEON W. CUNNINGHAM VOLUME 145. Structural and Contractile Proteins (Part E: Extracellular Matrix) Edited by LEON W. CUNNINGHAM VOLUME 146. Peptide Growth Factors (Part A) Edited by DAVID BARNES AND DAVID A. SIRBASKU VOLUME 147. Peptide Growth Factors (Part B) Edited by DAVID BARNES AND DAVID A. SIRBASKU VOLUME 148. Plant Cell Membranes Edited by LESTER PACKER AND ROLAND DOUCE VOLUME 149. Drug and Enzyme Targeting (Part B) Edited by RALPH GREEN AND KENNETH J. WIDDER VOLUME 150. Immunochemical Techniques (Part K: In Vitro Models of B and T Cell Functions and Lymphoid Cell Receptors) Edited by GIOVANNI DI SABATO VOLUME 151. Molecular Genetics of Mammalian Cells Edited by MICHAEL M. GOTTESMAN VOLUME 152. Guide to Molecular Cloning Techniques Edited by SHELBY L. BERGER AND ALAN R. KIMMEL VOLUME 153. Recombinant DNA (Part D) Edited by RAY WU AND LAWRENCE GROSSMAN VOLUME 154. Recombinant DNA (Part E) Edited by RAY WU AND LAWRENCE GROSSMAN VOLUME 155. Recombinant DNA (Part F) Edited by RAY WU VOLUME 156. Biomembranes (Part P: ATP-Driven Pumps and Related Transport: The Na, K-Pump) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER
xxviii
Methods in Enzymology
VOLUME 157. Biomembranes (Part Q: ATP-Driven Pumps and Related Transport: Calcium, Proton, and Potassium Pumps) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 158. Metalloproteins (Part A) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 159. Initiation and Termination of Cyclic Nucleotide Action Edited by JACKIE D. CORBIN AND ROGER A. JOHNSON VOLUME 160. Biomass (Part A: Cellulose and Hemicellulose) Edited by WILLIS A. WOOD AND SCOTT T. KELLOGG VOLUME 161. Biomass (Part B: Lignin, Pectin, and Chitin) Edited by WILLIS A. WOOD AND SCOTT T. KELLOGG VOLUME 162. Immunochemical Techniques (Part L: Chemotaxis and Inflammation) Edited by GIOVANNI DI SABATO VOLUME 163. Immunochemical Techniques (Part M: Chemotaxis and Inflammation) Edited by GIOVANNI DI SABATO VOLUME 164. Ribosomes Edited by HARRY F. NOLLER, JR., AND KIVIE MOLDAVE VOLUME 165. Microbial Toxins: Tools for Enzymology Edited by SIDNEY HARSHMAN VOLUME 166. Branched-Chain Amino Acids Edited by ROBERT HARRIS AND JOHN R. SOKATCH VOLUME 167. Cyanobacteria Edited by LESTER PACKER AND ALEXANDER N. GLAZER VOLUME 168. Hormone Action (Part K: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 169. Platelets: Receptors, Adhesion, Secretion (Part A) Edited by JACEK HAWIGER VOLUME 170. Nucleosomes Edited by PAUL M. WASSARMAN AND ROGER D. KORNBERG VOLUME 171. Biomembranes (Part R: Transport Theory: Cells and Model Membranes) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 172. Biomembranes (Part S: Transport: Membrane Isolation and Characterization) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER
Methods in Enzymology
xxix
VOLUME 173. Biomembranes [Part T: Cellular and Subcellular Transport: Eukaryotic (Nonepithelial) Cells] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 174. Biomembranes [Part U: Cellular and Subcellular Transport: Eukaryotic (Nonepithelial) Cells] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 175. Cumulative Subject Index Volumes 135–139, 141–167 VOLUME 176. Nuclear Magnetic Resonance (Part A: Spectral Techniques and Dynamics) Edited by NORMAN J. OPPENHEIMER AND THOMAS L. JAMES VOLUME 177. Nuclear Magnetic Resonance (Part B: Structure and Mechanism) Edited by NORMAN J. OPPENHEIMER AND THOMAS L. JAMES VOLUME 178. Antibodies, Antigens, and Molecular Mimicry Edited by JOHN J. LANGONE VOLUME 179. Complex Carbohydrates (Part F) Edited by VICTOR GINSBURG VOLUME 180. RNA Processing (Part A: General Methods) Edited by JAMES E. DAHLBERG AND JOHN N. ABELSON VOLUME 181. RNA Processing (Part B: Specific Methods) Edited by JAMES E. DAHLBERG AND JOHN N. ABELSON VOLUME 182. Guide to Protein Purification Edited by MURRAY P. DEUTSCHER VOLUME 183. Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences Edited by RUSSELL F. DOOLITTLE VOLUME 184. Avidin-Biotin Technology Edited by MEIR WILCHEK AND EDWARD A. BAYER VOLUME 185. Gene Expression Technology Edited by DAVID V. GOEDDEL VOLUME 186. Oxygen Radicals in Biological Systems (Part B: Oxygen Radicals and Antioxidants) Edited by LESTER PACKER AND ALEXANDER N. GLAZER VOLUME 187. Arachidonate Related Lipid Mediators Edited by ROBERT C. MURPHY AND FRANK A. FITZPATRICK VOLUME 188. Hydrocarbons and Methylotrophy Edited by MARY E. LIDSTROM VOLUME 189. Retinoids (Part A: Molecular and Metabolic Aspects) Edited by LESTER PACKER
xxx
Methods in Enzymology
VOLUME 190. Retinoids (Part B: Cell Differentiation and Clinical Applications) Edited by LESTER PACKER VOLUME 191. Biomembranes (Part V: Cellular and Subcellular Transport: Epithelial Cells) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 192. Biomembranes (Part W: Cellular and Subcellular Transport: Epithelial Cells) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 193. Mass Spectrometry Edited by JAMES A. MCCLOSKEY VOLUME 194. Guide to Yeast Genetics and Molecular Biology Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 195. Adenylyl Cyclase, G Proteins, and Guanylyl Cyclase Edited by ROGER A. JOHNSON AND JACKIE D. CORBIN VOLUME 196. Molecular Motors and the Cytoskeleton Edited by RICHARD B. VALLEE VOLUME 197. Phospholipases Edited by EDWARD A. DENNIS VOLUME 198. Peptide Growth Factors (Part C) Edited by DAVID BARNES, J. P. MATHER, AND GORDON H. SATO VOLUME 199. Cumulative Subject Index Volumes 168–174, 176–194 VOLUME 200. Protein Phosphorylation (Part A: Protein Kinases: Assays, Purification, Antibodies, Functional Analysis, Cloning, and Expression) Edited by TONY HUNTER AND BARTHOLOMEW M. SEFTON VOLUME 201. Protein Phosphorylation (Part B: Analysis of Protein Phosphorylation, Protein Kinase Inhibitors, and Protein Phosphatases) Edited by TONY HUNTER AND BARTHOLOMEW M. SEFTON VOLUME 202. Molecular Design and Modeling: Concepts and Applications (Part A: Proteins, Peptides, and Enzymes) Edited by JOHN J. LANGONE VOLUME 203. Molecular Design and Modeling: Concepts and Applications (Part B: Antibodies and Antigens, Nucleic Acids, Polysaccharides, and Drugs) Edited by JOHN J. LANGONE VOLUME 204. Bacterial Genetic Systems Edited by JEFFREY H. MILLER VOLUME 205. Metallobiochemistry (Part B: Metallothionein and Related Molecules) Edited by JAMES F. RIORDAN AND BERT L. VALLEE
Methods in Enzymology
xxxi
VOLUME 206. Cytochrome P450 Edited by MICHAEL R. WATERMAN AND ERIC F. JOHNSON VOLUME 207. Ion Channels Edited by BERNARDO RUDY AND LINDA E. IVERSON VOLUME 208. Protein–DNA Interactions Edited by ROBERT T. SAUER VOLUME 209. Phospholipid Biosynthesis Edited by EDWARD A. DENNIS AND DENNIS E. VANCE VOLUME 210. Numerical Computer Methods Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 211. DNA Structures (Part A: Synthesis and Physical Analysis of DNA) Edited by DAVID M. J. LILLEY AND JAMES E. DAHLBERG VOLUME 212. DNA Structures (Part B: Chemical and Electrophoretic Analysis of DNA) Edited by DAVID M. J. LILLEY AND JAMES E. DAHLBERG VOLUME 213. Carotenoids (Part A: Chemistry, Separation, Quantitation, and Antioxidation) Edited by LESTER PACKER VOLUME 214. Carotenoids (Part B: Metabolism, Genetics, and Biosynthesis) Edited by LESTER PACKER VOLUME 215. Platelets: Receptors, Adhesion, Secretion (Part B) Edited by JACEK J. HAWIGER VOLUME 216. Recombinant DNA (Part G) Edited by RAY WU VOLUME 217. Recombinant DNA (Part H) Edited by RAY WU VOLUME 218. Recombinant DNA (Part I) Edited by RAY WU VOLUME 219. Reconstitution of Intracellular Transport Edited by JAMES E. ROTHMAN VOLUME 220. Membrane Fusion Techniques (Part A) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 221. Membrane Fusion Techniques (Part B) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 222. Proteolytic Enzymes in Coagulation, Fibrinolysis, and Complement Activation (Part A: Mammalian Blood Coagulation Factors and Inhibitors) Edited by LASZLO LORAND AND KENNETH G. MANN
xxxii
Methods in Enzymology
VOLUME 223. Proteolytic Enzymes in Coagulation, Fibrinolysis, and Complement Activation (Part B: Complement Activation, Fibrinolysis, and Nonmammalian Blood Coagulation Factors) Edited by LASZLO LORAND AND KENNETH G. MANN VOLUME 224. Molecular Evolution: Producing the Biochemical Data Edited by ELIZABETH ANNE ZIMMER, THOMAS J. WHITE, REBECCA L. CANN, AND ALLAN C. WILSON VOLUME 225. Guide to Techniques in Mouse Development Edited by PAUL M. WASSARMAN AND MELVIN L. DEPAMPHILIS VOLUME 226. Metallobiochemistry (Part C: Spectroscopic and Physical Methods for Probing Metal Ion Environments in Metalloenzymes and Metalloproteins) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 227. Metallobiochemistry (Part D: Physical and Spectroscopic Methods for Probing Metal Ion Environments in Metalloproteins) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 228. Aqueous Two-Phase Systems Edited by HARRY WALTER AND GO¨TE JOHANSSON VOLUME 229. Cumulative Subject Index Volumes 195–198, 200–227 VOLUME 230. Guide to Techniques in Glycobiology Edited by WILLIAM J. LENNARZ AND GERALD W. HART VOLUME 231. Hemoglobins (Part B: Biochemical and Analytical Methods) Edited by JOHANNES EVERSE, KIM D. VANDEGRIFF, AND ROBERT M. WINSLOW VOLUME 232. Hemoglobins (Part C: Biophysical Methods) Edited by JOHANNES EVERSE, KIM D. VANDEGRIFF, AND ROBERT M. WINSLOW VOLUME 233. Oxygen Radicals in Biological Systems (Part C) Edited by LESTER PACKER VOLUME 234. Oxygen Radicals in Biological Systems (Part D) Edited by LESTER PACKER VOLUME 235. Bacterial Pathogenesis (Part A: Identification and Regulation of Virulence Factors) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 236. Bacterial Pathogenesis (Part B: Integration of Pathogenic Bacteria with Host Cells) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 237. Heterotrimeric G Proteins Edited by RAVI IYENGAR VOLUME 238. Heterotrimeric G-Protein Effectors Edited by RAVI IYENGAR
Methods in Enzymology
xxxiii
VOLUME 239. Nuclear Magnetic Resonance (Part C) Edited by THOMAS L. JAMES AND NORMAN J. OPPENHEIMER VOLUME 240. Numerical Computer Methods (Part B) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 241. Retroviral Proteases Edited by LAWRENCE C. KUO AND JULES A. SHAFER VOLUME 242. Neoglycoconjugates (Part A) Edited by Y. C. LEE AND REIKO T. LEE VOLUME 243. Inorganic Microbial Sulfur Metabolism Edited by HARRY D. PECK, JR., AND JEAN LEGALL VOLUME 244. Proteolytic Enzymes: Serine and Cysteine Peptidases Edited by ALAN J. BARRETT VOLUME 245. Extracellular Matrix Components Edited by E. RUOSLAHTI AND E. ENGVALL VOLUME 246. Biochemical Spectroscopy Edited by KENNETH SAUER VOLUME 247. Neoglycoconjugates (Part B: Biomedical Applications) Edited by Y. C. LEE AND REIKO T. LEE VOLUME 248. Proteolytic Enzymes: Aspartic and Metallo Peptidases Edited by ALAN J. BARRETT VOLUME 249. Enzyme Kinetics and Mechanism (Part D: Developments in Enzyme Dynamics) Edited by DANIEL L. PURICH VOLUME 250. Lipid Modifications of Proteins Edited by PATRICK J. CASEY AND JANICE E. BUSS VOLUME 251. Biothiols (Part A: Monothiols and Dithiols, Protein Thiols, and Thiyl Radicals) Edited by LESTER PACKER VOLUME 252. Biothiols (Part B: Glutathione and Thioredoxin; Thiols in Signal Transduction and Gene Regulation) Edited by LESTER PACKER VOLUME 253. Adhesion of Microbial Pathogens Edited by RON J. DOYLE AND ITZHAK OFEK VOLUME 254. Oncogene Techniques Edited by PETER K. VOGT AND INDER M. VERMA VOLUME 255. Small GTPases and Their Regulators (Part A: Ras Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL
xxxiv
Methods in Enzymology
VOLUME 256. Small GTPases and Their Regulators (Part B: Rho Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 257. Small GTPases and Their Regulators (Part C: Proteins Involved in Transport) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 258. Redox-Active Amino Acids in Biology Edited by JUDITH P. KLINMAN VOLUME 259. Energetics of Biological Macromolecules Edited by MICHAEL L. JOHNSON AND GARY K. ACKERS VOLUME 260. Mitochondrial Biogenesis and Genetics (Part A) Edited by GIUSEPPE M. ATTARDI AND ANNE CHOMYN VOLUME 261. Nuclear Magnetic Resonance and Nucleic Acids Edited by THOMAS L. JAMES VOLUME 262. DNA Replication Edited by JUDITH L. CAMPBELL VOLUME 263. Plasma Lipoproteins (Part C: Quantitation) Edited by WILLIAM A. BRADLEY, SANDRA H. GIANTURCO, AND JERE P. SEGREST VOLUME 264. Mitochondrial Biogenesis and Genetics (Part B) Edited by GIUSEPPE M. ATTARDI AND ANNE CHOMYN VOLUME 265. Cumulative Subject Index Volumes 228, 230–262 VOLUME 266. Computer Methods for Macromolecular Sequence Analysis Edited by RUSSELL F. DOOLITTLE VOLUME 267. Combinatorial Chemistry Edited by JOHN N. ABELSON VOLUME 268. Nitric Oxide (Part A: Sources and Detection of NO; NO Synthase) Edited by LESTER PACKER VOLUME 269. Nitric Oxide (Part B: Physiological and Pathological Processes) Edited by LESTER PACKER VOLUME 270. High Resolution Separation and Analysis of Biological Macromolecules (Part A: Fundamentals) Edited by BARRY L. KARGER AND WILLIAM S. HANCOCK VOLUME 271. High Resolution Separation and Analysis of Biological Macromolecules (Part B: Applications) Edited by BARRY L. KARGER AND WILLIAM S. HANCOCK VOLUME 272. Cytochrome P450 (Part B) Edited by ERIC F. JOHNSON AND MICHAEL R. WATERMAN VOLUME 273. RNA Polymerase and Associated Factors (Part A) Edited by SANKAR ADHYA
Methods in Enzymology
xxxv
VOLUME 274. RNA Polymerase and Associated Factors (Part B) Edited by SANKAR ADHYA VOLUME 275. Viral Polymerases and Related Proteins Edited by LAWRENCE C. KUO, DAVID B. OLSEN, AND STEVEN S. CARROLL VOLUME 276. Macromolecular Crystallography (Part A) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 277. Macromolecular Crystallography (Part B) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 278. Fluorescence Spectroscopy Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 279. Vitamins and Coenzymes (Part I) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 280. Vitamins and Coenzymes (Part J) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 281. Vitamins and Coenzymes (Part K) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 282. Vitamins and Coenzymes (Part L) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 283. Cell Cycle Control Edited by WILLIAM G. DUNPHY VOLUME 284. Lipases (Part A: Biotechnology) Edited by BYRON RUBIN AND EDWARD A. DENNIS VOLUME 285. Cumulative Subject Index Volumes 263, 264, 266–284, 286–289 VOLUME 286. Lipases (Part B: Enzyme Characterization and Utilization) Edited by BYRON RUBIN AND EDWARD A. DENNIS VOLUME 287. Chemokines Edited by RICHARD HORUK VOLUME 288. Chemokine Receptors Edited by RICHARD HORUK VOLUME 289. Solid Phase Peptide Synthesis Edited by GREGG B. FIELDS VOLUME 290. Molecular Chaperones Edited by GEORGE H. LORIMER AND THOMAS BALDWIN VOLUME 291. Caged Compounds Edited by GERARD MARRIOTT VOLUME 292. ABC Transporters: Biochemical, Cellular, and Molecular Aspects Edited by SURESH V. AMBUDKAR AND MICHAEL M. GOTTESMAN
xxxvi
Methods in Enzymology
VOLUME 293. Ion Channels (Part B) Edited by P. MICHAEL CONN VOLUME 294. Ion Channels (Part C) Edited by P. MICHAEL CONN VOLUME 295. Energetics of Biological Macromolecules (Part B) Edited by GARY K. ACKERS AND MICHAEL L. JOHNSON VOLUME 296. Neurotransmitter Transporters Edited by SUSAN G. AMARA VOLUME 297. Photosynthesis: Molecular Biology of Energy Capture Edited by LEE MCINTOSH VOLUME 298. Molecular Motors and the Cytoskeleton (Part B) Edited by RICHARD B. VALLEE VOLUME 299. Oxidants and Antioxidants (Part A) Edited by LESTER PACKER VOLUME 300. Oxidants and Antioxidants (Part B) Edited by LESTER PACKER VOLUME 301. Nitric Oxide: Biological and Antioxidant Activities (Part C) Edited by LESTER PACKER VOLUME 302. Green Fluorescent Protein Edited by P. MICHAEL CONN VOLUME 303. cDNA Preparation and Display Edited by SHERMAN M. WEISSMAN VOLUME 304. Chromatin Edited by PAUL M. WASSARMAN AND ALAN P. WOLFFE VOLUME 305. Bioluminescence and Chemiluminescence (Part C) Edited by THOMAS O. BALDWIN AND MIRIAM M. ZIEGLER VOLUME 306. Expression of Recombinant Genes in Eukaryotic Systems Edited by JOSEPH C. GLORIOSO AND MARTIN C. SCHMIDT VOLUME 307. Confocal Microscopy Edited by P. MICHAEL CONN VOLUME 308. Enzyme Kinetics and Mechanism (Part E: Energetics of Enzyme Catalysis) Edited by DANIEL L. PURICH AND VERN L. SCHRAMM VOLUME 309. Amyloid, Prions, and Other Protein Aggregates Edited by RONALD WETZEL VOLUME 310. Biofilms Edited by RON J. DOYLE
Methods in Enzymology
xxxvii
VOLUME 311. Sphingolipid Metabolism and Cell Signaling (Part A) Edited by ALFRED H. MERRILL, JR., AND YUSUF A. HANNUN VOLUME 312. Sphingolipid Metabolism and Cell Signaling (Part B) Edited by ALFRED H. MERRILL, JR., AND YUSUF A. HANNUN VOLUME 313. Antisense Technology (Part A: General Methods, Methods of Delivery, and RNA Studies) Edited by M. IAN PHILLIPS VOLUME 314. Antisense Technology (Part B: Applications) Edited by M. IAN PHILLIPS VOLUME 315. Vertebrate Phototransduction and the Visual Cycle (Part A) Edited by KRZYSZTOF PALCZEWSKI VOLUME 316. Vertebrate Phototransduction and the Visual Cycle (Part B) Edited by KRZYSZTOF PALCZEWSKI VOLUME 317. RNA–Ligand Interactions (Part A: Structural Biology Methods) Edited by DANIEL W. CELANDER AND JOHN N. ABELSON VOLUME 318. RNA–Ligand Interactions (Part B: Molecular Biology Methods) Edited by DANIEL W. CELANDER AND JOHN N. ABELSON VOLUME 319. Singlet Oxygen, UV-A, and Ozone Edited by LESTER PACKER AND HELMUT SIES VOLUME 320. Cumulative Subject Index Volumes 290–319 VOLUME 321. Numerical Computer Methods (Part C) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 322. Apoptosis Edited by JOHN C. REED VOLUME 323. Energetics of Biological Macromolecules (Part C) Edited by MICHAEL L. JOHNSON AND GARY K. ACKERS VOLUME 324. Branched-Chain Amino Acids (Part B) Edited by ROBERT A. HARRIS AND JOHN R. SOKATCH VOLUME 325. Regulators and Effectors of Small GTPases (Part D: Rho Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 326. Applications of Chimeric Genes and Hybrid Proteins (Part A: Gene Expression and Protein Purification) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON VOLUME 327. Applications of Chimeric Genes and Hybrid Proteins (Part B: Cell Biology and Physiology) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON
xxxviii
Methods in Enzymology
VOLUME 328. Applications of Chimeric Genes and Hybrid Proteins (Part C: Protein–Protein Interactions and Genomics) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON VOLUME 329. Regulators and Effectors of Small GTPases (Part E: GTPases Involved in Vesicular Traffic) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 330. Hyperthermophilic Enzymes (Part A) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 331. Hyperthermophilic Enzymes (Part B) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 332. Regulators and Effectors of Small GTPases (Part F: Ras Family I) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 333. Regulators and Effectors of Small GTPases (Part G: Ras Family II) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 334. Hyperthermophilic Enzymes (Part C) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 335. Flavonoids and Other Polyphenols Edited by LESTER PACKER VOLUME 336. Microbial Growth in Biofilms (Part A: Developmental and Molecular Biological Aspects) Edited by RON J. DOYLE VOLUME 337. Microbial Growth in Biofilms (Part B: Special Environments and Physicochemical Aspects) Edited by RON J. DOYLE VOLUME 338. Nuclear Magnetic Resonance of Biological Macromolecules (Part A) Edited by THOMAS L. JAMES, VOLKER DO¨TSCH, AND ULI SCHMITZ VOLUME 339. Nuclear Magnetic Resonance of Biological Macromolecules (Part B) Edited by THOMAS L. JAMES, VOLKER DO¨TSCH, AND ULI SCHMITZ VOLUME 340. Drug–Nucleic Acid Interactions Edited by JONATHAN B. CHAIRES AND MICHAEL J. WARING VOLUME 341. Ribonucleases (Part A) Edited by ALLEN W. NICHOLSON VOLUME 342. Ribonucleases (Part B) Edited by ALLEN W. NICHOLSON VOLUME 343. G Protein Pathways (Part A: Receptors) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT VOLUME 344. G Protein Pathways (Part B: G Proteins and Their Regulators) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT
Methods in Enzymology
xxxix
VOLUME 345. G Protein Pathways (Part C: Effector Mechanisms) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT VOLUME 346. Gene Therapy Methods Edited by M. IAN PHILLIPS VOLUME 347. Protein Sensors and Reactive Oxygen Species (Part A: Selenoproteins and Thioredoxin) Edited by HELMUT SIES AND LESTER PACKER VOLUME 348. Protein Sensors and Reactive Oxygen Species (Part B: Thiol Enzymes and Proteins) Edited by HELMUT SIES AND LESTER PACKER VOLUME 349. Superoxide Dismutase Edited by LESTER PACKER VOLUME 350. Guide to Yeast Genetics and Molecular and Cell Biology (Part B) Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 351. Guide to Yeast Genetics and Molecular and Cell Biology (Part C) Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 352. Redox Cell Biology and Genetics (Part A) Edited by CHANDAN K. SEN AND LESTER PACKER VOLUME 353. Redox Cell Biology and Genetics (Part B) Edited by CHANDAN K. SEN AND LESTER PACKER VOLUME 354. Enzyme Kinetics and Mechanisms (Part F: Detection and Characterization of Enzyme Reaction Intermediates) Edited by DANIEL L. PURICH VOLUME 355. Cumulative Subject Index Volumes 321–354 VOLUME 356. Laser Capture Microscopy and Microdissection Edited by P. MICHAEL CONN VOLUME 357. Cytochrome P450, Part C Edited by ERIC F. JOHNSON AND MICHAEL R. WATERMAN VOLUME 358. Bacterial Pathogenesis (Part C: Identification, Regulation, and Function of Virulence Factors) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 359. Nitric Oxide (Part D) Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 360. Biophotonics (Part A) Edited by GERARD MARRIOTT AND IAN PARKER VOLUME 361. Biophotonics (Part B) Edited by GERARD MARRIOTT AND IAN PARKER
xl
Methods in Enzymology
VOLUME 362. Recognition of Carbohydrates in Biological Systems (Part A) Edited by YUAN C. LEE AND REIKO T. LEE VOLUME 363. Recognition of Carbohydrates in Biological Systems (Part B) Edited by YUAN C. LEE AND REIKO T. LEE VOLUME 364. Nuclear Receptors Edited by DAVID W. RUSSELL AND DAVID J. MANGELSDORF VOLUME 365. Differentiation of Embryonic Stem Cells Edited by PAUL M. WASSAUMAN AND GORDON M. KELLER VOLUME 366. Protein Phosphatases Edited by SUSANNE KLUMPP AND JOSEF KRIEGLSTEIN VOLUME 367. Liposomes (Part A) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 368. Macromolecular Crystallography (Part C) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 369. Combinational Chemistry (Part B) Edited by GUILLERMO A. MORALES AND BARRY A. BUNIN VOLUME 370. RNA Polymerases and Associated Factors (Part C) Edited by SANKAR L. ADHYA AND SUSAN GARGES VOLUME 371. RNA Polymerases and Associated Factors (Part D) Edited by SANKAR L. ADHYA AND SUSAN GARGES VOLUME 372. Liposomes (Part B) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 373. Liposomes (Part C) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 374. Macromolecular Crystallography (Part D) Edited by CHARLES W. CARTER, JR., AND ROBERT W. SWEET VOLUME 375. Chromatin and Chromatin Remodeling Enzymes (Part A) Edited by C. DAVID ALLIS AND CARL WU VOLUME 376. Chromatin and Chromatin Remodeling Enzymes (Part B) Edited by C. DAVID ALLIS AND CARL WU VOLUME 377. Chromatin and Chromatin Remodeling Enzymes (Part C) Edited by C. DAVID ALLIS AND CARL WU VOLUME 378. Quinones and Quinone Enzymes (Part A) Edited by HELMUT SIES AND LESTER PACKER VOLUME 379. Energetics of Biological Macromolecules (Part D) Edited by JO M. HOLT, MICHAEL L. JOHNSON, AND GARY K. ACKERS VOLUME 380. Energetics of Biological Macromolecules (Part E) Edited by JO M. HOLT, MICHAEL L. JOHNSON, AND GARY K. ACKERS
Methods in Enzymology
VOLUME 381. Oxygen Sensing Edited by CHANDAN K. SEN AND GREGG L. SEMENZA VOLUME 382. Quinones and Quinone Enzymes (Part B) Edited by HELMUT SIES AND LESTER PACKER VOLUME 383. Numerical Computer Methods (Part D) Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 384. Numerical Computer Methods (Part E) Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 385. Imaging in Biological Research (Part A) Edited by P. MICHAEL CONN VOLUME 386. Imaging in Biological Research (Part B) Edited by P. MICHAEL CONN VOLUME 387. Liposomes (Part D) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 388. Protein Engineering Edited by DAN E. ROBERTSON AND JOSEPH P. NOEL VOLUME 389. Regulators of G-Protein Signaling (Part A) Edited by DAVID P. SIDEROVSKI VOLUME 390. Regulators of G-Protein Signaling (Part B) Edited by DAVID P. SIDEROVSKI VOLUME 391. Liposomes (Part E) Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 392. RNA Interference Edited by ENGELKE ROSSI VOLUME 393. Circadian Rhythms Edited by MICHAEL W. YOUNG VOLUME 394. Nuclear Magnetic Resonance of Biological Macromolecules (Part C) Edited by THOMAS L. JAMES VOLUME 395. Producing the Biochemical Data (Part B) Edited by ELIZABETH A. ZIMMER AND ERIC H. ROALSON VOLUME 396. Nitric Oxide (Part E) Edited by LESTER PACKER AND ENRIQUE CADENAS VOLUME 397. Environmental Microbiology Edited by JARED R. LEADBETTER VOLUME 398. Ubiquitin and Protein Degradation (Part A) Edited by RAYMOND J. DESHAIES
xli
xlii
Methods in Enzymology
VOLUME 399. Ubiquitin and Protein Degradation (Part B) Edited by RAYMOND J. DESHAIES VOLUME 400. Phase II Conjugation Enzymes and Transport Systems Edited by HELMUT SIES AND LESTER PACKER VOLUME 401. Glutathione Transferases and Gamma Glutamyl Transpeptidases Edited by HELMUT SIES AND LESTER PACKER VOLUME 402. Biological Mass Spectrometry Edited by A. L. BURLINGAME VOLUME 403. GTPases Regulating Membrane Targeting and Fusion Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 404. GTPases Regulating Membrane Dynamics Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 405. Mass Spectrometry: Modified Proteins and Glycoconjugates Edited by A. L. BURLINGAME VOLUME 406. Regulators and Effectors of Small GTPases: Rho Family Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 407. Regulators and Effectors of Small GTPases: Ras Family Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 408. DNA Repair (Part A) Edited by JUDITH L. CAMPBELL AND PAUL MODRICH VOLUME 409. DNA Repair (Part B) Edited by JUDITH L. CAMPBELL AND PAUL MODRICH VOLUME 410. DNA Microarrays (Part A: Array Platforms and Web-Bench Protocols) Edited by ALAN KIMMEL AND BRIAN OLIVER VOLUME 411. DNA Microarrays (Part B: Databases and Statistics) Edited by ALAN KIMMEL AND BRIAN OLIVER VOLUME 412. Amyloid, Prions, and Other Protein Aggregates (Part B) Edited by INDU KHETERPAL AND RONALD WETZEL VOLUME 413. Amyloid, Prions, and Other Protein Aggregates (Part C) Edited by INDU KHETERPAL AND RONALD WETZEL VOLUME 414. Measuring Biological Responses with Automated Microscopy Edited by JAMES INGLESE VOLUME 415. Glycobiology Edited by MINORU FUKUDA VOLUME 416. Glycomics Edited by MINORU FUKUDA
Methods in Enzymology
VOLUME 417. Functional Glycomics Edited by MINORU FUKUDA VOLUME 418. Embryonic Stem Cells Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 419. Adult Stem Cells Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 420. Stem Cell Tools and Other Experimental Protocols Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 421. Advanced Bacterial Genetics: Use of Transposons and Phage for Genomic Engineering Edited by KELLY T. HUGHES VOLUME 422. Two-Component Signaling Systems, Part A Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 423. Two-Component Signaling Systems, Part B Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 424. RNA Editing Edited by JONATHA M. GOTT VOLUME 425. RNA Modification Edited by JONATHA M. GOTT VOLUME 426. Integrins Edited by DAVID CHERESH VOLUME 427. MicroRNA Methods Edited by JOHN J. ROSSI VOLUME 428. Osmosensing and Osmosignaling Edited by HELMUT SIES AND DIETER HAUSSINGER VOLUME 429. Translation Initiation: Extract Systems and Molecular Genetics Edited by JON LORSCH VOLUME 430. Translation Initiation: Reconstituted Systems and Biophysical Methods Edited by JON LORSCH VOLUME 431. Translation Initiation: Cell Biology, High-Throughput and Chemical-Based Approaches Edited by JON LORSCH VOLUME 432. Lipidomics and Bioactive Lipids: Mass-Spectrometry–Based Lipid Analysis Edited by H. ALEX BROWN
xliii
xliv
Methods in Enzymology
VOLUME 433. Lipidomics and Bioactive Lipids: Specialized Analytical Methods and Lipids in Disease Edited by H. ALEX BROWN VOLUME 434. Lipidomics and Bioactive Lipids: Lipids and Cell Signaling Edited by H. ALEX BROWN VOLUME 435. Oxygen Biology and Hypoxia Edited by HELMUT SIES AND BERNHARD BRU¨NE VOLUME 436. Globins and Other Nitric Oxide-Reactive Protiens (Part A) Edited by ROBERT K. POOLE VOLUME 437. Globins and Other Nitric Oxide-Reactive Protiens (Part B) Edited by ROBERT K. POOLE VOLUME 438. Small GTPases in Disease (Part A) Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 439. Small GTPases in Disease (Part B) Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 440. Nitric Oxide, Part F Oxidative and Nitrosative Stress in Redox Regulation of Cell Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 441. Nitric Oxide, Part G Oxidative and Nitrosative Stress in Redox Regulation of Cell Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 442. Programmed Cell Death, General Principles for Studying Cell Death (Part A) Edited by ROYA KHOSRAVI-FAR, ZAHRA ZAKERI, RICHARD A. LOCKSHIN, AND MAURO PIACENTINI VOLUME 443. Angiogenesis: In Vitro Systems Edited by DAVID A. CHERESH VOLUME 444. Angiogenesis: In Vivo Systems (Part A) Edited by DAVID A. CHERESH VOLUME 445. Angiogenesis: In Vivo Systems (Part B) Edited by DAVID A. CHERESH VOLUME 446. Programmed Cell Death, The Biology and Therapeutic Implications of Cell Death (Part B) Edited by ROYA KHOSRAVI-FAR, ZAHRA ZAKERI, RICHARD A. LOCKSHIN, AND MAURO PIACENTINI VOLUME 447. RNA Turnover in Bacteria, Archaea and Organelles Edited by LYNNE E. MAQUAT AND CECILIA M. ARRAIANO
Methods in Enzymology
xlv
VOLUME 448. RNA Turnover in Eukaryotes: Nucleases, Pathways and Analysis of mRNA Decay Edited by LYNNE E. MAQUAT AND MEGERDITCH KILEDJIAN VOLUME 449. RNA Turnover in Eukaryotes: Analysis of Specialized and Quality Control RNA Decay Pathways Edited by LYNNE E. MAQUAT AND MEGERDITCH KILEDJIAN VOLUME 450. Fluorescence Spectroscopy Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 451. Autophagy: Lower Eukaryotes and Non-Mammalian Systems (Part A) Edited by DANIEL J. KLIONSKY VOLUME 452. Autophagy in Mammalian Systems (Part B) Edited by DANIEL J. KLIONSKY VOLUME 453. Autophagy in Disease and Clinical Applications (Part C) Edited by DANIEL J. KLIONSKY VOLUME 454. Computer Methods (Part A) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 455. Biothermodynamics (Part A) Edited by MICHAEL L. JOHNSON, JO M. HOLT, AND GARY K. ACKERS (RETIRED) VOLUME 456. Mitochondrial Function, Part A: Mitochondrial Electron Transport Complexes and Reactive Oxygen Species Edited by WILLIAM S. ALLISON AND IMMO E. SCHEFFLER VOLUME 457. Mitochondrial Function, Part B: Mitochondrial Protein Kinases, Protein Phosphatases and Mitochondrial Diseases Edited by WILLIAM S. ALLISON AND ANNE N. MURPHY VOLUME 458. Complex Enzymes in Microbial Natural Product Biosynthesis, Part A: Overview Articles and Peptides Edited by DAVID A. HOPWOOD VOLUME 459. Complex Enzymes in Microbial Natural Product Biosynthesis, Part B: Polyketides, Aminocoumarins and Carbohydrates Edited by DAVID A. HOPWOOD VOLUME 460. Chemokines, Part A Edited by TRACY M. HANDEL AND DAMON J. HAMEL VOLUME 461. Chemokines, Part B Edited by TRACY M. HANDEL AND DAMON J. HAMEL VOLUME 462. Non-Natural Amino Acids Edited by TOM W. MUIR AND JOHN N. ABELSON VOLUME 463. Guide to Protein Purification, 2nd Edition Edited by RICHARD R. BURGESS AND MURRAY P. DEUTSCHER
xlvi
Methods in Enzymology
VOLUME 464. Liposomes, Part F Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 465. Liposomes, Part G Edited by NEJAT DU¨ZGU¨NES¸ VOLUME 466. Biothermodynamics, Part B Edited by MICHAEL L. JOHNSON, GARY K. ACKERS, AND JO M. HOLT VOLUME 467. Computer Methods Part B Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 468. Biophysical, Chemical, and Functional Probes of RNA Structure, Interactions and Folding: Part A Edited by DANIEL HERSCHLAG VOLUME 469. Biophysical, Chemical, and Functional Probes of RNA Structure, Interactions and Folding: Part B Edited by DANIEL HERSCHLAG VOLUME 470. Guide to Yeast Genetics: Functional Genomics, Proteomics, and Other Systems Analysis, 2nd Edition Edited by GERALD FINK, JONATHAN WEISSMAN, AND CHRISTINE GUTHRIE VOLUME 471. Two-Component Signaling Systems, Part C Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 472. Single Molecule Tools, Part A: Fluorescence Based Approaches Edited by NILS G. WALTER VOLUME 473. Thiol Redox Transitions in Cell Signaling, Part A Chemistry and Biochemistry of Low Molecular Weight and Protein Thiols Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 474. Thiol Redox Transitions in Cell Signaling, Part B Cellular Localization and Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 475. Single Molecule Tools, Part B: Super-Resolution, Particle Tracking, Multiparameter, and Force Based Methods Edited by NILS G. WALTER VOLUME 476. Guide to Techniques in Mouse Development, Part A Mice, Embryos, and Cells, 2nd Edition Edited by PAUL M. WASSARMAN AND PHILIPPE M. SORIANO VOLUME 477. Guide to Techniques in Mouse Development, Part B Mouse Molecular Genetics, 2nd Edition Edited by PAUL M. WASSARMAN AND PHILIPPE M. SORIANO VOLUME 478. Glycomics Edited by MINORU FUKUDA
Methods in Enzymology
VOLUME 479. Functional Glycomics Edited by MINORU FUKUDA VOLUME 480. Glycobiology Edited by MINORU FUKUDA VOLUME 481. Cryo-EM, Part A: Sample Preparation and Data Collection Edited by GRANT J. JENSEN VOLUME 482. Cryo-EM, Part B: 3-D Reconstruction Edited by GRANT J. JENSEN VOLUME 483. Cryo-EM, Part C: Analyses, Interpretation, and Case Studies Edited by GRANT J. JENSEN VOLUME 484. Constitutive Activity in Receptors and Other Proteins, Part A Edited by P. MICHAEL CONN VOLUME 485. Constitutive Activity in Receptors and Other Proteins, Part B Edited by P. MICHAEL CONN VOLUME 486. Research on Nitrification and Related Processes, Part A Edited by MARTIN G. KLOTZ VOLUME 487. Computer Methods, Part C Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 488. Biothermodynamics, Part C Edited by MICHAEL L. JOHNSON, JO M. HOLT, AND GARY K. ACKERS VOLUME 489. The Unfolded Protein Response and Cellular Stress, Part A Edited by P. MICHAEL CONN VOLUME 490. The Unfolded Protein Response and Cellular Stress, Part B Edited by P. MICHAEL CONN VOLUME 491. The Unfolded Protein Response and Cellular Stress, Part C Edited by P. MICHAEL CONN VOLUME 492. Biothermodynamics, Part D Edited by MICHAEL L. JOHNSON, JO M. HOLT, AND GARY K. ACKERS VOLUME 493. Fragment-Based Drug Design Tools, Practical Approaches, and Examples Edited by LAWRENCE C. KUO VOLUME 494. Methods in Methane Metabolism, Part A Methanogenesis Edited by AMY C. ROSENZWEIG AND STEPHEN W. RAGSDALE VOLUME 495. Methods in Methane Metabolism, Part B Edited by AMY C. ROSENZWEIG AND STEPHEN W. RAGSDALE VOLUME 496. Research on Nitrification and Related Processes, Part B Edited by MARTIN G. KLOTZ AND LISA Y. STEIN
xlvii
xlviii
Methods in Enzymology
VOLUME 497. Synthetic Biology: Methods for Part/Device Characterization and Chassis Engineering, Part A Edited by CHRISTOPHER VOIGT VOLUME 498. Synthetic Biology: Computer Aided Design and DNA Assembly, Part B Edited by CHRISTOPHER VOIGT
C H A P T E R
O N E
DNA-Binding Specificity Prediction with FoldX Alejandro D. Nadra,* Luis Serrano,†,‡ and Andreu Alibe´s† Contents 1. Introduction 2. Description of the Force Field, FoldX, and the Implementation of DNA Energy Terms and Base Mutation 2.1. Force field 2.2. DNA parameterization 2.3. Detection of DNA base pairs 2.4. Distance constraints to automatically identify base pairs 2.5. DNA stacking energy 2.6. DNA mutation 3. Predicting Specificities 3.1. Prediction of DNA-binding profiles using FoldX 4. Changing Binding Capabilities 5. Designing New Specificities 6. Known Caveats 6.1. Resolution 6.2. Water molecules 6.3. DNA flexibility and base independence 6.4. Clashes 7. Conclusions and Future Outlook Acknowledgments References
4 5 6 6 6 8 8 8 9 9 12 13 14 15 15 15 15 16 16 17
Abstract With the advent of Synthetic Biology, a field between basic science and applied engineering, new computational tools are needed to help scientists reach their goal, their design, optimizing resources. In this chapter, we present a simple and powerful method to either know the DNA specificity of a wild-type protein * Departamentos de Quı´mica Biolo´gica y Fisiologı´a, Biologı´a Molecular y Celular, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Argentina EMBL/CRG Systems Biology Research Unit, Center for Genomic Regulation (CRG), UPF, Barcelona, Spain { ICREA Professor, Center for Genomic Regulation (CRG), UPF, Barcelona, Spain {
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00001-2
#
2011 Elsevier Inc. All rights reserved.
3
4
Alejandro D. Nadra et al.
or design new specificities by using the protein design algorithm FoldX. The only basic requirement is having a good resolution structure of the complex. Protein–DNA interaction design may aid the development of new parts designed to be orthogonal, decoupled, and precise in its target. Further, it could help to fine-tune the systems in terms of specificity, discrimination, and binding constants. In the age of newly developed devices and invented systems, computer-aided engineering promises to be an invaluable tool.
1. Introduction One of the goals of Synthetic Biology is the design and construction of new biological parts, devices, and systems. Most biological parts have been obtained from nature with slight if any modification, being an exception some experimentally selected mutants. This is particularly true in the case of DNAbinding proteins probably due to the lack of engineering tools to design protein–DNA interfaces until very recently (Alibe´s et al., 2010a,b; Morozov et al., 2005). In the past few years, there have also been successful attempts to predict binding specificity from structure, either by using existing crystal structures (Angarica et al., 2008; Arnould et al., 2006; Endres and Wingreen, 2006; Havranek et al., 2004; Jamal Rahi et al., 2008; Marcaida et al., 2008; Morozov et al., 2005; Paillard et al., 2004; Redondo et al., 2008) or by using a docking approach (Liu et al., 2008). These predictions were evaluated in zinc fingers (Benos et al., 2002; Paillard et al., 2004) where a sensitivity to docking geometry was reported (Siggers and Honig, 2007), and in meganucleases (Arnould et al., 2006; Marcaida et al., 2008; Redondo et al., 2008), highlighting the importance of having multiple templates to enhance accuracy. Protein–DNA interactions are key to the regulation of cellular activities such as transcriptional regulation and replication. The specificity displayed by a DNA-binding protein toward its DNA target sites is an essential feature for the function of any transcription factor in a given genome where an excess of nonspecific sites exists (Rohs et al., 2010). For Synthetic Biology, binding site should be differentiated from binding profile/motif being the former, an extreme simplification of DNA-binding capabilities. To gain a fine control of the system, a complete description of the behavior of a TF is desired. As an example, it could be desired to have a unique site in the complete genome. Alternatively, the requirement could be to have several but each with different affinities, or with overlapping binding sites, or even a collection of TF with graduated affinity/activity. Further, the goal for certain devices could be to have orthologous systems with the same affinity but varying their Kon/Koff. A working definition of Synthetic Biology (http://www.synthetic-biology.info) suggests that Synthetic Biology is the engineering of biological
DNA-Binding Specificity Prediction with FoldX
5
components and systems that do not exist in nature and the reengineering of existing biological elements; it is determined on the intentional design of artificial biological systems, rather than on the understanding of natural biology. This is particularly challenging in biological systems, where each process may be heavily influenced by regulatory and nonregulatory interactions with the environment. In any case, to achieve this goal for DNAbinding proteins, a specific protein–DNA engineering tool is needed. Protein–DNA interactions are complex phenomena as they involve direct and indirect interactions and there is not a general recognition code to predict base–residue interactions. Thus structural information is required in order to understand the specificity of any DNA-binding protein. Structure-based DNA-binding prediction is a powerful tool to infer protein-binding sites and design new specificities. This approach has developed in the past few years, and there exist some examples of successful binding site prediction for several proteins and particularly for zinc fingers. There is, however, only one example of structure-based design of a new specificity (Ashworth et al., 2006), underlining the state-of-the-art nature of the approach. These techniques usually require a lot of expertise and are difficult to reproduce by nonexperts. The aim of this chapter is to establish a simple protocol for structure-based prediction of proteins binding profiles. This protocol can be used to derive the binding motif of an existing protein and also to modify existing specificities. Designed to be simple, this protocol may be somehow limited, as we tried to make a balance between power and simplicity in the protocol. Some hints on how to fine-tune the predictions to make them more powerful are presented in Section 6. De novo design is very straightforward with the newly developed ability to optimize and synthesize genes. In the next chapter, we present FoldX protein design algorithm as an example of a tool to evaluate protein–DNA binding specificities that may allow to have a tighter control of your favorite transcription factor or to design new DNA sequence specificities.
2. Description of the Force Field, FoldX, and the Implementation of DNA Energy Terms and Base Mutation FoldX (Guerois et al., 2002; Schymkowitz et al., 2005; http://foldx. crg.es) is an application that provides a fast and quantitative estimation of the importance of the interactions contributing to the stability of proteins, protein–protein complexes, and protein–DNA complexes. The capability of FoldX to deal with protein–DNA complexes has been recently added and its predictions have been validated (Alibe´s et al., 2010a) and particularly applied to meganucleases (Arnould et al., 2006; Marcaida et al., 2008;
6
Alejandro D. Nadra et al.
Redondo et al., 2008). Given a template structure, this method generates a PWM/motif displayed by the DNA-binding protein of interest, derived by an energetic criterion. Its main feature in the framework of this chapter is its ability to mutate both protein and DNA and evaluate the effect of these mutations on the interaction energy and the stability of the complex. FoldX is freely available for academic users upon registration (http://foldx.crg.es).
2.1. Force field The FoldX force field defines the following terms (Eq. (1.1)) to calculate the free energy: DG ¼ DGvdw þ DGsolvH þ DGsolvP þ DGwb þ DGhbond þ DGel þ DGkon þ T DSmc þ T DSsc þ DGclash
ð1:1Þ
DGvdw, the van der Waals term; DGsolvH and DGsolvP, the interaction with the solvent, separated between the hydrophobic and polar groups; DGwb, the water bonds term; DGhbond, the term that takes into account hydrogen bonds; DGel, the electrostatic contribution to free energy; DGkon, the electrostatic contribution coming from the interaction of atoms of two different molecules; DSmc and DSsc, the entropy cost of fixing the main chain or the side chain to a particular conformation; DGclash, the term that takes into account the steric overlaps between atoms in the structure; and finally T, the temperature (Schymkowitz et al., 2005).
2.2. DNA parameterization The four bases, A, C, G, and T as well as the methylated A and C, were added using standard FoldX amino acid atoms as reference atoms. For each atom of each nucleic acid, we used the standard parameters (van der Waals radii, volumes, solvation energies, hydrogen bond energies, and angles) of an amino acid atom that was closest in nature (type of atom, hybridization). For example, the polar atom N6 of the adenine base which can make two hydrogen bonds use the parameters of the atom ND2 of asparagine, whereas the phosphate and the two oxygen atoms O1P and O2P are taken from the parameters of the phosphorylated serine (Table 1.1).
2.3. Detection of DNA base pairs To evaluate the preferred base at each position of the binding site, we need to calculate the energy for the four possible nucleotides, and the first step when moving or mutating DNA bases is to take into account that most of
7
DNA-Binding Specificity Prediction with FoldX
Table 1.1 One to one correspondence between the atom of adenine and the reference atoms chosen among the amino acids to define their parameters DNA atom Amino acid Reference atom Comments
P O1P O2P O5* C5* C4* O4* C3* O3* C2* C1* N9 C4 N3 C2 N1 C6 N6 C5 N7 C8
SEP SEP SEP HIS GLY PRO HIS PRO HIS PRO PRO HIS TRP HIS TYR HIS TYR ASN TRP HIS TYR
P O1P O2P O CA CA O CA O CG CA N CE2 N CD2 N CZ ND2 CE2 N CD2
Except for H bonds (none possible)
Except for H bonds (none possible) Except for H bonds (none possible)
Except for H bonds (none possible)
the bases are paired. Therefore, the problem is not just a one-body move but a two-body move. In principle, one could identify the complementary base by looking at the DNA sequence in the structure. However, not always are all bases in a structure paired and in some sequences this could lead to confusion. Thus, a design tool should be able to automatically identify those pairs in order to move (mutate) them at the same time. This is a simple task for the human eye, but not straightforward for a computational algorithm. The classical approach would be to consider two bases as being paired if they make hydrogen bonds (H bonds), but in a test, we found that, considering only high-resolution structures, 7.7% of the base pairs were misassigned. This is due to DNA distortion/opening that may occur upon protein binding and may have two main consequences: (i) bases are too far away from each other to be able to make an H bond; (ii) bases are able to interact with more than one base and as a result, it is difficult to decide, based on hydrogen bonding, which is the correct one. To solve this problem, we examined the set of structures above, analyzed all pairwise distances between atoms of facing DNA bases, and derived some simple distance constraints to automatically identify base pairs (see below).
8
Alejandro D. Nadra et al.
We then tested it against structures not included in the previous set and containing double-stranded DNA in complex with a protein, and structures of double-stranded DNA alone. Of all the base pairs in the protein–DNA set, all were assigned correctly by using the distance constraints, while 5.4% would be misassigned using a hydrogen bonding criteria only. This ratio dropped to 3.6% for the set composed of double-stranded DNA alone because of the higher regularity of the considered structures. In all cases using our distance constraint method, we could correctly identify the corresponding pairing base.
2.4. Distance constraints to automatically identify base pairs We define the distance between each C10 atom (ribose atom to which the base is attached), DistC1, and the distance between the atom N1 of the purines and the atom N3 of the pyrimidines, DistN. Their ratio Div ¼ DistC1/DistN is the determining factor for the base pair recognition. To be considered as base pairs, all bases must fulfill the following rules (average values in the dataset were considered for determining the parameters): Div is smaller than 5.0 and strictly either larger than 2.9 or larger than 2.5 if DistC1 is between 10.25 and 11.3 A˚. If there was any ambiguity between two possible pairs, we chose the one presenting stronger H bonds. For the few cases that did not fit the criteria, we made a second pass with the following rules: bases which were not paired before and for whom Div is between 2.1 and 5.0 and DistC1 is between 9 and ˚ were considered as base pairs. 13 A
2.5. DNA stacking energy To take into account the stacking of bases and their preferred conformation, we put an entropy-like term inside the van der Waals clash term of FoldX for adjacent bases. This term was derived from a statistical analysis of all DNA structures in the Protein Data Bank looking at each possible pairs of consecutive bases and looking at the angles made by the planes of each bases. Those angles were discretized and for each bin of 2 , an energy cost was calculated based on the probability p of having two bases in such angles (DG ¼ RT ln(p)).
2.6. DNA mutation To mutate DNA, we replace a base in the crystal by its mutant, superposing the two bases on the N1 atom and the C10 –N1–C2 plane for the pyrimidines and the N9 atom and C10 –N9–C4 plane for the purines. Based on a statistical analysis of the available DNA structures in the PDB and in order to simulate movements of the bases relative to the ribose, we allow two degrees of freedom: the angle C10 –N9–C4 (125 5 ) and the dihedral
DNA-Binding Specificity Prediction with FoldX
9
angle O40 –C10 –N9–C4 (5 around the original dihedral angle form the structure) for the purines and the angle C10 –N1–C2 (118 5 ) and the dihedral angle O40 –C10 –N1–C2 ( 5 around the original dihedral angle form the structure) for the pyrimidines. We replace in one step both the desired base and its base pair (when present). The side chains of the surrounding residues and bases were then moved sequentially to adapt to the new mutated bases.
3. Predicting Specificities The binding profile of a transcription factor can be known using in silico methods and the structure of its complex with DNA (Alibe´s et al., 2010a). The general strategy is to generate mutations in the DNA and evaluate the energies for each mutant toward the protein. This way, we evaluate the binding energy as well as the discrimination for related DNA sequences. The first step is to find a template structure. Then, the DNA is mutated at each position to all four bases from which the interaction energy will be evaluated and specificity calculated.
3.1. Prediction of DNA-binding profiles using FoldX The only requirement is to have a good quality structure (X-ray cocrystal up ˚ resolution are preferred over NMR structures) for the desired to 2.2 A protein in complex with DNA. Structures may be obtained from the Protein Data Bank (Berman et al., 2000). Then, DNA-binding-site motifs may be derived for this protein using FoldX. To do so, one should mutate every base in the structure to the other three in an exhaustive manner and determine the predicted binding energy between protein and mutant DNA. Additionally, it is useful to consider the van der Waals intramolecular clashes of the DNA molecule that could appear upon base mutation. This extra term penalizes those DNA variants that may have a good binding energy, but are forced into the DNA structure. Considering these energy terms, one can derive a DNA sequence profile by doing a partition function for each position. This procedure requires the following steps (Fig. 1.1): 1. The first step is to look for the best template in a database. Look at the RCSB-PDB Web site for crystal structures of the candidate bound to DNA (structures solved by NMR can also be used but usually crystal structures give better results). As the prediction results can vary depending on the template, only high-resolution structures should be considered. Using structures with resolutions better than 2.2 A˚ is a reasonable cutoff. If the structure had poorly determined regions, they should not occur in the protein–DNA interface.
10
Alejandro D. Nadra et al.
Get structure (Protein Data Bank)
Wild-type PWM and logo
Mutant PWM and logo
Select chains (Text editor/visualization tool) Optimize structure (RepairPDB)
Mutate each nucleotide (BuildModel)
Mutate residue(s) to Ala (BuildModel) Mutate each nucleotide (BuildModel)
Evaluate changes in ΔΔGint (AnalyseComplex)
Convert ΔΔGint to probabilities
Mutate Ala to desired mutant (BuildModel) Evaluate changes in ΔΔGint (AnalyseComplex)
Convert ΔΔGint to probabilities Create logo (SeqLogo R package)
Create logo (SeqLogo R package)
Figure 1.1 Scheme of the procedures described. In italics, FoldX function used in each step.
2. PDB structures contain a lot of information and, usually, more structures than those needed. Delete everything but the molecules you want to model (i.e., DNA strands, protein chain, metal ions, and water molecules in the interface). This can be done with any text editor or a structure visualization program, such as VMD, PyMol, or Swiss-PdbViewer. 3. The next step requires preparing the reference structure for the simulations. To optimize the crystal structure, use the FoldX RepairPDB function to minimize the energy according to the FoldX force field, removing small van der Waals clashes; generating missing side chains; and choosing the correct histidine, glutamine, and asparagine rotamers. After running the RepairPDB function, check the generated PDB file for the removal of relevant unrecognized atoms. This function creates a new file called “RepairPDB_,” followed by the name of the original structure file. This file will be used in the next step. Crystal structures are not perfect and sometimes interacting atoms are placed too close (van der Waals clashes) or too separate (missing a hydrogen bond). As a result, FoldX can move out a side chain that makes specific contacts with a base, or another amino acid. The user
11
DNA-Binding Specificity Prediction with FoldX
should check the resulting structure after repair to see if this is the case. There is an option in FoldX that will fix those problematic side chains if needed. As an example, we could mention the case of GCN4 crystal (PDB: 1YSA; Fig. 1.2). When FoldX repairs this crystal, arginine 243 side chain moves outward and does not make a specific contact with the bases. The distance involved is 3.2 A˚, from the NH1 from the arginine to the C2 from the adenine in front, which results in van der Waals clashes and consequently Arg243 is flipped out. This behavior is not frequent but the user should be aware of it. When such a case is detected, the user should “fix” the residue, to force the side chain to remain contacting the ligand (available for the RepairPDB and BuildModel commands). 4. Using the optimized structure obtained in the previous step, mutate the DNA sequence in the template structure to all possible bases at each position. The number of mutants will be four times the length (avoid mutating bases too far away that will not contact the protein, i.e., distances to protein larger than 6 A˚). This approximation requires considering each protein–DNA base pair contact as independent and additive. In general, this could be the case, but when having a densely packed area, it is quite probable that context effects due to the neighboring bases could be important. If the user suspects this is the case, she can proceed by mutating the base pairs as triplets centered in the target base. So, instead of considering four variants for
3.2 3.5
Figure 1.2 Example of an optimized residue (Arg243) that is flipped out of the binding interface due to a too short distance. Black: crystal structure. Gray: optimized structure by FoldX (without fixing the side chain).
12
Alejandro D. Nadra et al.
each position, we should analyze 64 base combinations. This kind of analysis may be more relevant when it is known that the flexibility and/ or significant DNA deformation is important either in intermediates or in the final complex. However, the user should be aware that it will significantly increase the computation time. 5. Once the DNA variants have been generated, we have to evaluate the changes in interaction energy with respect to the original structure. When mutating residues or bases, some variants that give good interaction energies but introduce internal van der Waals clashes (either with the DNA or with the protein) can be wrongly selected as a good option. To take this into account, add the intraclashes energy when the mutant has higher values than the reference (optimized structure) to the difference in interaction energy (note that it should never be subtracted the intraclashes value if the mutant displays a lower value than the crystal as it is reporting a crystallographic problem and not a gain in interaction energy). As a rule of the thumb, changes in intraclashes smaller than 0.6 kcal/mol could be omitted (0.6 kcal/ mol is the standard FoldX error). To avoid disrupting the interaction with the mutants introduced, the interaction energy should not be much higher than that of the crystal structure. In case you have fixed a side chain because it has bad contacts in the crystal structure, you should not consider the changes in van der Waals clashes in the overall interaction energy upon mutating the interacting base unless they become worst than the reference. 6. The changes in interaction energy corresponding to nucleotides in each position are converted to probabilities proportional to exp(DDGint/RT) and then, using the seqLogo package from R Bioconductor, graphically represented as logos (see examples in Fig. 1.3).
4. Changing Binding Capabilities Another important issue to consider is that DNA-binding capabilities could be changed quantitatively, qualitatively, or both. This means that it could be changed by the interaction energy without affecting significantly sequence discrimination. This can be done by varying electrostatics interactions between the protein and DNA backbone, without affecting the side chain–base contacts that dictate sequence specificity. However, one could desire to modify base discrimination but keeping relatively invariant the interaction strength. This task is quite more challenging and involve introducing very carefully specific contacts that either favors (or disfavors) one particular base or its size (to specify a purine or pyrimidine in a particular position). Combination of both preceding aspects can be made to completely change a binding site, slightly modify it, extend it, restrict it, introduce variable bases in the middle of a conserved binding region, etc.
13
DNA-Binding Specificity Prediction with FoldX
2
Usf1 (1AN4)
1.5
Information content
1 0.5 0 2 1.5 1 0.5 0 1
2
2
3
4
5
6
7 Gcn4 (1YSA)
1.5 Information content
1 0.5 0 2 1.5 1 0.5 0 1
2
3
4
5
6
7
8
9
10
11
12
Figure 1.3 Comparison between the experimental logos (top) and the FoldX predicted ones (bottom) for Usf1 and Gcn4. Usf1 experimental logo from Jaspar (Sandelin et al., 2004); Gcn4 experimental logo from UniPROBE (Newburger and Bulyk, 2009).
5. Designing New Specificities If we want to modify the binding profile of a transcription factor, we can do, before doing the experimental work, a first screen of the possible mutants in silico. To do so: 1. Select which residues are close to the DNA positions where the specificity needs to be changed. 2. Mutate those residues to alanine and then use the alanine mutant structure to scan all possible DNA combinations at the selected DNA positions as done above.
14
Alejandro D. Nadra et al.
3. Remove all DNA combinations that have bad internal energies due to van der Waals clashes (incompatibility of the DNA sequence with the DNA structure). 4. Mutate the alanine in each of the structures with its specific combination of DNA bases to the selected new residues (remember you can explore as many protein positions and amino acids as you want, but this comes with a computational cost). As a rule of the thumb, more than three positions with 20 amino acids each will make the calculation too long. It is useful in many cases to learn from nature. Try to find all proteins related to the one you are mutating and see if there is enough sequence variability at the positions you desire. If this is the case, then use those residues at each position and explore all combinations. If this is not possible, first mutate each protein position to the 20 amino acids and select the residues that will have favorable interaction energies with your DNA template, and then do all combinations. When mutating to residues with larger side chains, it is necessary to repeat the mutations several times (3–5). It is accomplished with the option
5. Then, for a statistical analysis, the average value should be taken, while for a particular design, the mutant with the best value should be taken. The reason being that the algorithm could be trapped in a minimum due to the starting residue and rotamer in the mutagenesis procedure. By using the multiple run option, it will start with a different position and rotamer each time, thus increasing the chance of finding the real energy optimum. The alanine mutation step is very important and it should be done before mutating the DNA bases, so the newly introduced ones can move more freely and adopt a favorable conformation that may be sterically blocked by the wild-type residue. By deleting the contact between the residue and the nucleotide, alternative bases can accommodate to the best conformation according to the DNA structure only. 5. Finally, the change in interaction energy with respect to the original wild-type structure should be calculated. We have to consider as possible candidates only those where the change of interaction energy between the wild-type complex and the mutant is not too large. Failure to consider this factor could yield nonrealistic logos.
6. Known Caveats The procedures and examples described above rely on having “the best case scenario”: a high-resolution structure, very stable and rigid upon mutation, whose binding sites are not affected by the chromatin structure. However, that is not the usual case and several factors have to be taken into account.
DNA-Binding Specificity Prediction with FoldX
15
6.1. Resolution Preferably, high-resolution structures should be used (resolution < 2.2 A˚), and low-resolution structures may have nonsolved side chains that limit the prediction power of the procedure described and errors in backbone conformation or side chain placement.
6.2. Water molecules Proteins whose contact with DNA depends on water-mediated interactions are more challenging to predict its specificities accurately. Even if FoldX is capable of predicting water molecule positions, the energetics of watermediated interactions are difficult to predict.
6.3. DNA flexibility and base independence For complexes where the intrinsic DNA structure or flexibility exerts an important role, nucleotide mutations could trigger important structural changes. FoldX is not capable of moving backbones upon mutation, and cannot take these cases into account. Further, in our procedure, we assume base independence, that is, that a nucleotide in one position does not change the probability of having another nucleotide in the neighboring positions. Although it is clearly not true in all cases, it is usually a good approximation. For those structures where changes in the DNA sequence could affect the neighboring bases, we recommend mutating the DNA in triplets. Also if there are several structures of the same protein, or a closely related protein with DNA, use all available templates; this will increase the number of possible hits when changing specificity, as well as the accuracy of the logo.
6.4. Clashes It is frequent to find that the FoldX informs of high van der Waals clashes between bases and DNA backbone or residues in the wild-type structure. When this happen and introducing a mutation relieves considerably the clashes present in the wild type, it could be assumed that there was a crystallographic problem and the clashes should be dismissed. Further, small increments in van der Waals clashes could be omitted if they are lower than 0.6 kcal/mol. Fine analysis of interaction energies require to analyze both, the gain in energy due to specific and nonspecific contacts and the repulsive forces due to close contacts. Indeed, this analysis could be done discriminating between the effect in each ligand (i.e., intraclashes) and the interface interclashes. It is quite common to introduce a mutation that enhances binding—thus, interaction energy—significantly but introduces high intraclashes that destabilize the protein, maybe, in a bigger extent.
16
Alejandro D. Nadra et al.
7. Conclusions and Future Outlook We have described a procedure where using a self-consistent forcefield protein–DNA interaction specificity can be accurately predicted using a crystal structure. The protocol can also be used to change protein specificity and affinity. Both speed and accuracy can be improved at the expense of one another, but when applying FoldX into a high-throughput analysis, a balance can be made between them. FoldX in combination with a protein–DNA structure can provide the user with a PWM (Position weight matrix) to be used for the purpose of scanning the genome or locus of interest for putative binding sites. Although we only discuss using X-ray structures, structures coming from NMR or homology modeling with high percentage identity (especially in the interface with DNA) could also be used. When dealing with a known flexible protein/interaction, incorrect prediction of specificity in positions where there is no contact may be the result of specific local conformation of the DNA, emphasizing the importance of local backbone moves. Adding DNA and protein backbone flexibility, especially at the edges of the binding site, could improve the prediction. In the near future, we plan to incorporate backbone moves to FoldX with the use of Brix (Baeten et al., 2008), a collection of protein fragments.
ACKNOWLEDGMENTS We would like to thank Franc¸ois Stricher for the continuous development of FoldX during these past years and for insight on the characterization of DNA.
Appendix. Mutation Protocol For each template considered, the positions of the amino acid side chains and bases in the crystal structure were first energetically optimized using the FoldX RepairPDB function. Then, each base was mutated to the other three possible bases five times to increase the conformational space analyzed. Each single point mutation takes less than 60 s using a single CPU (Intel Xeon 3.00 GHz, 8 Gb of RAM). Using the average value, the difference in the interaction energy with respect to the wild type (DDGint) was calculated, adding the difference in intramolecular clashes if they were higher than for the crystal structure. This function is graphically displayed in Fig. 1.3 as information content by means of the R package seqLogo (Bembom, 2007), where the height of a given nucleotide is proportional to exp(DDGint/RT). When more than one
DNA-Binding Specificity Prediction with FoldX
17
structure/chain exists for a given protein, then the one with the better resolution should be chosen. In general, the same physical conditions may be used unless particularities of the system. Standard values may be: temperature of 298 K, pH of 7.0, and ion strength of 150 mM (in proteins from extremophiles, changing these parameters may be required).
REFERENCES Alibe´s, A., Nadra, A. D., De Masi, F., Bulyk, M. L., Serrano, L., and Stricher, F. (2010a). Using protein design algorithms to understand the molecular basis of disease caused by protein-DNA interactions: The Pax6 example. Nucleic Acids Res. 38, 7422–7431. Alibe´s, A., Serrano, L., and Nadra, A. D. (2010b). Structure-based DNA-binding prediction and design. Methods Mol. Biol. 649, 77–88. Angarica, V. E., Perez, A. G., Vasconcelos, A. T., Collado-Vides, J., and ContrerasMoreira, B. (2008). Prediction of TF target sites based on atomistic models of proteinDNA complexes. BMC Bioinform. 9, 436. Arnould, S., Chames, P., Perez, C., Lacroix, E., Duclert, A., Epinat, J. C., Stricher, F., Petit, A. S., Patin, A., Guillier, S., Rolland, S., Prieto, J., et al. (2006). Engineering of large numbers of highly specific homing endonucleases that induce recombination on novel DNA targets. J. Mol. Biol. 355, 443–458. Ashworth, J., Havranek, J. J., Duarte, C. M., Sussman, D., Monnat, R. J., Jr., Stoddard, B. L., and Baker, D. (2006). Computational redesign of endonuclease DNA binding and cleavage specificity. Nature 441, 656–659. Baeten, L., Reumers, J., Tur, V., Stricher, F., Lenaerts, T., Serrano, L., Rousseau, F., and Schymkowitz, J. (2008). Reconstruction of protein backbones from the BriX collection of canonical protein fragments. PLoS Comput. Biol. 4, e1000083. Bembom, O. (2007). seqLogo: An R package for plotting DNA sequence logos. http://bioconductor.org/packages/2.6/bioc/html/seqLogo.html. Benos, P. V., Lapedes, A. S., and Stormo, G. D. (2002). Probabilistic code for DNA recognition by proteins of the EGR family. J. Mol. Biol. 323, 701–727. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Res. 28, 235–242. Endres, R. G., and Wingreen, N. S. (2006). Weight matrices for protein-DNA binding sites from a single co-crystal structure. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 73, 061921. Guerois, R., Nielsen, J. E., and Serrano, L. (2002). Predicting changes in the stability of proteins and protein complexes: A study of more than 1000 mutations. J. Mol. Biol. 320, 369–387. Havranek, J. J., Duarte, C. M., and Baker, D. (2004). A simple physical model for the prediction and design of protein-DNA interactions. J. Mol. Biol. 344, 59–70. Jamal Rahi, S., Virnau, P., Mirny, L. A., and Kardar, M. (2008). Predicting transcription factor specificity with all-atom models. Nucleic Acids Res. 36, 6209–6217. Liu, Z., Guo, J. T., Li, T., and Xu, Y. (2008). Structure-based prediction of transcription factor binding sites using a protein-DNA docking approach. Proteins 72, 1114–1124. Marcaida, M. J., Prieto, J., Redondo, P., Nadra, A. D., Alibe´s, A., Serrano, L., Grizot, S., Duchateau, P., Paques, F., Blanco, F. J., and Montoya, G. (2008). Crystal structure of I-DmoI in complex with its target DNA provides new insights into meganuclease engineering. Proc. Natl. Acad. Sci. USA 105, 16888–16893.
18
Alejandro D. Nadra et al.
Morozov, A., Havranek, J., Baker, D., and Siggia, E. (2005). Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 33, 5781–5798. Newburger, D. E., and Bulyk, M. L. (2009). UniPROBE: An online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 37, D77–D82. Paillard, G., Deremble, C., and Lavery, R. (2004). Looking into DNA recognition: Zinc finger binding specificity. Nucleic Acids Res. 32, 6673–6682. Redondo, P., Prieto, J., Munoz, I. G., Alibe´s, A., Stricher, F., Serrano, L., Cabaniols, J. P., Daboussi, F., Arnould, S., Perez, C., Duchateau, P., Paques, F., et al. (2008). Molecular basis of xeroderma pigmentosum group C DNA recognition by engineered meganucleases. Nature 456, 107–111. Rohs, R., Jin, X., West, S. M., Joshi, R., Honig, B., and Mann, R. S. (2010). Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79, 233–269. Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W. W., and Lenhard, B. (2004). JASPAR: An open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94. Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F., and Serrano, L. (2005). The FoldX web server: An online force field. Nucleic Acids Res. 33, W382–W388. Siggers, T. W., and Honig, B. (2007). Structure-based prediction of C2H2 zinc-finger binding specificity: Sensitivity to docking geometry. Nucleic Acids Res. 35, 1085–1097.
C H A P T E R
T W O
The Ribosome Binding Site Calculator Howard M. Salis*,† Contents 1. Introduction 1.1. Inputs, outputs, and usage 1.2. Considerations 2. Applications of the RBS Calculator 2.1. Manipulating the protein expression level 2.2. Optimizing synthetic metabolic pathways 2.3. Designing and connecting genetic circuits 2.4. Evolutionary robustness of RBSs 2.5. Predicting translation initiation rates across a genome 3. RBSs and Bacterial Translation 3.1. Translation initiation as a rate-limiting step 3.2. The RBS genetic part 3.3. The rate-limiting molecular interactions of translation initiation 4. A Biophysical Model and Optimization Method for RBSs 4.1. Thermodynamics of RNA interactions 4.2. A free energy model for ribosome assembly 4.3. A statistical thermodynamic model of translation initiation 4.4. Optimization of synthetic RBSs 4.5. Accuracy and limitations 5. Precise Measurements of Fluorescent Protein Expression Levels 5.1. Protocol 5.2. Considerations Acknowledgments References
20 20 21 22 22 22 23 23 24 24 24 25 25 27 27 28 33 36 37 38 38 39 40 41
Abstract The Ribosome Binding Site (RBS) Calculator is a design method for predicting and controlling translation initiation and protein expression in bacteria. The method can predict the rate of translation initiation for every start codon * Department of Chemical Engineering, Pennsylvania State University, University Park, Pennsylvania, USA Department of Agricultural and Biological Engineering, Pennsylvania State University, University Park, Pennsylvania, USA
{
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00002-4
#
2011 Elsevier Inc. All rights reserved.
19
20
Howard M. Salis
in an mRNA transcript. The method may also optimize a synthetic RBS sequence to achieve a targeted translation initiation rate. Using the RBS Calculator, a protein coding sequence’s translation rate may be rationally controlled across a 100,000þ fold range. We begin by providing an overview of the potential biotechnology applications of the RBS Calculator, including the optimization of synthetic metabolic pathways and genetic circuits. We then detail the definitions, methodologies, and algorithms behind the RBS Calculator’s thermodynamic model and optimization method. Finally, we outline a protocol for precisely measuring steady-state fluorescent protein expression levels. These methods and protocols provide a clear explanation of the RBS Calculator and its uses.
1. Introduction The Ribosome Binding Site (RBS) Calculator is a predictive design method for controlling translation initiation and protein expression in bacteria (Salis et al., 2009). In its reverse engineering mode, the method predicts the rate of translation initiation for every start codon in an mRNA transcript. In its forward engineering mode, the method optimizes a synthetic RBS sequence to achieve a targeted translation initiation rate. Using the RBS Calculator, a protein coding sequence’s (CDS) translation rate may be rationally controlled across a 100,000þ fold range. The RBS Calculator employs a thermodynamic model of bacterial translation initiation to calculate the Gibbs free energy of ribosome binding. Using a statistical thermodynamic approach, we relate this Gibbs free energy change to a protein CDS’s translation initiation rate. For forward engineering, the thermodynamic model is combined with a stochastic optimization method to design synthetic RBSs according to the desired specifications. A Web interface to the RBS Calculator is located at http://salis.psu.edu/ software. The source code (v1.0) is freely available at http://github.com/ hsalis/Ribosome-Binding-Site-Calculator.
1.1. Inputs, outputs, and usage To use the reverse engineering mode, an mRNA sequence is inputted. The translation initiation rate for each start codon is then predicted on a proportional scale from 0.001 to 100,000þ au. A protein CDS translated at 1000 au will produce 10 times more protein than one translated at 100 au, assuming that all other conditions are equal (e.g., transcription rates and mRNA stabilities). These comparisons in translation initiation rate are unaffected by the proportionality of the scale. The method will warn you if the prediction may not satisfy a model assumption. The following warnings are generated:
The RBS Calculator
21
a. Kinetic trap (K) or not at equilibrium (NEQ): the folding of the mRNA transcript may experience a kinetic trap, preventing it from reaching an equilibrium and invalidating a key assumption of the thermodynamic model. b. Overlapping start codons (OS or OLS): the presence of two or more closely spaced start codons may alter the accuracy of the model’s prediction. c. Short protein CDS: the model requires at least 35 nucleotides of the protein CDS to make a valid prediction. A longer CDS should be inputted. A key benefit of the forward engineering mode is the ability to design synthetic sequences that always satisfy the model’s assumptions, which leads to higher predictive accuracies. To use the forward engineering mode, a target translation initiation rate and the first 35 nucleotides of a protein CDS are inputted. The translation initiation rate is selected from a proportional scale that ranges from 0.001 to 100,000þ au. An optional presequence may also be inputted, which is any sequence that must appear upstream (50 ) of the RBS. The method then generates a synthetic RBS sequence (30–35 nucleotides) that will initiate translation of the inputted protein CDS at the target rate. The forward engineering mode uses a stochastic optimization method to efficiently search through 435 possible RBS sequences until it identifies one with the desired specifications. A variant of the forward engineering mode allows you to input an initial RBS sequence and to specify which nucleotides are allowed to be altered according to the UIPAC degenerate nucleotide code. For example, when an XbaI restriction site must be located nearby the start codon, the sequence NNNNNTCTAGANNNNNN could be inputted. We refer to these sequences as RBS constraints.
1.2. Considerations a. For Escherichia coli K12, the maximum possible translation initiation rate is 5,687,190 au, which is only obtainable for extremely AT-rich protein CDSs that do not form any secondary structure. The maximum predicted translation initiation rate in the E. coli K12 genome is 250,000 au. b. Reusing the same RBS sequence with different protein CDSs can decrease the translation initiation rate by 500-fold (Salis et al., 2009). A synthetic RBS sequence should be designed for each protein CDS. c. Typically, there are multiple RBS sequences that equally satisfy an inputted specification, which is referred to as degeneracy in an optimization problem. The RBS Calculator’s forward engineering mode will return the first synthetic RBS sequence that satisfies the design specifications. d. A synthetic RBS sequence may not exist given the desired inputs; for example, if the 50 end of the protein CDS is GC-rich and a high
22
Howard M. Salis
translation initiation rate is selected (100,000þ au). An RBS constraint with few alterable nucleotides may also not have a solution. In these cases, the optimization method will self-terminate, and an RBS sequence is not returned. e. The RBS Calculator does not predict the physiological effect of translating a protein CDS at a selected rate. Sufficiently high expression of a protein will result in cell growth inhibition, due to competition for metabolic resources. A protein may also exhibit an activity that causes cell death prior to this threshold.
2. Applications of the RBS Calculator Proteins are the workhorse of cells. They are responsible for carrying out metabolism, replication, signal transduction, and differentiation. They sense the environment, regulate gene expression, catalyze chemical reactions, and act as therapeutics for treating human disease. Consequently, the ability to rationally control the protein expression level has broad applications in biotechnology.
2.1. Manipulating the protein expression level Recombinant protein production is a multibillion dollar industry. The RBS Calculator allows you to produce your protein of choice at a selected expression level in bacteria (e.g., E. coli BL21 or C43) by balancing the translation rate with the transcription and folding rates (Grunberg and Serrano, 2010). Using the reverse engineering mode, you may also predict a protein CDS’s translation rate before cloning it into a vector, allowing you to know in advance whether protein expression will be low. Then, using the forward engineering mode, you can systematically increase the translation rate of a protein CDS in controlled steps (e.g., 1000 to 10,000 to 100,000 au) to identify the optimal expression conditions.
2.2. Optimizing synthetic metabolic pathways Bacteria employ synthetic metabolic pathways to manufacture high-value chemical products from sugars or plant biomass. The RBS Calculator allows you to efficiently identify the optimal enzyme expression levels of a metabolic pathway, eliminating any bottlenecks in the pathway, and maximizing its productivity (Bujara and Panke, 2010; Holtz and Keasling, 2010; Na et al., 2010). Both combinatorial and convergent optimization strategies are possible. In combinatorial optimization, the RBS Calculator’s forward engineering
The RBS Calculator
23
mode is used to systematically sample combinations of the enzyme expression levels across their entire range. The optimal enzyme expression levels are then identified using a high-throughput screen or selection. In convergent optimization, the forward engineering mode is used to measure the local gradient of productivity with respect to enzyme expression level, followed by a large leap in expression level space toward the optimal levels. Convergent optimization requires fewer experiments than combinatorial optimization, but is not guaranteed to yield the global optimum. The development and demonstration of these optimization strategies are active areas of research.
2.3. Designing and connecting genetic circuits Genetic circuits—networks of regulated genes—employ transcription factors, regulatory RNAs, signaling proteins, and cell–cell communication to control gene expression according to a logical program or a dynamical behavior. Genetic circuits have been engineered to exhibit bistability, oscillations, pattern formation, traveling waves, and Boolean logic (Danino et al., 2010; Ellis et al., 2009; Khalil and Collins, 2010; Lu et al., 2009; Purnick and Weiss, 2009; Stricker et al., 2008; Tabor et al., 2009). The RBS Calculator allows you to rationally tune the expression levels of the proteins in a genetic circuit, either during the initial design phase or when connecting two genetic circuits together. A quantitative model of the genetic circuit can be used to determine which experimental perturbations are needed to successfully obtain a desired behavior; however, the model can only report these perturbations as numbers (e.g., increase a transcription factor’s expression level by fivefold). Using the reverse and forward engineering modes, you may convert these numbers into a synthetic RBS sequence, increasing or decreasing a protein’s expression level by a selected fold change. The circuit’s quantitative model may be physics based or empirical, dynamical or at steady state, stochastic or deterministic; regardless, the RBS Calculator connects the model’s numerical solution to a specific DNA sequence.
2.4. Evolutionary robustness of RBSs During evolutionary mutation and selection, protein expression level changes can result in physiological changes that increase an organism’s fitness. The RBS Calculator’s reverse engineering mode allows you to perform a sensitivity analysis on a RBS sequence, identifying which nucleotide mutations will most affect its translation initiation rate and protein expression level. Thus, probable paths during evolution can be identified a priori. Synthetic RBSs can also be designed for robustness to evolutionary pressure, by ensuring that any nucleotide mutation does not result in a large change in protein expression level.
24
Howard M. Salis
2.5. Predicting translation initiation rates across a genome The RBS Calculator’s reverse engineering mode can predict the translation initiation rate of every start codon in a bacterial genome. These predictions enable you to find the correct start codons of open reading frames, estimate the expression levels of proteins within an operon, and identify internal start codons that exhibit significant translation and produce variant proteins. The RBS Calculator software uses MPI for parallel programming and its calculations may be distributed across many processors on a supercomputer. For example, the E. coli K12 genome contains 629,738 start codons. Over 600 internal start codons have significant amounts of translation, compared to the annotated open reading frame. In particular, the RBS Calculator correctly predicts the significant translation (6700 au) of a fiveamino acid peptide encoded within its 23S rRNA (Tenson et al., 1996).
3. RBSs and Bacterial Translation Synthetic biology cannot advance as an engineering discipline without clear definitions and predictive rules. In this section, we clearly state the rules that control bacterial translation initiation, including the boundaries of the RBS genetic part and its molecular interactions with the ribosome.
3.1. Translation initiation as a rate-limiting step Translation is the process by which ribosomes bind to an mRNA sequence and produce a corresponding protein according to the genetic code. The process has three phases: initiation, elongation, and termination (Kozak, 1999; Laursen et al., 2005). Translation initiation is the key rate-limiting step; it controls how many ribosomes begin the elongation process and how many can potentially finish it. To obtain the highest protein production rates, both the translation initiation and elongation rates must be maximized. Codon optimization of protein CDSs is a commonly used method for increasing the translation elongation rate (Welch et al., 2009). However, any reduction in translation initiation will always lead to a reduction in protein production. Thus, by codon optimizing the protein CDS and manipulating its translation initiation rate, we can exert complete control over a protein’s production rate across its entire range. The rate of translation initiation is determined by two factors: global changes in the cell’s metabolism and the specific mRNA sequence surrounding a start codon. When bacteria are grown in nutrient-rich media, their number of available ribosomes is increased to support faster protein production rates and self-replication. Conversely, in nutrient-poor media or when protein
The RBS Calculator
25
production exceeds a critical threshold, bacteria will reduce the overall protein production rate to conserve resources (e.g., the stringent response). These changes are global; they affect the translation of all mRNAs inside the cell. Bacteria differentially regulate the translation of individual mRNAs according to the sequence of the RBS, a ubiquitous genetic part that is located upstream of a protein CDS. By modifying the RBS sequence, one alters a CDS’s translation initiation relative to all other translated CDSs inside the cell. In other words, the RBS sequence controls the distribution of protein synthesis resources to transcribed protein CDSs (a proportion)—not their absolute translation initiation rate (proteins/mRNA/second).
3.2. The RBS genetic part We propose the following functional definition of the RBS genetic part and its boundaries: Definition (i) The RBS part begins at least 35 nucleotides before the start codon of a protein CDS, up to the start (þ1) of the mRNA transcript. (ii) The translation initiation rate of a protein CDS depends on at least the 35 nucleotides before and after its start codon on an mRNA transcript. Our RBS part length is longer than previous definitions to include the presence of long-range molecular interactions between the RBS sequence, the CDS, and the ribosome. If the RBS part length is shortened, these interactions would not be incorporated into the prediction and would instead appear as “context effects”—unknown position-dependent molecular interactions that somehow alter the translation initiation rate. Our definition substantially reduces these context effects.
3.3. The rate-limiting molecular interactions of translation initiation Bacterial translation initiation requires the coordinated assembly of the 30S ribosomal complex onto the RBS at a protein CDS’s start codon. This preinitiation complex includes translation initiation factors IF1, IF2, and IF3, and the initiator tRNAfMet (Ramakrishnan, 2002). Successful initiation also requires GTP as an energy source. Once bound, the ribosome protects a large mRNA region from hydroxyl radical attack, consisting of about 35 nucleotides before the start codon and extending to 19–22 nucleotides after the start codon (Hu¨ttenhofer and Noller, 1994). The RBS region is subdivided into a standby site, a 16S rRNA binding site, a spacer region, and a start codon (Fig. 2.1). Initially, the 30S subunit binds to the standby site, which is a single-stranded region farther upstream of the
26
Howard M. Salis
30S complex
Standby site 16S rRNA binding site
Spacing Start
30S ribosomal subunit footprint Ribosome-binding site and 5¢ protein coding sequence
Figure 2.1 The ribosome binding site and 50 protein coding sequence control an mRNA’s translation initiation rate. The RBS region is subdivided into a standby site, a 16S rRNA-binding site, a start codon, and a spacer sequence. The footprint of a bound 30S subunit extends from the beginning of the standby site to about þ22 bp, centered on the start codon.
start codon. Once bound, the 30S subunit slides downstream into its position over the RBS, assembles the 30S preinitiation complex, and initiates translation. The existence of the standby site was hypothesized to solve a paradox (de Smit and van Duin, 2003): how would a freely diffusing, cytoplasmic 30S subunit have sufficient time to bind to a RBS that has been sequestered by an mRNA secondary structure? As the mRNA’s structure dynamically unfolds and refolds, it will only be single stranded for a span of microseconds. During this brief time, the 30S must diffuse toward the mRNA and bind to it, which would require it to have an association binding constant of 1010 [Ms] 1 or more; however, its actual association binding constant is around 107 [Ms] 1. The presence of the standby site eliminates this paradox and has been experimentally demonstrated; the 30S binds to a standby site, followed by a nondiffusive downstream slide along the mRNA and assembly at the RBS. mRNAs that contain single-stranded standby sites and sequestered 16S rRNA-binding sites can be translated efficiently; however, if the standby site itself is sequestered by secondary structures (Studer and Joseph, 2006) or bound by sRNAs (Darfeuille et al., 2007), then the mRNA’s translation initiation decreases. As the 30S complex slides across the mRNA into its preinitiation position, many noncovalent bonds are created and broken. The stability of the preinitiation complex and the translation initiation rate is determined by the energetics of these bonds. The complex’s assembly rate is decreased by the unfolding of mRNA structures that sequester the 16S rRNA binding site, spacer region, start codon, or ribosome footprint region (Studer and Joseph, 2006). These mRNA structures are composed of intramolecular nucleotide base pairings (hydrogen bonds) that form helices, knots, loops, and bulges. The absence of these mRNA structures will increase the
The RBS Calculator
27
translation initiation rate. Importantly, both the RBS and protein CDSs can participate in these mRNA structures. The preinitiation complex may disassemble before translation initiation takes place. Stabilizing interactions will decrease this disassembly rate. These interactions include (a) hybridization between the mRNA and the last nine nucleotides of the 30S complex’s 16S rRNA, which is called the anti-Shine– Dalgarno sequence (ASD; Shine and Dalgarno, 1974); (b) hybridization between the tRNAfMet anticodon and the start codon; and (c) attractive interactions between the mRNA and ribosomal proteins (e.g., S1; Aliprandi et al., 2008; Boni, 1991). Conversely, the preinitiation complex is destabilized when it is forced to stretch or compress itself to occupy the RBS in the required position, analogous to a rigid spring. These distortions occur when the aligned distance between the 16S rRNA-binding site and the start codon deviates from an optimal of five nucleotides (Chen et al., 1994).
4. A Biophysical Model and Optimization Method for RBSs The RBS Calculator uses a statistical thermodynamic model to predict the translation initiation rate of a protein CDS. Given a RBS and a protein CDS, the model calculates the free energy change during the assembly of the 30S complex onto the mRNA (DGtot). We then use a statistical ensemble approach to relate the protein CDS’s translation initiation rate r to the DGtot. The biophysical model bridges the gap between an RBS sequence and its translation initiation rate, creating a quantitative relationship between a string of letters (As, Gs, Cs, and Us) and a number. The RBS Calculator combines the biophysical model with stochastic optimization to identify a synthetic (nonnatural) RBS sequence that will yield a user-selected translation initiation rate. Importantly, this relationship also depends on the first 35 nucleotides of the protein CDS and the synthetic RBS sequence must be designed with this sequence included.
4.1. Thermodynamics of RNA interactions Using thermodynamics, the strengths of the intermolecular and intramolecular interactions that govern ribosome binding to mRNA may be calculated. At constant temperature and pressure, the available energy of a chemical species—the maximum amount of internal energy that is convertible to nonmechanical work—is called the Gibbs free energy, G. The change in Gibbs free energy (DG) quantifies the strengths of the interactions that cause the system to transition from a well-defined initial molecular state to a welldefined final molecular state. Importantly, the DG of a transition is path
28
Howard M. Salis
independent; the DG calculation does not depend on the transition rate or the number of intermediate states in between the initial and final state. The path-independence of thermodynamics allows us to define an arbitrary reference state and calculate the DG of a transition from state 1 to state 2 using two independent terms: DG12 ¼ DG2 DG1, where DG1 ¼ G1 Gref and DG2 ¼ G2 Gref. We are allowed to define the reference state and its energy, but the same reference state must be used in all calculations. For RNA-related DG calculations, the reference state is defined as the fully unfolded and unstructured RNA molecule with a free energy of zero (Gref ¼ 0 kcal/mol). RNA is single stranded and will fold back onto itself to form secondary structures (helices, loops, bulges, and knots; Bevilacqua and Blose, 2008). Multiple RNA molecules will cofold or hybridize together, forming a combination of intermolecular and intramolecular nucleotide base pairings. The Gibbs free energy of an RNA secondary structure is calculated by a semiempirical model that predicts the individual free energies of a structure’s helices, loops, bulges, and mismatches (Badhwar et al., 2007; Blose et al., 2007; Christiansen and Znosko, 2008; Mathews et al., 1999; Miller et al., 2008; Vecenie et al., 2006; Xia et al., 1998). For short RNA sequences, the model has a 5–10% accuracy, compared to experimentally measured free energies. Under equilibrium conditions, a solution of RNA molecules forms a mixture (an ensemble) of secondary structures. The probability of an RNA molecule forming a particular structure with free energy DG is proportional to its Boltzmann weight, exp(DG/RT), where T is the system’s temperature and R is the gas constant. The most probable RNA secondary structure has the lowest DG (the minimum free energy, MFE). However, there are many suboptimal RNA structures that will coexist with the MFE RNA structure; the most common structures will have energies within 2RT (1.24 kcal/mol at 37 C) of the MFE DG. Algorithms have been developed to calculate the MFE structure and energy, suboptimal structures and free energies, and base pairing probabilities of folded or cofolded RNA molecules. These algorithms use dynamic programming to implicitly enumerate all possible RNA secondary structures, without explicit calculation of every structure’s free energy, in order to find ones with minimal or near-minimal energies (Dirks et al., 2007; Gruber et al., 2008; Markham and Zuker, 2008; Mathews and Turner, 2006). More recent algorithms can generate RNA structures from the ensemble in proportion to their Boltzmann weight (Mathews, 2006).
4.2. A free energy model for ribosome assembly The thermodynamic model calculates the difference in Gibbs free energy before and after the 30S complex assembles onto an mRNA transcript, denoted by DGtot. The model considers an mRNA subsequence consisting
The RBS Calculator
29
of 35 nucleotides before and after a start codon. Given an mRNA subsequence, five free energy terms are calculated and summed together: DGtot ¼ DGfinal DGinitial ¼ DGmRNA:rRNA þ DGstart þ DGspacing DGstandby DGmRNA ð2:1Þ The initial state is the unbound 30S complex and mRNA subsequence. The mRNA subsequence is folded to its MFE secondary structure with a corresponding free energy DGmRNA. We do not include the free energy of the unbound 30S complex in the initial state. The final state is the 30S complex bound to the mRNA subsequence. The strengths of the participating interactions are quantified by four free energy terms (Fig. 2.2). The DGmRNA:rRNA is the energy released when the last nine nucleotides of the 16S rRNA cofolds and hybridizes with the mRNA subsequence at the 16S rRNA-binding site. The DGmRNA:rRNA calculation includes both intermolecular nucleotide base pairings between the 16S rRNA and mRNA and intramolecular nucleotide base pairings within the mRNA itself (DGmRNA:rRNA < 0). These intramolecular nucleotide base pairings are mutually exclusive with the ribosome footprint. The DGstart is the energy released when the tRNAfMet’s anticodon hybridizes to the start codon (DGstart < 0). The DGstandby is the energy released when the standby site is folded (DGstandby < 0); accordingly, DGstandby is the amount of energy needed to unfold the standby site. We define the standby site as the four nucleotides upstream of the 16S rRNA-binding site. The DGspacing is an energetic penalty for a nonoptimal distance between the 16S rRNA-binding site and the start codon (DGspacing > 0). The quantitative relationship between the distance s and DGspacing was experimentally determined and fit to a simple model (Salis et al., 2009). RBS sequences were constructed to have a high affinity 16S rRNA-binding site, minimal secondary structure, and an aligned distance s that was varied from s ¼ 0 to s ¼ 15 nucleotides (TCTAGA A7 TAAGGAGGT As ATG . . .). These sequences are predicted to have the same DGmRNA:rRNA, DGmRNA, DGstart, and DGstandby free energies. Accordingly, the differences in their translation initiation rates may be directly related to their DGspacing energies as a function of their aligned spacing s: 1 rs¼5 DGspacing ðsÞ ¼ log ð2:2Þ rs b Here, we define the optimal spacing (sopt ¼ 5) to have a zero DGspacing. The translation initiation rates of these sequences were measured and their DGspacing energies were calculated and fit to an appropriate equation.
30
Howard M. Salis
G G G C A C U A G A A CGA U
DG initial
G
A C G G C A G A G G C U UA C C G C G C U U U C G U C C A U A A GAG C A G G A A U A G C G G C C G = –14.2 kcal/mol U G A A
A
DG final = –3.4 kcal/mol
= –4.6 kcal/mol
DG spacing
= 0.005 kcal/mol GG DG start = –1.2 kcal/mol G A AUUCCUCCA G C CGAUAAGAUCAC GCAAGUCUUAGAGCAGGAUGGCUGAAGCGCAAAAUGAUCCCCUGCUGCCGGG s=4
B
DG initial = –3.4 kcal/mol G C A U C A A AU G G A A A A C C C G C U A C G G C G A AA A A U A C G U U U A A A U C C U A G C G G A C UAAGCGA UCAGAAC GAU
C
DG tot = +2.2 kcal/mol DG mRNA:rRNA = –0.9 kcal/mol
DG final = –2.1 kcal/mol
DG standby
= 0 kcal/mol
DG spacing
= 0.005 kcal/mol
DG start
= –1.2 kcal/mol
AUUCCUCCA ACUAAUCAUAAAAAACACCCCACAAGCUCCACCAAAUGAGUAGCGUAGAUAUUCUGGUCCCUGACCUGCC s=4
DG tot = –3.3 kcal/mol DG mRNA:rRNA = –6.2 kcal/mol
DG final = –6.7 kcal/mol
DG standby
= 0 kcal/mol
DG spacing
= 0.67 kcal/mol
DG start
= –1.2 kcal/mol
AUUC CUCCA GCAUCAAGACCAGAUAAACUAAG GAGCACAAAAGAUGUCAGAACGUUUCCCAAAUGACGUGGAUCCGAU C s=6
D AAA U A C AA U A A A A U U U G C G C G G C A G G U A A A U CA C G A C G AU A A G C GU A A U ACU AA UUAAAAUGUCC GAAAU
DG mRNA:rRNA = –6.8 kcal/mol DG standby
DG initial = –4.3 kcal/mol A C A A A A A A U A GC C U G A GUA G U U A A A C UU A UAG A A A U U C G C C G A CCCCACAAGCUCCA UCCCUGACCUGCC
DG initial = –4.0 kcal/mol
DG tot = +10.8 kcal/mol
DG tot = –8.4 kcal/mol
DG final = –12.4 kcal/mol
DG mRNA:rRNA = –11.2 kcal/mol DG standby
= 0 kcal/mol
DG spacing
= 0.0 kcal/mol
DG start
= –1.2 kcal/mol CA G A AU UCC U CC A C G ACU AAUUCAACUAAAUAGGGAGAAUUAAAAUGUCCAAAAUCGUAAAAAUCAUCGGUCGUGAAAU s=5
Figure 2.2 The RBS Calculator’s forward engineering mode generates synthetic ribosome binding sites for the (A) araC, (B) sucB, (C) aceE, and (D) eno protein coding sequences with targeted DGtot energies (translation initiation rates). The minimum free energy structures for the initial (left) and final (right) states are shown with corresponding free energies. The aligned spacing s is also shown.
When the ribosome is stretched (s > 5), the DGspacing has a quadratic behavior, well fit by the equation: 2 DGspacing ¼ c1 s sopt þ c2 s sopt
ð2:3Þ
31
The RBS Calculator
where c1 ¼ 0.048 kcal/mol/nt2, and c2 ¼ 0.24 kcal/mol/nt. When the ribosome is compressed (s < 5), the DGspacing has a sigmoidal behavior, fit by the equation: DGspacing ¼
c1 3 1 þ exp c2 s sopt þ 2
ð2:4Þ
where c1 ¼ 12.2 kcal/mol and c2 ¼ 2.5 nt 1. We next describe the steps to calculate the DGtot of an mRNA subsequence. 4.2.1. The free energy calculation of the initial state Using a dynamic programming algorithm, the mRNA subsequence is folded to its MFE secondary structure with a corresponding free energy DGmRNA. This straightforward calculation may be carried out by the MFE program of NuPACK (Dirks et al., 2007), the RNAfold program of ViennaRNA (Gruber et al., 2008), or the hybrid-ss-min program of UNAFold (Markham and Zuker, 2008). 4.2.2. The free energy calculation of the final state The free energy calculation of the final state begins with cofolding the last nine nucleotides of the 16S rRNA with the mRNA subsequence. MFE and suboptimal mRNA–rRNA structures are efficiently enumerated with the constraint that nucleotides located in the standby site or ribosome footprint are not allowed to base pair. For each structure, the free energies DGmRNA: rRNA, DGspacing, and DGstandby are calculated and summed together. The final state is the mRNA–rRNA structure that minimizes the summation of DGmRNA:rRNA, DGspacing, and DGstandby. Using a lookup table, the DGstart term is then added to this summation. The DGstart is 1.194 kcal/mol for AUG, 0.0748 kcal/mol for GUG, 0.0435 kcal/mol for UUG, and 0.03406 kcal/mol for CUG start codons, respectively. The algorithm finds the physiological 16S rRNA-binding site by identifying the rRNA–mRNA base pairings that will minimize the entire system’s Gibbs free energy, including the DGspacing and DGstandby. It is possible that the physiological 16S rRNA-binding site is not the highest affinity one within the mRNA. For example, if a very strong rRNAbinding site (DGmRNA:rRNA ¼ 12 kcal/mol) is located far upstream of the start codon, then its penalty for nonoptimal spacing would be large (DGspacing ¼ þ 12 kcal/mol), which yields a nonminimal DGfinal. To find the physiological 16S rRNA-binding site with a brute force approach, one would enumerate all potential 16S rRNA-binding sites with a large range in energies, generating millions of possible configurations.
32
Howard M. Salis
Instead, our algorithm uses an efficient spacing-indexed approach. First, classes of suboptimal mRNA–rRNA structures that have the same aligned spacing s and DGspacing penalty, ranging from 0 to þ15 kcal/mol, are enumerated. Each class contains MFE and suboptimal mRNA–rRNA structures where the DGspacing and DGstart are constant, but the DGmRNA: rRNA and DGstandby vary. The number of structures in each class is relatively small in contrast to the brute-force approach. The enumeration procedure may use the subopt program of NuPACK (Dirks et al., 2007) or the RNAsubopt program of ViennaRNA (Gruber et al., 2008) to generate the MFE and suboptimal mRNA–rRNA structures. Then, for each structure in each class, the DGmRNA:rRNA and DGstandby are calculated in a multistep procedure. The positions of the 16S rRNA– mRNA base pairs that are closest and farthest from the start codon are, respectively, labeled as (x1, y1) and (x2, y2). The borders of the 16S rRNAbinding site are between [y2 x2 þ 1] and [y1 x1 þ 9] and the aligned spacing is s ¼ start [y1 x1 þ 9]. Each mRNA–rRNA structure is allowed to form intramolecular base pairs that do not sequester its standby site, its 16S rRNA-binding site, or its ribosome footprint. The mRNA subsequence that begins at position 1 and ends at ([y2 x2 þ 1] Nstandby) is folded to its MFE structure, preventing the standby site from participating in base pairings. The number of nucleotides in the standby site is defined as Nstandby ¼ 4. Then, the mRNA subsequence that begins at start þ footprint and ends at start þ 35 is folded to its MFE structure. The footprint is the number of ribosome-bound nucleotides after the start codon. In current practice, we assume that the footprint is large enough to prevent significant base pairing in the protein CDS; this assumption only applies to the final state and may change in the future. Finally, we add these intramolecular folding energies to DGmRNA:rRNA so that it quantifies the total intermolecular and intramolecular interactions between the mRNA and RNA, including our folding constraints. By introducing our constraints during the folding process, the DGmRNA:rRNA will include the DGstandby penalty. We may also calculate the individual value of the DGstandby penalty by comparing the free energies with and without the standby site folding constraints. For clarity, Eq. (2.1) explicitly includes the individual DGmRNA:rRNA and DGstandby values. The total Gibbs free energy for each structure in each spacing-indexed class is summed according to DGmRNA:rRNA þ DGstart þ DGspacing DGstandby. The mRNA–rRNA structure that minimizes the total Gibbs free energy is identified and labeled as the final state. 4.2.3. Considerations a. The thermodynamic model assumes a two-state system where both initial and final states have reached equilibrium. Based on kinetic studies
The RBS Calculator
33
of RNA folding (Chen, 2008), some highly structured mRNAs may exhibit long-lived intermediate states that may invalidate this assumption. b. The thermodynamic model does not include nonspecific protein– mRNA interactions. For example, the ribosome likely has a strong attractive and nonspecific interaction with all mRNAs. This sequenceindependent interaction would appear as a large, negative constant in our free energy model; instead, it is currently incorporated into the proportionality constant that lumps together all sequence-independent effects. c. The minimum possible value for DGtot is 17.2 kcal/mol, yielding the maximum possible translation initiation rate. The translation initiation rate becomes negligibly small when DGtot is greater than þ25 kcal/mol. d. The mRNA subsequence begins at the mRNA transcript’s þ1 nucleotide if the start codon’s position is less than 35 nucleotides. In this case, the þ 1 nucleotide is considered “dangling” and an energetic bonus for dangling nucleotides is included in the free energy calculations of DGmRNA:rRNA and DGmRNA. e. The model for calculating DGspacing has the highest accuracy within the range 2 < s < 12 nucleotides, where s is the aligned distance between 16S rRNA-binding site and start codon. Outside this range, the translation initiation rate was sufficiently decreased to make precise experimental measurements difficult, leading to uncertainty during model fitting. f. Multiple configurations of the initial or final states may all have identical minimum free energies (e.g., degenerate structures). In these cases, the calculation of DGtot will produce the same result and predict the same translation initiation rate. g. According to statistical thermodynamics, a system at equilibrium will occupy a mixture of configurations; the majority will have energies between MFE and MFE þ 2RT. The thermodynamic model may be improved by sampling the configurations and computing the statistics of their DGtot free energies. The improvement will be limited to about 1–2 kcal/mol.
4.3. A statistical thermodynamic model of translation initiation We apply statistical thermodynamics to predict the translation initiation rates of all mRNAs inside the cell according to their Gibbs free energies of ribosome binding, DGtot. Our system-of-interest is the cellular pool of ribosomes and mRNA transcripts. Ribosomes and mRNA transcripts are dynamically produced and degraded, but during the exponential phase of growth, their numbers will fluctuate around an average. The system achieves a nonequilibrium steady-state condition called detailed balance, where statistical thermodynamics may be validly applied.
34
Howard M. Salis
For each mRNA transcript inside the cell, we describe the 30S ribosomal subunit’s binding and assembly according to the following association and dissociation reactions: mRNAi þ 30S $ mRNAi :: 30S
ð2:5Þ
where the index i ¼ 1 . . . N enumerates overall transcribed protein CDSs with their corresponding RBS sequences. At this point, we make our first assumption. Assumption #1: The pool of ribosomes and mRNAs remain at chemical equilibrium. During chemical equilibrium, ribosomes dynamically bind to mRNAs, initiate translation, or dissociate from the mRNA; however, the average number of free ribosomes, free mRNAs, or mRNA–ribosome complexes remains constant. The number of ribosomes R, mRNAs mi, and 30S–mRNA complexes Ci are then related according to Ci ¼ mi R expðbDGi Þ
ð2:6Þ
where DGi is the change in Gibbs free energy before and after the 30S complex of the ribosome assembles onto the ith protein CDS of an mRNA transcript. The total amount of 30S complex Rtot is the sum of the free and bound forms, which is ! X X Rtot ¼ R þ ð2:7Þ Cj ¼ R 1 þ mj exp bDGj j
j
Equation (2.7) may be rearranged to give the free amount of 30S complex: R¼
1þ
Rtot
m exp bDG j j j
P
ð2:8Þ
Substituting Eq. (2.8) into Eq. (2.6) and rearranging, we obtain the following relationship between the number of 30S–mRNA complexes Ci and the assembly reaction’s Gibbs free energy change: Ci ¼
mi Rtot expðbDGi Þ P 1 þ j mj exp bDGj
Here, we make our second assumption.
ð2:9Þ
35
The RBS Calculator
Assumption #2: The translation initiation rate of a protein CDS is proportional to the number of assembled 30S–mRNA complexes. This assumption allows us to rewrite Eq. (2.9) in terms of a relative translation initiation rate r: ri /
mi Rtot expðbDGi Þ N P 1 þ mj exp bDGj
ð2:10Þ
j¼1
where the j summation is performed over all transcribed protein CDSs. Equation (2.10) describes the relative translation initiation rates of the “ribosome ensemble”—the pool of ribosomes interacting with the pool of mRNAs inside the cell—in terms of the Gibbs free energy changes of their individual assembly reactions. It indicates that the translation initiation rate of the ith protein CDS is (a) proportional to the number of mRNA transcripts and ribosomes, (b) higher when the assembly reaction possesses a more negative Gibbs free energy change, and (c) lower when a competing assembly reaction (j 6¼ i) has a more negative Gibbs free energy change. Its denominator quantifies the portion of the total protein synthesis rate that is defined by the cell’s genome and transcriptome. Equation (2.10) also allows us to quantify the effect of expressing new proteins on the cell’s total protein synthesis rate. Without any cellular modifications, the total protein synthesis rate is a generally large number. When a new protein is modestly expressed inside the cell, the total synthesis rate and the translation rate of other mRNAs are generally unaffected. However, when one or more new proteins are highly expressed, Eq. (2.10)’s denominator will greatly increase. While the cell contains feedback loops to increase protein synthesis capacity under stress, a sufficiently large demand will result in lower translation rates for all mRNAs inside the cell. The resulting slowdown in global protein production will have many physiological effects, including a longer cell doubling time. With these potential physiological changes in mind, we may simplify Eq. (2.10) by approximating its denominator as a constant: ri ¼ K expðbDGi Þ
ð2:11Þ
where the proportionality constant K now includes the denominator. Equation (2.11) has two important conclusions: (i) a natural log r versus DGtot plot will be linear and (ii) the linear plot’s slope is b. Experimentally, we confirmed the validity of Eq. (2.11) and measured that b ¼ 0.45 0.05 mol/kcal (Salis et al., 2009; Fig. 2.3). We choose K ¼ 2500 so that
36
Howard M. Salis
105
12
10
8 103
Frequency
Fluorescence (au)
104
102
6
4 101
10
2
0
–10 –8 –6 –4 –2 0
2
4
6
8 10 12 14 16 18 20
0
Predicted ΔGtot (kcal / mol)
0
1
2
3
4
Error |ΔΔG| (RT)
Figure 2.3 The translation initiation rates of 29 synthetic ribosome binding sites are experimentally compared to the RBS Calculator’s predictions. According to the theory, a linear relationship between the calculated DGtot and the natural logarithm of the measured protein expression level is expected. This theory is validated with an R2 ¼ 0.84. The average error is 1.82 kcal/mol (0.82RT).
physiologically possible translation initiation rates r will vary between 0.1 and 100,000. According to statistical thermodynamics, a system at thermal equilibrium with its surroundings will have a b ¼ (RT) 1. According to the experimental data, the system’s apparent temperature is about 1100 K, which is much higher than the actual system temperature of 310 K. Understanding this discrepancy will require further experimentation; however, our calculation of DGtot assumes a two-state model at equilibrium. It is not uncommon for complicated thermodynamic systems to exhibit greater effective temperatures when modeled as a simpler system.
4.4. Optimization of synthetic RBSs Synthetic RBSs are nonnatural sequences that are optimized to yield a targeted translation initiation rate. The target translation initiation rate and a selected protein CDS are inputted into a stochastic optimization method, called simulated annealing, which then performs iterative rounds of mutation, calculation, and selection until it identifies an RBS sequence that satisfies all constraints. The target translation initiation rate is converted into a target DGtot (DGtarget) using Eq. (2.11) and an experimentally measured b ¼ 0.45 mol/kcal. The mRNA subsequence is initialized by concatenating a randomly generated 35 nucleotide RBS sequence to the inputted protein CDS.
The RBS Calculator
37
Its DGtot is calculated according to Eq. (2.1) and its objective function is evaluated as Oold ¼ jDGtot DGtargetj. New mRNA subsequences are generated by inserting, deleting, or replacing nucleotides in the RBS, followed by calculation of their DGtot, and evaluation of their objective function Onew ¼ jDGtot DGtargetj. The mutation is then accepted or rejected according to the Metropolis criteria (Metropolis et al., 1953), using an annealing temperature TSA, and three additional sequence constraints. The Metropolis criteria compares the objective functions of the current and new mRNA subsequences and calculates a probability of accepting the mutation according to
Oold Onew P ¼ max 1; exp ð2:12Þ TSA If the new mRNA subsequence is closer to satisfying the target and constraints, then it is immediately accepted with probability one. Otherwise, it is conditionally accepted with probability P < 1. The annealing temperature TSA is dynamically adjusted to maintain a conditional acceptance ratio between 5% and 20%. Sequence constraints are added to prevent the optimization algorithm from generating a synthetic RBS sequence that may invalidate the thermodynamic model’s assumptions. If a mutated sequence fails to satisfy these rules, then the mutation is discarded. First, the mutated RBS sequence may not contain start codons. Second, the energy required to unfold the 16S rRNA-binding site must be less than 6 kcal/mol. Third, the MFE structure of the mRNA subsequence may not contain base pairings that are farther than 35 nucleotides apart. The first rule prevents translation from initiating at additional start codons. The second rule prevents the formation of potentially long-lived intermediate states that may result when the 16S rRNA-binding site is strongly sequestered by secondary structure. The third rule encourages the formation of local secondary structures that reach equilibrium quickly. Conversely, nonlocal secondary structures may encounter kinetic traps that prevent them from reaching equilibrium.
4.5. Accuracy and limitations The thermodynamic model’s predictions have been experimentally tested on 119 natural and synthetic mRNA sequences (Salis et al., 2009). On average, a synthetic RBS sequence may be designed to achieve a targeted translation initiation rate to within a factor of 2.3 over a range of 100,000-fold (Fig. 2.3). The average error in the DGtot calculation is 1.82 kcal/mol (0.82RT).
38
Howard M. Salis
The model’s DGtot error is well fit by a one-sided Gaussian distribution with a variance of 2.44 kcal/mol (1.1RT). However, there are extrema sequences where the model’s error exceeds 6 kcal/mol (2.7RT); in these cases, certain molecular interactions may be absent from the model that play an important role in altering the translation initiation rate. The thermodynamic model has the following limitations: (a) the model does not include the interaction between the mRNA and ribosomal S1 protein; (b) the model does not consider the presence of antisense RNA- or RNAse-binding sites; (c) the model assumes that multiple start codons are independently translated, ignoring the potential for coupling between closely spaced start codons; and (d) the model ignores the potential for translational coupling between protein CDSs in an operon, which likely occurs when an RBS and upstream protein CDS overlap.
5. Precise Measurements of Fluorescent Protein Expression Levels Fluorescent proteins are versatile reporters that enable the measurement of in vivo protein expression levels. The following protocol describes the usage of spectrophotometers and/or flow cytometry to record fluorescent protein expression levels from E. coli cultures in 96-well microplate format. Importantly, our protocol uses long culture times of 24–36 h to achieve steady-state conditions so that the average number of proteins per cell is constant over time. This protocol improves the measurement’s precision, day-to-day and lab-to-lab reproducibility, and enables a more accurate comparison between model predictions and experimental data.
5.1. Protocol Equipment: a spectrophotometer with monochromators and incubation capability (e.g., a TECAN M1000); a flow cytometer with appropriate lasers and detectors for measuring selected fluorescent proteins in a highthroughput microplate format (e.g., a BD LSRII or Fortessa). a. A standard (or deep) 96-well microplate containing 200 mL (or 1 mL) Luria–Bertani (LB) media (10 g/L tryptone, 5 g/L yeast extract, 10 g/L NaCl) and selective antibiotic is inoculated from single colonies. A nonfluorescent (white) cell culture is also inoculated. Cultures are grown overnight at 37 C with 250 rpm orbital shaking to an optical cell density (OD600) of 2.0. b. A fresh 96-well transparent, flat bottom microplate is filled with 198 mL M9 minimal media (1 M9 salts: 6.8 g/L Na2PO4, 3 g/L KH2PO4, 0.5 g/L NaCl, 1 g/L NH4Cl; 2 mM MgSO4, 100 mM CaCl2; selective
The RBS Calculator
39
antibiotic; and 0.4% glucose, adjusted to a pH of 7.4). Microplate wells are inoculated by overnight cultures using a 1:100 dilution. The microplate is placed inside the spectrophotometer and incubated at 37 C with 250 rpm orbital shaking. c. The spectrophotometer records OD600 and fluorescence (FLU) measurements every 10 min. These measurements are used to calculate cell growth rates and steady-state bulk FLU per OD600. d. Once a culture reaches an OD600 between 0.15 and 0.20, 10–20 mL samples of each culture are transferred to fresh transparent, flat bottom microplates containing 180–190 mL prewarmed M9 minimal media and selective antibiotic (a 1:5–1:10 dilution). The new microplate is placed inside the spectrophotometer and incubated at 37 C with 250 rpm orbital shaking. e. Additional 10 mL samples of each culture are transferred to a roundbottom microplate containing 190 mL PBS and 2 mg/mL kanamycin for flow cytometry measurements. The excess kanamycin stops bacterial protein production. f. Repeat steps c, d, and e. Steady-state conditions are not reached until about 4–10 h of culture time. This media replacement strategy can be continued until sufficient protein expression level data have been gathered. Typically, at least three to four serial dilutions are performed for a total culture time of 24–36 h.
5.2. Considerations a. The M9 minimal media is supplemented with 0.05 g/L leucine when using E. coli DH10B cells, which have a leucine auxotrophic phenotype. b. The M9 minimal media may be replaced with another defined media, such as MOPS minimal media or yeast synthetic defined (SD) media. The key requirement is that the culture’s growth rate remains constant between media batches. The formulation of medias that use tryptone, yeast extract, or peptide mixtures will greatly vary between batches. c. Only the inner 60 wells of a 96-well microplate should be used for cell culture. During an 8- 12-h culture, evaporation will cause the liquid levels in the outer wells to significantly drop. The outer wells should be filled with blank media to buffer this effect. d. Both the spectrophotometer and flow cytometry use a gain parameter to convert photon emission counts into digitized data. It is essential that this gain parameter remain a constant throughout all experiments. Some machines offer dynamic control of the gain to maximize FLU sensitivity; this dynamic control should be turned off. During preliminary experiments, the gain parameter may be optimized such that (i) the background FLU of media or white cells is twofold higher than detector noise, and (ii)
40
Howard M. Salis
the data from cell cultures expressing the highest protein expression level are twofold less than the machine’s digital overflow value. The optimal gain parameter depends on fluorescent protein brightness and detector path length. e. The spectrophotometer records cell optical density (OD) and FLU data of cell culture samples, white cells, and blank media over time. For each time point, the average bulk fluorescence per cell (FLPC) is calculated according to FLPCsample ¼
FLUsample FLUmedia FLUwhite FLUmedia ODsample ODmedia ODwhite ODmedia
ð2:13Þ
which correctly subtracts the background OD and accounts for background FLU from both blank media and white cell sources. This measurement is useful when the single-cell FLU distribution is unimodal (single-peaked) and when stochastic contributions to gene expression are not relevant. f. The flow cytometer records forward scatter (FSC), side scatter (SSC), and FLU data for single cells taken from cell culture samples or white cells. At least 100,000 single-cell measurements should be recorded. Cell debris and clumped cells are eliminated from the data by discarding events (“gating”) according to the following criteria: (i) the FLU is greater than zero; (ii) the FSC to SSC ratio is greater than 0.50 and less than 1.50; and (iii) the FSC and SSC values are inside a circle, drawn with its center located at the average FSC and SSC data and with a radius of 10,000. The circle’s radius may be adjusted according to the flow cytometer’s gain parameter. Automated gating may be performed using MATLAB and the fca_readfcs script (Laszlo Balkay). The arithmetic mean (not geometric) of the FLU distribution should be used to calculate the average FLU per single cell. The variance, kurtosis, and skewness of the distribution, respectively, quantify its variation, peakedness (heavy tailedness), and asymmetry. g. A cell culture has reached steady state when (i) the rate of change of FLPCsample over time is approximately zero for a sufficiently long time, or (ii) the distribution of single-cell FLU does not significantly change between two time points. Once steady-state conditions have been reached, the recorded FLPCsample data is time averaged to reduce measurement noise and the FLU distribution’s statistics are calculated.
ACKNOWLEDGMENTS We would like to thank Joseph Wade (Wadsworth Center) for an insightful discussion on genome-scale translation rates. This work was supported by a DARPA Young Faculty Award.
The RBS Calculator
41
REFERENCES Aliprandi, P., Sizun, C., et al. (2008). S1 ribosomal protein functions in translation initiation and ribonuclease RegB activation are mediated by similar RNA-protein interactions. J. Biol. Chem. 283(19), 13289. Badhwar, J., Karri, S., et al. (2007). Thermodynamic characterization of RNA duplexes containing naturally occurring 1 2 nucleotide internal loops. Biochemistry 46(50), 14715–14724. Bevilacqua, P., and Blose, J. (2008). Structures, kinetics, thermodynamics, and biological functions of RNA hairpins. Phys. Chem. 59(1), 79. Blose, J., Manni, M., et al. (2007). Non-nearest-neighbor dependence of the stability for RNA bulge loops based on the complete set of group I single-nucleotide bulge loops{. Biochemistry 46(51), 15123–15135. Boni, I. (1991). Ribosome-messenger recognition: mRNA target sites for ribosomal protein S1. Nucleic Acids Res. 19(1), 155. Bujara, M., and Panke, S. (2010). Engineering in complex systems. Curr. Opin. Biotechnol. 21, 586–591. Chen, S. (2008). RNA folding: Conformational statistics, folding kinetics, and ion electrostatics. Annu. Rev. Biophys. 37, 197. Chen, H., Bjerknes, M., et al. (1994). Determination of the optimal aligned spacing between the Shine-Dalgarno sequence and the translation initiation codon of Escherichia coli mRNAs. Nucleic Acids Res. 22(23), 4953. Christiansen, M., and Znosko, B. (2008). Thermodynamic characterization of the complete set of sequence symmetric tandem mismatches in RNA and an improved model for predicting the free energy contribution of sequence asymmetric tandem mismatches{. Biochemistry 47(14), 4329–4336. Danino, T., Mondrago´n-Palomino, O., et al. (2010). A synchronized quorum of genetic clocks. Nature 463(7279), 326–330. Darfeuille, F., Unoson, C., et al. (2007). An antisense RNA inhibits translation by competing with standby ribosomes. Mol. Cell 26(3), 381–392. de Smit, M., and van Duin, J. (2003). Translational standby sites: How ribosomes may deal with the rapid folding kinetics of mRNA. J. Mol. Biol. 331(4), 737–743. Dirks, R., Bois, J., et al. (2007). Thermodynamic analysis of interacting nucleic acid strands. SIAM Rev. 49(1), 65. Ellis, T., Wang, X., et al. (2009). Diversity-based, model-guided construction of synthetic gene networks with predicted functions. Nat. Biotechnol. 27(5), 465–471. Gruber, A., Lorenz, R., et al. (2008). The Vienna RNA websuite. Nucleic Acids Res. 36 (Web Server issue), W70. Grunberg, R., and Serrano, L. (2010). Strategies for protein synthetic biology. Nucleic Acids Res. 38, 2663–2675. Holtz, W., and Keasling, J. (2010). Engineering static and dynamic control of synthetic pathways. Cell 140(1), 19–23. Hu¨ttenhofer, A., and Noller, H. (1994). Footprinting mRNA-ribosome complexes with chemical probes. EMBO J. 13(16), 3892. Khalil, A., and Collins, J. (2010). Synthetic biology: Applications come of age. Nat. Rev. Genet. 11(5), 367–379. Kozak, M. (1999). Initiation of translation in prokaryotes and eukaryotes. Gene 234(2), 187–208. Laursen, B., Sorensen, H., et al. (2005). Initiation of protein synthesis in bacteria. Microbiol. Mol. Biol. Rev. 69(1), 101. Lu, T., Khalil, A., et al. (2009). Next-generation synthetic gene networks. Nat. Biotechnol. 27(12), 1139–1150.
42
Howard M. Salis
Markham, N., and Zuker, M. (2008). Software for nucleic acid folding and hybridization. Methods Mol. Biol. 453, 3–31. Mathews, D. (2006). Revolutions in RNA secondary structure prediction. J. Mol. Biol. 359(3), 526–532. Mathews, D., and Turner, D. (2006). Prediction of RNA secondary structure by free energy minimization. Curr. Opin. Struct. Biol. 16(3), 270–278. Mathews, D., Sabina, J., et al. (1999). Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure1. J. Mol. Biol. 288(5), 911–940. Metropolis, N., Rosenbluth, A., et al. (1953). Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087. Miller, S., Jones, L., et al. (2008). Thermodynamic analysis of 5 and 3 single- and 3 doublenucleotide overhangs neighboring wobble terminal base pairs. Nucleic Acids Res. 36(17), 5652. Na, D., Kim, T., et al. (2010). Construction and optimization of synthetic pathways in metabolic engineering. Curr. Opin. Microbiol. 13(3), 363–370. Purnick, P., and Weiss, R. (2009). The second wave of synthetic biology: From modules to systems. Nat. Rev. Mol. Cell Biol. 10(6), 410–422. Ramakrishnan, V. (2002). Ribosome structure and the mechanism of translation. Cell 108(4), 557–572. Salis, H. M., Mirsky, E. A., et al. (2009). Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27(10), 946–950. Shine, J., and Dalgarno, L. (1974). The 3-terminal sequence of Escherichia coli 16S ribosomal RNA: Complementarity to nonsense triplets and ribosome binding sites. Proc. Natl. Acad. Sci. USA 71(4), 1342. Stricker, J., Cookson, S., et al. (2008). A fast, robust and tunable synthetic gene oscillator. Nature 456(7221), 516–519. Studer, S., and Joseph, S. (2006). Unfolding of mRNA secondary structure by the bacterial translation initiation complex. Mol. Cell 22(1), 105–115. Tabor, J. J., Salis, H. M., et al. (2009). A synthetic genetic edge detection program. Cell 137 (7), 1272–1281. Tenson, T., DeBlasio, A., et al. (1996). A functional peptide encoded in the Escherichia coli 23S rRNA. Proc. Natl. Acad. Sci. USA 93(11), 5641. Vecenie, C., Morrow, C., et al. (2006). Sequence dependence of the stability of RNA hairpin molecules with six nucleotide loops{. Biochemistry 45(5), 1400–1407. Welch, M., Villalobos, A., et al. (2009). You’re one in a googol: Optimizing genes for protein expression. J. R. Soc. Interface 6(Suppl. 4), S467. Xia, T., SantaLucia, J., Jr., et al. (1998). Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs{. Biochemistry 37(42), 14719–14735.
C H A P T E R
T H R E E
Designing Genes for Successful Protein Expression Mark Welch, Alan Villalobos, Claes Gustafsson, and Jeremy Minshull Contents 44 45 45 45 49 56 56 57 58 59 61 62
1. Introduction 2. Gene Design Software 3. General Sequence Parameters Affecting Protein Expression 3.1. Initiation of translation 3.2. Codon bias 3.3. mRNA structure and translational elongation 4. Protein-Specific Factors Providing Additional Complexity 4.1. Protein toxicity 4.2. Transmembrane proteins 4.3. cis-Regulatory regions 5. Conclusions References
Abstract DNA sequences are now far more readily available in silico than as physical DNA. De novo gene synthesis is an increasingly cost-effective method for building genetic constructs, and effectively removes the constraint of basing constructs on extant sequences. This allows scientists and engineers to experimentally test their hypotheses relating sequence to function. Molecular biologists, and now synthetic biologists, are characterizing and cataloging genetic elements with specific functions, aiming to combine them to perform complex functions. However, the most common purpose of synthetic genes is for the expression of an encoded protein. The huge number of different proteins makes it impossible to characterize and catalog each functional gene. Instead, it is necessary to abstract design principles from experimental data: data that can be generated by making predictions followed by synthesizing sequences to test those predictions. Because of the degeneracy of the genetic code, design of gene sequences to encode proteins is a high-dimensional problem, so there is no single simple DNA2.0, Inc., Suite A, Menlo Park, California, USA Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00003-6
#
2011 Elsevier Inc. All rights reserved.
43
44
Mark Welch et al.
formula to guarantee success. Nevertheless, there are several straightforward steps that can be taken to greatly increase the probability that a designed sequence will result in expression of the encoded protein. In this chapter, we discuss gene sequence parameters that are important for protein expression. We also describe algorithms for optimizing these parameters, and troubleshooting procedures that can be helpful when initial attempts fail. Finally, we show how many of these methods can be accomplished using the synthetic biology software tool Gene Designer.
1. Introduction A major objective of synthetic biology is to characterize biological components with sufficient precision to enable these components to be combined to produce predictable outcomes. Progress has been made in defining functional parameters for some elements. Particularly, those with regulatory functions that act to control transcription (promoters, operators, repressors, and activators) are now reasonably well characterized (see http:// www.partsregistry.org; Lisser and Margalit, 1993; Peccoud et al., 2008). However, reaching the ultimate targets of synthetic biology projects will require the balanced control of both transcription and translation in order to achieve controlled protein expression, whether those targets are engineered pathways for producing metabolites, remodeled photosynthesis, or trees that can turn into houses. Proteins are not necessarily the components of regulatory networks; they may also be catalysts that interact with cellular metabolism, structural parts of the cell, or therapeutically active compounds. Unfortunately, understanding transcriptional regulation is not sufficient to provide control of protein production. The characterization of sequences governing translation has proved challenging. This is largely because translational determinants interact with, or are embedded within the sequences that encode the polypeptide. Consequently, there is not yet a perfectly robust way to convert a virtual amino acid sequence to a DNA sequence that will, when introduced into a desired host cell, yield sufficient protein for a specific downstream application. Here, we describe recently developed tools and technologies for gene design, and discuss the heuristic basis of our understanding of particularly important design features. Translation can be controlled at the level of initiation and elongation. Initiation of translation is primarily dependent on the sequence of the ribosome binding site (RBS) and early mRNA secondary structure (Allert et al., 2010; Kudla et al., 2009; Salis et al., 2009). Other determinants of protein expression are less well understood but equally potent. Different proteins expressed from the same promoter with the same RBS or 50
Gene Design and Protein Expression
45
untranslated region (UTR) may be expressed at wildly different levels. Even different ways of encoding the same protein, under otherwise identical conditions, can result in protein concentrations differing by 100-fold (Allert et al., 2010; Kudla et al., 2009; Welch et al., 2009b). Understanding these determinants would greatly enhance our ability to express proteins at specific desired levels. In the best case, we could hope to use the control they offer. At the least, it would be helpful if we could eliminate them so that we could rely on the controls we do understand. Experimental data on the influence of gene design on heterologous expression are rapidly growing, and design algorithms derived from these experiments provide both an increased probability of success in individual projects and a starting point for further experimentation.
2. Gene Design Software Backtranslation from a polypeptide sequence to obtain a DNA sequence requires choosing between an enormous number of possibilities (Welch et al., 2009b). We use the backtranslation module of Gene Designer, a free software tool (www.dna20.com/genedesigner2), to select sequences with specific design characteristics. Backtranslation parameters can be altered by selecting backtranslation profiles from the Configure menu in the Project Window (see Fig. 3.1). These parameters will be discussed in more detail in the following sections.
3. General Sequence Parameters Affecting Protein Expression Evidence that recoding a gene can radically change its expression has been accumulating over the past two decades (Gustafsson et al., 2004; Welch et al., 2009b). However, it is only in the last year or two that experiments have compared the expression of many different individual genes encoding the same protein. These experiments are finally allowing hypotheses about the causes of expression differences to be tested.
3.1. Initiation of translation A key component affecting initiation of translation in prokaryotes is the RBS that occurs between 5 and 15 bases upstream of the open reading frame (ORF) AUG start codon. Binding of the ribosome to the Shine–Dalgarno (SD) sequence within the RBS localizes the ribosome to the initiation codon. This binding is primarily due to direct base pairing with the anti-SD region
46
Mark Welch et al.
Figure 3.1 Project Window, My Backtranslation Profiles, and Backtranslation Profile Editor. From the Configure Menu, choose Backtranslation Profiles to open My Backtranslation Profiles. Then select a profile and click edit (pencil icon) or double click on a profile to open the Backtranslation Editor. In the editor, you can change parameters related to the genetic algorithm, codon usage, sequences to avoid, 50 structure, repeats, and homologous DNA.
of the 16S rRNA of the small ribosome subunit and can be greatly influenced by context (Komarova et al., 2005; Lee et al., 1996; Shultzaberger et al., 2001; Vimberg et al., 2007). Changes in RBS sequences can change expression levels over more than three orders of magnitude. Affinity of the RBS for the ribosome is a critical factor controlling the efficiency with which new polypeptide chains are initiated. This interaction is in competition with possible base-pairing interactions involving the RBS region that may form within the mRNA itself. Thus, SD sequences with weaker base pairing to the ribosome are more susceptible to interference from mRNA structure. However, some experiments suggest that SD sequences with too strong affinity can be deleterious, particularly at lower temperatures, by stalling initial elongation
Gene Design and Protein Expression
47
(Komarova et al., 2002; Vimberg et al., 2007). Also critical is the distance between the RBS and the start codon with 5–7 bases from the consensus SD AGGAGG being optimal (Chen et al., 1994). Models that factor competition between the anti-SD and mRNA structure as well as start codon spacing have been shown to approximate actual translation initiation rates (de Smit and van Duin, 2003; Na et al., 2010; Salis et al., 2009). Much prior work has demonstrated that mRNA structures that occlude the region of the RBS and/or start codon in genes expressed in prokaryotes can impair expression (de Smit and van Duin, 1990, 1994; Griswold et al., 2003; Kozak, 1986; Kudla et al., 2009; Studer and Joseph, 2006). For this reason, gene design strategies often avoid such structures in choosing coding of the first several amino acids. Salis and coworkers have recently developed a thermodynamic model that captures competition between internal mRNA structures and the binding of the ribosome to the RBS (Salis et al., 2009). An alternative mathematical model of initiation has also been proposed based on similar considerations (Na et al., 2010). The Salis model is the basis of an online tool that can be used to design RBSs with modified rates of initiation of translation (http://www.voigtlab.ucsf.edu/software/). In its current stage of development, this tool is best suited for attenuating expression of an existing gene. In eukaryotes, translation initiation is significantly different from that in prokaryotes, and multiple mechanisms have been characterized. Most initiation of translation from polymerase II-derived transcripts proceeds via recognition of the m7G cap at the 50 terminus of the mRNA followed by scanning of the ribosome to the initiation codon, which is identified by proximity to the 50 -end and sequence context (Kozak, 1999, 2005; Pestova et al., 2001; Preiss and Hentze, 1999). Several factors are apparently involved in unwinding structure in the region from the cap to the start codon (Parsyan et al., 2009; Pisareva et al., 2008). Alternatively, initiation for some genes can occur via recognition of internal mRNA elements that recruit ribosomes to the message and direct them to the start codon (Berry et al., 2010; Gazo et al., 2004; Pestova et al., 2001). Numerous lines of evidence suggest that the initial 15–25 codons of the ORF deserve special consideration in gene optimization (Allert et al., 2010; Chen and Inouye, 1994; Eyre-Walker and Bulmer, 1993; Gonzalez de Valdivia and Isaksson, 2004, 2005; Kudla et al., 2009; Stenstro¨m and Isaksson, 2002; Stenstro¨m et al., 2001a,b; Tuller et al., 2010). Studies have shown that the impact of rare codons on translation rate is particularly strong in these first codons, for expression in both Escherichia coli and Saccharomyces cerevisiae (Chen and Inouye, 1990, 1994; Hoekema et al., 1987). In E. coli, peptidyl-tRNA drop-off during translation of the initial codons appears to be accentuated by the presence of rare or NGG codons (Cruz-Vera et al., 2004; Gonzalez de Valdivia and Isaksson, 2004, 2005). These effects appear to be independent of local mRNA secondary structure. The impact of early
48
Mark Welch et al.
rare codons may in some cases be suppressed by the overexpression of cognate tRNAs; however, such a strategy does not suppress the effect of NGG codons (Gonzalez de Valdivia and Isaksson, 2005). It is also true that expression may be recovered by 50 sequence replacement even for sequences that do not show especially strong mRNA structure or contain rare codons or other obvious deleterious elements in this region (Welch et al., 2009a). 3.1.1. Avoiding mRNA structure in gene design Backtranslation in Gene Designer allows special treatment of the 50 -end of the mRNA, with the goal of reducing secondary RNA structure. The user is able to define multiple structure identification strategies. Each strategy is weighted for fitness scoring in the genetic algorithm. To configure each strategy, the user can define the search window (in base pairs), minimum stem size, minimum loop size, maximum loop size, and the scoring weight. During backtranslation, Gene Designer uses a sliding window technique to evaluate all possible single loop structures within the constraints given by the strategy. One challenge in trying to both minimize 50 structure and match a codon bias is that the two often pull designs in opposite directions. To mitigate this conflict, it can be helpful to first minimize the 50 structure, and then create the remainder of the gene to give an overall match to the desired codon bias. It can sometimes be more difficult to minimize structure without using undesired codons in the important early coding region. 3.1.2. N-terminal tags to improve expression Making N-terminal fusions can be a way to improve the expression of recalcitrant proteins either by displacing mRNA structure from the initiation region or by improving the physical integrity of the protein (Hammarstrom et al., 2002; Korepanova et al., 2007; Smyth et al., 2003). Some useful fusion tags are loaded into the Gene Designer Library. They can be added to the N-terminus of a protein by dragging from the Library and in front of the coding region of the protein (see Fig. 3.2). Because the sequences can also be edited, the original N-terminal methionine may be removed if desired. As an example, the mRNA encoding one particular protein (“ProtA”) was prone to form a very strong hairpin in first 15 codons of the ORF. No coding could be found to remove strong predicted structure in this region. A codon-optimized version of the gene showed weak full-length expression from either of two 50 -end codings designed to minimize mRNA structure. One coding gave no detectable expression (not shown), whereas the other gave weak yield of full-length protein along with a more significant level of a truncated product, perhaps due to internal initiation or protein degradation. However, displacement of the initial sequence by an N-terminal fusion to maltose binding protein (MBP; Korepanova et al., 2007; Smyth et al., 2003) greatly improved expression
Gene Design and Protein Expression
49
Figure 3.2 Library Explorer and Project Window showing Sequence View. To edit the sequence of an element in sequence view, simply select the DNA region of the element in question and click on the Edit link button bellow the DNA strands.
(see Fig. 3.3). Improved full-length expression was also seen when an 18-codon phage gIII secretion leader sequence was added to the 50 -end. Although it is tempting to interpret these results as meaning that a limiting 50 sequence was replaced with nonlimiting ones, we have observed the effect of fusions to be highly gene dependent. In the case of another gene (“ProtB”), adding the same gIII coding sequence proved to lower expression significantly, well below that of gIII_ProtA (see Fig. 3.3). This discrepancy is not explained by predicted local mRNA structure. Neither the original nor gIII-fused versions of ProtB genes have strong predicted structure in the RBS and initial coding regions of the mRNA and both show significantly less structure than the higher expressing gIII_ProtA. It remains to be determined why such conditional effects are observed. Clearly 50 replacement can be a useful tool to improve gene expression in some cases, but much is still to be learned about the interdependence of the 50 region and downstream sequence or other protein characteristics.
3.2. Codon bias Each amino acid is encoded by as few as one (methionine and tryptophan) to as many as six codons (arginine, leucine, and serine) in the canonical genetic code. Different organisms use synonymous codons with different apparent preferences. This is exemplified in the far range of G þ C content
50 MBP_ProtA
gIII_ProtA
ProtA
C
Mark Welch et al.
ProtB –
+
gIII_ProtB – +
MBP_ProtA 62
49
38 gIII_ProtA ProtA? 28
Figure 3.3 Impact of N-terminal fusions on expression of ProtA and ProtB in E. coli. Left panel: PAGE analysis of ProtA and N-terminal fusions with gIII and MBP are shown. C, control with empty vector. Numbers to left indicate positions of MW standards (kDa). Right panel: PAGE showing expression for uninduced () and induced (þ) cultures of ProtB and gIII-fused ProtB.
found in bacterial coding sequences that use G þ C in the third position of codons as low as approximately 10% (e.g., Buchnera sp.) to as high as approximately 90% (e.g., Streptomyces sp.; Sharp et al., 2005). Further, significant bias in codon usage exists between the complete transcriptome and genes that are highly expressed in some organisms (Sharp et al., 1988). The reasons for these differences have been the subject of considerable speculation (Akashi, 2001; Akashi and Gojobori, 2002; Eyre-Walker, 1996; Eyre-Walker and Bulmer, 1993, 1995; Holm, 1986; Knight et al., 2001; Marquez et al., 2005; Rocha, 2004; Suzuki et al., 2008; Yang and Nielsen, 2008). Initial gene designs were guided by host codon bias—a reasonable approach given that the abundance of cognate tRNAs is generally correlated to codon usage frequency (Bulmer, 1987; Dong et al., 1996; Kanaya et al., 2001). 3.2.1. Approximating the host codon bias There are two intuitively sensible ways in which host codon use frequencies can be adapted for gene design. The first is to select the codon that is used most often for each amino acid, either among the entire transcriptome or that for the most highly expressed genes, and use that exclusively within the
51
Gene Design and Protein Expression
design. Genes preferring such codons are often referred to as more “adapted” for expression (Sharp and Li, 1987). The underlying and unproven assumption is that the most common codon corresponds to the highest translational efficiency in heterologous expression. This method has many potential drawbacks. If only one codon is used to encode each amino acid, there is only a single possible DNA sequence with which to encode a specific protein. This eliminates any flexibility in other design criteria such as the elimination or incorporation of restrictions sites, repetitive elements within the sequence which can compromise stability, or sequences that could form structures at or around the site of translational initiation. Overuse of particular codons may also result in significant amino acid misincorporation (Kurland and Gallant, 1996), which might compromise the function of the protein. Most importantly, however, is that such codon usage may not be optimal for expression. Instead, there is ample empirical evidence that genes designed using common codons are not correlated with high protein expression (Kudla et al., 2009; Welch et al., 2009a) and evidence that in some cases it may be detrimental (Maertens et al., 2010; Welch et al., 2009a; see Fig. 3.4). The second way in which host codon frequencies can be used is to match the host codon frequencies in the designed gene. This can be done simply by choosing each codon with a probability that matches the host codon frequency. Although simple to implement, this does have the limitations that probabilistic selection will sometimes result, by pure chance, in a gene design where the frequencies of some codons are quite far from those in the host. This skewing can be exacerbated by subsequent sequence
scFv1 C Exp HiCAl
Bl amylase C Exp HiCAl
Taq Pol C Exp HiCAl
scFv C6.5 C Exp HiCAl
Figure 3.4 Comparison of expression of genes coded using experimentally optimized codon usage (“Exp”) or that preferring codons used at highest frequency in naturally highly expressed E. coli genomic genes (“HiCAI”). In each case shown, genes were expressed from a strong repressible promoter, either T5 or T7, carried on a high copy plasmid. Transformed BL21 cells are cultured Luria broth at 37 C until mid-log growth (OD at 600 nm 0.6). Expression was induced by addition of IPTG to 1 mM, and cultures were incubated at 30 C for 4 h. PAGE analysis was performed on normalized amounts of total culture protein. Gels were stained using Sypro Ruby and imaged by UV fluorescence.
52
Mark Welch et al.
modification steps, for example, if undesired sequence elements are removed by removing a codon in an undesired element and replacing it probabilistically. 3.2.2. Experimental determination of an optimal E. coli codon bias The ease with which synthetic genes can now be synthesized has allowed researchers to perform experiments that test previous assumptions regarding optimal codon bias for heterologous protein expression. Using sets of genes broadly varied in gene design features, Welch et al. found that variation in synonymous codon usage frequencies had a profound effect on the amount of protein produced in E. coli, independent of local 50 sequence effects. Variation of at least two orders of magnitude in expression was seen due to substitution beyond the initial 15 codons of the ORF (Welch et al., 2009a). This variation was strongly correlated with the global codon usage frequencies of the genes, although the codon frequencies found in the highest expressed variants did not correspond to those found in the genome or in highly expressed endogenous genes of E. coli. Multivariate analysis showed that the frequencies of specific codons for about six amino acids could predict the observed differences in expression. It is not clear what the biochemical basis is for this correlation. It is possible that it reflects a physiological shock to the host cells as they attempt to synthesize large amount of a single protein, biasing the consumption of the aminoacyltRNA population: most of the best codons for high expression are also those that are predicted to remain more highly charged under starvation conditions (Dittmar et al., 2005; Elf et al., 2003; Welch et al., 2009a). Regardless of its biochemical basis, the effect of codon frequencies on expression is not limited to bacteria. Similar results have been obtained in yeast, plant, fungal, and mammalian hosts (Welch, unpublished data. See https://www.dna20.com/index.php?pageID=330). In all cases to date, expression is highly correlated with codon usage but does not show a general preference for use of codons used at highest frequency in the genome or in the highly expressed gene subset of the host. Much further research is needed to fully understand the nature of these effects; however, the observed correlations can already serve as the basis for more reliable design algorithms as well as providing direction for gene improvement strategies. 3.2.3. Designing genes using codon tables Backtranslation in Gene Designer is performed in two steps. First, the design parameters are entered in the backtranslation profile (see Fig. 3.1). Codon bias tables corresponding to the ORFs from the genome of almost any organism can be downloaded easily. First, select from the File menu: Import, then Codon Table (see Fig. 3.5). A dialog box will appear where you can enter search criteria. DNA2.0’s Web service will try to match your criteria with common and scientific names of species in its database. Once
Gene Design and Protein Expression
53
Figure 3.5 Library Explorer, Project Window, and Import Codon Table dialog box. From the File menu, choose Import, then Codon Table. In the Import Codon Table dialog box, type in at least three characters as a search criteria. Once the desired table has been found, select it and then click on Import. The Library Explorer will open to reveal your newly imported codon table.
you have found the organism you are looking for, simply select it from the Results list, and click on the Import button. Gene Designer will proceed to download the codon table and show its new location in the Codon Table Library pane of the Library Explorer. One of these can then be loaded into a backtranslation profile by dragging it out of the Library Explorer and into the profile. The program will use the table probabilistically, but it can also be used to search iteratively within additional constraints to provide a solution that precisely matches the selected table. Additional design criteria frequently include avoiding specific sequences such as restriction sites, internal RBSs, transcriptional terminators, and RNA splice sites. These sequences can be set by selecting Edit under “Sequences to Avoid.” A dialog box with two panes will appear (see Fig. 3.6). The top pane contains the list of sequences to avoid for the backtranslation profile. The bottom pane contains lists corresponding to motifs and restriction sites. To add new sequences to the unwanted list, simply drag and drop them from the bottom pane to the top pane. To remove sequences from the unwanted list, you can drag them into the trash can on the right. It is also often desirable to avoid
54
Mark Welch et al.
Figure 3.6 Managing Unwanted Sequences to avoid for a given backtranslation profile. Motifs for RBS and Shine–Dalgarno, and restriction sites EcoRI and HindIII are already added. XbaI is being dragged in. Sites and motifs may be added via drag and drop.
repeated sequences, both to simplify synthesis and to prevent genetic instability. The repeat size to avoid is set by a slider in the backtranslation profile editor. Backtranslation will avoid creating repeats within the ORF; it will also avoid creating a sequence within the ORF that occurs elsewhere within the construct. To address the challenge of searching through the large sequence space available for evaluation during backtranslation (Welch et al., 2009b), Gene Designer uses a genetic algorithm that helps to avoid getting trapped in suboptimal local minima. Initially, a population of sequences is generated by random selection of codons weighted on codon bias. Then, each individual (sequence) is evaluated against a set of criteria (occurrence of unwanted
55
Gene Design and Protein Expression
sequences, codon bias, occurrence of 50 secondary structure, repetitiveness, homology with specified DNA), each criteria is weighted (see Fig. 3.1), and a score is summed for each individual. Then, new individuals are created by crossing individuals from the existing population and introducing random mutations. The new individuals are then also evaluated. Finally, the best individuals are kept for the next generation of offspring and the cycle continues. Parameters for the genetic algorithm such as the maximum number of generations, population size, and mutation rate can also be set within the backtranslation profile (see Fig. 3.1). Although empirically optimal codon frequencies may not match the host’s bias, tables derived from experimental data can also be loaded into a Gene Designer backtranslation profile, and are used by the program in the same way as one prepared by analysis of host sequences. Genes for other proteins that are designed using data-driven tables frequently show similar improvements in expression even when predicted 50 mRNA structure is suppressed. In head to head comparisons of genes using experimentally optimized bias and those using a bias favoring codons used most frequently in highly expressed host genes, the experimentally derived bias showed significantly better average yield and consistency (see Table 3.1 and Fig. 3.4). Among the genes listed in Table 3.1, both versions for Bl Amylase, Fs Cutinase, and scFv used identical sequences for the 50 -UTR and at least 47 bases into the ORF. Thus, differences observed in expression are not due to initial coding effects or mRNA secondary structure local to that region and must be due to substitutions outside the initiation region. Among the others, where coding was varied between the versions, no correlation was seen between predicted local mRNA structure and expression. Clearly synonymous codon usage outside the 50 ’ Table 3.1 Comparison of genes coded using experimentally optimized codon usage (Exper. Opt) or that preferring codons used at highest frequency in naturally highly expressed E. coli genomic genes (HiCAI) HiCAI
a b
Exper. Opt
Protein
CAIa
mg/mlb
CAIa
mg/mlb
Exper/HiCAI
scFv C6.5 Bl Amylase Fs Cutinase MCherry Taq Pol scFv1 NR2B
0.90 0.89 0.89 0.91 0.91 0.88 0.97
5 50 200 220 50 20 5
0.71 0.71 0.71 0.68 0.69 0.69 0.64
200 200 130 240 200 140 100
40 4 0.7 1.1 4 7 20
The gene codon adaptation index as defined by Sharp and Li (1987). Approximate expression level in one to three E. coli cultures 4 h after induction at 30 C. See Fig. 3.4 legend for expression method details.
56
Mark Welch et al.
region can have a dramatic effect on expression and simple increased use of high “codon adaptation” codons is not a reliable strategy to maximize translation efficiency. Inherent to an experimental approach to optimization is that solutions are subject to the idiosyncrasies of the training set. Individual proteins may have different sensitivities to codon bias. Expression data from one protein might be less useful for guiding the design of a gene for a different protein sequence, particularly if the amino acid compositions of the two proteins are very different. For example, the expression levels of an alanine-rich protein limited by alanine codon usage but not by serine codon usage will not be helpful in choosing which serine codons to use in a second protein limited by serine codon usage. With more experimentation to determine preferences for a broad range of protein targets, general and protein-specific design rules should emerge.
3.3. mRNA structure and translational elongation While much evidence suggests that mRNA structure can interfere with translational initiation in both prokaryotes and eukaryotes, the effects of structure on elongation are less well understood. This in part may be due to intrinsic helicase activity of ribosomes, which allows translation through even very strong hairpins and may preclude many structures from limiting the translation rate in either prokaryotes (Takyar et al., 2005) or eukaryotes (Minshull and Hunt, 1986). Perhaps more importantly, mRNA structure is difficult to predict, particularly for actively translated messages which are in continuous flux between various folded and unfolded states. Some optimization strategies restrict structure analysis to local windows along the mRNA where structure could form between ribosomes, but it is not clear that such treatments accurately reflect structure in the context of the complete mRNA. The current uncertainties in both the impact and the prediction of mRNA structure currently obscure a rational approach to mRNA structure optimization. Any practical consideration of mRNA structure in gene design will depend on further systematic experimentation to identify reliable principles.
4. Protein-Specific Factors Providing Additional Complexity Quite often the target protein itself, due to properties of its structure or its activity, is a strong determinant of expression yield. The protein may be particularly unstable in the host, especially if it is poorly folded due to
Gene Design and Protein Expression
57
inherent instability, lack of sufficient prosthetic factors, or improper posttranslational modification. Expression of the protein may be toxic to the cell leading to instability of the expression vector or host suppression of protein synthesis. Expression of secreted and membrane proteins may be limited by mechanisms for directing these proteins to the membrane. It is even possible that the protein amino acid sequence may limit translational efficiency. For example, proline is thought to be slowly translated in E. coli, regardless of which codon is used (Pavlov et al., 2009). Proteins containing runs of prolines or high proline content may therefore be intrinsically more difficult to express without either altering their sequence (and thus probably their function) or without some serious tinkering with the process of translation in the host. There exists a growing list of strategies to circumvent proteinspecific limitations, some of which are summarized below.
4.1. Protein toxicity Quite often expression is limited by toxicity of the protein product or side products of attempted expression (Saida, 2007). Toxicity can greatly increase plasmid instability if gene expression is not tightly shut down during cell growth. A strongly repressed promoter and a host genetic background that promotes stability can be critical for very toxic genes. Upon induction, high toxicity may lead to a shutdown of protein expression. Optimal expression may require conditions where toxicity is mitigated. A common strategy to reduce toxicity is to lower expression to tolerable levels. Promoters varied in strength can be valuable tools for finding an optimal expression rate for maximal yield. As one example, we observed toxicity in trying to express periplasmdirected heavy and light chains of a FAB antibody fragment in E. coli. Use of strong T5 and T7 promoters resulted in only poor yields of product, which was not efficiently directed to the periplasm. Lowered expression by use of a lac promoter reduced toxicity upon induction and increased both final yield and efficiency of secretion to the periplasm. Indeed, most accumulation in the periplasm from the lac-driven constructs appeared to occur prior to induction from this system, which showed measurable amounts of noninduced expression. The lowered expression of the uninduced lac promoter in our constructs perhaps allowed efficient transport to the periplasm without accumulation of toxic levels in the cytoplasm. One potential way to avoid toxicity of some proteins is to direct expression to the periplasm or media. This may be accomplished by N-terminal fusion of a secretion signal sequence. Many such sequences have been described for secretion from a wide range of prokaryotic and eukaryotic host cells (Baneyx, 1999; Brake et al., 1984; Korepanova et al., 2009; Peroutka et al., 2008), and several are provided in Gene Designer
58
Mark Welch et al.
ProtC –
+
glll_ProtC – +
Figure 3.7 Expression of ProtC and gIII-fused ProtC in E. coli. Induced (þ) and uninduced () cultures are shown for each variant.
(see Fig. 3.2). In the example shown in Fig. 3.7, fusion of a phage gIII signal sequence to one protein (“ProtC”) proved critical to obtain substantial expression yield. Attempts to mitigate the high toxicity of this protein by tight promoter control, lowered temperature, and MBP fusion were not successful. However, fusion to the gIII signal sequence reduced toxicity substantially and significant yield was obtained. Intentionally directing proteins to insoluble inclusion bodies using “insolubility tags,” such fusion with ketosteroid isomerase (Park et al., 2008) may also avoid toxicity from the soluble form of proteins, though the general usefulness of this approach is limited to applications where insoluble protein is acceptable, for example, in raising antibodies, or where successful refolding strategies are known.
4.2. Transmembrane proteins Transmembrane proteins can be particularly difficult to successfully express in heterologous hosts (Freigassner et al., 2009). Quite often such proteins are poorly directed to the membrane and often are toxic to the cell (Luo et al., 2009; Steffensen and Pedersen, 2006; Wagner et al., 2006, 2008). For both reasons, attenuated expression constructs may prove useful (Wagner et al., 2008). Lowered transcription (e.g., by use of a weaker promoter) may help to limit the expression rate to that of the membrane insertion capacity of the cell, avoiding accumulation of unfolded protein and potential indirect or direct toxic effects of overexpression.
Gene Design and Protein Expression
59
4.2.1. Addition of solubility tags If the transmembrane portion of the protein is not absolutely required for function of the protein (e.g., if the protein is a membrane-bound enzyme rather than a transporter), modifications or elimination of the anchor site can allow expression and function of the protein in a heterologous system. A good example of this was published in 2008 by Michelle Chang and colleagues (Chang et al., 2007; Craft et al., 2003). She showed that active expression was improved by replacing the N-terminal membrane anchor for a plant cytochrome P450 with several different sequences, including P450 sequences from yeast (Craft et al., 2003) or mammals (Barnes et al., 1991), bacterial secretion signals, or synthetic solubilization sequences (Roosild et al., 2005; Schafmeister et al., 1993; Schoch et al., 2003; Sueyoshi et al., 1995). These sequences are preloaded into Gene Designer (see Fig. 3.8).
4.3. cis-Regulatory regions In some genes, particularly as exemplified in retroviral genes, regulation may be accomplished by sequence elements within the coding region itself (Kotsopoulou et al., 2000; Woltering and Duboule, 2009). One relatively simple way to eliminate such motifs is to perform backtranslation while maximizing the genetic distance from the sequence found in nature. Incorporating codon changes where possible will maximize the likelihood of disrupting any hidden or unknown elements within the mRNA sequence. This can be accomplished in Gene Designer by specifying a homologous DNA sequence for the Amino Acid Element in question. To do this, open the AA Element Properties dialog box by selecting an AA Element and clicking on the edit button (pencil icon) or double clicking on the AA Element. Once in the AA Element Properties dialog box (see Fig. 3.9), you can specify to aim for or avoid similarity with any given DNA sequence. To edit the homologous DNA, click on the Edit Homologous DNA button. Gene Designer will then allow you to enter the DNA, will translate said DNA into and amino acid sequence, and show the alignment of the translated sequence with the sequence of the AA Element you are editing. This alignment is used for calculating the Homologous DNA similarity score used during backtranslation. The genetic algorithm in Gene Designer used for backtranslation is dependent on the weights specified in the backtranslation profile (see Fig. 3.1). These weights are used as a means to sort out conflicting requirements. For example, maximizing sequence similarity with a homologous DNA sequence might conflict with avoidance of a motif that is present in the homologous DNA. During backtranslation, whichever weighted scores (i.e., Unwanted Sequence Avoidance vs. Homologous DNA Matching)
60
Mark Welch et al.
Figure 3.8 Library Explorer with various fusion tag folders open. Elements from the library can be dragged out and into the Project Window to add them to a Design Construct.
contribute more to the overall score will have a stronger effect on the fitness of each individual of the population and therefore on the general search direction in sequence space. At the end of backtranslation, Gene Designer asks if you would like to see a Backtranslation Summary Report. Here, you can verify if the
Gene Design and Protein Expression
61
Figure 3.9 Amino Acid Element Properties dialog box. Shown on the left, the properties box has just been opened, and the Amino Acid Element EMP-PD1’s properties can be changed from here. To edit this element’s homologous DNA, click on Edit Homologous DNA and another box will open, shown on the right. From here, you can specify DNA which will be translated into an amino acid sequence that will then be aligned with the Amino Acid Element’s sequence.
sequences you wanted to avoid were truly avoided. The report is also available under the Reports menu.
5. Conclusions The promise of synthetic biology depends on understanding the behavior and interactions between genetic parts, and between those parts and the host system. While there has been great progress in identifying fundamental genetic elements and standardizing frameworks for the expression and control of genes, we still lack the ability to reliably rationally design genes for successful expression. This problem is in part one of not fully understanding the nature of the parts. Perhaps more significantly, it is difficult to predict the impact of protein-specific issues (folding, toxicity, etc.) which could greatly affect expression and optimal gene design. Tools such as Gene Designer that facilitate gene engineering will be essential for the development of reliable synthetic systems.
62
Mark Welch et al.
REFERENCES Akashi, H. (2001). Gene expression and molecular evolution. Curr. Opin. Genet. Dev. 11, 660–666. Akashi, H., and Gojobori, T. (2002). Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proc. Natl. Acad. Sci. USA 99, 3695–3700. Allert, M., Cox, J. C., and Hellinga, H. W. (2010). Multifactorial determinants of protein expression in prokaryotic open reading frames. J. Mol. Biol. 402, 905–918. Baneyx, F. (1999). Recombinant protein expression in Escherichia coli. Curr. Opin. Biotechnol. 10, 411–421. Barnes, H. J., Arlotto, M. P., and Waterman, M. R. (1991). Expression and enzymatic activity of recombinant cytochrome P450 17 alpha-hydroxylase in Escherichia coli. Proc. Natl. Acad. Sci. USA 88, 5597–5601. Berry, K. E., Waghray, S., and Doudna, J. A. (2010). The HCV IRES pseudoknot positions the initiation codon on the 40S ribosomal subunit. RNA 16, 1559–1569. Brake, A. J., Merryweather, J. P., Coit, D. G., Heberlein, U. A., Masiarz, F. R., Mullenbach, G. T., Urdea, M. S., Valenzuela, P., and Barr, P. J. (1984). Alpha-factordirected synthesis and secretion of mature foreign proteins in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA 81, 4642–4646. Bulmer, M. (1987). Coevolution of codon usage and transfer RNA abundance. Nature 325, 728–730. Chang, M. C., Eachus, R. A., Trieu, W., Ro, D. K., and Keasling, J. D. (2007). Engineering Escherichia coli for production of functionalized terpenoids using plant P450s. Nat. Chem. Biol. 3, 274–277. Chen, G., and Inouye, M. (1990). Suppression of the negative effect of minor arginine codons on gene expression; preferential usage of minor codons within the first 25 codons of the Escherichia coli genes. Nucleic Acids Res. 18, 1465–1473. Chen, G. T., and Inouye, M. (1994). Role of the AGA/AGG codons, the rarest codons in global gene expression in Escherichia coli. Genes Dev. 8, 2641–2652. Chen, H., Bjerknes, M., Kumar, R., and Jay, E. (1994). Determination of the optimal aligned spacing between the Shine-Dalgarno sequence and the translation initiation codon of Escherichia coli mRNAs. Nucleic Acids Res. 22, 4953–4957. Craft, D. L., Madduri, K. M., Eshoo, M., and Wilson, C. R. (2003). Identification and characterization of the CYP52 family of Candida tropicalis ATCC 20336, important for the conversion of fatty acids and alkanes to alpha, omega-dicarboxylic acids. Appl. Environ. Microbiol. 69, 5983–5991. Cruz-Vera, L. R., Magos-Castro, M. A., Zamora-Romo, E., and Guarneros, G. (2004). Ribosome stalling and peptidyl-tRNA drop-off during translational delay at AGA codons. Nucleic Acids Res. 32, 4462–4468. de Smit, M. H., and van Duin, J. (1990). Secondary structure of the ribosome binding site determines translational efficiency: A quantitative analysis. Proc. Natl. Acad. Sci. USA 87, 7668–7672. de Smit, M. H., and van Duin, J. (1994). Control of translation by mRNA secondary structure in Escherichia coli. A quantitative analysis of literature data. J. Mol. Biol. 244, 144–150. de Smit, M. H., and van Duin, J. (2003). Translational standby sites: How ribosomes may deal with the rapid folding kinetics of mRNA. J. Mol. Biol. 331, 737–743. Dittmar, K. A., Sorensen, M. A., Elf, J., Ehrenberg, M., and Pan, T. (2005). Selective charging of tRNA isoacceptors induced by amino-acid starvation. EMBO Rep. 6, 151–157.
Gene Design and Protein Expression
63
Dong, H., Nilsson, L., and Kurland, C. G. (1996). Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J. Mol. Biol. 260, 649–663. Elf, J., Nilsson, D., Tenson, T., and Ehrenberg, M. (2003). Selective charging of tRNA isoacceptors explains patterns of codon usage. Science 300, 1718–1722. Eyre-Walker, A. (1996). Synonymous codon bias is related to gene length in Escherichia coli: Selection for translational accuracy? Mol. Biol. Evol. 13, 864–872. Eyre-Walker, A., and Bulmer, M. (1993). Reduced synonymous substitution rate at the start of enterobacterial genes. Nucleic Acids Res. 21, 4599–4603. Eyre-Walker, A., and Bulmer, M. (1995). Synonymous substitution rates in enterobacteria. Genetics 140, 1407–1412. Freigassner, M., Pichler, H., and Glieder, A. (2009). Tuning microbial hosts for membrane protein production. Microb. Cell Fact. 8, 69. Gazo, B. M., Murphy, P., Gatchel, J. R., and Browning, K. S. (2004). A novel interaction of Cap-binding protein complexes eukaryotic initiation factor (eIF) 4F and eIF(iso)4F with a region in the 30 -untranslated region of satellite tobacco necrosis virus. J. Biol. Chem. 279, 13584–13592. Gonzalez de Valdivia, E. I., and Isaksson, L. A. (2004). A codon window in mRNA downstream of the initiation codon where NGG codons give strongly reduced gene expression in Escherichia coli. Nucleic Acids Res. 32, 5198–5205. Gonzalez de Valdivia, E., and Isaksson, L. A. (2005). Abortive translation caused by peptidyltRNA drop-off at NGG codons in the early coding region of mRNA. FEBS J. 272, 5306–5316. Griswold, K. E., Mahmood, N. A., Iverson, B. L., and Georgiou, G. (2003). Effects of codon usage versus putative 50 -mRNA structure on the expression of Fusarium solani cutinase in the Escherichia coli cytoplasm. Protein Expr. Purif. 27, 134–142. Gustafsson, C., Govindarajan, S., and Minshull, J. (2004). Codon bias and heterologous protein expression. Trends Biotechnol. 22, 346–353. Hammarstrom, M., Hellgren, N., van Den Berg, S., Berglund, H., and Hard, T. (2002). Rapid screening for improved solubility of small human proteins produced as fusion proteins in Escherichia coli. Protein Sci. 11, 313–321. Hoekema, A., Kastelein, R. A., Vasser, M., and de Boer, H. A. (1987). Codon replacement in the PGK1 gene of Saccharomyces cerevisiae: Experimental approach to study the role of biased codon usage in gene expression. Mol. Cell. Biol. 7, 2914–2924. Holm, L. (1986). Codon usage and gene expression. Nucleic Acids Res. 14, 3075–3087. Kanaya, S., Yamada, Y., Kinouchi, M., Kudo, Y., and Ikemura, T. (2001). Codon usage and tRNA genes in eukaryotes: Correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis. J. Mol. Evol. 53, 290–298. Knight, R. D., Freeland, S. J., and Landweber, L. F. (2001). A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes. Genome Biol. 2RESEARCH0010. Komarova, A. V., Tchufistova, L. S., Supina, E. V., and Boni, I. V. (2002). Protein S1 counteracts the inhibitory effect of the extended Shine-Dalgarno sequence on translation. RNA 8, 1137–1147. Komarova, A. V., Tchufistova, L. S., Dreyfus, M., and Boni, I. V. (2005). AU-rich sequences within 50 untranslated leaders enhance translation and stabilize mRNA in Escherichia coli. J. Bacteriol. 187, 1344–1349. Korepanova, A., Moore, J. D., Nguyen, H. B., Hua, Y., Cross, T. A., and Gao, F. (2007). Expression of membrane proteins from Mycobacterium tuberculosis in Escherichia coli as fusions with maltose binding protein. Protein Expr. Purif. 53, 24–30. Korepanova, A., Pereda-Lopez, A., Solomon, L. R., Walter, K. A., Lake, M. R., Bianchi, B. R., McDonald, H. A., Neelands, T. R., Shen, J., Matayoshi, E. D.,
64
Mark Welch et al.
Moreland, R. B., and Chiu, M. L. (2009). Expression and purification of human TRPV1 in baculovirus-infected insect cells for structural studies. Protein Expr. Purif. 65, 38–50. Kotsopoulou, E., Kim, V. N., Kingsman, A. J., Kingsman, S. M., and Mitrophanous, K. A. (2000). A Rev-independent human immunodeficiency virus type 1 (HIV-1)-based vector that exploits a codon-optimized HIV-1 gag-pol gene. J. Virol. 74, 4839–4852. Kozak, M. (1986). Influences of mRNA secondary structure on initiation by eukaryotic ribosomes. Proc. Natl. Acad. Sci. USA 83, 2850–2854. Kozak, M. (1999). Initiation of translation in prokaryotes and eukaryotes. Gene 234, 187–208. Kozak, M. (2005). Regulation of translation via mRNA structure in prokaryotes and eukaryotes. Gene 361, 13–37. Kudla, G., Murray, A. W., Tollervey, D., and Plotkin, J. B. (2009). Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258. Kurland, C., and Gallant, J. (1996). Errors of heterologous protein expression. Curr. Opin. Biotechnol. 7, 489–493. Lee, K., Holland-Staley, C. A., and Cunningham, P. R. (1996). Genetic analysis of the Shine-Dalgarno interaction: Selection of alternative functional mRNA-rRNA combinations. RNA 2, 1270–1285. Lisser, S., and Margalit, H. (1993). Compilation of E. coli mRNA promoter sequences. Nucleic Acids Res. 21, 1507–1516. Luo, J., Choulet, J., and Samuelson, J. C. (2009). Rational design of a fusion partner for membrane protein expression in E. coli. Protein Sci. 18, 1735–1744. Maertens, B., Spriestersbach, A., von Groll, U., Roth, U., Kubicek, J., Gerrits, M., Graf, M., Liss, M., Daubert, D., Wagner, R., and Schafer, F. (2010). Gene optimization mechanisms: A multi-gene study reveals a high success rate of full-length human proteins expressed in Escherichia coli. Protein Sci. 19, 1312–1326. Marquez, R., Smit, S., and Knight, R. (2005). Do universal codon-usage patterns minimize the effects of mutation and translation error? Genome Biol. 6, R91. Minshull, J., and Hunt, T. (1986). The use of single-stranded DNA and RNase H to promote quantitative ‘hybrid arrest of translation’ of mRNA/DNA hybrids in reticulocyte lysate cell-free translations. Nucleic Acids Res. 14, 6433–6451. Na, D., Lee, S., and Lee, D. (2010). Mathematical modeling of translation initiation for the estimation of its efficiency to computationally design mRNA sequences with desired expression levels in prokaryotes. BMC Syst. Biol. 4, 71. Park, T. J., Choi, S. S., Gang, G. A., and Kim, Y. (2008). High-level expression and purification of the second transmembrane domain of wild-type and mutant human melanocortin-4 receptor for solid-state NMR structural studies. Protein Expr. Purif. 62, 139–145. Parsyan, A., Shahbazian, D., Martineau, Y., Petroulakis, E., Alain, T., Larsson, O., Mathonnet, G., Tettweiler, G., Hellen, C. U., Pestova, T. V., Svitkin, Y. V., and Sonenberg, N. (2009). The helicase protein DHX29 promotes translation initiation, cell proliferation, and tumorigenesis. Proc. Natl. Acad. Sci. USA 106, 22217–22222. Pavlov, M. Y., Watts, R. E., Tan, Z., Cornish, V. W., Ehrenberg, M., and Forster, A. C. (2009). Slow peptide bond formation by proline and other N-alkylamino acids in translation. Proc. Natl. Acad. Sci. USA 106, 50–54. Peccoud, J., Blauvelt, M. F., Cai, Y., Cooper, K. L., Crasta, O., DeLalla, E. C., Evans, C., Folkerts, O., Lyons, B. M., Mane, S. P., Shelton, R., Sweede, M. A., et al. (2008). Targeted development of registries of biological parts. PLoS One 3, e2671. Peroutka, R. J., Elshourbagy, N., Piech, T., and Butt, T. R. (2008). Enhanced protein expression in mammalian cells using engineered SUMO fusions: Secreted phospholipase A2. Protein Sci. 17, 1586–1595.
Gene Design and Protein Expression
65
Pestova, T. V., Kolupaeva, V. G., Lomakin, I. B., Pilipenko, E. V., Shatsky, I. N., Agol, V. I., and Hellen, C. U. (2001). Molecular mechanisms of translation initiation in eukaryotes. Proc. Natl. Acad. Sci. USA 98, 7029–7036. Pisareva, V. P., Pisarev, A. V., Komar, A. A., Hellen, C. U., and Pestova, T. V. (2008). Translation initiation on mammalian mRNAs with structured 50 UTRs requires DExHbox protein DHX29. Cell 135, 1237–1250. Preiss, T., and Hentze, M. W. (1999). From factors to mechanisms: Translation and translational control in eukaryotes. Curr. Opin. Genet. Dev. 9, 515–521. Rocha, E. P. (2004). Codon usage bias from tRNA’s point of view: Redundancy, specialization, and efficient decoding for translation optimization. Genome Res. 14, 2279–2286. Roosild, T. P., Greenwald, J., Vega, M., Castronovo, S., Riek, R., and Choe, S. (2005). NMR structure of Mistic, a membrane-integrating protein for membrane protein expression. Science 307, 1317–1321. Saida, F. (2007). Overview on the expression of toxic gene products in Escherichia coli. Curr. Protoc. Protein Sci. Chapter 5, Unit 5 19. Salis, H. M., Mirsky, E. A., and Voigt, C. A. (2009). Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950. Schafmeister, C. E., Miercke, L. J., and Stroud, R. M. (1993). Structure at 2.5 A of a designed peptide that maintains solubility of membrane proteins. Science 262, 734–738. Schoch, G. A., Attias, R., Belghazi, M., Dansette, P. M., and Werck-Reichhart, D. (2003). Engineering of a water-soluble plant cytochrome P450, CYP73A1, and NMR-based orientation of natural and alternate substrates in the active site. Plant Physiol. 133, 1198–1208. Sharp, P. M., and Li, W. H. (1987). The codon Adaptation Index—A measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295. Sharp, P. M., Cowe, E., Higgins, D. G., Shields, D. C., Wolfe, K. H., and Wright, F. (1988). Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens: A review of the considerable within-species diversity. Nucleic Acids Res. 16, 8207–8211. Sharp, P. M., Bailes, E., Grocock, R. J., Peden, J. F., and Sockett, R. E. (2005). Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res. 33, 1141–1153. Shultzaberger, R. K., Bucheimer, R. E., Rudd, K. E., and Schneider, T. D. (2001). Anatomy of Escherichia coli ribosome binding sites. J. Mol. Biol. 313, 215–228. Smyth, D. R., Mrozkiewicz, M. K., McGrath, W. J., Listwan, P., and Kobe, B. (2003). Crystal structures of fusion proteins with large-affinity tags. Protein Sci. 12, 1313–1322. Steffensen, L., and Pedersen, P. A. (2006). Heterologous expression of membrane and soluble proteins derepresses GCN4 mRNA translation in the yeast Saccharomyces cerevisiae. Eukaryot. Cell 5, 248–261. Stenstro¨m, C. M., and Isaksson, L. A. (2002). Influences on translation initiation and early elongation by the messenger RNA region flanking the initiation codon at the 30 side. Gene 288, 1–8. Stenstro¨m, C. M., Holmgren, E., and Isaksson, L. A. (2001a). Cooperative effects by the initiation codon and its flanking regions on translation initiation. Gene 273, 259–265. Stenstro¨m, C. M., Jin, H., Major, L. L., Tate, W. P., and Isaksson, L. A. (2001b). Codon bias at the 30 -side of the initiation codon is correlated with translation initiation efficiency in Escherichia coli. Gene 263, 273–284. Studer, S. M., and Joseph, S. (2006). Unfolding of mRNA secondary structure by the bacterial translation initiation complex. Mol. Cell 22, 105–115.
66
Mark Welch et al.
Sueyoshi, T., Park, L. J., Moore, R., Juvonen, R. O., and Negishi, M. (1995). Molecular engineering of microsomal P450 2a-4 to a stable, water-soluble enzyme. Arch. Biochem. Biophys. 322, 265–271. Suzuki, H., Brown, C. J., Forney, L. J., and Top, E. M. (2008). Comparison of correspondence analysis methods for synonymous codon usage in bacteria. DNA Res. 15, 357–365. Takyar, S., Hickerson, R. P., and Noller, H. F. (2005). mRNA helicase activity of the ribosome. Cell 120, 49–58. Tuller, T., Waldman, Y. Y., Kupiec, M., and Ruppin, E. (2010). Translation efficiency is determined by both codon bias and folding energy. Proc. Natl. Acad. Sci. USA 107, 3645–3650. Vimberg, V., Tats, A., Remm, M., and Tenson, T. (2007). Translation initiation region sequence preferences in Escherichia coli. BMC Mol. Biol. 8, 100. Wagner, S., Bader, M. L., Drew, D., and de Gier, J. W. (2006). Rationalizing membrane protein overexpression. Trends Biotechnol. 24, 364–371. Wagner, S., Klepsch, M. M., Schlegel, S., Appel, A., Draheim, R., Tarry, M., Hogbom, M., van Wijk, K. J., Slotboom, D. J., Persson, J. O., and de Gier, J. W. (2008). Tuning Escherichia coli for membrane protein overexpression. Proc. Natl. Acad. Sci. USA 105, 14371–14376. Welch, M., Govindarajan, S., Ness, J. E., Villalobos, A., Gurney, A., Minshull, J., and Gustafsson, C. (2009a). Design parameters to control synthetic gene expression in Escherichia coli. PLoS One 4, e7002. Welch, M., Villalobos, A., Gustafsson, C., and Minshull, J. (2009b). You’re one in a googol: Optimizing genes for protein expression. J. R. Soc. Interface 6(Suppl 4), S467–S476. Woltering, J. M., and Duboule, D. (2009). Conserved elements within open reading frames of mammalian Hox genes. J. Biol. 8, 17. Yang, Z., and Nielsen, R. (2008). Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. Mol. Biol. Evol. 25, 568–579.
C H A P T E R
F O U R
Application of Metabolic Flux Analysis in Metabolic Engineering Sang Yup Lee,*,†,‡,§ Jong Myoung Park,*,† and Tae Yong Kim*,‡ Contents 1. Introduction 1.1. Systems metabolic engineering and metabolic flux analysis 1.2. 13C-based flux analysis 1.3. Constraints-based flux analysis 2. General Structure of Constraints-Based Flux Analysis 3. Algorithms of Metabolic Flux Analysis 3.1. Flux balance analysis 3.2. Identifying gene targets for engineering strain development: Gene knockout 3.3. Identifying gene targets for engineering strain development: Up- or downregulation of Genes 3.4. Identifying gene targets for engineering strain development: Foreign genes insertion 3.5. Identifying gene targets for engineering strain development: Metabolite essentiality 3.6. Accurately describing cellular physiology: Incorporation of experimental data and physiological properties into the in silico model 4. Concluding Remarks Acknowledgments References
68 68 68 70 70 73 75 77 79 82 82
83 85 86 87
Abstract Metabolic flux analysis (MFA) is an important analytical technique to quantify intracellular metabolic fluxes as a consequence of all catalytic and transcriptional interactions. In systems metabolic engineering, MFA has played important * Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Program), KAIST, Daejeon, Republic of Korea BioProcess Engineering Research Center, Center for Systems and Synthetic Biotechnology, Institute for the BioCentury, KAIST, Daejeon, Republic of Korea { Bioinformatics Research Center, KAIST, Daejeon, Republic of Korea } Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea {
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00004-8
#
2011 Elsevier Inc. All rights reserved.
67
68
Sang Yup Lee et al.
role to understand cellular physiology under particular conditions and predict its metabolic capability after genetic or environmental perturbations. Two methods using optimization procedure, 13C-based flux analysis and constraints-based flux analysis, have been used generally on the basis of stoichiometry of metabolic reactions and mass balances around intracellular metabolites under pseudo-steady state assumption. Practically, MFA has been applied to generate new knowledge on the biological system, analyze cellular physiology systemwide, and consequently design metabolic engineering strategies at a systemslevel. In this chapter, we study the basic principle of MFA (more particularly constraints-based flux analysis), inspect the characteristics of several in silico algorithms developed for system-wide analysis of cellular metabolic fluxes, and discuss their applications.
1. Introduction 1.1. Systems metabolic engineering and metabolic flux analysis Systems metabolic engineering has been provided as a new paradigm for generating new knowledge on biological systems and for systematically designing novel strategies to develop improved strain ( Joyce and Palsson, 2006; Lee et al., 2005b, 2007; Park and Lee, 2008; Park et al., 2007). In systems metabolic engineering, metabolic flux analysis (MFA) has played an important role in understanding cellular physiology and predicting its metabolic capability under specified environmental or genetic conditions (Orth et al., 2010; Raman and Chandra, 2009; Sauer, 2006; Zamboni and Sauer, 2009). MFA is an analytical and powerful technique using optimization procedure to quantify intracellular metabolic fluxes from all known catalytic and transcriptional interactions. MFA is based on the stoichiometry of the metabolic reactions and the mass balances around intracellular metabolites under pseudo-steady state assumption. Two methods have been used to study the metabolic flux in a biological system: 13C-based flux analysis and constraints-based flux analysis.
1.2.
13 13
C-based flux analysis
The C-based flux analysis utilizes an isotope-labeled carbon substrate and allows the determination of intracellular fluxes in metabolic networks by analyzing 13C enrichment patterns of metabolites with nuclear magnetic resonance (NMR) or gas chromatography–mass spectrometry (GC–MS) (Sauer, 2006; Zamboni and Sauer, 2009). The 13C-labeled substrates are fed to growing cells until the isotope-labeled carbon is distributed throughout the metabolic network. The measured 13C-isotope pattern data and
Application of Metabolic Flux Analysis
69
additional physiological data during cultivation, including exchange fluxes (uptake rate and production rate) determined from time courses of extracellular metabolite concentrations and biomass composition data, are simultaneously integrated with computational analysis. The intracellular fluxes are then estimated by fitting iteratively the simulated fluxes in stoichiometric models to the measured data. The difference between simulated and measured labeled pattern is minimized (Sauer, 2004, 2006; Wiechert, 2001). Typically, 13C-based flux analysis has been used to understand the physiological status of a cell by quantifying intracellular fluxes under a particular condition (Al Zaid Siddiquee et al., 2004; Li et al., 2006; Peng et al., 2004; Schmidt et al., 1999; Zhao et al., 2004a,b). 13C-based flux analysis also has been used to discover and quantify the in vivo operation of unusual pathways within complex metabolic networks and to elucidate the pathways in less-characterized species (Rabinowitz, 2007; Risso et al., 2008; Sauer, 2006; Tang et al., 2007). Other applications that 13C-based flux analysis has been utilized for are the elucidation of mechanisms in network-wide balancing of intracellular components, for example, energy and redox balancing, and the demonstration of the role of unfamiliar pathways in the metabolic network (Fuhrer and Sauer, 2009; Peyraud et al., 2009; Zamboni and Sauer, 2009). Combining 13C-based flux analysis and other data pertaining to the network being investigated has enabled characterization of condition-dependent regulatory circuits that ultimately govern the metabolic phenotype (Ishii et al., 2007; Nanchen et al., 2008; Tang et al., 2009; Tannler et al., 2008; Zamboni and Sauer, 2009). Experimental fluxes based on 13C-based flux analysis were used to predict cellular physiology using the genome-scale metabolic model, with relatively high accuracy, by constraining the flux solution space (Herrgard et al., 2006a; Kim and Lee, 2006; Sauer, 2006) and evaluation of the model predictions (Park et al., 2010; Segre et al., 2002). This has allowed members of the biotechnology community to utilize 13C-based flux analysis for metabolic engineering (e.g., isoprenoid production in Escherichia coli; Kizer et al., 2008), drug development (e.g., dihydrofolate reductase inhibitor in E. coli; Kwon et al., 2008), and the identification of functional side effects of drugs (Schneider et al., 2009) because of the extensive perspectives on cellular energetics and network-wide balancing provided by 13C-based flux analysis (Zamboni and Sauer, 2009). In practice, despite relatively accurate estimation of intracellular fluxes, 13C-based flux analysis typically focuses on small-scale metabolic network (i.e., central metabolism) rather than the entire, or genome-scale, metabolic network because of difficulties in experimentation and subsequent computational calculations required for large-scale metabolic models, limiting its applications for large-scale analysis (Kim et al., 2008a; Sauer, 2006).
70
Sang Yup Lee et al.
1.3. Constraints-based flux analysis Constraints-based flux analysis is a general mathematical method using optimization-based simulation techniques to analyze cellular metabolism under a specified environmental or genetic condition and predict metabolic capability when the specified conditions are perturbed (Park et al., 2009). To implement constraints-based flux analysis, a stoichiometric model needs to be first constructed based on genomic information, databases, and literatures. As the genomes of increasing number of organisms have been completely sequenced, in silico (means “performed on computer or via computer simulation.”) genome-scale metabolic models have been constructed for several organisms in the domains of bacteria, archaea, and eukarya to use them for exploring their metabolic characteristics at a systems-level (Duarte et al., 2007; Durot et al., 2009; Feist and Palsson, 2008; Joyce and Palsson, 2006; Kim et al., 2008a) (Fig. 4.1). Reconstruction of the in silico genome-scale metabolic model begins with utilizing the genome annotation to generate a collection of metabolic reactions and the stoichiometric coefficients of the metabolites, giving a set of linear mass balance equations for cellular metabolites describing the cellular metabolism (Davidsen et al., 2010; Kanehisa et al., 2010). This collection of equations forms the foundation of the metabolic network. Gaps in the metabolic network, due to insufficient data or characterization in the genome annotation, are filled in, and the errors are corrected based on knowledge from literature, databases, and experiments. The in silico genome-scale metabolic model is then validated by comparing simulation results with actual experimental data. If the simulation results differ greatly from experimental observations, the metabolic model should be refined iteratively until the discrepancies are resolved (Fig. 4.1). After the in silico genome-scale metabolic model has been validated by iterative processes, constrains-based flux analysis can be utilized using appropriate objective functions (e.g., maximization of cell growth rate) and constraints that restrict the solution space of the model to exclude incorrect or infeasible metabolic states. In this chapter, we focus on the methods of constraintsbased flux analysis using the genome-scale metabolic model and its applications in systems metabolic engineering for strain improvement.
2. General Structure of Constraints-Based Flux Analysis In silico algorithms developed for the genome-scale metabolic models, to date, are based on optimization techniques with various constraints applied for improving the accuracy of the simulation (Park et al., 2009). Before inspecting each in silico algorithm, a general structure of constraints-based flux analysis,
71
Application of Metabolic Flux Analysis
13. Strain improvement
D-glucose 6-phosphate → D-Fructose 6phosphate D-Fructose 6-phosphate + ATP → D-Fructose 1,6-bisphosphate + ADP D-Fructose 1,6-bisphosphate → Glycerone phosphate + D-glyceraldehyde 3-phosphate
3. Automatic reconstruction (database)
12. Systems metabolic engineering (experimental and in silico procedure)
4. Extracting biochemical reaction
2. Genome annotation
11. Model validation
1. Genome sequencing Gene1
10. Experimental data
Gene2
Transcript Protein
A B C
Reaction
Conserved neighborhood
A B C Fusion
A B C Co–occurrence
8. Comparative genomics (Phylogeny, gene neighborhood, gene fusion, co-occurrence)
7. Gene–reaction correlation
S.v = 0
5. Manual curation (literature, database)
6. Gap filling in pathways
9. Static simulation
In silico genome-scale metabolic model
Figure 4.1 Procedure for the reconstruction of in silico genome-scale metabolic model and its application to metabolic engineering. (A) Automatic reconstruction of metabolic network based on genome sequence and annotation data (1-2-3-4). (B) Manual curation and fine-tuning of the metabolic network using literatures, databases, gene/reaction correlation, and comparative genomics to fill gaps and correct errors in the pathways (4-7-8-2 or 5-6-8-2). (C) Validation of the metabolic model in comparison with experimental data (9-10-11). The biomass composition determined by the experiments is applied to the model. If the simulation results do not correspond with experimental data, the model needs to be refined further by an iterative process until the differences between predictions and experiments are resolved. (D) Systems metabolic engineering for strain improvement by combining experimental and in silico procedures (12-13). The gray arrows indicate the procedures for the construction of the model. The black arrows indicate the procedure for the simulation and validation of the model and its application.
based on optimization, is worth understanding. The general structure of constraints-based flux analysis is the basic principle in constructing in silico algorithms or metabolic models. It is able to cover all classes of mathematical optimization methods. The structure of constraints-based flux analysis consists of two parts: objective functions and the constraints to metabolic fluxes in the metabolic model (Lee and Papoutsakis, 1999; Stephanopoulos et al., 1998). Constraints are the conditions that must be satisfied while solving for the optimal solution to the metabolic network by maximizing/minimizing the
72
Sang Yup Lee et al.
objective function(s). Constraints can be in the form of either equality or inequality statements. A general form of the constraints is as follows: ayj vj byj ; yj 2 f0; 1g
ð4:1Þ
where vj is a continuous variable, yj is a discrete variable having a binary value of 0 or 1, and a and b are constants that represent the upper and lower limits, respectively. Constraints-based flux analysis solves for an optimal solution to the metabolic network by maximizing or minimizing an objective function(s) subject to the constraints defined for the independent variables. More than one objective function can be selected and solved for. Solving the system of equations defining the metabolic network proceeds in the following manner: 1. 2. 3. 4.
Determine the decision (or control) variables Formulate all objectives representing the purpose of decision maker Formulate constraints Maximize or minimize the objective function(s) subject to the constraints
Objective function: Maximize=Minimize
ZðxÞ ¼ ðc1 xm1 1 þ c2 xm2 2 þ . . . þ cn xmn n Þk ;
for all n ð4:2Þ
Constraints: Subject to
dXi ¼ Sij v; dt
aj vj bj
ð4:3Þ
The objective function Z(x) is a mathematical expression of the goals for the system desired by the user, where c, m, and k are constants and xn is a variable that can be designated as a vector representing the fluxes of metabolic reactions, the number of significant metabolic flux changes represented by a binary variable, or any other characteristic of interest. The type of simulation is determined by the form of objective function Z(x). The system of equations is linear or nonlinear according to the values of m and k and consequently determines the method used to solve the system (i.e., linear programming (LP) or nonlinear programming). The number of objective functions also determines what type of problem the system becomes, whether it is a single, or multiple objective function system. In silico genome-scale metabolic model is composed of metabolic reactions that define the stoichiometric conversion of substrate metabolites into various intracellular metabolites that are precursors to different components important for cellular function. Mass balances can be set up as Eq. (4.3),
73
Application of Metabolic Flux Analysis
where the difference between consumption rate and production rate of a specific metabolite is equal to the change rate of metabolite concentration. The subscripts i and j represent the indices of metabolites and reactions that the metabolite participates in, respectively. X denotes the vector representing the concentrations of metabolites. The stoichiometric matrix S is an m n matrix where m is the total number of metabolites and n is the total number of reactions in the metabolic network that is being described. v is a vector of the fluxes for the reactions that consume and produce the metabolites. The stoichiometric coefficients are negative if the metabolite i is a substrate of the reaction and positive if the metabolite i is a product of the reaction. The fluxes in v are subject to lower and upper bounds, a and b, respectively (Fig. 4.2). To simplify the process of solving this system of equations, the pseudosteady state assumption is applied to eliminate the time derivative from Eq. (4.3), reducing it to a system of linear equations in the form of Eq. (4.4) (Lee and Papoutsakis, 1999; Stephanopoulos et al., 1998) (Fig. 4.2A). This pseudo-steady state assumption is based on the observation that the changes in intracellular concentrations of the metabolites are infinitesimally small compared to the overall timescale of cellular functions, such as cell division. Sij vj ¼ 0;
aj vj bj
ð4:4Þ
3. Algorithms of Metabolic Flux Analysis Constraints-based flux analysis of the in silico genome-scale metabolic model allows us to investigate the metabolic status of the cell under specified conditions and to rapidly predict and evaluate phenotypes that would result from genetic and/or environmental perturbations to the cell. This has been used with great success in improving the metabolic capability for the overproduction of the desired product (Kim et al., 2008b; Park et al., 2009). Based on the general structure of constraints-based flux analysis, several in silico algorithms have been developed to tailor the objective functions or constraints according to the desired goals such that the cellular physiology can be accurately described for a specific condition and identify targets to engineer for strain improvement (Fig. 4.3). Recently, constraints-based flux analysis was used for identifying metabolic engineering targets for the overproduction of industrially important products, including petroleum-alternative biochemicals, amino acids, biopolymers, and biofuels (Kim et al., 2008a,b; Park et al., 2009; Raman and Chandra, 2009), and for identifying drug targets in pathogens (Hu et al., 2007; Jamshidi and Palsson, 2007; Kim et al., 2010; Lee et al., 2009; Yeh et al., 2004).
74
Sang Yup Lee et al.
A E1
A
R1
bp1 C bp1 R6 P R2 R bp2 cof 5 R3 R7 E B D cof P R4 R8 bp2 P F P
A: –R1 +E1 = dA/dt B: +R1 –4R2 –2R3 –2R4 = dB/dt C: +R2 –R5 –R6 = dC/dt D: +R3 +R5 –2R7 = dD/dt Pseudo-steady E: +2R6 +R7 +R8 –E2= dE/dt state assumption F: +R4 –2R8 = dF/dt P: +R2 +R4 +3R7 –E3 = dP/dt cof: –R3 –R5 +2R7 = dcof/dt bp1: +R2 –E4 = dbp1/dt bp2: +R8 –E5 = dbp2/dt
E4
B
E5
Objective function
Maximize Z = E2
E2
Subject to
E3
Sij.vj = 0
A: –R1 +E1 = 0 B: +R1 –4R2 –2R3 –2R4 = 0 C: +R2 –R5 –R6 = 0 D: +R3 +R5 –2R7 = 0 E: +2R6 +R7 +R8 –E2= 0 F: +R4 –2R8 = 0 P: +R2 +R4 +3R7 –E3 = 0 cof: –R3 –R5 +2R7 = 0 bp1: +R2 –E4 = 0 bp2: +R8 –E5 = 0 V
aj ≤ vj ≤ bj j = R1,R2,R3, ... ,E4, E5
Constraints
(R6 = 0)-constraint for gene knockout
C
S
–1 0 0 0 0 0 0 0 1 –4 –2 –2 0 0 0 0 0 1 0 0 –1 –1 0 0 0 0 1 0 1 0 –2 0 0 0 0 0 0 2 1 1 0 0 0 1 0 0 0 –2 0 1 0 1 0 0 3 0 0 0 –1 0 –1 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 –1 0 0 0 0 0 0 0 0 0 0 –1 0 0 0 0 0 0 0 0 0 0 –1 0 0 0 0 0 –1
R1 R2 R3 R4 R5 R6 R7 R8 E1 E2 E3 E4 E5
10 10
=0
A
10 10
bp C 2.5 bp1 2.5 1 0 P 0 bp2 0 cof 0 0 0 5 2.5 B D E cof P 0 bp2 0 0 P P F 0
2.5 0 0 0 5 2.5 2.5 7.5
Figure 4.2 Construction of a metabolic model expressed by a stoichiometric matrix and its simulation using constraints-based flux analysis. (A) An example metabolic model consisting of 13 reactions and 15 metabolites (10 internal and 5 external metabolites). Mass balances for each metabolite are set up as differential equations, where the difference between consumption rate and production rate of a specific metabolite is equal to the change rate of metabolite concentration. Based on the pseudo-steady state assumption, the time derivative can be eliminated, giving a set of linear equations. The stoichiometric coefficients for substrate and product of a reaction are negative and positive, respectively. The stoichiometric matrix S is an m n matrix where m is the number of metabolites, and n is the number of reactions. v is a vector representing the fluxes of reactions that consume and produce the metabolites. In this model, internal reactions are represented by R, and reactions related with external metabolites are represented by E. The subscripts i and j represent the indices of metabolites and reactions. (B) Optimization based on constraints-based flux analysis is formulated with an objective function(s) subject to mass balances and additional constraints. v is subject to lower and upper bound constraints, represented as a and b, respectively. In this example, to investigate the metabolic capability after gene knockout, the metabolic flux of reaction R6 is constrained to zero while maximizing an objective function E2. (C) The distribution of metabolic fluxes calculated by constraints-based flux analysis for wild type and knockout mutant (i.e., R6 ¼ 0) strains is shown. Upper black and lower gray values indicate the flux values of wild type and knockout mutant, respectively. The deletion of reaction R6 in this exampled mutant increases the production rate of metabolite P but decreases the flux rate of objective reaction E2, compared with those of wild-type strain. The units of fluxes are mmol/gDCW/h.
75
Application of Metabolic Flux Analysis
In silico algorithms To describe cellular physiology accurately
What is the purpose?
What kind of genetic perturbation? Gene amplification FSEOF
FRA
FSA
FCA
FVA
OptForce
OptReg
OptORF
GDLS
G re ene gu d la ow tio n n
Ins for ertio eig n o ng f ene s OptStrain
OptGene
OptKnock
Gene knockout Flux-sum
ROOM
MOMA
C-based flux
Physiological property FBAwGR
13 FBAwMC
OMNI
TMFA, EBA
Transcriptional regulation Thermodynamics
What kind of information?
SR-FBA
To identify gene targets to be engineered for strain improvement
Linear programming (LP) Quadratic programming (QP) Mixed integer linear programming (MILP) Multi-objective linear programming (MOLP) Genetic algorithm (GA) Requirement of template flux Gene knockout analysis Gene down-regulation analysis Gene amplification analysis Gene insertion analysis Considering relationship among reactions Considering possible flux range 13 Comparing with C-based flux Iterative procedure Model reduction Metabolite essentiality Transcriptional regulation Thermodynamics Molecular crowding Genomic context analysis
Figure 4.3 Flowchart for the simulations based on constraints-based flux analysis of several representative in silico algorithms. The black box denotes a particular property that corresponds to an in silico algorithm.
3.1. Flux balance analysis Flux balance analysis (FBA) is a widely used and basic approach of constraints-based flux analyses. FBA quantifies the intracellular flux distribution of a metabolic network, as represented in Eq. (4.4), by optimizing a linear objective function with LP. Additional constraints can be applied to represent perturbations that can be made to the system and thereby allowing the user to predict the change of metabolic fluxes in response to that perturbation (Fig. 4.2B and C). For example, the change in physiology to gene knockout can be investigated by constraining the metabolic flux of the corresponding reaction to the gene that is to be knocked out to zero. Another example is to apply inequality constraints to the flux values (e.g., vj or desired level of a flux) to represent intervention of gene expression (i.e., down- or upregulation of gene expression). To eliminate unrealistic metabolic fluxes, the flux solution space of in silico metabolic model is
76
Sang Yup Lee et al.
restricted by utilizing constraints determined from experimentally measured fluxes or physiological data. This limits the solution space to fluxes that are realistically within the cell’s capacity. These constraints are applied as inequality or equality constraints (e.g., vj , , or ¼ measured value). Generally, the metabolic reaction representing biomass formation, which is based on experimental measurements of biomass composition under various cultivation conditions, has been used as the objective function in FBA of the metabolic network. The selection of the biomass formation reaction as the objective function is based on the assumption that the cell seeks to maximize cellular growth to ensure survival (Orth et al., 2010; Raman and Chandra, 2009; Schuetz et al., 2007; Smallbone and Simeonidis, 2009; Varma et al., 1993). The maximization of growth rate in FBA is useful in predicting the essentiality of gene/reaction and the robustness of a cell under specific genetic or environmental conditions (Edwards and Palsson, 2000; Kauffman et al., 2003). The essentiality of a reaction and robustness of a cell can be explored by observing the change in the objective value for the biomass formation in response to variations in the flux of a particular reaction (Edwards and Palsson, 2000; Kauffman et al., 2003; Orth et al., 2010). For example, if the objective value for the biomass formation is zero when the flux of a particular reaction is constrained to zero to simulate gene deletion, then the relevant gene or reaction is determined to be essential. For the development of novel drugs to kill a pathogenic microorganism, gene/reaction essentiality analysis on the metabolic network of that microorganism can provide useful information for identifying drug targets. To identify drug targets, essential genes or reactions identified through FBA are further characterized by sequence analyses and structural studies (Hu et al., 2007; Jamshidi and Palsson, 2007; Raman and Chandra, 2009; Yeh et al., 2004). Additionally, by applying constraints to external metabolites, such as substrate uptake rate, oxygen presence to simulate aerobic/anaerobic condition, and biochemical secretion rate, FBA also can be used for quantifying the cellular growth rate under different environmental conditions (Edwards et al., 2001; Oberhardt et al., 2009), investigating byproduct secretion under increasingly anaerobic conditions (Varma et al., 1993), evaluating carbon source utilization capacity (Oberhardt et al., 2009; Orth et al., 2010; Varma et al., 1993), and identifying the optimal growth media composition (Song et al., 2008). The capability of carbon source utilization and the optimal media composition for growth can be examined through varying the constraints related to carbon source uptake and media composition and observing the effects on the growth rate (i.e., viable, nonviable, or maximal growth rate). However, the objective function for FBA is not restricted to only biomass formation and other objective functions have been utilized to investigate other characteristics of the metabolic network, including maximization of ATP or reducing power (Ramakrishna et al., 2001; Schuetz et al., 2007), and maximization of a particular biochemical production
Application of Metabolic Flux Analysis
77
(Hong et al., 2003; Kauffman et al., 2003). This allows FBA the flexibility to investigate a wide range of target phenotypes. In investigating the production capability of a desired biochemical and identifying alternative metabolic pathways that lead to the production of desired biochemical, the in silico theoretical maximum yield is evaluated (i.e., maximizing the production rate of the target biochemical in FBA; Hong et al., 2003). FBA can also be used to calculate the yields of important cofactors, such as ATP, NADH, or NADPH (Orth et al., 2010; Varma et al., 1993). FBA has also been used for the refinement of in silico metabolic models by filling gaps in the metabolic network due to incomplete information in the genome annotation and the databases. Gaps in the metabolic network appear where the predicted results are inconsistent with experimental data. Analysis of the results generated using FBA can identify the missing reactions that are not annotated in the genome and are required in the metabolic network to reconcile the disagreements between predictions and experiments. By filling in these gaps for the model refinement, the genome annotation is concurrently updated (Oberhardt et al., 2008, 2009; Raman and Chandra, 2009; Reed et al., 2006). FBA and FBA-based approaches have been utilized in metabolic engineering to identify gene targets with the goal of improving the production yield of a desired biochemical. This is accomplished by selecting targets which increase the availability of metabolic precursors and cofactor balancing by redirecting the metabolic fluxes through fluxes that generate the desired biochemical (Park et al., 2009; Raman and Chandra, 2009). Gene knockout approaches, such as OptKnock and its derivatives, have been widely used to identify target genes that will block competing fluxes and funnel the flux toward the overproduction of biochemicals (Burgard et al., 2003; Lee et al., 2005a; Pharkya et al., 2003). In addition to gene knockout approaches, other FBA-based approaches have been developed to analyze features of metabolic network and the relationship among different reactions to each other in the metabolic network. These approaches include flux variability analysis (FVA), flux coupling analysis (FCA), flux sensitivity analysis (FSA), and flux response analysis (FRA) (Burgard et al., 2004; Jung et al., 2010; Lee et al., 2007; Mahadevan and Schilling, 2003; Price et al., 2004) and will be discussed later in detail (Fig. 4.3). By considering the relationship of the metabolic reaction to the desired biochemical, regulatory targets can be identified to improve the production of the desired biochemical.
3.2. Identifying gene targets for engineering strain development: Gene knockout In metabolic engineering, gene knockout is the most common and important tool that generates strategies leading to the overproduction of the desired biochemical by redirecting metabolic fluxes and redesigning the
78
Sang Yup Lee et al.
metabolic pathways of the host strain (Fig. 4.2B and C). However, there is a problem in trusting strategies generated from knockout simulations to accurately reflect in vivo knockout phenotypes. The problem is that biological systems do not instantly attain the optimal phenotype that is displayed from FBA. The cell requires an adjustment to the perturbation that is introduced to its metabolic network. To account for this adjustment to the metabolic network, algorithms describing the physiological characteristics of a cell after gene knockout perturbations were developed: minimization of metabolic adjustment (MOMA) and regulatory on/off minimization (ROOM; Segre et al., 2002; Shlomi et al., 2005) (Fig. 4.3). These algorithms require a template flux distribution, which calculates the flux distribution of mutant. Typically, the template flux distribution is the flux distribution of wild-type strain or base strain for the next stage of engineering. MOMA assumes that the metabolic fluxes of the metabolic network in the mutant go through a minimal flux redistribution in relation to the wild type. Therefore, the objective function in MOMA finds a unique flux distribution for the mutant network that is closest to a given template flux distribution using Euclidian norm with quadratic programming (QP; Segre et al., 2002). ROOM also utilizes a different objective function which looks for a flux distribution that minimizes the number of significant flux changes from the template flux distribution using mixed integer linear programming (MILP; Shlomi et al., 2005). Comparing the two algorithms reveals that MOMA fluctuates most of metabolic flux values in the metabolic network in relation to the template flux distribution. However, ROOM minimizes the number of flux changes compared with the template flux distribution. MOMA was utilized to identify gene knockout targets to develop strains capable of enhanced production of lycopene (Alper et al., 2005; Choi et al., 2010), L-valine (Park et al., 2007), and polylactic acid ( Jung et al., 2010) in E. coli and sesquiterpene in Saccharomyces cerevisiae (Asadollahi et al., 2009). Particularly, sequential and iterative optimization approach using MOMA, whereby single gene knockouts are investigated in the genetic background of mutants identified from previous iterations, was used to identify knockout target genes for the overproduction of lycopene and L-valine in E. coli (Alper et al., 2005; Park et al., 2007). For the production of sesquiterpene in S. cerevisiae, the effects of gene knockouts were evaluated using MOMA as objective function and OptGene as simulation framework (Asadollahi et al., 2009). ROOM was utilized to show improved flux predictions in pyruvate kinase (pyk) knockout E. coli and good performances for the prediction of gene essentiality in S. cerevisiae compared with either FBA or MOMA (Shlomi et al., 2005). Although the enhanced production of the desired biochemical through genetic modifications is the desired outcome, increasing the production rate of the desired biochemical often negatively affects the cellular growth rate,
Application of Metabolic Flux Analysis
79
and vice versa. To resolve this dilemma, the bi-level optimization framework using MILP, OptKnock, was developed (Burgard et al., 2003) (Fig. 4.3). OptKnock allows the user to find a set of gene knockout targets that increase the fluxes toward the production of the desired biochemical, while biomass precursors are simultaneously generated to maintain a sufficient level of growth. OptKnock has been utilized to suggest gene knockout strategies for the production of amino acids (Pharkya et al., 2003), lactic acid (Fong et al., 2005; Hua et al., 2006), succinic acid (Burgard et al., 2003), and 1,3-propanediol (Burgard et al., 2003) in E. coli. Based on the OptKnock framework, OptGene was developed to identify target knockout genes for optimization of a desired biochemical production using genetic algorithm, instead of MILP, to reduce computational time (Patil et al., 2005) (Fig. 4.3). A population of several genotypes is initiated by assigning an on/off status for each gene, and each individual genotype is then scored for their fitness by using FBA, MOMA, ROOM, or any other algorithm. After scoring their fitness, the best individual is selected for the generation of a new population by applying random genetic modifications, crossovers, and mutations. This cycle of evolution is repeated until the performance of the mutant achieves a satisfactory performance. OptGene has suggested potential gene knockout targets for the improved production of vanillin, glycerol, succinic acid, and sesquiterpene in S. cerevisiae (Asadollahi et al., 2009; Patil et al., 2005). The other algorithms, such as OptStrain (Pharkya et al., 2004), OptReg (Pharkya and Maranas, 2006), OptORF (Kim and Reed, 2010), and OptForce (Ranganathan et al., 2010) using the OptKnock framework as a starting point and a heuristic algorithm called genetic design through local search (GDLS) (Lun et al., 2009) to reduce computational burden, can also be applied to predict gene knockout targets (Fig. 4.3).
3.3. Identifying gene targets for engineering strain development: Up- or downregulation of Genes Increasing or decreasing gene expression levels to increase the production of the target biochemical has been widely recognized in the community of metabolic engineering ( Jensen and Hammer, 1998; Koffas et al., 2003). Determining whether a gene should be upregulated or downregulated is based on the relationship among metabolic reactions and the response of reactions according to varying the flux of a specific reaction (e.g., production of desired biochemical). Several different algorithms based on LP have been developed to investigate the relationship between the metabolic reactions in metabolic network to the characteristic of interest (Burgard et al., 2004; Jung et al., 2010; Lee et al., 2007; Mahadevan and Schilling, 2003; Price et al., 2004) (Fig. 4.3). FVA investigates the possible flux ranges of reactions (i.e., flux solution space of the metabolic reactions) by examining the maximal and minimal fluxes for each reaction (Bushell et al., 2006;
80
Sang Yup Lee et al.
Khannapho et al., 2008; Puchalka et al., 2008). FVA was utilized to identify inactive or infeasible reactions and classify the reactions according to their simulated behaviors in the in silico genome-scale metabolic models by considering a minimal and maximal flux values (Faria et al., 2010; Feist et al., 2007; Lun et al., 2009; Teusink et al., 2006). FVA can also analyze the changes of flux ranges of reactions after the flux of a metabolic reaction is forced to up or down and was applied to identify gene targets to be engineered for the production of biochemicals, such as succinic acid, 1-butanol, and lycopene in E. coli (Choi et al., 2010; Ranganathan and Maranas, 2010; Ranganathan et al., 2010). FCA examines the correlations for every pair of metabolic fluxes in the metabolic network (Bundy et al., 2007; Burgard et al., 2004; Puchalka et al., 2008). FCA was used to analyze the coupled reaction sets in silico genome-scale metabolic models of E. coli, Pseudomonas putida, Helicobacter pylori, S. cerevisiae, and Homo sapiens (Burgard et al., 2004; Duarte et al., 2007; Pal et al., 2005; Puchalka et al., 2008). FSA explores the change in the objective function flux in response to the flux changes of other metabolic reactions (Delgado and Liao, 1997; Price et al., 2004). FSA suggested metabolic engineering strategies for improving the production of biochemicals, such as acetate, phenylalanine, and erythromycin precursors in E. coli (Delgado and Liao, 1997; Gonzalez-Lergier et al., 2006; Wahl et al., 2004) and was applied to estimate the usefulness of a metabolite toward increasing the growth rate (Finley et al., 2010; Grafahrend-Belau et al., 2009). FRA examines the response of the flux values for target reactions (i.e., desired biochemical production rate and cell growth rate) to the variation in the fluxes of other metabolic reactions. FRA was applied to identify metabolic engineering strategies to increase the production of L-threonine, malic acid, and polylactic acid in E. coli ( Jung et al., 2010; Lee et al., 2007; Moon et al., 2008). FRA was employed to identify targets, in conjunction with MOMA, to develop PLA-overproducing E. coli strain ( Jung et al., 2010). The method to identify gene amplification targets called flux scanning based on enforced objective flux (FSEOF) scans the changes of all the metabolic fluxes in response to the enhancement of the flux toward the desired biochemical (Choi et al., 2010) (Fig. 4.3). FSEOF selects the reactions, as amplification targets, representing fluxes that increase when the flux toward the production of desired biochemical is forced to increase. This method was validated by identifying amplification targets that improved the production of lycopene in E. coli (Choi et al., 2010). To consider simultaneous applications of multiple up- or downregulations and elimination of genes, OptReg and OptForce, derivatives of OptKnock, were developed (Pharkya and Maranas, 2006; Ranganathan et al., 2010) (Fig. 4.3). OptReg requires the determination of initial steady-state fluxes for all metabolic reactions (Pharkya and Maranas, 2006). The fluxes of metabolic reactions are defined as repressed or activated when the fluxes are sufficiently higher or lower compared to the
Application of Metabolic Flux Analysis
81
corresponding initial steady-state fluxes. OptReg suggested metabolic engineering strategies for the production of ethanol and succinic acid in E. coli (Pharkya and Maranas, 2006). OptForce identifies possible engineering interventions by comparing the maximal ranges of flux variability for all metabolic reactions in the wild-type metabolic network, with those in a hypothetical overproducing networks (Ranganathan et al., 2010). The fluxes of overproduction targets in the overproducing networks are forced to maintain the desired limits the fluxes can achieve as additional constraints (i.e., vtarget > vdesired). By doing so, OptForce identifies the sets that must be changed to achieve the prespecified overproducing networks (i.e., MUST sets) for the overproduction of desired biochemical and classifies metabolic reactions into MUST sets including reactions whose flux value must increase, decrease, and become zero: MUSTU, MUSTL, and MUSTX, respectively. Based on these sets, OptForce subsequently extracts a minimal set of fluxes (i.e., FORCE set) that must be modified to obtain the desired phenotype. OptForce was employed to examine the metabolic network for the increased production of succinic acid and 1-butanol in E. coli (Ranganathan and Maranas, 2010; Ranganathan et al., 2010). OptORF, a bi-level optimization method, was developed to design optimal gene knockout and amplification strategies for strain improvement by integrating transcriptional regulatory networks and metabolic networks (Kim and Reed, 2010) (Fig. 4.3). OptORF identifies the targets for modification that maximize biochemical production along with maximizing cellular growth rate using transcriptional regulatory constraints based on Boolean logics (e.g., AND, OR, TRUE, and FALSE) to allow for transcriptional regulation in the metabolic network. OptORF was implemented for producing ethanol and higher alcohol (e.g., isobutanol) in E. coli (Kim and Reed, 2010). Developing multiple genetic manipulation strategies is necessary in maximizing the capabilities of the metabolic network in the production of the desired biochemical and to achieve improved cellular performance. However, as the scale of metabolic model and the number of genetic manipulations employed increase, the calculation time and required computational resources needed to formulate these strategies increase exponentially. To resolve these difficulties, GDLS was developed and involves a process of first, model reduction and then an iterative cycle of sequential gene knockout simulation (Lun et al., 2009) (Fig. 4.3). GDLS framework starts with the reduction of the metabolic model into a smaller model that is equivalent in performance to the original model. This step removes deadend reactions and linked reactions that are not essential in the functioning of the metabolic model. Then, GDLS randomly selects an initial set of genetic manipulations and yields a recombinant network. Subsequently, GDLS searches for the best additional genetic manipulation that improves the
82
Sang Yup Lee et al.
phenotype using MILP. The best perturbed network selected is used as the start point in the next round of the search. This search cycle continues until no further improvements are found within the allowed range of genetic manipulations. This GDLS framework can operate any other optimization algorithms, such as OptKnock or OptReg, in each cycle to find target genes for manipulation. The GDLS algorithm was applied to the production of acetate and succinic acid in E. coli, and its performance was compared to the global search method used by OptKnock (Lun et al., 2009).
3.4. Identifying gene targets for engineering strain development: Foreign genes insertion To confer nonnative functionality into a host organism to achieve a desired phenotype, the insertion of foreign genes is considered. The OptStrain framework was developed for examining the insertion of foreign genes for strain improvement (Pharkya et al., 2004) (Fig. 4.3). OptStain first identifies a pathway that can achieve the maximum in silico yield of desired biochemical using a universal reaction database, which includes all elementally balanced metabolic reactions. Optknock subsequently redesigns a stoichiometrically balanced metabolic pathway that contains the minimum number of nonnative reactions from the universal database, incorporates the nonnative reactions into the host’s metabolic model, and finally applies the OptKnock framework to optimize the phenotype of the newly designed strain. The OptStrain framework was validated through the designing of strategies for the production of hydrogen and vanillin in E. coli (Pharkya et al., 2004).
3.5. Identifying gene targets for engineering strain development: Metabolite essentiality Biological systems maintain phenotypic stability against diverse genetic and environmental perturbations due to redundant or alternative pathways. This is an inherent property of metabolic networks called as robustness (Kitano, 2004, 2007a,b; Stelling et al., 2004; Xu et al., 2009). The study on the robustness of an organism has generally depended on the identification of genes or reactions essential for cell growth as a reaction-centric viewpoint. However, the reaction-centric approach has met with difficulties, where the only limited number of genes or reactions has been identified as essential ones that destroy the cellular robustness. Thus, a metabolite-centric approach, flux-sum, for analyzing the robustness of metabolic network was reported (Chung and Lee, 2009; Kim et al., 2007) (Fig. 4.3). The flux-sum is defined to be one half of the summation of all consumption and generation fluxes around a particular metabolite under pseudo-steady state. Flux-sum analysis elucidates the essentiality of metabolites in the
Application of Metabolic Flux Analysis
83
network through observing the behavior of the flux-sum to perturbations to the metabolic network. For example, essential metabolites are capable of maintaining a steady flux-sum against diverse perturbations by redistributing metabolic fluxes so that the flux-sum remains steady. Hence, the breakdown of the flux-sum around essential metabolites can have negative effects on cellular robustness and cell growth or survival. This metabolite-centric approach provides unique insights into cellular robustness and relevant fragility. Flux-sum has been applied to several applications where robustness or the disruption of robustness is examined, such as identifying drug target candidates in pathogens. Using the metabolic network of the pathogen Acinetobacter baumannii AYE, the flux-sum approach was used to find the most effective drug targets in killing the pathogen by targeting metabolites that are essential in the robustness of its metabolic network (Kim et al., 2010).
3.6. Accurately describing cellular physiology: Incorporation of experimental data and physiological properties into the in silico model The results simulated from in silico genome-scale model usually do not agree with the experimental data because the information used to reconstruct the model is incomplete. Also, the discrepancies between predictions and experimental data can be caused by the broad flux solution space of in silico genome-scale model, which represents all physiologically feasible states and is much larger in comparison with the biologically feasible flux solution space of a real organism (Kim et al., 2008a; Palsson et al., 2003; Park et al., 2009). Thus, there have been several efforts to reduce the flux solution space of the in silico genome-scale model to reduce the differences between prediction and experiment. In some cases, the inaccurate prediction results can be caused by the dissimilarity of active reaction sets between in vivo system and in silico model. The flux solution space of in silico genome-scale model can be reduced by eliminating unrealistic reactions under a given condition. The algorithm, optimal metabolic network identification (OMNI), determines the active reactions in the in silico genome-scale model through comparison with experimentally measured fluxes from 13C-based flux analysis (Herrgard et al., 2006a) (Fig. 4.3). OMNI efficiently identifies the set of reactions, by minimizing the discrepancies between experimental data and in silico predictions using MILP, that need to be included in the metabolic model and finds bottleneck reactions that need to be excluded from the metabolic model. Another strategy in reducing the flux solution space of in silico genomescale model is to supply additional and more detailed information and procedures regarding the metabolic network in the form of additional constraints for the model and simulations (Fig. 4.3). Experimentally measured flux data typically obtained from 13C-based flux analysis and
84
Sang Yup Lee et al.
fermentation data can be utilized as constraints during the simulation of in silico genome-scale model (Blank et al., 2005; Fischer and Sauer, 2005; Kim and Lee, 2006) (Fig. 4.3). Using 13C-based flux analysis, the in vivo fluxes calculated by using isotope-labeled substrates can serve as realistic constraints by limiting the flux values of corresponding intracellular reactions. In vivo systems are controlled by complex regulatory mechanism that responds to various environmental changes, such as temperature, pH, oxygenic condition, or genetic perturbations. Thus, attempts were made to integrate regulatory information into the in silico metabolic models using Boolean logics to describe the regulatory mechanism (Barrett et al., 2005; Covert et al., 2004, 2008; Herrgard et al., 2006b; Shlomi et al., 2007) (Fig. 4.3). Steady-state regulatory flux balance analysis (SR-FBA) was developed by combining in silico genome-scale model with a transcriptional regulation network that represents GPR relationship through Boolean logics. SR-FBA can express regulatory effects, using MILP, of environmental or genetic perturbations by operating on/off conditions of gene, protein, and reaction as binary variables. Conventional FBA determines flux distributions depending only on mass balances of metabolites but does not consider the thermodynamics of reactions. Accordingly, this results in several reactions showing thermodynamically infeasible fluxes. To resolve this issue, FBA extensions, such as energy balance analysis (EBA) (Beard et al., 2004; Yang et al., 2005), thermodynamics-based metabolic flux analysis (TMFA) (Henry et al., 2007), and so on (Feist et al., 2007; Kummel et al., 2006), have been carried out (Fig. 4.3). The feasibility, directionality, and reversibility of metabolic reactions in the model can be determined by calculating the Gibbs free energy of the metabolic reactions based on the laws of thermodynamics. Metabolic reactions that are found to violate the laws of thermodynamics can be modified or excluded from the in silico metabolic model. Intracellular cytoplasm is occupied by macromolecules, many of which are enzymes (Beg et al., 2007). The cytoplasmic enzymes are restricted within the available cytoplasmic space. Thus, the concentration of cytoplasmic enzymes cannot increase further without drastic effects on protein structures, biochemical reaction kinetics, and dynamics of transport within the limited cytoplasmic space of a cell. Consequently, the competitions among enzymes in the limited cytoplasmic space might affect the attainable flux values of each reaction. To incorporate this cellular physiological property into FBA, FBA with molecular crowding (FBAwMC), representing physical and spatial constraints, was applied to predict the growth rate of E. coli wild type and mutant strains and to examine the dynamic patterns of substrate utilization (i.e., the sequence and mode of substrate uptake) of the E. coli cell in mixed-substrate media (Beg et al., 2007) (Fig. 4.3). Intracellular proteins, including enzymes responsible for catalyzing metabolic reactions, interact directly through physical binding and may also
Application of Metabolic Flux Analysis
85
interact indirectly through utilizing a substrate together during enzymatic actions, regulating each other transcriptionally, or forming larger multiprotein assemblies (von Mering et al., 2003, 2007). Functional associations among proteins can be analyzed by genomic context of the genes in the form of conserved neighborhood (i.e., the degree of proximity), gene fusion (i.e., events of forming a hybrid gene), and co-occurrence (i.e., presence or absence across organisms; Jensen et al., 2009). To incorporate this cellular physiological property into FBA, the constraints regarding the grouping of functionally and physically related reactions in the metabolic network were developed by considering genomic context and flux-converging patterns (Park et al., 2010) (Fig. 4.3). Based on genomic context analysis, functionally related reactions are organized together. Followed by genomic context analysis, reactions in each group are further clustered by flux-converging pattern analysis that considers the carbon number of metabolites in reactions and the flux patterns converged from a carbon source in metabolic network. Based on the assumption that the functionally related reactions in the same group show similar expression patterns by similar regulation under several conditions, FBA with grouping reaction constraints (FBAwGR) was applied to describe the changes of fluxes under several different genotypic (pykF, zwf, ppc, and sucA knockout mutants) and environmental (i.e., carbon source shift from glucose to acetate) conditions in E. coli (Park et al., 2010). Mutualisms, in which two or more organisms interact with one another and each individual obtains fitness benefits, may significantly influence the community structure and stability of ecosystems. To describe the mutualistic interactions between two different organisms (i.e., sulfate-reducing bacteria and methanogens; Desulfovibrio vulgaris and Methanococcus maripaludis), a method using a multiobjective system was developed by designing a system of three compartments (Stolyar et al., 2007). Two metabolic models representing two different species were constructed and contained in separate compartments. The third compartment contained exchange reactions connecting the two organisms through the transfer of metabolites between the two species.
4. Concluding Remarks In systems metabolic engineering, MFA plays an important role for the generation of new biological knowledge on the cellular system, system-wide analysis of cellular physiology, and in developing metabolic engineering strategies at the systems-level. Practically, MFA by means of several in silico algorithms and constraints has been applied for the understanding of metabolic characteristics of a cell and for the design of metabolic strategies identifying target genes to be engineered for strain improvement. When performing simulations using in silico methods, determining a suitable
86
Sang Yup Lee et al.
algorithm for application is important (Fig. 4.3). The simulation starts with the consideration of the desired purposes, whether it is to accurately describe cellular physiology or to identify target genes for strain improvement. In describing cellular physiology, the flux solution space of an in silico genome-scale model that represents all biologically feasible metabolic states for a given condition is examined. However, the flux solution space of the model is broader than the physiologically feasible flux solution space of the real cell because of various levels of cellular mechanisms, such as cellular regulation, signaling, and homeostasis, that are not considered due to incomplete information regarding the metabolic network. Thus, additional constraints, including experimental flux data, Boolean logics representing transcriptional regulation, thermodynamics, and physiological data, during simulation can improve the flux solution space of the metabolic model to represent that of a real cell. In improving cellular performance, metabolic engineering approaches, using gene knockout, gene amplification, gene down-regulation, and introduction of foreign genes, have been considered. In identifying gene targets to be engineered for strain improvement, several in silico algorithms can be applied and have been discussed above. There are still several important issues that need to be advanced to complete in silico cell that realizes a real cell successfully. The first thing to be done is to reconstruct thoroughly the in silico genome-scale metabolic model with accurate information. In the aspect of describing cellular physiology, other new constraints and improved algorithms need to be developed to incorporate other physiological properties into the in silico genome-scale metabolic model. Additionally, innovative and progressive in silico algorithms beyond static approach based on pseudo-steady state assumptions need to be developed to describe the dynamic behaviors of a cell. Still, the operation of an individual algorithm has limitations. Since several mechanisms in a real cell operate synchronously, integration of several algorithms or constraints might be advisable for the description of complex cellular physiology. Accordingly, in silico strategies to integrate metabolic, regulatory, and signal transduction mechanisms, such as integrated dynamic FBA (idFBA) (Lee et al., 2008) and integrated FBA (iFBA) (Covert et al., 2008), and combining dynamic kinetics with in silico genome-scale metabolic model (Yugi et al., 2005) have been proposed. In conclusion, to make great progress on these issues, we rather make constant efforts on developing the advanced models and in silico algorithms for their applications in biological and biotechnological studies.
ACKNOWLEDGMENTS This work was supported by the Korean Systems Biology Research Project (20100002164) of the Ministry of Education, Science and Technology (MEST). Further support by the World Class University Program (R32-2009-000-10142-0) through the National Research Foundation of Korea funded by the MEST is appreciated.
Application of Metabolic Flux Analysis
87
REFERENCES Al Zaid Siddiquee, K., Arauzo-Bravo, M. J., and Shimizu, K. (2004). Metabolic flux analysis of pykF gene knockout Escherichia coli based on 13C-labeling experiments together with measurements of enzyme activities and intracellular metabolite concentrations. Appl. Microbiol. Biotechnol. 63, 407–417. Alper, H., Jin, Y. S., Moxley, J. F., and Stephanopoulos, G. (2005). Identifying gene targets for the metabolic engineering of lycopene biosynthesis in Escherichia coli. Metab. Eng. 7, 155–164. Asadollahi, M. A., Maury, J., Patil, K. R., Schalk, M., Clark, A., and Nielsen, J. (2009). Enhancing sesquiterpene production in Saccharomyces cerevisiae through in silico driven metabolic engineering. Metab. Eng. 11, 328–334. Barrett, C. L., Herring, C. D., Reed, J. L., and Palsson, B. O. (2005). The global transcriptional regulatory network for metabolism in Escherichia coli exhibits few dominant functional states. Proc. Natl. Acad. Sci. USA 102, 19103–19108. Beard, D. A., Babson, E., Curtis, E., and Qian, H. (2004). Thermodynamic constraints for biochemical networks. J. Theor. Biol. 228, 327–333. Beg, Q. K., Vazquez, A., Ernst, J., de Menezes, M. A., Bar-Joseph, Z., Barabasi, A. L., and Oltvai, Z. N. (2007). Intracellular crowding defines the mode and sequence of substrate uptake by Escherichia coli and constrains its metabolic activity. Proc. Natl. Acad. Sci. USA 104, 12663–12668. Blank, L. M., Kuepfer, L., and Sauer, U. (2005). Large-scale 13C-flux analysis reveals mechanistic principles of metabolic network robustness to null mutations in yeast. Genome Biol. 6, R49. Bundy, J. G., Papp, B., Harmston, R., Browne, R. A., Clayson, E. M., Burton, N., Reece, R. J., Oliver, S. G., and Brindle, K. M. (2007). Evaluation of predicted network modules in yeast metabolism using NMR-based metabolite profiling. Genome Res. 17, 510–519. Burgard, A. P., Pharkya, P., and Maranas, C. D. (2003). Optknock: A bilevel programming framework for identifying gene knockout strategies for microbial strain optimization. Biotechnol. Bioeng. 84, 647–657. Burgard, A. P., Nikolaev, E. V., Schilling, C. H., and Maranas, C. D. (2004). Flux coupling analysis of genome-scale metabolic network reconstructions. Genome Res. 14, 301–312. Bushell, M. E., Sequeira, S. I., Khannapho, C., Zhao, H., Chater, K. F., Butler, M. J., Kierzek, A. M., and Avignone-Rossa, C. A. (2006). The use of genome scale metabolic flux variability analysis for process feed formulation based on an investigation of the effects of the zwf mutation on antibiotic production in Streptomyces coelicolor. Enzyme Microb. Technol. 39, 1347–1353. Choi, H. S., Lee, S. Y., Kim, T. Y., and Woo, H. M. (2010). In silico identification of gene amplification targets for improvement of lycopene production. Appl. Environ. Microbiol. 76, 3097–3105. Chung, B. K., and Lee, D. Y. (2009). Flux-sum analysis: A metabolite-centric approach for understanding the metabolic network. BMC Syst. Biol. 3, 117. Covert, M. W., Knight, E. M., Reed, J. L., Herrgard, M. J., and Palsson, B. O. (2004). Integrating high-throughput and computational data elucidates bacterial networks. Nature 429, 92–96. Covert, M. W., Xiao, N., Chen, T. J., and Karr, J. R. (2008). Integrating metabolic, transcriptional regulatory and signal transduction models in Escherichia coli. Bioinformatics 24, 2044–2050. Davidsen, T., Beck, E., Ganapathy, A., Montgomery, R., Zafar, N., Yang, Q., Madupu, R., Goetz, P., Galinsky, K., White, O., and Sutton, G. (2010). The comprehensive microbial resource. Nucleic Acids Res. 38, D340–D345.
88
Sang Yup Lee et al.
Delgado, J., and Liao, J. C. (1997). Inverse flux analysis for reduction of acetate excretion in Escherichia coli. Biotechnol. Prog. 13, 361–367. Duarte, N. C., Becker, S. A., Jamshidi, N., Thiele, I., Mo, M. L., Vo, T. D., Srivas, R., and Palsson, B. O. (2007). Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proc. Natl. Acad. Sci. USA 104, 1777–1782. Durot, M., Bourguignon, P. Y., and Schachter, V. (2009). Genome-scale models of bacterial metabolism: Reconstruction and applications. FEMS Microbiol. Rev. 33, 164–190. Edwards, J. S., and Palsson, B. O. (2000). The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proc. Natl. Acad. Sci. USA 97, 5528–5533. Edwards, J. S., Ibarra, R. U., and Palsson, B. O. (2001). In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data. Nat. Biotechnol. 19, 125–130. Faria, J. P., Focha, M., Stevens, R. L., and Henry, C. S. (2010). Analysis of the effect of reversibility constraints on the predictions of genome-scale metabolic models. In “Advances in Bioinformatics,” (M. P. Rocha, F. F. Riverola, H. Shatkay, and J. M. Corchado, eds.), pp. 209–215. Springer, Berlin. Feist, A. M., and Palsson, B. O. (2008). The growing scope of applications of genome-scale metabolic reconstructions using Escherichia coli. Nat. Biotechnol. 26, 659–667. Feist, A. M., Henry, C. S., Reed, J. L., Krummenacker, M., Joyce, A. R., Karp, P. D., Broadbelt, L. J., Hatzimanikatis, V., and Palsson, B. O. (2007). A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Mol. Syst. Biol. 3, 121. Finley, S. D., Broadbelt, L. J., and Hatzimanikatis, V. (2010). In silico feasibility of novel biodegradation pathways for 1, 2, 4-trichlorobenzene. BMC Syst. Biol. 4, 7. Fischer, E., and Sauer, U. (2005). Large-scale in vivo flux analysis shows rigidity and suboptimal performance of Bacillus subtilis metabolism. Nat. Genet. 37, 636–640. Fong, S. S., Burgard, A. P., Herring, C. D., Knight, E. M., Blattner, F. R., Maranas, C. D., and Palsson, B. O. (2005). In silico design and adaptive evolution of Escherichia coli for production of lactic acid. Biotechnol. Bioeng. 91, 643–648. Fuhrer, T., and Sauer, U. (2009). Different biochemical mechanisms ensure network-wide balancing of reducing equivalents in microbial metabolism. J. Bacteriol. 191, 2112–2121. Gonzalez-Lergier, J., Broadbelt, L. J., and Hatzimanikatis, V. (2006). Analysis of the maximum theoretical yield for the synthesis of erythromycin precursors in Escherichia coli. Biotechnol. Bioeng. 95, 638–644. Grafahrend-Belau, E., Schreiber, F., Koschutzki, D., and Junker, B. H. (2009). Flux balance analysis of barley seeds: A computational approach to study systemic properties of central metabolism. Plant Physiol. 149, 585–598. Henry, C. S., Broadbelt, L. J., and Hatzimanikatis, V. (2007). Thermodynamics-based metabolic flux analysis. Biophys. J. 92, 1792–1805. Herrgard, M. J., Fong, S. S., and Palsson, B. O. (2006a). Identification of genome-scale metabolic network models using experimentally measured flux profiles. PLoS Comput. Biol. 2, e72. Herrgard, M. J., Lee, B. S., Portnoy, V., and Palsson, B. O. (2006b). Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae. Genome Res. 16, 627–635. Hong, S. H., Moon, S. Y., and Lee, S. Y. (2003). Prediction of maximum yields of metabolites and optimal pathways for their production by metabolic flux analysis. J. Microbiol. Biotechnol. 13, 571–577. Hu, W., Sillaots, S., Lemieux, S., Davison, J., Kauffman, S., Breton, A., Linteau, A., Xin, C., Bowman, J., Becker, J., Jiang, B., and Roemer, T. (2007). Essential gene identification and drug target prioritization in Aspergillus fumigatus. PLoS Pathog. 3, e24.
Application of Metabolic Flux Analysis
89
Hua, Q., Joyce, A. R., Fong, S. S., and Palsson, B. O. (2006). Metabolic analysis of adaptive evolution for in silico-designed lactate-producing strains. Biotechnol. Bioeng. 95, 992–1002. Ishii, N., Nakahigashi, K., Baba, T., Robert, M., Soga, T., Kanai, A., Hirasawa, T., Naba, M., Hirai, K., Hoque, A., Ho, P. Y., Kakazu, Y., et al. (2007). Multiple highthroughput analyses monitor the response of E. coli to perturbations. Science 316, 593–597. Jamshidi, N., and Palsson, B. O. (2007). Investigating the metabolic capabilities of Mycobacterium tuberculosis H37Rv using the in silico strain iNJ661 and proposing alternative drug targets. BMC Syst. Biol. 1, 26. Jensen, P. R., and Hammer, K. (1998). The sequence of spacers between the consensus sequences modulates the strength of prokaryotic promoters. Appl. Environ. Microbiol. 64, 82–87. Jensen, L. J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., Bork, P., and von Mering, C. (2009). STRING 8—A global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 37, D412–D416. Joyce, A. R., and Palsson, B. O. (2006). The model organism as a system: Integrating ’omics’ data sets. Nat. Rev. Mol. Cell Biol. 7, 198–210. Jung, Y. K., Kim, T. Y., Park, S. J., and Lee, S. Y. (2010). Metabolic engineering of Escherichia coli for the production of polylactic acid and its copolymers. Biotechnol. Bioeng. 105, 161–171. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., and Hirakawa, M. (2010). KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38, D355–D360. Kauffman, K. J., Prakash, P., and Edwards, J. S. (2003). Advances in flux balance analysis. Curr. Opin. Biotechnol. 14, 491–496. Khannapho, C., Zhao, H., Bonde, B. K., Kierzek, A. M., Avignone-Rossa, C. A., and Bushell, M. E. (2008). Selection of objective function in genome scale flux balance analysis for process feed development in antibiotic production. Metab. Eng. 10, 227–233. Kim, T. Y., and Lee, S. Y. (2006). Accurate metabolic flux analysis through data reconciliation of isotope balance-based data. J. Microbiol. Biotechnol. 16, 1139–1143. Kim, J., and Reed, J. L. (2010). OptORF: Optimal metabolic and regulatory perturbations for metabolic engineering of microbial strains. BMC Syst. Biol. 4, 53. Kim, P. J., Lee, D. Y., Kim, T. Y., Lee, K. H., Jeong, H., Lee, S. Y., and Park, S. (2007). Metabolite essentiality elucidates robustness of Escherichia coli metabolism. Proc. Natl. Acad. Sci. USA 104, 13638–13642. Kim, H. U., Kim, T. Y., and Lee, S. Y. (2008a). Metabolic flux analysis and metabolic engineering of microorganisms. Mol. Biosyst. 4, 113–120. Kim, T. Y., Sohn, S. B., Kim, H. U., and Lee, S. Y. (2008b). Strategies for systems-level metabolic engineering. Biotechnol. J. 3, 612–623. Kim, H. U., Kim, T. Y., and Lee, S. Y. (2010). Genome-scale metabolic network analysis and drug targeting of multi-drug resistant pathogen Acinetobacter baumannii AYE. Mol. Biosyst. 6, 339–348. Kitano, H. (2004). Biological robustness. Nat. Rev. Genet. 5, 826–837. Kitano, H. (2007a). A robustness-based approach to systems-oriented drug design. Nat. Rev. Drug Discov. 6, 202–210. Kitano, H. (2007b). Towards a theory of biological robustness. Mol. Syst. Biol. 3, 137. Kizer, L., Pitera, D. J., Pfleger, B. F., and Keasling, J. D. (2008). Application of functional genomics to pathway optimization for increased isoprenoid production. Appl. Environ. Microbiol. 74, 3229–3241. Koffas, M. A., Jung, G. Y., and Stephanopoulos, G. (2003). Engineering metabolism and product formation in Corynebacterium glutamicum by coordinated gene overexpression. Metab. Eng. 5, 32–41.
90
Sang Yup Lee et al.
Kummel, A., Panke, S., and Heinemann, M. (2006). Systematic assignment of thermodynamic constraints in metabolic network models. BMC Bioinform. 7, 512. Kwon, Y. K., Lu, W., Melamud, E., Khanam, N., Bognar, A., and Rabinowitz, J. D. (2008). A domino effect in antifolate drug action in Escherichia coli. Nat. Chem. Biol. 4, 602–608. Lee, S. Y., and Papoutsakis, E. T. (1999). Metabolic Engineering. Marcel Dekker, New York. Lee, S. J., Lee, D. Y., Kim, T. Y., Kim, B. H., Lee, J., and Lee, S. Y. (2005a). Metabolic engineering of Escherichia coli for enhanced production of succinic acid, based on genome comparison and in silico gene knockout simulation. Appl. Environ. Microbiol. 71, 7880–7887. Lee, S. Y., Lee, D. Y., and Kim, T. Y. (2005b). Systems biotechnology for strain improvement. Trends Biotechnol. 23, 349–358. Lee, K. H., Park, J. H., Kim, T. Y., Kim, H. U., and Lee, S. Y. (2007). Systems metabolic engineering of Escherichia coli for L-threonine production. Mol. Syst. Biol. 3, 149. Lee, J. M., Gianchandani, E. P., Eddy, J. A., and Papin, J. A. (2008). Dynamic analysis of integrated signaling, metabolic, and regulatory networks. PLoS Comput. Biol. 4, e1000086. Lee, S. Y., Kim, H. U., Park, J. H., Park, J. M., and Kim, T. Y. (2009). Metabolic engineering of microorganisms: General strategies and drug production. Drug Discov. Today 14, 78–88. Li, M., Ho, P. Y., Yao, S., and Shimizu, K. (2006). Effect of sucA or sucC gene knockout on the metabolism in Escherichia coli based on gene expressions, enzyme activities, intracellular metabolite concentrations and metabolic fluxes by 13C-labeling experiments. Biochem. Eng. J. 30, 286–296. Lun, D. S., Rockwell, G., Guido, N. J., Baym, M., Kelner, J. A., Berger, B., Galagan, J. E., and Church, G. M. (2009). Large-scale identification of genetic design strategies using local search. Mol. Syst. Biol. 5, 296. Mahadevan, R., and Schilling, C. H. (2003). The effects of alternate optimal solutions in constraint-based genome-scale metabolic models. Metab. Eng. 5, 264–276. Moon, S. Y., Hong, S. H., Kim, T. Y., and Lee, S. Y. (2008). Metabolic engineering of Escherichia coli for the production of malic acid. Biochem. Eng. J. 40, 312–320. Nanchen, A., Schicker, A., Revelles, O., and Sauer, U. (2008). Cyclic AMP-dependent catabolite repression is the dominant control mechanism of metabolic fluxes under glucose limitation in Escherichia coli. J. Bacteriol. 190, 2323–2330. Oberhardt, M. A., Puchalka, J., Fryer, K. E., Martins dos Santos, V. A., and Papin, J. A. (2008). Genome-scale metabolic network analysis of the opportunistic pathogen Pseudomonas aeruginosa PAO1. J. Bacteriol. 190, 2790–2803. Oberhardt, M. A., Chavali, A. K., and Papin, J. A. (2009). Flux balance analysis: Interrogating genome-scale metabolic networks. Methods Mol. Biol. 500, 61–80. Orth, J. D., Thiele, I., and Palsson, B. O. (2010). What is flux balance analysis? Nat. Biotechnol. 28, 245–248. Pal, C., Papp, B., and Lercher, M. J. (2005). Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nat. Genet. 37, 1372–1375. Palsson, B. O., Price, N. D., and Papin, J. A. (2003). Development of network-based pathway definitions: The need to analyze real metabolic networks. Trends Biotechnol. 21, 195–198. Park, J. H., and Lee, S. Y. (2008). Towards systems metabolic engineering of microorganisms for amino acid production. Curr. Opin. Biotechnol. 19, 454–460. Park, J. H., Lee, K. H., Kim, T. Y., and Lee, S. Y. (2007). Metabolic engineering of Escherichia coli for the production of L-valine based on transcriptome analysis and in silico gene knockout simulation. Proc. Natl. Acad. Sci. USA 104, 7797–7802.
Application of Metabolic Flux Analysis
91
Park, J. M., Kim, T. Y., and Lee, S. Y. (2009). Constraints-based genome-scale metabolic simulation for systems metabolic engineering. Biotechnol. Adv. 27, 978–988. Park, J. M., Kim, T. Y., and Lee, S. Y. (2010). Prediction of metabolic fluxes by incorporating genomic context and flux-converging pattern analyses. Proc. Natl. Acad. Sci. USA 107, 14931–14936. Patil, K. R., Rocha, I., Forster, J., and Nielsen, J. (2005). Evolutionary programming as a platform for in silico metabolic engineering. BMC Bioinform. 6, 308. Peng, L., Arauzo-Bravo, M. J., and Shimizu, K. (2004). Metabolic flux analysis for a ppc mutant Escherichia coli based on 13C-labelling experiments together with enzyme activity assays and intracellular metabolite measurements. FEMS Microbiol. Lett. 235, 17–23. Peyraud, R., Kiefer, P., Christen, P., Massou, S., Portais, J. C., and Vorholt, J. A. (2009). Demonstration of the ethylmalonyl-CoA pathway by using 13C metabolomics. Proc. Natl. Acad. Sci. USA 106, 4846–4851. Pharkya, P., and Maranas, C. D. (2006). An optimization framework for identifying reaction activation/inhibition or elimination candidates for overproduction in microbial systems. Metab. Eng. 8, 1–13. Pharkya, P., Burgard, A. P., and Maranas, C. D. (2003). Exploring the overproduction of amino acids using the bilevel optimization framework OptKnock. Biotechnol. Bioeng. 84, 887–899. Pharkya, P., Burgard, A. P., and Maranas, C. D. (2004). OptStrain: A computational framework for redesign of microbial production systems. Genome Res. 14, 2367–2376. Price, N. D., Reed, J. L., and Palsson, B. O. (2004). Genome-scale models of microbial cells: Evaluating the consequences of constraints. Nat. Rev. Microbiol. 2, 886–897. Puchalka, J., Oberhardt, M. A., Godinho, M., Bielecka, A., Regenhardt, D., Timmis, K. N., Papin, J. A., and Martins dos Santos, V. A. (2008). Genome-scale reconstruction and analysis of the Pseudomonas putida KT2440 metabolic network facilitates applications in biotechnology. PLoS Comput. Biol. 4, e1000210. Rabinowitz, J. D. (2007). Cellular metabolomics of Escherchia coli. Expert Rev. Proteomics 4, 187–198. Ramakrishna, R., Edwards, J. S., McCulloch, A., and Palsson, B. O. (2001). Flux-balance analysis of mitochondrial energy metabolism: Consequences of systemic stoichiometric constraints. Am. J. Physiol. Regul. Integr. Comp. Physiol. 280, R695–R704. Raman, K., and Chandra, N. (2009). Flux balance analysis of biological systems: Applications and challenges. Brief. Bioinform. 10, 435–449. Ranganathan, S., and Maranas, C. D. (2010). Microbial 1-butanol production: Identification of non-native production routes and in silico engineering interventions. Biotechnol. J. 5, 716–725. Ranganathan, S., Suthers, P. F., and Maranas, C. D. (2010). OptForce: An optimization procedure for identifying all genetic manipulations leading to targeted overproductions. PLoS Comput. Biol. 6, e1000744. Reed, J. L., Patel, T. R., Chen, K. H., Joyce, A. R., Applebee, M. K., Herring, C. D., Bui, O. T., Knight, E. M., Fong, S. S., and Palsson, B. O. (2006). Systems approach to refining genome annotation. Proc. Natl. Acad. Sci. USA 103, 17480–17484. Risso, C., Van Dien, S. J., Orloff, A., Lovley, D. R., and Coppi, M. V. (2008). Elucidation of an alternate isoleucine biosynthesis pathway in Geobacter sulfurreducens. J. Bacteriol. 190, 2266–2274. Sauer, U. (2004). High-throughput phenomics: Experimental methods for mapping fluxomes. Curr. Opin. Biotechnol. 15, 58–63. Sauer, U. (2006). Metabolic networks in motion: 13C-based flux analysis. Mol. Syst. Biol. 2, 62. Schmidt, K., Nielsen, J., and Villadsen, J. (1999). Quantitative analysis of metabolic fluxes in Escherichia coli, using two-dimensional NMR spectroscopy and complete isotopomer models. J. Biotechnol. 71, 175–189.
92
Sang Yup Lee et al.
Schneider, K., Kromer, J. O., Wittmann, C., Alves-Rodrigues, I., Meyerhans, A., Diez, J., and Heinzle, E. (2009). Metabolite profiling studies in Saccharomyces cerevisiae: An assisting tool to prioritize host targets for antiviral drug screening. Microb. Cell Fact. 8, 12. Schuetz, R., Kuepfer, L., and Sauer, U. (2007). Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli. Mol. Syst. Biol. 3, 119. Segre, D., Vitkup, D., and Church, G. M. (2002). Analysis of optimality in natural and perturbed metabolic networks. Proc. Natl. Acad. Sci. USA 99, 15112–15117. Shlomi, T., Berkman, O., and Ruppin, E. (2005). Regulatory on/off minimization of metabolic flux changes after genetic perturbations. Proc. Natl. Acad. Sci. USA 102, 7695–7700. Shlomi, T., Eisenberg, Y., Sharan, R., and Ruppin, E. (2007). A genome-scale computational study of the interplay between transcriptional regulation and metabolism. Mol. Syst. Biol. 3, 101. Smallbone, K., and Simeonidis, E. (2009). Flux balance analysis: A geometric perspective. J. Theor. Biol. 258, 311–315. Song, H., Kim, T. Y., Choi, B. K., Choi, S. J., Nielsen, L. K., Chang, H. N., and Lee, S. Y. (2008). Development of chemically defined medium for Mannheimia succiniciproducens based on its genome sequence. Appl. Microbiol. Biotechnol. 79, 263–272. Stelling, J., Sauer, U., Szallasi, Z., Doyle, 3rd, F. J., and Doyle, J. (2004). Robustness of cellular functions. Cell 118, 675–685. Stephanopoulos, G. N., Aristidou, A. A., and Nielsen, J. (1998). Metabolic Engineering. Academic Press, San Diego. Stolyar, S., Van Dien, S., Hillesland, K. L., Pinel, N., Lie, T. J., Leigh, J. A., and Stahl, D. A. (2007). Metabolic modeling of a mutualistic microbial community. Mol. Syst. Biol. 3, 92. Tang, Y. J., Chakraborty, R., Martin, H. G., Chu, J., Hazen, T. C., and Keasling, J. D. (2007). Flux analysis of central metabolic pathways in Geobacter metallireducens during reduction of soluble Fe(III)-nitrilotriacetic acid. Appl. Environ. Microbiol. 73, 3859–3864. Tang, Y. J., Martin, H. G., Dehal, P. S., Deutschbauer, A., Llora, X., Meadows, A., Arkin, A., and Keasling, J. D. (2009). Metabolic flux analysis of Shewanella spp. reveals evolutionary robustness in central carbon metabolism. Biotechnol. Bioeng. 102, 1161–1169. Tannler, S., Fischer, E., Le Coq, D., Doan, T., Jamet, E., Sauer, U., and Aymerich, S. (2008). CcpN controls central carbon fluxes in Bacillus subtilis. J. Bacteriol. 190, 6178–6187. Teusink, B., Wiersma, A., Molenaar, D., Francke, C., de Vos, W. M., Siezen, R. J., and Smid, E. J. (2006). Analysis of growth of Lactobacillus plantarum WCFS1 on a complex medium using a genome-scale metabolic model. J. Biol. Chem. 281, 40041–40048. Varma, A., Boesch, B. W., and Palsson, B. O. (1993). Stoichiometric interpretation of Escherichia coli glucose catabolism under various oxygenation rates. Appl. Environ. Microbiol. 59, 2465–2473. von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P., and Snel, B. (2003). STRING: A database of predicted functional associations between proteins. Nucleic Acids Res. 31, 258–261. von Mering, C., Jensen, L. J., Kuhn, M., Chaffron, S., Doerks, T., Kruger, B., Snel, B., and Bork, P. (2007). STRING 7–recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 35, D358–D362. Wahl, A., El Massaoudi, M., Schipper, D., Wiechert, W., and Takors, R. (2004). Serial 13 C-based flux analysis of an L-phenylalanine-producing E. coli strain using the sensor reactor. Biotechnol. Prog. 20, 706–714. Wiechert, W. (2001). 13C metabolic flux analysis. Metab. Eng. 3, 195–206. Xu, Z., Sun, X., and Yu, S. (2009). Genome-scale analysis to the impact of gene deletion on the metabolism of E. coli: Constraint-based simulation approach. BMC Bioinform. 10 (Suppl. 1), S62.
Application of Metabolic Flux Analysis
93
Yang, F., Qian, H., and Beard, D. A. (2005). Ab initio prediction of thermodynamically feasible reaction directions from biochemical network stoichiometry. Metab. Eng. 7, 251–259. Yeh, I., Hanekamp, T., Tsoka, S., Karp, P. D., and Altman, R. B. (2004). Computational analysis of Plasmodium falciparum metabolism: Organizing genomic information to facilitate drug discovery. Genome Res. 14, 917–924. Yugi, K., Nakayama, Y., Kinoshita, A., and Tomita, M. (2005). Hybrid dynamic/static method for large-scale simulation of metabolism. Theor. Biol. Med. Model. 2, 42. Zamboni, N., and Sauer, U. (2009). Novel biological insights through metabolomics and 13 C-flux analysis. Curr. Opin. Microbiol. 12, 553–558. Zhao, J., Baba, T., Mori, H., and Shimizu, K. (2004a). Effect of zwf gene knockout on the metabolism of Escherichia coli grown on glucose or acetate. Metab. Eng. 6, 164–174. Zhao, J., Baba, T., Mori, H., and Shimizu, K. (2004b). Global metabolic response of Escherichia coli to gnd or zwf gene-knockout, based on 13C-labeling experiments and the measurement of enzyme activities. Appl. Microbiol. Biotechnol. 64, 91–98.
C H A P T E R
F I V E
Developer’s and User’s Guide to Clotho v2.0: A Software Platform for the Creation of Synthetic Biological Systems Bing Xia,† Swapnil Bhatia,* Ben Bubenheim,† Maisam Dadgar,‡ Douglas Densmore,*,‡ and J. Christopher Anderson†,§,},k Contents 98 99 99 100 107 107 107 107 109 112 112 112 114 116 119 122 122 124 125 128
1. Introduction 1.1. Background 1.2. Current status 1.3. General overview 1.4. Resources 1.5. Article organization 2. Developers 2.1. Getting started (Windows version) 2.2. Writing your first App 3. Users 3.1. General remarks 3.2. Managing your Apps 3.3. Adding a New Institution, Lab, and User 3.4. Creating a new Feature 3.5. Creating a new Part 3.6. Creating a new Vector 3.7. Creating a new Plasmid 3.8. Looking at DNA sequences 3.9. Adding Notes and “Factoids” to your data 3.10. Using the right-click menu
* Department of Electrical and Computer Engineering, Boston University, Boston, Massachusetts, USA Department of Bioengineering, University of California, Berkeley, California, USA { Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA } SynBERC: Synthetic Biology Engineering Research Center, University of California, Emeryville, California, USA } QB3: California Institute for Quantitative Biological Research, University of California, Emeryville, California, USA k Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA {
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00005-X
#
2011 Elsevier Inc. All rights reserved.
97
98
Bing Xia et al.
3.11. Predicting PCR products 3.12. Connecting to a database 3.13. Common errors 3.14. Biosafety check 4. Concluding Remarks Acknowledgments References
131 131 132 133 135 135 135
Abstract To design the complex systems that synthetic biologists propose to create, software tools must be developed. Critical to success is the enablement of collaboration across our community such that individual tools that perform specific tasks combine with other tools to provide multiplicative benefits. This will require standardization of the form of the data that exists within the field (Parts, Strains, measurements, etc.), a software environment that enables communication between tools, and a sharing mechanism for distributing the tools. Additionally, this data model must describe the data in a sufficiently rigorous and validated form such that meaningful layers of abstraction can be built upon the base. Herein, we describe a software platform called “Clotho” which provides such a data model, and the plugin and sharing mechanisms needed for a rich tool environment. This document provides a tutorial for users of Clotho and information for software developers who wish to contribute new tools (known as “Apps”) to it.
Abbreviations API CAD EDA
application programming interface computer-aided design electronic design automation
1. Introduction Synthetic biology research is driven by several goals described by the participants and observers of the field (Boldt and Mu¨ller, 2009; Purnick and Weiss, 2009). The overarching goal of “making biology easier to engineer,” however, is the goal which looms large over the field. This goal not only captures the youthful spirit of the discipline but underscores the need to make the science more accessible to a larger audience as well as add the
Developer’s and User’s Guide to Clotho v2.0
99
formalisms, abstractions, and methodologies needed to truly make it an engineering discipline. The former goal of accessibility is being achieved by a variety of community organized events such as iGEM and by the BioBricks Foundation (Smolke, 2009). The latter aspect, however, is only going to be achieved with the introduction of community standards, modular design tools, and rigorous design flows. This chapter presents Clotho, a software platform for creating biological systems, which attempts to integrate these various aspects.
1.1. Background Clotho as a project was started in early 2008 with the observation that biological functionality (what some biological system ultimately does) could be separated from biological implementation (how it is physically realized). This “separation of concerns” is a driving factor behind many of the tools in the electronic design automation (EDA) industry (Keutzer et al., 2000) and a key abstraction which allows the unprecedented levels of productivity seen in the design of modern electronics. Clotho “Classic” was created as a project for the iGEM competition in 2008 (http://2008.igem.org/Team: UC_Berkeley_Tools; Densmore et al., 2009). Clotho v1.0 was an improved version of Clotho which was released as a project in the 2009 competition (http://2009.igem.org/Team:Berkeley_Software). This version also introduced a data model based loosely on the work of POBOL (now SBOL; http://www.sbolstandard.org/). Both of these versions were mainly “proofon-concept” vehicles. The 2008 version showcased the ability to connect to a variety of databases to retrieve biological part data. The 2009 version embodied the ideas of abstract design at a high level and automation in a basic architecture. This architecture is now fully realized in the current version which presents a powerful App-based environment and the ability to integrate tools, thus raising the level of abstraction used to design systems.
1.2. Current status Clotho is currently branded as Clotho v2.0. This software represents the best aspects of the previous two versions of Clotho and is a collaboration between developers at Boston University and the University of California, Berkeley. This software will continue to be improved and revised but the software architecture and application programming interface (API) will remain backward compatible. Future branding will keep the versions within the 2.0 line (i.e., 2.01, 2.2, etc.). Thus, the ideas presented here will be applicable to future versions of Clotho.
100
Bing Xia et al.
1.3. General overview Due to space constraints as well as keeping with the spirit of this chapter, an exhaustive overview of Clotho will not be given. However, Clotho can be summarized at a high level as shown in Fig. 5.1: 1. Clotho encodes various related objects with a basis in the world of genetic engineering such as Parts, Formats, Features, Families, Oligos, Authors, and Notes. The set of objects and relations together forms the Clotho data model. 2. The data model provides not only objects but also an API for using these objects. The API allows developers to write Apps and for Apps to communicate to other Apps via object manipulation. 3. Clotho represents the types of biological objects and their relationships in an extensible way and an API which provides the basis for developing Apps
Tools
Widgets
API layer (standardized data)
Viewers
Algorithms
Formats
Parts, Vectors, Plasmids, Formats, Samples, Data, Notes, Factoids, WikiText, Institutions, Labs, Persons... Datum layer (Primitive data)
Connections
Database
Figure 5.1 Overview of the Clotho software architecture. Clotho provides a data model for representing biological objects, a common API for manipulating these objects, and a common platform for developing Apps for designing synthetic biological systems. The Core of Clotho (in blue) represents a standardized set of objects in its API that constitute the Clotho data model. Each object contains a set of primitive data fields stored in a Datum object. Most Clotho Apps (in gray) communicate to Clotho through the creation or modification of API objects. Clotho Connection Apps, in contrast, directly communicate with the Core’s Datum layer to enable two-way data conversion from a database.
Developer’s and User’s Guide to Clotho v2.0
101
Apps that create and operate on instances of these object types. Clotho and its Apps are written in Java and use the Netbeans Module framework to create a fully functioning application environment with support for installing, removing, updating, and managing Apps and their dependencies. 4. Clotho is an open-source project under a general BSD license. It is the intent of the creators that community-based App development be a key driver in Clotho’s adoption and growth. While the Clotho data model facilitates the manipulation of data by a variety of Apps, Clotho is designed to separate the tools from the data. Bottom-up, Clotho stores data in a database. The database may be local (i.e., colocated with a running copy of Clotho on a user’s personal computer) or remote (i.e., where Clotho must use a network to communicate with a server). These data are manipulated by tools (Clotho Apps) through Clotho’s Core API to guarantee that the data are always represented in a form compliant with Clotho’s data model. Clotho Apps are packaged together with the ClothoCore as a rich client application and run on a user’s personal computer. This design has several advantages. Apps can be developed independently, but they must employ the ClothoCore for retrieving and storing data. This allows every new tool to produce data that can be used by any other tool, thus multiplying the benefits of creating new tools and guarantee the integrity of the data. App developers are freed from routine data persistence details and can focus on creating tools that are easy to use. Staging Apps on a single platform eases the development of new Apps through code reuse and improves software quality through added redundancy in testing and debugging Clotho. It is common for user communities to form around an application like Clotho and this benefits users and developers alike. Packaging Apps on a unified platform promotes a better user experience and eases the learning curve. Clotho’s Java-based code makes it portable across major operating systems. 1.3.1. Data model overview This section provides a brief introduction to the objects that make up Clotho’s data model. For more detailed information, please see http:// www.clothohelp.org. Figure 5.2 provides an illustration of data model object relationships. 1.3.1.1. Composition objects
NucSeq—A NucSeq object represents a nucleic acid sequence of a DNA or an RNA molecule. It also provides various utility methods for operating on it such as calculating the reverse complement and translating the DNA sequence to protein sequence. In the sequence-based composition objects (Feature, Part, Vector, Oligo, and Plasmid), the sequence itself is held through a link to a NucSeq.
102
Bing Xia et al.
Basic Parts Annotations Feature Part
Sample
Plasmid
Notes Factoids Reference
Vector
Figure 5.2 Data linkage in the Clotho data model. Information managed by Clotho is organized according to a referential model. This enables detailed information about a biological material such as a literature reference about its function to be indirectly linked to the physical vessel containing that material through a structured representation of its composition. In Clotho, physical aliquots of biological material are represented by Samples. Physical DNAs are more specifically represented by PlasmidSamples that have a Plasmid composition. The Plasmid is composed of one Vector and one Part. The Part may be composite, and if so, it can be further subdivided into basic Parts. The sequence of a Part or Vector can contain Annotations that link it to a Feature that encapsulates information about a biological primitive such as a coding sequence or promoter. Features are linked to information about them in the form of Notes. Notes can link to Factoids that contain specific information related to a literature Reference. Further details on how to encapsulate different scenarios of describing a genetic composition are provided at (http://wiki.bu.edu/ece-clotho/index.php/Plasmid_composition).
Oligo—An Oligo object represents an oligonucleotide composition. It is theoretical—you can have many physical Samples with the composition of an Oligo at various concentrations and qualities. Part—A Part object provides a composition primitive analogous to the concept of a “standard biological part.” Parts do not necessarily have an associated physical DNA or Sample. Clotho distinguishes between two types of parts: “basic Parts” and “composite Parts.” A composite Part is described by an ordered list of Parts and a Format that validates the composition. Feature—A Feature object represents functional information about a biological primitive such as a promoter, coding sequence, or origin of replication. Like a Part object, it has a NucSeq representing its sequence, and it does not map to a physical object directly. Often, a single Part will exactly coincide with a single Feature, but this is not necessary. A “basic Part” can be associated with multiple Feature objects.
Developer’s and User’s Guide to Clotho v2.0
103
Annotation—An Annotation object links biological primitives (Features) and compositional primitives (Part, Vector, and Plasmid) via their NucSeq object. NucSeq objects can have multiple Annotations, each associating a Feature to a portion of the sequence represented by the NucSeq object. Since Parts, Vectors, and Plasmids (as well as Oligos and Features) all are associated with NucSeq objects, they may inherit associations to multiple Features. Vector—A Vector object is also a composition primitive, like a Part object, except that the data model disallows a composite Vector. One Part (basic or composite) and one Vector together make up a “Plasmid.” We define a Vector broadly: it represents the backbone portion of a plasmid DNA, but could also represent linear DNAs or regions of a genome. This allows the description of both genomic modifications and episomal DNAs using Plasmid compositions. Plasmid—A Plasmid object is a combination of a Vector and a Part. Like Part and Vector, Plasmid objects have a defined sequence and composition, but may not necessarily exist physically. Format—Format objects embody the standards that define how Parts and Vectors may be composed into larger constructs. RFC 10 (http://bbf. openwetware.org/RFC.html), the Biofusion standard, BglBricks (Anderson et al., 2010), “FreeForm”, and “Golden Gate” (Engler et al., 2008) are all examples of Formats. Parts and Vectors must have a Format, but the definition of a Format broadly encapsulates the rules for combining DNAs according to all BioBrick, PCR-based, and ad hoc methods. A Format object points to a Clotho App that describes the rules for composing Parts and Vectors into composite Parts and Plasmids in code. In this way, any composition standard that can be reduced to a rigorous set of rules can be encapsulated by a Format. Family—A Family object describes the biochemical similarities between Features (biological primitives). Families can have subfamilies. For example, the J23119 Part has a Feature called “Strong Pcon promoter” which is a member of an “Anderson Pcon promoters” Family which is a subfamily of “constitutive promoter” which is a subfamily of the “promoter” Family. Currently, Family is a lightweight object for describing relationships between Features or other Families. In future versions of Clotho, Family will be used to abstract the biological behavior of its member Features. Strain—A Strain object represents the theoretical genetic composition of a cell including its chromosome(s) and episomal DNAs. Like a Part, Strains can be basic or composite. A basic Strain is any organismal composition that cannot be described explicitly as some parent Strain modified with Clotho Plasmids. “DH10B,” “MC1061,” and “HeLa” are all examples of basic Strains. A composite Strain is one that has various genome modifications and added DNAs. Since genome modifications
104
Bing Xia et al.
and episomal DNA are all handled by the Plasmid/Vector/Part framework, a composite Strain is defined as some parent Strain with a list of added Plasmids.
1.3.1.2. Physical instantiation objects
Sample—A Sample object represents an aliquot of liquid containing biological material. That material is believed to be the physical instantiation of a Plasmid, an Oligo, or a Strain. The class of Sample objects is the basis for StrainSample, PlasmidSample, and OligoSample classes. Sample objects have volumes measured in microliters, a quality that describes the confidence that it manifests its composition, and link to a Container object that represents the physical object that contains the liquid. PlasmidSample—A PlasmidSample object represents a sample of plasmid DNA such as one obtained from a miniprep procedure. PlasmidSamples link to a Plasmid that describes the composition of the DNA inside it as well as a parent Strain from which the plasmid was purified. StrainSample—A StrainSample represents a sample of cells, such as a 80 stock. StrainSamples link to a Strain that describes the composition of the cell inside it. OligoSample—An OligoSample is a Sample of oligonucleotide DNA. OligoSample objects link to an Oligo object. Container—A Container object represents a physical container holding liquid. Samples always have containers. Physical containers can be in plates and boxes, but they can also be things like microcentrifuge tubes that have no fixed location. A Container therefore may or may not link to a Plate object. Containers link to at most one Sample, though containers could be empty and then the Sample link is null. Every Sample links to one Container. Plate—A Plate object represents an array of placeholders for containers. A 96-well PCR plate is an example of a plate. Every Plate links to a PlateType. PlateType—A PlateType encodes the configuration of a Plate. It holds information on the number of wells and columns, whether it has fixed containers (like a PCR plate) or mobile containers (such as 2D barcode tubes or a paper box with microcentrifuge tubes). It also has information about its physical appearance and dimensions.
1.3.1.3. Authoring objects
Institution—An Institution object represents an institution such as “UC Berkeley” and some data about that institution.
Developer’s and User’s Guide to Clotho v2.0
105
Lab—A Lab object represents a laboratory within an institution. The “Densmore Lab” is the name of a Lab object that links to the Institution whose name is “Boston University.” Person—A Person object represents information about members of Labs. Person objects are linked by other object types to refer to the author of the object. A Person links to a Collection object that can be used to organize other objects of interest to the user. A Person also links to a WikiText object used as a biography that might include data such as a photo of the user.
1.3.1.4. Literature objects
WikiText—A WikiText object represents a section of MediaWiki-style wiki text. It contains methods for converting the wiki text to HTML, handling images and other file attachments (Attachment objects) linked from the wiki text, and recognizing some new tags such as <evidence> and . Attachment—Any file that needs to be embedded in some other object is wrapped as an Attachment object. These objects are used throughout the data model, most prominently, in WikiText objects, where they hold the images and other embedded files such as spreadsheets or PDF files. Factoid—A piece of information linked to a specific Reference is encoded as a Factoid object. For example, a Feature encoding fast-folding YFP might link to a Note that contains a Factoid that points to PMID 11753368 which describes the original discovery of the sequence. The WikiText linked to the Factoid might contain clips from the paper that describe how the sequence was identified. The Reference in the Factoid provides access to the document’s PDF and bibliographic information. Reference—A Reference object is defined by a DOI, PMID, URL, or Patent accession number and encodes a reference to external literature. From this link, Clotho can retrieve such documents and, in the case of a PMID, parse the various fields of the Medline entry. Note—A Note object comprises a collection of Factoids, one additional WikiText, and a title. Notes are hierarchical and may be appended to Strains, Families, and Features. A Note is used to describe the theoretical properties of these biological primitives in a human-readable form and link them to the literature that describes their discovery, characterization, or use.
1.3.1.5. Experimental data objects
SampleData—A SampleData object is the basis for all pieces of data about a Sample. There are currently three types of SampleData objects: Comment, SequenceRead, and ExpData. Only Samples have SampleData
106
Bing Xia et al.
objects, as data may only be associated with real-world instances of biological objects. Comment—A Comment object links to a WikiText object containing data about a sample. Comments are used for free-form descriptions of the outcome of an experiment or to embed pieces of data such as Excel files. SequenceRead—A SequenceRead object is a SampleData object that stores cycle sequencing data. It holds the Oligo primer used for sequencing, the date the sample was submitted, potentially a barcode, and the “.abi” file for the read. ExpData—The ExpData object stores all other types of hard data that are intended to be machine interpretable. It is not fully implemented in Clotho v2.0. 1.3.1.6. Organizational objects
Collection—A Collection object holds a list of other data model objects (including other Collections). It is used to bundle together objects such as all the Parts associated with a user, or objects associated with one project or task. Collections allow Apps to access subsets of data from the database without the need for accessing or querying the entire database.
1.3.2. App types Clotho currently supports six types of Apps all characterized by the type of ClothoPlugin interface used to communicate with Clotho:
ClothoWidget—An App that is launched when Clotho is started. Often Widgets will encode a graphical user interface (GUI). ClothoViewer—An App that is launched on data model objects of a specific type. Installed Viewers are indexed by the ClothoCore and are made available through Clotho’s right-click popup menu from other Apps. ClothoTool—An App that can be launched without referring to a specific data model object. Many tools provide a front end for choosing a specific object and then launching a Viewer. ClothoAlgorithm—An App used for automating a modular task in the background. Algorithms usually will not contain a GUI as they are intended to be connected together into longer workflows or iterated over a list of objects. ClothoConnection—An App that maps a particular database to the Clotho API. Since databases are not required to obey one specific schema in Clotho, the standardization of the data between the database and the API is handled through a Connection. ClothoFormat—An App that is called by a Format object to validate the creation of composition Parts and Plasmids and calculate their sequences.
Developer’s and User’s Guide to Clotho v2.0
107
1.4. Resources We recommend the following information resources for the reader who wishes to learn the details of using and contributing to Clotho: 1. http://www.clothocad.org—This is the flagship site for Clotho. Here you can download the software, learn more about Clotho, and interact with other users in the forum. 2. http://www.clothoapps.org—This is where users can download and share apps. This store is currently under development. 3. http://www.clothohelp.org—This site contains information helpful to users and developers. If you do not find your answer in this document, we recommend you check out this site. 4. http://sourceforge.net/projects/clothocad/—This site hosts the Clotho code. Developers can check out this site for source code to begin App development.
1.5. Article organization The rest of this chapter is organized around the two main types of users of Clotho. The first group is termed “developers.” These are individuals that will actively contribute to the Clotho community via the creation of new Clotho tools. The second group is termed “users.” These are individuals who will use Clotho on a daily basis in a laboratory environment.
2. Developers This section caters to developers. It describes the process of setting up a development environment for creating new Apps for Clotho. This guide is not exhaustive, but we provide pointers to resources where further details may be found.
2.1. Getting started (Windows version) 1. The first item required is Java 6. We recommend that you download Java 6 with Netbeans. Older versions of Netbeans should be upgraded to the latest version. (The current version we use is 6.9.1; refer to Clotho’s help Web site for updates.) Both Java 6 and Netbeans may be downloaded from the following location: http://www.oracle.com/ technetwork/java/javase/downloads/index.html 2. Netbeans is an integrated development environment (IDE) built on the Netbeans plugin environment that is also the basis for Clotho’s plugin
108
Bing Xia et al.
environment. Therefore, Clotho Apps are built using the Netbeans IDE, and a portion of the Netbeans code is also embedded within a Clotho distribution. Clotho “Apps” are created as Netbeans “Modules.” We provide a Netbeans project with the Clotho source code. It is possible that other Java IDEs (e.g., Eclipse) will work for Clotho development. However, this is not officially supported. If you get other IDEs to work, please share your solution with us and other users through the Clotho users’ forum. 3. You will also need a Subversion client for communicating with our Sourceforge code repository. For developers new to Subversion, we recommend the Tortoise SVN client which can be found at the following location: http://tortoisesvn.net/downloads. For users new to Subversion, we also recommend the following tutorial: http://svnbook. red-bean.com/en/1.5/svn-book.pdf. 4. Clotho and its source code are shared through a repository called “Sourceforge.” The Clotho Sourceforge page is at http://sourceforge. net/projects/clothocad. The next task is to “check out” Clotho. (“Check out” is a specific action defined in the Subversion client protocol; see the SVN tutorial for details.) Follow the instructions below to check out Clotho code: Create three folders under a separate folder, set aside exclusively for Clotho development. Label them as “ClothoProject,” “ClothoApps” and “ClothoDevelopment.” The location of the Clotho Subversion repository is https://clothocad. svn.sourceforge.net/svnroot/clothocad/trunk/ClothoProj. (Check the Clotho help Web site for changes in this address.) You will need this for checking out the Clotho source code using any Subversion client. If you choose to use the Tortoise client, then right click on the “ClothoProject” folder and select the “SVN Checkout . . .” option. A window should appear. In the text box labeled “URL of Repository,” paste the repository location. It is important to have these specific folder names to ensure that the Netbeans build files provided with the source code work correctly. (Check for capitalization, punctuation, and typing errors.) The above steps provide you access to the Clotho core source code which can be built and run. The core contains the core API, but does not contain any of the existing apps. To obtain apps: Right click the “ClothoApps” folder and select the “SVN Checkout . . .” option. In the “URL of Repository” box, paste the following line: https://clothocad.svn.sourceforge.net/svnroot/clothocad/trunk/ ClothoApps. This provides you access to the source code of all the existing Apps. There may be other Apps under development. To obtain their source code:
Developer’s and User’s Guide to Clotho v2.0
109
Right click the “ClothoDevelopment” folder and paste in the following line: https://clothocad.svn.sourceforge.net/svnroot/clothocad/ trunk/ClothoDevelopment. 5. Your next task is to “build” Clotho in Netbeans. This step translates and integrates the source code into a single executable application. Start Netbeans and choose the “Open Project” option under the “File” menu. Navigate to the ClothoProject folder and choose the Netbeans project file named ClothoProject. (It should appear beside an icon that looks like a pair of pieces from a jigsaw puzzle.) Enable the “Open as Main Project” and “Open all required projects” options and click “OK.” A list of Clotho “modules” should appear in the left Netbeans Project pane. Right click on ClothoProject and choose “build.” At the end of the process, you should see a “Build successful” message in one of the bottom panes. Otherwise, you should retrace your steps and check each step carefully. After a successful build, right click on the ClothoProject and choose “Run.” A new Clotho window should appear on the screen: this is the Clotho Dashboard, the launch pad for other Apps. 6. You will notice that when you run Clotho, there are a lot of Apps already installed. If you want to add other Apps, expand the Clotho project on the left hand side of Netbeans and right click on the “Modules” folder. Select the “Add Existing” option. Go to the folder you created earlier labeled “ClothoApps” or “ClothoDevelopment” and select tools to add to Clotho. Once those are added, build Clotho again and run it. The Clotho Dashboard should appear on the screen, but this time with an icon for each ClothoTool App that you added. 7. For more information on setting up Clotho for Linux- or MacOS-based machines, see: http://wiki.bu.edu/ece-clotho/index.php/DeveloperSetup.
2.2. Writing your first App This section will provide enough information for you to make your first “Hello World” application. This will be enough for you to see how to launch a new App from the Dashboard. To make a fully functioning App, you will need to learn more about the Clotho API at http://www.clothohelp.org. 2.2.1. Create a Netbeans Module 1. Begin by going through the steps in Section 2.1. Create a new folder in “ClothoDevelopment” with the name of the App that you are creating. Capital and lower case letters do make a difference: remember the names of files and folders exactly. For our example, call the folder “TestApp.” 2. Under the “ClothoProject” directory in the left panel in your Netbeans project, right click on the “Modules” folder and select “Create New.”
110
Bing Xia et al.
3. A window will pop up asking you to provide the name of the module you are creating and the project location on your file system. Give the name of the project “TestApp” and set the project location to the folder you created under ClothoDevelopment. Click “Next.” All plugins have an extended address that should be unique to both your projects and other developers. You should choose a name that is unique to your domain such as “smithlab” for all your Apps and then also a name such as “testapp” specific to your new App. Set the “Code Name Base” as org. smithlab.tool.testapp and provide TestApp as the “Module Display Name.” Click the check box to “create XML Layer.” You have created the foundation for your new module (which will become a Clotho App). 4. Your module should now appear under “Modules” in ClothoProject. Expand the folder of your module, right click on the project, and select “Properties.” Select “Libraries” on the left hand side of the new window. Select “Add Dependency.” When the list of dependencies appears, select “ClothoCore.” This will give you access to the Clotho API. 2.2.2. App source code Next, you have to create the source code for your App. This will depend on the functionality provided by your App, and will be the primary contribution made by you, the developer. The primary class for your App, however, must implement one of the Clotho App types. App-type information is available at http://www.clothohelp.org. In this example, we shall use ClothoTool as the type. So, the class definition will look like this: public class TestApp implements ClothoTool
Your App code must implement the required methods. In particular, it must implement the launch( ) method. This is the method that will be called when your tool icon is clicked on the Dashboard. In addition to the code, you should also provide your tool icon file. 2.2.3. App XML files There are two XML files you must provide for the App to function. The first is layer.xml. This file will be part of the source files group already. You need to modify it as follows:
Developer’s and User’s Guide to Clotho v2.0
111
The second XML file you will need to create provides auxiliary information about your App. Add a file to the source code directory named “TestApp.xml.” The file should look like this ClothoTool 1.0 org-smithlab-tool-testapp-TestApp <description>Basic Test App Test App org.smithlab.tool.testapp <screenshotpath>none Your name org/smithlab/tool/testapp/ NAMEOFYOURICON none For more information please see.. Command1 Command2
2.2.4. Configuring and running your App The Netbeans IDE will dynamically check your code and mark potential problems and errors with exclamation symbols in the code margins. You should check such flagged lines and resolve the errors before proceeding further. When all errors are resolved, right click on your module in the Netbeans’ Projects left pane, and select the “Clean and Build” option. Once this process is completed, you are ready to launch your application. Next, you need to right click on your new App module and select “Run.” Clotho will launch, your App icon should appear on the Dashboard, and the code in your launch method will execute when you double click on the icon.
112
Bing Xia et al.
3. Users This section will act as a tutorial for users of Clotho and take you through a variety of exercises which will familiarize you with the basic tools and their capabilities.
3.1. General remarks
All references to items on the “Dashboard” assume that you are using the Dashboard provided with the Clotho Starter version 2.0 available at the http://www.clothocad.org Web site. Other Dashboards, which may be available in future distributions, may implement this functionality differently. We strongly suggest that you use this Dashboard when going through this document. This tutorial uses specific Apps for many tasks (e.g., the Spreadit family of Apps, Person Editor, Sledrunner, etc.). If you are using a later version of Clotho, other Apps may be included or these Apps may be replaced. We highly recommend that you use the Apps referenced in this tutorial. If you are using a later version of Clotho and wish to get Clotho v2.0 starter, please see http://sourceforge.net/projects/clothocad/files/. You may have restrictions on the way Parts are created with Apps based on the Format you select. If you experience trouble creating Parts or Vectors in certain formats, see http://wiki.bu.edu/ece-clotho/index. php/App_Information#Clotho_Formats.
3.2. Managing your Apps Clotho provides a platform to build new Apps and for Apps to interact with each other by sharing data. The “Plugin Manager” allows you to install, remove, enable, and disable apps; register preferred Viewers; change the default Connection; and jump to the App store. We provide an overview of its capabilities below: 1. Launch the Plugin Manager by clicking on “Manage plugins” or double clicking on the Plugin Manager icon on the Clotho Dashboard as shown in Fig. 5.3. 2. The “shopping cart” takes you to the Web site of the Clotho App Store. The Help buoy takes you to the help page. (Clicking on any of the management tools will center them and clicking again will run them.) 3. In the Starter version, Clotho is connected to a database local to your computer. We call this a “local connection.” The “Manage Database” tool—shown in Fig. 5.4—allows you to change this to a “Configurable
Developer’s and User’s Guide to Clotho v2.0
113
Figure 5.3 Starting the Plugin Manager. There are two ways to start the Plugin Manager from the Clotho Dashboard. The first is to simply double click on the Plugin Manager icon. The second is to click on the Plugin Manager text. The Plugin Manager allows the user to perform many functions such as installation and removal of Apps, showing and hiding Widget Apps, choosing the default Connection, and registering preferred Viewers.
Connection”: this allows Clotho to connect to any remote database based on a particular MySQL schema. 4. Close the “Manage Database” window to return to the Plugin Manager. 5. Clicking on the Manage Widgets icon should show you a list of manageable Widgets. You can enable or disable a Widget by right clicking on it. (The specific widget icon should appear brightened or dimmed, respectively.) 6. The “Manage Viewers” tool allows you to access help pages about the different types of objects in Clotho and register your preferred Viewers for these object types. Navigate to the Collection icon, and right click. You will get a list of all available viewers for a Collection. Choosing one will set it as the preferred Collection viewer.
114
Bing Xia et al.
Figure 5.4 Setting the default Connection. One Connection is always registered as the preferred App for communications with Clotho. The Starter Edition of Clotho comes with two such Connections: a “local connection” or a “configurable connection.” To change the default, click on the name and then click the button.
7. The “Remove plugins” tool allows you to uninstall Apps of your choice. Just click what you want to uninstall, and click “Uninstall.” Be careful with this option as you may not be able to revert this step easily if the App is not available in the App store. (To install a Clotho App, you first need to download the App’s .clo file. You could get those by e-mail from someone, or you could download them from the Clotho App store. Then, drag the file either onto the Dashboard or the Plugin Manager to install.)
3.3. Adding a New Institution, Lab, and User The data that you add and create when using Clotho are associated with a particular user and this user is associated with a particular Lab and Institution. On your first use, we recommend that you create a Person for yourself and associate it with a Lab and an Institution. If the latter are missing, we
Developer’s and User’s Guide to Clotho v2.0
115
recommend that you create them too. Entering this information requires the following steps: 1. Start Clotho by clicking on the executable. 2. If you are not using the Clotho Starter version and you are not connected to a database, then refer to Section 3.12 for instructions on connecting to a database. 3.3.1. Add New Institution 3. Launch the “Institution Editor” from the Dashboard. You can choose an existing Institution or choose “New Institution” from the drop-down list. Here, we assume that your institution is not in the list, so choose “New Institution.” 4. Follow the prompts to add the Institution name, city, state, and country. When you are done, you can select “Save Changes.” Close the window. 5. You can test if this worked by rerunning the Institution Editor. The newly entered institution should appear in the list. (Information associated with it can be edited.) 3.3.2. Add New Lab 6. Double click on the “Lab Editor” tool on the Dashboard. Choose your Lab if it exists in the list; else choose “New lab.” 7. Fill in the prompts with the information for your Lab. You will also be able to assign the Lab and Institution from the list of available Institutions. (You will not be able to add a Lab PI until that user’s information is entered into the database. If you type in an unrecognized name it will simply disappear. You can do this after step 10 when you add a Person.) Save this information and verify it by running the Lab Editor again and checking that your Lab appears in the list. 3.3.3. Add New Person 8. Click on the “Person Editor” and add a new username. Fill in the details for the user and associate the user with the correct Lab. You will need to create a password. (Be sure to use a long and strong password that is independent of your other passwords. While we are taking all necessary precautions to protect your password (all passwords are encrypted in the database), security is not our primary focus in this version of Clotho.) 9. Filling in the Person details also allows you to add biographical information about yourself via a WikiText. If you double click on the panel to the right (or hit Ctrl-E), you will get a text editor window you can edit.
116
Bing Xia et al.
Figure 5.5 Institution, Lab, and Person Editors. These Apps allow you to create or modify Institution, Lab, or Person objects. They also function as Viewers of these objects and enable the user to modify specific data contained within them.
You can toggle back using Ctrl-P (see http://wiki.bu.edu/ece-clotho/ index.php/WikiText_Editor for more). Save the user information. You should now have an Institution, Lab, and Person created in the database. The final results after each step should look like the windows shown in Fig. 5.5. Future users in the same Lab will only need to go through steps 8 and 9.
3.4. Creating a new Feature In the Clotho data model, biological primitives such as coding sequences, promoters, terminators, and origins of replication are called “Features.” 1. Spreadit Features, like all apps in the Spreadit Series, launch with and view a Collection. A Collection is a set of objects affiliated with a specific user and can include Parts, Vectors, Features, etc. The Starter version of Clotho comes with some Collections preinstalled. For example, to view the dummy user “jjenn’s” collection, ensure that you are connected to a database and type “jenn feat” into the text box on the Clotho Dashboard, and click enter. A window similar to the one shown in Fig. 5.6 should appear. 2. Right click and navigate to Spreadit Features to launch the viewer. 3. This produces a list of Features inside the jjenn Features Collection. It shows the Feature’s name, its sequence, its risk group, and the forward and reverse colors used for annotating it. The Family and Note columns are currently not implemented. 4. There is additional information about Features that is not displayed here. 5. One piece of nondisplayed data is whether the Feature is a coding sequence or not. If a sequence is a coding sequence, it stores it without
Developer’s and User’s Guide to Clotho v2.0
117
Figure 5.6 A user’s Collection viewed with the Collection View App. Each user has the ability to view the various objects they are organized by Collections. Here, the user has objects further organized into subcollections categorized by various object types.
start and stop codons. Later when you auto-annotate, the starts and stops will be restored. 6. You can edit the name, sequence, risk group, or author of a Feature. The new name must be unique to the database. The risk group can only be increased (to integers 1 through 5), and the author you put in must be in the database. All other changes will be rejected. 7. To change colors, first click on a row to highlight it. Then you have two options. You can right click on the colored cell which pulls up the chooser shown at left. You can click the color you want on that chooser. Alternatively, you can double click on the cell you want to edit and type in the integer code for the color you want. You can also type in various HTML color codes such as “plum.” 8. Like all the Apps provided with the Starter edition of Clotho, objects are not committed to the database unless you actively save them, but Clotho will warn you about modified objects when you shut down. To save a modified Feature, click to highlight it, right click outside the color cells, and choose “Save to database.” We now show the procedure for creating new Features. (More information about the role of a Feature in Clotho can be obtained from http:// wiki.bu.edu/ece-clotho/index.php/Feature.) 1. Spreadit Features allow you to create new Features and add them to the Collection that you are viewing. If you need to upload a long list of Features, we recommend using BullTrowell instead.
118
Bing Xia et al.
Figure 5.7 Information needed to add a new Feature. A new Feature is created in Spreadit Features by specifying the name, sequence, and whether or not it is a coding sequence (CDS).
2. To add a new Feature, go under File > Add Feature or type Ctrl-N. You will be prompted for a name, sequence, and the coding status of the sequence. You will then see a window asking for the name of the feature and the sequence, as shown in Fig. 5.7. Enter the information and click “submit.” Your feature should now appear in the table. (Remember that all changes are saved locally until you actively save them to the database.) 3. For example, suppose you enter the DNA sequence of a short peptide. Since it encodes protein, you should enable the “is a CDS” option. For things other than peptides or full-length ORFs, disable it. For CDS Features, the sequence supplied must be in frame and encode protein with no internal stop codons. The Feature will otherwise be rejected. Start and stop codons are acceptable, but they will be deleted during the creation of the Feature. 4. You can change the colors associated with a feature by right clicking on the color entries and using the color chooser that presents itself. You can also manually enter the color values if you choose. 5. A Collection can become long and tedious to search. You can search for a specific entry by typing Ctrl-F, typing in the query, and clicking “Find.” You can also search more specifically by name or by sequence. Navigate to Edit > Search Name, type in the query, and click OK. A new window with a new Collection containing all the Features whose name contains the query will pop up. 6. Similarly, you can search all the Features that have a query sequence. Navigate to Edit > Search Sequence, type in the query sequence and click OK. You will get a new Collection with Features whose sequence contains the queried sequence. 7. Like many Apps, Spreadit Features implements a right-click popup. It does this for each Feature as well as for the Collection currently in view. From the right-click popup, you can execute a variety of actions such as launching other Viewers. To get the right-click popup on a Feature, highlight the row with the Feature, then right click on one of the cells that is not colored. To get the right-click popup on the Collection, right click on one of the blue border regions of the window.
Developer’s and User’s Guide to Clotho v2.0
119
3.5. Creating a new Part Clotho Parts are standardized basic and composite biological parts. Typically, one basic Part will map to one Feature (a biological primitive). Parts are like BioBricks. They have a sequence and a format. The format is a standard like “RFC 10.” The sequence must obey the format or Clotho will reject the data. In Clotho, Parts are different from Features—they are not necessarily biological primitives. Parts are composition primitives. The document “Clotho_for_users.pptx” available at clothocad.org provides a more in-depth explanation of the distinction between Parts and Features. A detailed comparison of Registry of Standard Biological Parts “parts” and Clotho Parts is available in the App developer FAQ. The Starter edition of Clotho contains an App called “BullTrowell” that allows you to copy data from an Excel or Google Docs spreadsheet and paste, and parse the data into Clotho objects. We describe the procedure below. 1. Ensure that you are connected to the database and then double click BullTrowell. 2. You will see a control panel window from which you can choose the objects that you would like to parse. You can generate basic Parts, composite Parts, Vectors, Plasmids, Oligos, or Features. In this tutorial, we will generate basic Parts but all other parsers work similarly. 3. Click “basic Parts” and it will create a table to paste in the number of parts indicated in the text box at top. The default limit of 50 can be changed via the text box. 4. You minimally need to fill in the Nickname, Short Description, and Sequence fields for each Part that you want parsed. (See instructions in the top left of the window.) You can type that information in, or copy and paste from Excel, or paste text copied from, say, ApE into an individual cell after double clicking on it. 5. All Parts in Clotho must have a Format. If you are a BioBrick user, several of the known standards archived by the BioBricks Foundation come preloaded with your Starter edition of Clotho. A detailed look at the “rules” for each format installed with the Starter edition is available under “Clotho Formats” at http://wiki.bu.edu/ece-clotho/index. php/App_Information. 6. Consider the RFC10 standard (the original XbaI/SpeI-based Knight standard). In Clotho, RFC10 is split between two Formats: RFC10 and RFC10–CDS. RFC10–CDS is for coding sequences starting with ATG, and RFC10 is for other sequences. In the example shown in Fig. 5.8, three of our Parts are coding sequences with an ATG prefix, and one does not have that prefix—it is a promoter part. Since most of them are RFC10–CDS, select RFC10–CDS as the default Format for parsing. 7. You can use multiple Formats in one round of parsing. Just type in or paste in the Format’s name (here, RFC10) into the Format column to override the default.
120
Bing Xia et al.
Figure 5.8 Selecting Objects for parsing in BullTrowell. The BullTrowell App lets you import external data from spreadsheets. You must specify certain details (e.g., Nickname, Short Description, and Sequence for a Part). However, if other values (e.g., Author and Format of a Part) are not supplied, the values in the corresponding drop-down boxes will be used.
8. You also need to choose an author. Just as with the Formats before, you can override the default author for individual Part lines by typing in the author’s name into the Author column. When you are finished entering information, click “submit.” 9. In the example in Figure 5.8, three of the four Parts were successfully parsed, but BBA_FGK14 failed. Parts parsed successfully: a. Show up in the popup window of Collection View. b. Show up in the text box in the top right corner of the parser. c. Have their lines deleted from the parser’s table. The failure to parse could be due to a misspelled Format or Author name. It is also possible that the Part’s sequence disobeys the Format. In the example in the figure, the sequence of the P_yfiD promoter has an XbaI site in it, and so it disobeys the RFC10 standard. It was therefore rejected. You can edit your data in the table and click “submit again” if you identified your error. 10. The window that pops up is a viewer (in this case, Collection View) launched on a transient Collection object containing your new Parts. Note that although none of the Parts have been saved to the database, you can manipulate them with Clotho’s various tools. If you right click on one, you will see a popup that will allow you to choose and run a Part viewer like Sledrunner. 11. To store your Parts in your own personal Collection, we could open jjenn’s Part by typing “jenn part” into the Dashboard. Alternatively, we can right click on the blue region of the Sledrunner window to get a right-click popup for this Part, navigate from the Author field to its popup, to the Collection field to its popup, and then choose the viewer
Developer’s and User’s Guide to Clotho v2.0
12.
13.
14.
15.
16.
17.
121
“Collection View.” This shows us jjenn’s personal collection in which we see several Collections that jjenn is using to organize data. Double clicking on “jjenn Parts” will open the default viewer for Collection on that Collection. We now have two Collection View windows: one for the destination Collection and the other with our new Parts. Typing Ctrl-A and CtrlC in the new Parts window will select and copy all the parts in that Collection. Clicking on jjenn’s Parts Collection and typing Ctrl-V will paste our new Parts into jjenn’s Collection. You can save your changes to jjenn Parts Collection and all its new contents by clicking “Save Changes.” When you rerun Clotho, your new Parts will be in the database and available in jjenn Parts. Remember that all your changes are local until you actively save them to the database. We can look at a Collection of Parts using a different tool called Spreadit Parts. You can launch this tool from by right clicking in the blue areas of the Collection View window and choosing Spreadit Parts from the right-click popup. Spreadit Parts provides a different look-and-feel and different functionality. It is functionally similar to Spreadit Features. It allows you to add basic and composite Parts; search by name, description, or sequence; and manually edit the fields with input validation. For example, editing a sequence and typing in TCTAGA (an XbaI site) will get the change rejected because it violates the Format requirements. There are two special functions unique to Spreadit Parts: the “Status” column and the Plasmid Calculator (discussed in Section 3.7) at the top of the window. The user can add search tags such as “works” or “fails” into the Status column. (All Clotho objects, not just Parts, accept search tags.) You can also add and look at these search tags from the right-click popup menu. Spreadit Parts can also auto-annotate a Part’s NucSeq with Features. For example, to auto-annotate a Part with Features in jjenn’s Collection, right click on Bjn1392 and launch the “Sledrunner” viewer. This should display the Bjn1392 sequence in a new window. There are two options for auto-annotating with Features. You can use all the Features in the database (Ctrl-Alt-K) or just all the Features in any Collection within jjenn’s personal Collection (Ctrl-K). This sequence possesses the “OmpX” Feature and this Feature is in jjenn’s Collection of Features. Therefore, the search should find it when you type Ctrl-K.
So far we have seen what Spreadit Parts and BullTrowell can do to existing parts. We now describe how to create a new Part using Spreadit Parts. 1. Open the “Spreadit Parts” tool. 2. From the file menu, select “Add a basic Part.” Fill in the information that you want for the part. To tie pieces of this tutorial together, enter somewhere in your part sequence the Feature’s sequence for the feature
122
3. 4. 5.
6.
Bing Xia et al.
you created earlier. This will be useful when you want to see how Features are annotated onto sequences. Make another part the same way. Make it with the same Format so that you can then make a third composite Part. If the names for your Parts already exist, then you will be prompted to give them new names. Save the Parts by choosing “Select all” under “Edit” and then choosing “Save selection” under “Selection.” Make a composite Part by selecting File->“Add Composite Part.” Enter the information for the Parts that you just made, as shown in Fig. 5.9. As a safety measure, if the two Parts that you are using are incompatible according their Formats, a message window will notify you that these basic Parts cannot form a composite Part. You can see all the objects that you have made thus far by launching “Collector Viewer.” You may be prompted to select a user’s collection.
3.6. Creating a new Vector Vectors are similar in their Clotho properties to Parts. All the tools— BullTrowell, Spreadit Vectors, Sledrunner, and Collection View—applicable to Vectors are equivalent to their Parts cousins. The significant differences between a Vector and a Part are the following: 1. Whereas Composite Parts are well defined in Clotho, “composite Vectors” do not exist. 2. The Formats have different rules about the sequence of a Vector. For example, an RFC10 Vector must start with CTAGT and end in T to be considered valid. 3. Plasmids are composed of a Part and a Vector. Next, we describe the procedure for creating a Vector. 1. Launch “Spreadit Vectors.” 2. Choose the “Add Vector” option from the “File” menu. Fill in the required information. Create a Vector with a Format compatible with the Parts that you created while adhering to the requirements of the Format. 3. Highlight the newly created Vector in the Spreadit Vector table and choose “Save selection” under the “Selection” menu.
3.7. Creating a new Plasmid 1. Launch “Spreadit Parts.” This will show the Parts you have created. Launch “Spreadit Vectors.” This will show the Vectors you have created. 2. Type the nicknames of the Part and the Vector which you want to use in the Plasmid Calculator fields of Spreadit Parts to create a Plasmid.
Developer’s and User’s Guide to Clotho v2.0
123
Figure 5.9 Making a composite Part. Creating a composite Part with Spreadit Parts involves specifying a Name, Short Description, Format, and a “Lefty” and “Righty” Part. The lefty Part is the 50 Part, and the righty Part is the 30 ’ Part. The Format selected must be compatible with the Formats of the lefty and righty Parts.
The Format of the Part and the Vector must be compatible. Push the “Tab” button (not “Enter”) in the final field to create the plasmid. 3. A Plasmid icon (and restriction site selector) should appear if the Plasmid was created successfully (see Fig. 5.10.)
124
Bing Xia et al.
Figure 5.10 Creating a new Plasmid. With Parts and Vectors, you can create Plasmids using the Plasmid Calculator. In Spreadit Parts, you can type in the Vector Nickname and the Part Nickname and an icon for a Plasmid will appear. You can then save that Plasmid to a Collection by copying it to the clipboard and then pasting it, or by dragging and dropping the icon onto a Collection. Also, you can use the restriction tool to calculate the expected fragment sizes for cleavage by the selected restriction enzymes.
4. You can then use other tools to view the Plasmid object as well as its sequence (this is called a NucSeq in the Clotho data model).
3.8. Looking at DNA sequences A DNA sequence is associated with objects as a NucSeq object. We now describe the procedure for viewing a NucSeq object. 1. Open the Collection View tool. 2. Right click on a Part and choose the “Sledrunner” tool as the viewer. The DNA sequence for the Part will appear displayed in a Sledrunner window, as shown in Fig. 5.11. 3. If the Part whose sequence is viewed also contains a Feature, then you can highlight the feature by typing Ctrl-K.
Developer’s and User’s Guide to Clotho v2.0
125
Figure 5.11 Viewing NucSeq objects with the Sledrunner tool. Sledrunner is a straightforward App to view NucSeq objects and annotate them. Here, we show two sequences. One is in a Feature; the other is in a Part. When Ctrl-K is typed in the Sledrunner window for the Part, the NucSeq becomes linked to the Feature via a new Annotation.
3.9. Adding Notes and “Factoids” to your data We use the Grapevine App to create Notes, Factoids, and References and link them to Strains and Features. Clotho’s Notes and Factoids are used for capturing and linking information about the function of biological primitives. You attach Notes to Strains, Features, and Families. A Note can contain three types of objects: 1. Notes, since Notes are hierarchical; 2. WikiText, which includes formatted text, images, and embedded files; and 3. Factoids, which are statements of fact from a specific digital reference. We now describe the procedure for creating a Note and linking it to a Feature. 1. Ensure that Clotho is connected to a database and then launch Grapevine. 2. Select “Create a new Note” then click OK. This creates a new Note. The string “New Note” is the default title of the Note. Click on it to edit it, and off it to retain the change.
126
Bing Xia et al.
3. To link it to a Feature, say “ffGFP,” for example, click on the “Link to” box, type in “ffGFP” without quotes, and then click off the box. If the linking succeeds, then a little Feature icon representing the Feature “ffGFP” will appear. The icon is right clickable. 4. To save any changes made to a Note or anything else accessible from this Note Editor, click “Save Everything.” Remember all your changes are local until you actively save them to the database. We now describe the procedure for creating a Factoid within a Note. 5. Double click the Dashboard icon for Grapevine again, or go under the File menu and select New Note, or type Ctrl-N. Type in a title and attach the Note to the Feature “ffGFP.” Then, click “Add new Factoid.” 6. From the Note Editor, you can add and remove Factoids, edit their WikiText, save them to the database, right click on them and use the popup menu, and launch a more detailed Factoid Editor on them. If you want to remove the Factoid, click the trash can icon highlighted above the Factoid. 7. To launch the Factoid Editor, click the right-most icon or double click the title bar for the Factoid. 8. You can change the title, the author, or enter a digital reference link. 9. For example, in creating a Factoid about the discovery of “ffGFP,” one may want to link to the original paper titled “Engineering and characterization of a superfolder green fluorescent protein,” Nat Biotech 24 (1) January 2006. This, however, need not be all typed in. The Factoid Editor, given a PMID such as 16369541 in the Reference Link box, can fetch the relevant reference itself. The Editor also accepts the following types of pointer information: Type
Example
Web sites DOI Papers Books United States Patents International Patents
http://andersonlab.qb3.berkeley.edu 1721.1/45143 PMID: 1636954 ISBN: 0879696125 US_Patent:7,003,515 Intl_Patent: 7354370
If a reference is valid, then Factoid Editor will retrieve the abstract and title of the reference and add it to the window, as shown in the example in Fig. 5.12. 10. You can directly access the PDF document of papers linked through their PMID from the Note Editor by double clicking the PMID displayed on the title bar of the Factoid.
Developer’s and User’s Guide to Clotho v2.0
127
Figure 5.12 Adding PMID references to a Factoid. Using the Grapevine App, you can create Factoid objects to add to your Notes. Here, the Factoid WikiText can be changed along with adding a reference link (in this case to a PMID number).
11. The WikiText object in a Factoid can be used to summarize a paper. Clotho’s WikiText objects are based on the MediaWiki syntax like the one used in the Registry of Standard Biological Parts, OpenWetWare, and Wikipedia. To view the WikiText in HTML mode, type Ctrl-P. To go back to edit mode, either double click or type Ctrl-E. 12. In addition to normal MediaWiki syntax, Clotho WikiText supports some extra formatting tags. For example, we can mark direct quotes from the paper using < claim> tags. This results in a color block behind the tag-enclosed text. You can use tags to highlight specific statements from the paper or specific experimental information about
128
Bing Xia et al.
their system. A tag may be followed by an <evidence> tag. This tag is intended for providing information supporting a claim. 13. WikiText also supports the addition of images. There are two ways to add images to a WikiText in edit mode. You can drag and drop a PNG file into the editor window, which will insert the tag at the end of your text. You can also copy an image from other sources like a Web page or a PDF file and paste it into the editor by typing Ctrl-Shift-V into the editor. Images require a name; enter it in and click OK. The inserted image tag can be moved around the WikiText. 14. Like images, WikiText objects accept a variety of other objects such as Excel spreadsheets. They can be embedded into the WikiText just like images and result in a link to the file. Clicking the link launches the independent application registered for handling an object of that type (e.g., Excel for Excel spreadsheets). You can also directly edit the WikiText objects in Factoids from the Note Editor by double clicking on them. A Note also has a WikiText area editable by double clicking. 15. You can save changes by choosing “Save Everything” under the “File” menu. The final note will look similar to the example shown in Fig. 5.13.
3.10. Using the right-click menu Much of the functionality provided by Clotho is encoded in its Apps. The right-click menu, however, presents operations that are implemented in the core of the Clotho API, and therefore, is inherited by many objects in the Clotho data model. Because of its centrality in the Clotho architecture, we describe the right-click menu capabilities below. The menu shown in Fig. 5.14 offers the following capabilities: Save to database. The default behavior for most apps is to only make local changes to an object. An App ought to provide a means to save its changes to the database, but this action can also be invoked from the right-click menu. Delete. The delete action, which removes an object from the database, is currently unimplemented in Clotho v2.0, and has no effect. Update. This will update the data in an object to include any recent changes in the database. If your information is in conflict with someone else’s changes, it will warn you. Revert. This will replace a local object with the version saved in the database. It will remove any unsaved changes you made to the object. Undo. Every change you make to an object is tracked by Clotho’s Core. If you choose undo, it will revert to the previous version of the object. You can undo all your recent changes for the session through successive undo operations. Redo. You can undo your undos with redo. You can redo your modifications successively through multiple redo operations.
Developer’s and User’s Guide to Clotho v2.0
129
Figure 5.13 Adding an image to a Note. Notes can contain Factoids. Factoids have WikiText that lets you add images to them. Here, you see a complete Note and Factoid complete with text and images.
Copy to clipboard. The core stores one object at a time to its clipboard. Use this to put the object there. It will also put some relevant data to your system clipboard. For example, if you copy a part, the sequence of the part can then be pasted into a word processor. Paste from clipboard. You can paste whatever is on Clotho’s clipboard. This operation will function only when the item being pasted to accept the type of object being pasted. Any object can be pasted onto a Collection;
130
Bing Xia et al.
Figure 5.14 Using the right-click menu. The right-click menu is available in many Apps and appears when you right click on an object. Selection of a command from the popup menu then performs a function on the clicked object. Some of these functions (e.g., delete) are not fully functioning in Clotho v2.0.
Factoids can be pasted onto Notes; and Notes can be pasted onto Features, Families, or Strains, to name a few valid operations. Export XML. Create an XML representation for the object. This can be used for external tools or data exchange. Search Tags. All Clotho objects can be linked to search tags that later can be queried from the database. From the right-click menu, you can add or remove these search tags. Preferred Viewer. View the object in the currently set Viewer for this object type. Choose Viewer. Select from the list of Viewers available for this object type. Upon selection, the Viewer will launch on the selected object. The right-click menu is particularly useful when viewing a Collection. 1. Open a Collection with the Collection View App. 2. Select one of the objects in the Collection View list. 3. Right click on a Part. A list of options should appear. Navigate to and click on “Preferred Viewer.” This should open a viewer that will allow you to see more information about the part. 4. You can also use the right-click menu to “Copy to the clipboard,” objects such as Parts or Features. These can be pasted into the WheelBarrow App window. Objects in the WheelBarrow can be dragged back into a Collection, possibly belonging to a different user. In this way, one can migrate objects from one user or Collection to another using the right-click menu and the WheelBarrow.
Developer’s and User’s Guide to Clotho v2.0
131
For more information on the Collection view and WheelBarrow, see http://wiki.bu.edu/ece-clotho/index.php/Collection_View and http:// wiki.bu.edu/ece-clotho/index.php/Wheelbarrow.
3.11. Predicting PCR products 1. To start, double click PCR Predictor from the Dashboard. 2. When running a PCR prediction, you can retrieve: Parts Vectors Oligos Features Plasmids by name to serve as template for the PCR reaction by typing its name into the Template box in the PCR Predictor App. 3. You can similarly choose the Oligos by typing in their names into the For and Rev boxes. Alternatively, you can copy and paste from another program or type in the DNA sequences. When you have all three fields on the left filled, click calculate. The PCR product will be displayed. If nothing occurs, check your Oligos and template sequence.
3.12. Connecting to a database We now describe the process of connecting to a remote database. This is not needed for the “Clotho Starter” version and is provided only for those using the “Power User” version. Clotho operates on biological data through Apps. Clotho organizes this biological data in a database, and this database can be either a local database or a remote database. The Starter version (which we have used in our discussion so far) works with a local database. The local database is stored as a simple flat file on the same storage device as Clotho, requires no configuration, and, although fully functional, is only intended to be used in illustrative exercises. The remote database is intended for advanced users who wish to share their data with other users distributed across different locations. Clotho needs to be configured to connect to a remote database, and we describe this procedure below. 1. Start Clotho and open the “Plugin Manager” by clicking on “Manage Plugins” on the Dashboard or double clicking on the “Plugin Manager” app. (Future versions of Clotho may differ here depending on the Plugin Manager provided.) 2. Navigate to the “Manage Database” icon in the “Plugin Manager” and click on it.
132
Bing Xia et al.
3. Select “Configurable Connection” and then click “Set as default.” This prepares Clotho to connect to a remote database, instead of interacting with a local file. 4. Clotho does, however, need to know the location of the remote database. Choose the “Configure Connection” App from the Clotho Dashboard and fill in the following data: a. Server—the address of the remote database (this could be a URL or an IP address). b. Database—the name of the remote database (e.g., LabDatabase1). c. Login—the login name for the remote database. d. Password—the password for logging into the remote database. 5. Click save changes; the window will disappear. (If you do not know the details of any remote database, you can leave the data filled in by default as it is; this will connect Clotho to a default remote database. You can learn how to set up a remote database from the Clotho help Web site under the “Database setup” section.) 6. To connect to the remote database, click the Clotho logo in the upper left of the Dashboard. Clotho will indicate that it is attempting to connect to the remote database, and upon success, the small circle at the bottom of the Dashboard (above the search box) should turn green.
3.13. Common errors Setting up Clotho for developers
Incorrectly naming the folders into which the code is checked out. This will cause a problem during the build process if you are using the existing Netbeans projects as this tutorial suggests. Not adding the Dashboard Module to the ClothoProject. This will result in nothing showing up when the tool is run. This will occur if you remove this module from your Netbeans project. Connect to database
Not typing in the correct information into the “Configure Connection” tool. Not being connected to the internet when trying to connect to a database. Not clicking on the Clotho logo to actually connect. Already being connected to a database when you try to connect (this would happen if you have connected before and are trying to establish a new connection; you must restart Clotho then try again.). Adding Institutions, Labs, and Users
You cannot add a PI to the Lab until that person has been created.
Developer’s and User’s Guide to Clotho v2.0
133
Adding Parts
Not saving the basic Parts before you try to make a composite Part with them. Not typing in the exact Nicknames of the basic Parts when trying to make a composite. Trying to create basic Parts which do not adhere to the Format requirements. Trying to create a composite Part with Parts of incompatible Formats. Adding Vectors
Failing to adhere to the standards of the Format will prevent the creation of the Vector. Adding Plasmids
Using a Part and Vector that are not compatible types. Adding Factoids
Typing in incorrect name information when attempting to link to another object.
3.14. Biosafety check The powerful design Apps in Clotho allow the user to design any biological object, including components that may be dangerous to researchers or the community. Currently, there is no community standard for describing genetic compositions in a technically rigorous way though the NIH Guidelines present a rubric for ascertaining some types of risk. Though much additional discussion by the synthetic biology community is needed to establish technical standards for describing risk, Clotho 2.0 implements a simplified biosafety framework for evaluating, recording, and presenting the risks associated with biological primitives that include virulence factors relevant to animal or plant health NIH Guidelines (http://oba.od.nih. gov/rdna/nih_guidelines_oba.html). Figure 5.15 provides an overview of the biosafety framework. All composition objects in Clotho (Vectors, basic and composite Parts, basic and composite Strains, Plasmids, Features, and Families) have the risk group field for assigning these flags. For Parts, Vectors, and Features, these risk groups are automatically assigned upon their creation. Clotho BLASTs the sequence against a databank of known virulence factors and returns the risk group of the organism that contained the gene with the highest match. This framework ensures that every DNA sequence in Clotho has a risk group value associated with it. Clotho’s Strain
134
Bing Xia et al.
Figure 5.15 An overview of Clotho’s biosafety protocol. When any Clotho App generates a new object with a NucSeq (e.g., a Part), the process of creating that object triggers a BLAST search against known virulence factors. The result of that search is returned and stored along with the NucSeq object as its “Risk Group.” Risk group is then examined in all subsequent operations involving this object such as composite Part or Plasmid generation.
and Family fields similarly contain risk groups, but currently they must be assigned manually. In addition, when compositions such as Plasmids, composite Parts, or composite Strains are created, their risk group is calculated from their components. In each case, the calculated risk group is the maximum of its components. For example, a Plasmid composed of an RG1 Vector and a RG2 Part would be an RG2 Plasmid. A composite Strain composed of a RG1 basic Strain and an RG2 Plasmid would be an RG2 Strain. In this way, risk groups are recursively passed through the acts of composition to help guarantee that the inclusion of some virulence factor at an earlier stage of development is not overlooked in future stages. Similarly, a Family can contain a minimal risk group, and any Feature that belongs to that Family will inherit the elevated risk group of the Family. In all cases, the risk group of a Clotho object can be raised if a user is aware of specific dangers associated with that component. However, the risk group can never be lowered to a value inconsistent with the BLASTgenerated result or the value calculated from its composition. If more than one component of a composition contains a risk group higher than 1, the user is alerted that they are potentially generating a dangerous object.
Developer’s and User’s Guide to Clotho v2.0
135
Additionally, risk group 4 components (from CDC rated RG4 organisms or viruses) and Select Agents (denoted as RG5 in Clotho) carry special warnings whenever such objects are created. A significant aspect of this feature is that the Clotho API enforces this check. Whenever any software tool makes a genetic composition, Clotho checks it before it is saved to the database.
4. Concluding Remarks We presented a broad overview of Clotho for users and developers. We hope that this will encourage its use and growth. This tutorial is not intended to be exhaustive; we invite you to keep abreast of Clotho via http://www.clothocad.org and http://www.clothohelp.org. Both of these sites will be updated regularly and should prove a valuable resource for up to date material on Clotho.
ACKNOWLEDGMENTS The authors would like to thank the members of the UC Berkeley 2008 iGEM team (Anne Van Devender, Matthew Johnson, Nade Sritanyaratana); the UC Berkeley 2009 iGEM team (Lesia Bilitchenko, Joanna Chen, Adam Liu, Richard Mar, Thien Nguyen, Nina Revko); Josh Kittleson, Michal Galdzicki, Tim Hsiau, Tim Ham, Carlos Olguin, and Cesar Rodriguez for their help in support for getting Clotho development, infrastructure, and publicity to the point at which it is today. Financial support was provided by SynBERC NSF ERC, Autodesk, and Quintara Biosciences.
REFERENCES Anderson, J. C., et al. (2010). BglBricks: A flexible standard for biological part assembly. J. Biol. Eng. 4(1), 1. Boldt, J., and Mu¨ller, O. (Eds.). (2009). What’s in a name? [News feature]. Nat. Biotechnol. 27(12), 1071–1073. Densmore, D., et al. (2009). A platform-based design environment for synthetic biological systems. The Fifth Richard Tapia Celebration of Diversity in Computing Conference: Intellect, Initiatives, Insight, and Innovations, Portland, Oregon, ACM. Engler, C., Kandzia, R., and Marillonnet, S. (2008). A one pot, one step, precision cloning method with high throughput capability. PLoS One 3(11), e3647. Keutzer, K., et al. (2000). System-level design: Orthogonalization of concerns and platformbased design. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 19(12), 1523–1543. Purnick, P. E., and Weiss, R. (2009). The second wave of synthetic biology: From modules to systems. Nat. Rev. Mol. Cell Biol. 10(6), 410–422. Smolke, C. D. (2009). Building outside of the box: iGEM and the BioBricks Foundation. Nat. Biotechnol. 27(12), 1099–1102.
C H A P T E R
S I X
SynBioSS-Aided Design of Synthetic Biological Constructs Yiannis N. Kaznessis Contents 137 139 142 147 150 150 151
1. Introduction 2. SynBioSS Components 3. Simulations of Biologic AND Gates 4. Advantages and Disadvantages of SynBioSS 5. Concluding Remarks Acknowledgments References
Abstract We present walkthrough examples of using SynBioSS to design, model, and simulate synthetic gene regulatory networks. SynBioSS stands for Synthetic Biology Software Suite, a platform that is publicly available with Open Licenses at www.synbioss.org. An important aim of computational synthetic biology is the development of a mathematical modeling formalism that is applicable to a wide variety of simple synthetic biological constructs. SynBioSS-based modeling of biomolecular ensembles that interact away from the thermodynamic limit and not necessarily at steady state affords for a theoretical framework that is generally applicable to known synthetic biological systems, such as bistable switches, AND gates, and oscillators. Here, we discuss how SynBioSS creates links between DNA sequences and targeted dynamic phenotypes of these simple systems.
1. Introduction Synthetic biology is a quest to engineer new functions in living organisms. The scale of these new functions varies, from autonomous logical and informational architectures, such as bistable switches, oscillators, Department of Chemical Engineering and Materials Science, and Digital Technology Center, University of Minnesota, Minneapolis, Minnesota, USA Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00006-1
#
2011 Elsevier Inc. All rights reserved.
137
138
Yiannis N. Kaznessis
and logic gates (Alon, 2003; Anderson et al., 2007; Andrianantoandro et al., 2006; Drubin et al., 2007; Elowitz and Leibler, 2000; Gardner et al., 2000; Kærn et al., 2003; Lutz and Bujard, 1997; Ramalingam et al., 2009; Tigges et al., 2009), to modular cascades of metabolic reactions (Fung et al., 2005; Ro et al., 2006; Zhang et al., 2010), to engineered ecosystems of multicellular systems (Basu et al., 2004; Bulter et al., 2004; You et al., 2004), all the way up to minimal genomes and whole synthetic cells (Gibson et al., 2008, 2010; Glass et al., 2006; Pennisi, 2010). Synthetic biology may be viewed as the flip side of systems biology: synthetic biology is a forward engineering approach, whereas systems biology is a reverse engineering one. The former attempts to assemble components into a new whole. The latter attempts to capture the behavior of existing biological systems in a holistic way. Their paths are complementing: systems biology generates information on components and interactions that can be used in synthetic biology applications. Synthetic biology can be employed to probe mechanisms and provide mechanistic insight on how phenotypic complexity emerges from interacting molecules. Importantly, their ultimate goals overlap: efforts in both synthetic and systems biology are aimed at understanding and engineering complexity of biomolecular systems. Indeed, the goal to provide mechanistic explanations of complex biological phenomena in terms of biomolecular interactions is commonly shared by both systems biology and synthetic biology. As a forward engineering approach, synthetic biology may benefit from mathematical models. Modeling can assist synthetic biology the same way modeling helps in aircraft engineering or architecture design: models and computer simulations can relatively quickly provide a clear picture of how different components influence the behavior of the whole. They can provide mechanistic insight that may guide design choices and engineering implementations. In other words, they may help in probing the relationship between DNA sequences designed by the synthetic biologists and the observed behavior of the synthetic biological system. An important challenge in computational synthetic biology is thus to derive mathematical modeling formalisms that are fit for analysis and design of synthetic constructs. Numerous software packages have been developed to address this challenge, such as CellDesigner (Funahashi et al., 2003), GenNetDes (Rodrigo et al., 2007), COPASI (Hoops et al., 2006), and TinkerCell (Chandran et al., 2009) among others (de Jong, 2002; Kaznessis, 2007; Marchisio and Stelling, 2008). One important and widely used viewpoint associates biological systems with computer programs: the DNA sequence encodes a set of instructions which is implemented in series and produces well-prescribed outputs. Although useful as a metaphor, by viewing synthetic constructs as sets of serial instructions, in a vain similar to computer programs, one resorts to assumptions regarding the system that may not be pertinent, or accurate.
Models of Synthetic Biological Systems
139
Important assumptions may be the following: (1) the system is context-free and environment-independent. That is the system behaves in precisely the same manner, regardless of the biological context, or the cellular and extracellular environments; (2) the system is at equilibrium, or at best, at a steady state. In other words, time is not an independent variable and instead of time derivatives, simple algebraic equations may describe the system outputs as a function of system inputs; (3) the system behavior is determinate. Noise, whether intrinsic to the system, or extrinsic due to environmental factors, does not influence the system. Epigenetic studies pose hard questions regarding the validity of the first question (Costa, 2008; Russo et al., 1996), where observed phenotypic variation is ascribed to nongenetic factors. The second assumption is arguably of limited use in synthetic biology, where the goal is to engineer timedependent responses of outputs, such as the flipping of a bistable switch, to well-defined time-profiles of inputs. The third assumption has been unequivocally proven wrong for biomolecular systems, with the study of cellular populations (Bagh et al., 2008; Blake et al., 2003; Rosenfeld et al., 2005). Instead of viewing synthetic biological constructs as computer programs, an alternative is to view them as soups of chemicals. These chemicals interact according to physicochemical principles dictated by statistical mechanics. When at equilibrium, they interact minimizing the free energy of the system. When away from the thermodynamic limit they interact in a probabilistic manner, with time a dominant independent variable. These chemicals and their interactions may be subject to their environment and to the biological context of the organism that carries them. We have developed the Synthetic Biology Software Suite to implement this modeling approach (Hill et al., 2008; Weeding et al., 2010). An important aim for the continued development of SynBioSS is the link between synthetic DNA sequences and targeted biological functions. Herein, we present examples of using SynBioSS to build models of synthetic biological constructs, conduct computer simulations to study the dynamic behavior, and guide the experimental construction and testing of modular logical architectures in bacteria.
2. SynBioSS Components There are three components in SynBioSS: Designer, Wiki, and Simulator (Fig. 6.1). With SynBioSS Designer, gene network models are created automatically after the user enters molecular components and their
140
Yiannis N. Kaznessis
Figure 6.1 Schematic representation of SynBioSS components.
relationships. SynBioSS Designer is a web-based tool, available at http:// www.synbioss.org, with a user-friendly interface which uses biological rules to build a network of biomolecular interactions. The software automatically generates a kinetic model from a construct composed entirely of biological “parts,” such as promoters and terminators. While these parts can be hypothetical, chosen at will by the user, Designer is especially efficient at creating models for devices composed of BioBrick parts. BioBricks are synthetic DNA sequences catalogued in the Registry of Standard Biological Parts, a repository of synthetic biological constructs (Weeding et al., 2010). A BioBrick standard biological part has “a nucleic acid-encoded biological function (e.g., turn on/off gene expression), along with associated information defining and describing the part” (Shetty et al., 2008). The sequential ordering of these BioBricks (or “bricks”) therefore describes a sequence of DNA by its intended function within a cell. SynBioSS Designer now has a database that is populated using information extracted from the official Parts Registry, but organized in a way that is machine-readable, allowing for structured queries. At present, this data is hosted locally at the Minnesota Supercomputing Institute. Designer has a tabbed interface, making the complete sequence of BioBricks visually accessible and easily manipulated. Clicking on a tab
Models of Synthetic Biological Systems
141
pulls up properties of that individual brick and allows the user to add, edit, and delete said properties. Properties are also easy to edit; clicking directly on an editable field causes a text input field or drop-down menu will appear, allowing the user to make appropriate changes. A user can enter biological components, including BioBricks, in SynBioSS Designer, and receive as an output a file with a reaction network that models the synthetic construct. Every reaction in the model has a corresponding kinetic rate that describes the rate of association of its reactant molecules and the formation or destruction of any covalent bonds or stable noncovalent interactions. SynBioSS Wiki has been specifically created to store and recall just this sort of kinetic data. SynBioSS Wiki has two components: (i) a web interface based on the MediaWiki package and (ii) a database for storing molecular components, their interactions, and pertinent biological information. SynBioSS Wiki goes beyond the MediaWiki software in storing kinetic information in a formatted (and therefore machine-searchable) format. The database of kinetic constants is easily searchable for participating species, reaction type, etc. Users can search or browse the Web site and select reactions to interactively build a model that can be exported in a SBML format. Each kinetic constant entered in the database is correlated with a reference field in the database as well as typespecific reference information (pdb ID for proteins, CAS ID for small molecules, PubMed ID for everything, etc.). Given the vast and varied nature of biochemical reaction data, no single person or research group is best suited for the task of curating such a database, thus necessitating this distributed approach—in spite of the accompanying challenges faced by any open Wiki approach, such as Wikipedia. To avoid abuse and vandalism, SynBioSS users are asked to register with a valid email address in order to make changes. The third component, the SynBioSS Desktop Simulator, is a package that is currently available for Windows platforms. It includes mutliscale algorithms appropriate for modeling reaction networks with multiple time scales away from the thermodynamic limit (Canton, 2008; Salis and Kaznessis, 2005a,b; Sotiropoulos and Kaznessis, 2008; Sotiropoulos et al., 2009). SynBioSS Desktop has a Windows GUI interface, which can be used for constructing and editing gene network models, choosing simulation parameters and conducting numerical simulations. Version 1.0.2 of SynBioSS is currently available with Open Licenses on http://www.synbioss.org. SynBioSS has evolved from HySSS (Salis et al., 2006), a software package for modeling reaction networks, which has a Matlab GUI and FORTRAN codes available for UNIX and Linux platforms. HySSS is also available with Open Licenses at hysss.sourceforge.net. In what follows, we will discuss the process of building a model of a synthetic gene network, conducting simulations, and guiding the design of synthetic biological systems.
142
Yiannis N. Kaznessis
3. Simulations of Biologic AND Gates As a demonstration, we present the steps in Designer for building and simulating a logic AND gate, a synthetic biological construct we have also built and tested experimentally (Ramalingam et al., 2009). The desired behavior of the AND gate gene network is to produce green fluorescent protein (GFP) if and only if two signal molecules are present. To achieve this, we have constructed a promoter sequence both in vivo and in silico that combines elements from the tetracycline operon and the lactose operon in prokaryotes (Ramalingam et al., 2009). The synthetic DNA promoter sequence consists of three operator sites, each approximately 20 base pairs in length, placed sequentially and adjacently upstream of the gene coding for GFP. In an ideal situation, if any of the operator sites are occupied, RNA polymerase cannot bind to the promoter region, and GFP is not produced. Operator sites can be selected so as to bind TetR protein (T), LacI protein (L), or neither (N). Additionally, TetR can be bound by the small molecule inducer anhydrous tetracycline (aTc), and LacI can be bound by isopropyl b-D-1-thiogalactopyranoside (IPTG). In this induced state, both TetR and LacI undergo conformational changes that cause them to unbind from their respective DNA operator sites. An AND gate can be constructed by selecting a promoter region containing both T and L operator sites: only in the presence of both aTc and IPTG are both TetR and LacI induced, causing them to unbind from the promoter region, thereby allowing RNA polymerase to bind, resulting in the expression of the GFP reporter. If only one of the two inducers is present, the uninduced repressor protein will remain bound, and GFP expression will be repressed. This behavior is shown graphically in Fig. 6.2. The following few steps will result in a model of the AND gate: 1. Go to www.synbioss.org and click on Designer. 2. Enter parts in order (i.e., Promoter ! RBS ! DNA ! Terminator). From a drop-down menu, characterize each of these parts, one at a time (see Fig. 6.3 for screenshots of Designer during these steps). These parts may be user-defined, or existing BioBricks. In this example, we will combine both, starting with promoter K091101 (a dually repressed promoter by TetR and LacI), adding a user-defined ribosome binding site, adding protein E0040 (the reporter gene of GFP), and finally adding BioBrick terminator sites. 3. Provide the name of the protein for each coding DNA region (e.g., Registry part E0040 is GFP), add and characterize other proteins (activator; repressor [TetR and LacI]; reporter [GFP]; enzyme; other). There is no need to use Registry names. Any arbitrary name will be sufficient. 4. Specify promoters as constitutively ON or OFF (all ON in this example).
143
Models of Synthetic Biological Systems
OFF state TetR
Lacl
TetR
TetO
TetO –35
LacO
gfp
–10
ON state IPTG TetR
Lacl
GFP GFP GFP
Tetracycline TetO
TetO –35
LacO
gfp
–10
Figure 6.2 Schematic representation of the synthetic logic-AND gate promoter.
5. If the operators were not prespecified in the promoter BioBrick, in this step add operators to promoters and specify their relative position. This is an important step for defining regulatory relationships. The synthetic AND-gate promoters have tetO and lacO, and are dually repressed by TetR and LacI. 6. Enter any proteins constitutively expressed. In the AND-gate example, these are TetR and LacI. 7. Specify protein oligomeric structure (monomer [GFP, RFP], dimer [TetR2], tetramer [LacI4]). 8. Specify where transcription factors bind (TetR2-tetO; LacI4-lacO). 9. Enter any relevant effector molecules (e.g., inducers) present in the system (aTc and IPTG). 10. Specify how many times each effector binds to a protein (two aTc can bind to TetR2; four IPTG to LacI4). Finally, the user can click on a button to generate a reaction network with all the interactions and save the reaction network in SBML or NetCDF file format. Designer generates the reactions with default kinetic constants. These are taken from known biomolecular interactions, stored in SynBioSS Wiki, and applied to the various interaction types (e.g., RNAp binding on promoters, ribosome binding on RBS, protein dimerization, protein-operator, protein-effector). We have stored this file along with other example reaction networks in http://synbioss.sourceforge.net/simulator/examples/. The files are ready to upload on SynBioSS Desktop to run numerical simulations. The user can also upload the file on SynBioSS Wiki and carefully check all the reactions and parameters, searching in the Wiki
144
Yiannis N. Kaznessis
Figure 6.3 Screenshots of consecutive SynBioSS Designer web-pages depicting user inputs of a synthetic logic-AND gate. Top: the first step is to add the components of the synthetic sequence, either by searching for BioBricks or by adding user-defined components. Bottom left: in the second step, regulatory proteins are added as needed and protein–DNA binding events defined. Bottom right: Effector molecules are added and protein-effector interactions defined.
database for available information on any interaction. If there is no available information, the user can choose to retain the default value entered or conduct simulations over a range of parameter values for a sensitivity analysis. Uploading the SBML or NetCDF file with the reaction network in SynBioSS Desktop also allows the user to visually and carefully check the reactions, the kinetic constants, and the initial conditions and run numerical simulations (Fig. 6.4 shows a screenshot of the reaction network editor of SynBioSS Desktop.). Simulating gene regulatory networks is now simple. With the third component of SynBioSS, the Desktop Simulator, a user can run sophisticated numerical simulations of complex reaction networks quickly and seamlessly
Models of Synthetic Biological Systems
145
Figure 6.4 Screenshot of SynBioSS Desktop. With a graphics user interface, users can manipulate and simulate reaction network models generated by SynBioSS Designer.
on a PC. SynBioSS Desktop Simulator can be downloaded as an installation executable for Windows. The steps are: 1. Go to www.synbioss.org. 2. Click on “Simulator” on the upper left corner. This will take you to http://synbioss.sourceforge.net/simulator/. 3. Click on “Download” in the middle of the webpage. This will take you to the sourceforge file directory. 4. Click on SynBioSSDSInstaller-1.0.2.exe. This downloads the installation executable on your computer. 5. Run the executable. This will install the current version of SynBioSS on your computer. 6. Click on the Start Menu to find and click the SynBioSS icon. This will launch SynBioSS. More than 60 reactions comprise the network of components used to simulate AND gates. All reactions are modeled as initially occurring in a well mixed volume of 10 15 L, which represents a cell. Cell growth is handled by allowing the reaction volume to double over a period of time (average of 60 min), followed by an instantaneous halving of volume to represent cytokinesis. In our earlier work (Ramalingam et al., 2009), we described in detail the results of simulations. Importantly, we experimentally constructed and tested six different AND-gate designs, shuffling the operator positions in the promoter: TTL, TLT, LTT, LLT, LTL, and TLL. Each of these promoters was cloned in the backbone plasmid pGlow (Invitrogen) and transformed in a DH5aPRO E. coli strain, which
146
Yiannis N. Kaznessis
constitutively expresses TetR and LacI from the chromosome. In vivo GFP, fluorescence was measured using a Becton Dickinson FACS Calibur flow cytometer. Details on materials and methods can be found in reference (Ramalingam et al., 2009). To compare the simulation with the experimental results, we determined the average number of GFP molecules per cell at 6 h, averaging over 1000 stochastic simulation trajectories, and the average fluorescence strength at 6 h, averaging over 100,000 cytometry measurements. As an example, Fig. 6.5 presents the binary logic output of the TTL synthetic designs, both simulated and experimentally measured for the grid of 36 aTc/ IPTG pair concentrations. Let us first focus on the experimental results (Fig. 6.5, right panel). A high-fidelity logic AND gate will have high GFP expression levels only at high concentrations of both aTc and IPTG. It is clear that the TTL biological gate is not of perfect digital fidelity. We find that this is the case for all tested designs, because of leakiness of the promoters. This was actually expected, since biological dynamic response cannot be absolutely binary, because of thermal noise. Although there are discernible differences between the modeling and the experimental results, the models generally capture the experimentally observed behavior well. The models then can explain the emergence of synthetic biological phenotypes in terms of biomolecular interactions that follow the molecular biology dogma and obey statistical thermodynamics. It is important to note that a single model with 63 reactions captured the dynamic behavior of GFP distributions for six designs and 36 aTc-IPTG
Relative fluorescence
TTL 1
0.5
0 100 aTc (ng/ml)
0.5 0
0
IPTG (mM)
1
0.5
0 200
1 100
aTc (ng/ml)
0.5 0
0
IPTG (mM)
Figure 6.5 Comparison of model and experimental results for the TLT AND gate. The x and y axes form a grid of inducer concentrations: aTc (0–200 ng/ml) and IPTG (0–1 mM). The color scheme reflects the average strength of fluorescence from the experiments or the average number of GFP molecules in the simulations, scaled by the maximum strength/number of GFP molecules. In all cases, behavior is depicted 6 h after induction. The plotted model values are the means of 1000 independent stochastic kinetic simulations, whereas experimental values are the means of 100,000 FACS observations.
Models of Synthetic Biological Systems
147
concentrations. The only one parameter that changed between designs was the kinetic constants of leakiness reactions. These were modeled with RNA polymerase binding on the promoter and initiating transcription, even if the promoter was occupied by repressor molecules bound on their cognate operators as discussed in (Ramalingam et al., 2009). What this study illustrated was the need for bidirectional passing of information from the models to the experiments. As we stress in (Ramalingam et al., 2009) the first models we constructed did not include leakiness dependent on promoter-topology. After the first set of experiments, it became clear that reactions capturing and quantifying the leakiness were required, and that different values for the kinetic parameters would lead to correlation with the designed promoters. What was gained was useful insight into the importance of leakiness. And although significant computational resources are demanded, SynBioSS tools lower the barriers and streamline the process for setting up, modeling and analyzing the AND-gate systems.
4. Advantages and Disadvantages of SynBioSS Certainly, the modeling methodology adopted in SynBioSS has numerous disadvantages: (1) There is dearth of quantitative information on biomolecular interactions. Such information is hard to come by, because time-consuming and expensive experiments are necessary, involving the isolation and purification of the interacting molecules in large enough quantities to measure accurately, and requiring sophisticated experimental techniques, for example, surface plasmon resonance. Certainly, this is not the case for the tetracycline operon, thanks to enormous efforts expended by a large community of biochemists and molecular biologists. Other systems, such as the lactose, tryptophan, and arabinose operons have been studied thoroughly and a lot of information on them is available in the literature. But the absence of quantitative information on biomolecular interactions in other systems will hamper the efforts to use a detailed mechanistic representation of very many synthetic biological constructs. (2) Quantitative information is more often available for biomolecular interactions in the form of equilibrium constants. In such cases, we may assume that the forward rate of binding of a large protein to its DNA or RNA binding site is diffusion-limited, use the size of the protein to calculate its forward binding kinetic constant, and then use the equilibrium data to calculate the unbinding kinetic constant. An estimate may all that is needed to obtain useful insight on the behavior of a biological system, but ultimately only experimental information can reliably provide accurate information.
148
Yiannis N. Kaznessis
(3) The mechanism of expression may not precisely follow the molecular biology dogma, or the precise steps and biomolecular interactions may not be known. For example, the role of antisense regulation has emerged as an important one even in bacteria. Or, nonspecific DNA interactions due to proteins binding nonspecifically to DNA may be important. (4) The context and the environment of some biomolecular interactions may be different in a control experiment from the actual one in a bacterial cell. This may change the mechanism itself, or at the very least, alter the values of the kinetic constants. For example, in some contexts the folding and maturation time of GFP can become exceptionally long. The mechanism itself, of a first order reaction may not be appropriate, and even if it were, the kinetic constant could be different that the one found in the literature. Other variations may be present and important, such as cellular size variation, or variable metabolic load of a synthetic system on the cells. Appreciation is then important of the context and environment experimental measurements are made, and of the transferability of mechanisms and kinetic constants. (5) We are currently limited to bacterial species. Synthetic biology efforts are being expended in more complex organisms, like yeast or mammalian cells. But the knowledge of biomolecular mechanisms, although far from perfect, is more complete for bacterial species. Consequently, it would be far more challenging to try to link phenotypic complexity to biomolecular interactions in more complex organisms, where the molecular biology mechanisms are under vigorous investigation. On the other hand, the SynBioSS approach has important advantages: (1) It is a general method for constructing models of synthetic biological systems and thus applicable to any synthetic gene regulatory network. It can then be written in algorithmic form and serve as the heart of software tools to assist synthetic biologists’ designs. The method is general because molecular interactions between a transcription or translation factor and its DNA or RNA binding site are universal and context-free, that is the kinetics of the molecular interaction remains unchanged when the binding site is moved to a different location in the DNA sequence of the same organism. (2) It provides a detailed mechanistic picture of the dynamic behavior of biological systems. To our knowledge, this is the first attempt at the systematic modeling of all the known biomolecular interactions involved in bacterial transcription, translation, regulation, and induction. This approach certainly challenges established molecular biology and in the absence of agreement between models and experiments it poses new questions and requires new avenues of investigation.
Models of Synthetic Biological Systems
149
(3) It has a strong predictive character, enabling rational engineering of regulatable gene transcription systems. Rational design principles come in terms of molecular components, the kinetics and the thermodynamics of their interaction. With simply built models, alternative designs can be tested and a detailed picture can emerge of how each piece of the construct influences the synthetic network behavior. Sensitivity analysis and optimization can be conducted to determine key components and decide on network topologies. Computer simulations make possible exhaustive searches of different network connectivities and molecular thermodynamic/kinetic parameters, greatly advancing the development of design principles through the mapping of interaction strengths on specific DNA mutant sequences. (4) This approach of constructing dynamic models of all the biomolecular interactions involved in gene expression and regulation pushes the limits of computational mathematics. Because of the large number of participating species and the complexity of their interactions, only sophisticated algorithms can accurately capture dynamic gene expression in a way fit for analysis and design. (5) This approach is well-suited for synthetic biology. Although numerically challenging, it always remains tractable, not hampered by the significant size and complexity of naturally occurring biological systems. Synthetic biology modeling efforts concentrate on systems that are not overwhelmingly large, under the assumption that they are independent of the bacterial expression and metabolism machinery. Importantly, since these systems are engineered, the synthetic biologist can choose to include molecular components for which there is ample quantitative information and refine the models in a careful, contextdependent manner (Salis and Kaznessis, 2005c, 2006; Tomshine and Kaznessis, 2006; Tuttle et al., 2005). Returning to the AND-gate example, it quickly becomes clear that there are many possible designs that could achieve the desired behavior. The order and the number of tetO and lacO operator sites can be varied—for instance, one possible promoter consists of two tetO operator sites followed by a lacO operator site (designated as TTL), and another consists of two lacO operators followed by a tetO operator (designated as LLT). Another possible design parameter is the particular DNA sequence of the operator sites (TetO and LacO), which can be mutated to vary the strength of binding between the repressor proteins TetR and LacI and their cognate operator sites. With mechanistic models these changes can be incorporated and tested in a simulation, before the experiments begin. (6) This approach also pushes the limits of quantitative biology, motivating the collection and employment of quantitative information regarding molecular mechanisms, biomolecular interactions, and their kinetic and equilibrium constants that is currently scattered throughout the
150
Yiannis N. Kaznessis
literature. To collect information for components of well-studied systems, such as the tetracycline, lactose, and arabinose operons, that are widely used as parts in synthetic biological systems, we are also building software tools to facilitate the collection of this information by the synthetic biology community into a publicly available repository.
5. Concluding Remarks Synthetic biology has all the characteristic features of an engineering discipline: applying technical and scientific knowledge to design and implement devices, systems, and processes that safely realize a desired objective. Mathematical modeling has always been an important component of engineering disciplines. It can play an important role in synthetic biology the same way modeling helps in aircraft or architecture design: models and computer simulations can quickly provide a clear picture of how different components influence the behavior of the whole, reaching objectives quickly. Here, we discussed a modeling methodology that may help scientists and engineers to construct complex synthetic biological systems. We are developing sophisticated mathematical models of synthetic biological systems that connect the targeted biological phenotype (what we want the synthetic biological system to do) to the DNA sequence (that we need to physically construct to realize the synthetic biological system). Using these mathematical models, we can conduct simulations of many alternate designs to decide on the optimum set of components before synthesizing and testing the designs in the wet lab. Of course, identifying and constructing a few actual designs early in the synthesis process will better guide the model construction itself. Again, the importance of combining theory and experiment can hardly be overstated. The numerical methods and software tools presented herein are standardized, so that the process for generating models of synthetic gene regulatory networks is applicable to any synthetic construct and is suitable for automation. Consequently, SynBioSS represents a first step toward the direction of an automated design process, while assisting in the much broader objective: to develop theoretical and computational models that describe how the physical interactions of molecules lead to complex biological phenotypes.
ACKNOWLEDGMENTS This work was supported by a grant from the National Science Foundation (CBET-0425882 and CBET-0644792), the National Institutes of Health (American Recovery and Reinvestment Act grant R01GM086865), and the University of Minnesota Biotechnology Institute.
Models of Synthetic Biological Systems
151
Computational support from the Minnesota Supercomputing Institute (MSI) is gratefully acknowledged. This work was also supported by the National Computational Science Alliance under TG-MCA04N033. In addition, the author thanks Jon Tomshine, Ben Swiniarski, and Kostas Biliouris for help with the illustrations.
REFERENCES Alon, U. (2003). Biological networks: The tinkerer as an engineer. Science 301, 1866–1867. Anderson, J. C., Voigt, C. A., and Arkin, A. P. (2007). Environmental signal integration by a modular AND gate. Mol. Syst. Biol. 3, 133. Andrianantoandro, E., Basu, S., Karig, D. K., and Weiss, R. (2006). Synthetic biology: New engineering rules for an emerging discipline. Mol. Syst. Biol. 2, 0028. Bagh, S., et al. (2008). Plasmid-borne prokaryotic gene expression: Sources of variability and quantitative system characterization. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 77, 021919. Basu, S., Mehreja, R., Thiberge, S., Chen, M. T., and Weiss, R. (2004). Spatiotemporal control of gene expression with pulse-generating networks. Proc. Natl. Acad. Sci. USA 101, 6355–6360. Blake, W. J., Kærn, M., Cantor, C. R., and Collins, J. J. (2003). Noise in eukaryotic gene expression. Nature 422, 633–637. Bulter, T., et al. (2004). Design of artificial cell–cell communication using gene and metabolic networks. Proc. Natl. Acad. Sci. USA 101, 2299–2304. Canton, B., et al. (2008). Refinement and standardization of synthetic biological parts and devices. Nat. Biotech. 26, 787–793. Chandran, D., Bergmann, F. T., and Sauro, H. M. (2009). TinkerCell: Modular CAD tool for synthetic biology. J. Biol. Eng. 3, 19. Costa, F. F. (2008). Non-coding, RNAs, epigenetics and complexity. Gene 410, 9–17. de Jong, H. (2002). Modeling and simulation of genetic regulatory systems: A literature review. J. Comput. Biol. 9, 67–103. Drubin, D. A., Way, J. C., and Silver, P. A. (2007). Designing biological systems. Genes Dev. 21, 242–254. Elowitz, M. B., and Leibler, S. (2000). A synthetic oscillatory network of transcriptional regulators. Nature 403, 335–338. Funahashi, A., Morohashi, M., Kitano, H., and Tanimura, N. (2003). Cell Designer: A process diagram editor for gene-regulatory and biochemical networks. Biosilico 1, 159–162. Fung, E., Wong, W. W., Suen, J. K., Bulter, T., Lee, S. G., and Liao, J. C. (2005). A synthetic gene-metabolic oscillator. Nature 435, 118–122. Gardner, T. S., Cantor, C. R., and Collins, J. J. (2000). Construction of a genetic toggle switch in Escherichia coli. Nature 403, 339–342. Gibson, D. G., et al. (2008). Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 319, 1215–1220. Gibson, D. G., et al. (2010). Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52–56. Glass, J. I., et al. (2006). Essential genes of a minimal bacterium. Proc. Natl. Acad. Sci. USA 103, 425. Hill, A. D., Tomshine, J. R., Weeding, E. M., Sotiropoulos, V., and Kaznessis, Y. N. (2008). SynBioSS: The synthetic biology modeling suite. Bioinformatics 24, 2551–2553. Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P., and Kummer, U. (2006). COPASI—A COmplex PAthway SImulator. Bioinformatics 22, 3067–3074.
152
Yiannis N. Kaznessis
Kærn, M., Blake, W. J., and Collins, J. J. (2003). The engineering of gene regulatory networks. Annu. Rev. Biomed. Eng. 5, 179–206. Kaznessis, Y. (2007). Models for synthetic biology. BMC Syst. Biol. 1, 47. Lutz, R., and Bujard, H. (1997). Independent and tight regulation of transcriptional units in Escherichia coli via the LacR/O, the TetR/O and AraC/I1-I2 regulatory elements. Nucleic Acids Res. 25, 1203. Marchisio, M., and Stelling, J. (2008). Computational design of synthetic gene circuits with composable parts. Bioinformatics 24, 1903–1910. Pennisi, E. (2010). Synthetic genome brings new life to bacterium. Science 328, 958. Ramalingam, K. I., Tomshine, J., Maynard, J. A., and Kaznessis, Y. N. (2009). Forward engineering of synthetic bio-logical AND gates. Biochem. Eng. J. 47, 38. Ro, D. K., et al. (2006). Production of the antimalarial drug precursor artemisinic acid in engineered yeast. Nature 440, 940–943. Rodrigo, G., Carrera, J., and Jaramillo, A. (2007). Genetdes: Automatic design of transcriptional networks. Bioinformatics 23, 1857–1858. Rosenfeld, N., Young, J. W., Alon, U., Swain, P. S., and Elowitz, M. B. (2005). Gene regulation at the single-cell level. Science 307, 1962–1965. Russo, V. E. A., Martienssen, R. A., and Riggs, A. D. (1996). Epigenetic Mechanisms of Gene Regulation. Cold Spring Harbor Laboratory Press, Plainview, NY. Salis, H., and Kaznessis, Y. N. (2005a). Accurate hybrid stochastic simulation of a system of coupled chemical or biochemical reactions. J. Chem. Phys. 122, 054103, 1–13. Salis, H., and Kaznessis, Y. N. (2005b). An equation-free probabilistic steady state approximation: Dynamic application to the stochastic simulation of biochemical reaction networks. J. Chem. Phys. 123, 214106. Salis, H., and Kaznessis, Y. N. (2005c). Stochastic simulations of gene regulatory modules. Comput. Chem. Eng. 29, 577–588. Salis, H., and Kaznessis, Y. N. (2006). Computer-aided design of modular protein devices: Boolean AND gene activation. Phys. Biol. 3, 295–310. Salis, H., Sotiropoulos, V., and Kaznessis, Y. N. (2006). Multiscale Hy3S: Hybrid stochastic simulations for supercomputers. BMC Bioinform. 7, 93. Shetty, R. P., Endy, D., and Knight, T. F., Jr. (2008). Engineering BioBrick vectors from BioBrick parts. J. Biol. Eng. 2, 5. Sotiropoulos, V., and Kaznessis, Y. N. (2008). An adaptive time step scheme for a system of SDEs with multiple multiplicative noise. Chemical Langevin equation, a proof of concept. J. Chem. Phys. 128, 014103. Sotiropoulos, V., Contou-Carrere, M.-N., Daoutidis, P., and Kaznessis, Y. N. (2009). Model reduction of multiscale chemical Langevin equations: A numerical case study. IEEE/ACM Trans. Comp. Biol. Bioinf. 6, 470. Tigges, M., Marquez-Lago, T., Stelling, J., and Fussenegger, M. (2009). A tunable synthetic mammalian oscillator. Nature 457, 309–312. Tomshine, J., and Kaznessis, Y. N. (2006). Optimization of a stochastically simulated gene network model via simulated annealing. Biophys. J. 91, 3196–3205. Tuttle, L., Salis, H., Tomshine, J., and Kaznessis, Y. N. (2005). Model-driven design principles of gene networks: The oscillator. Biophys. J. 89, 3873–3883. Weeding, E., Houle, J., and Kaznessis, Y. N. (2010). SynBioSS Designer: A web-based tool for the automated generation of kinetic models for synthetic biological constructs. Brief Bioinform. 11, 394–402. You, L., Cox, R. S., 3rd, Weiss, R., and Arnold, F. H. (2004). Programmed population control by cell–cell communication and regulated killing. Nature 428, 868–871. Zhang, K., Li, H., Cho, K., and Liao, J. C. (2010). Expanding metabolism for total biosynthesis of the nonnatural amino acid L-homoalanine. Proc. Natl. Acad. Sci. USA 107, 6234–6239.
C H A P T E R
S E V E N
The Eugene Language for Synthetic Biology Lesia Bilitchenko,* Adam Liu,† and Douglas Densmore‡,§ Contents 154 154 155 155 155 165 169 171 172
1. Overview 2. Installation and Use 2.1. Syntax highlighting 3. Design and Implementation 3.1. Language definition 3.2. Implementation 4. Examples Acknowledgments Reference
Abstract Synthetic biological systems are currently created by an ad hoc, iterative process of design, simulation, and assembly. These systems would greatly benefit from the introduction of a more formalized and rigorous specification of the desired system components as well as constraints on their composition. In order to do so, the creation of robust and efficient design flows and tools is imperative. We present a human readable language (Eugene) which allows for both the specification of synthetic biological designs based on biological parts as well as providing a very expressive constraint system to drive the creation of composite devices from collection of parts. This chapter provides an overview of the language primitives as well as instructions on installation and use of Eugene v0.03b.
* Department of Computer Science, California State Polytechnic University, Pomona, California, USA Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA { Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA } Department of Electrical and Computer Engineering, Boston University, Boston, Massachusetts, USA {
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00007-3
#
2011 Elsevier Inc. All rights reserved.
153
154
Lesia Bilitchenko et al.
1. Overview Eugene is a language created to specifically define the composition of a composite biological part (called a Device) from individual Parts. Parts are defined by user specified Properties. These properties can come either from the explicit declaration by a designer or by the extraction of data from standardized repositories of biological parts. Eugene is not meant to be a language by which synthetic biological systems are simulated or modeled but rather it was created to be a mechanism to specify the “topology” or “architecture” of a design and the rules dictating which alternate architectures are “valid.” This space of valid architectures can then be explored automatically to create a combinatorial set of Devices which the user wishes to investigate further. Due to the lightweight nature of Eugene, this information is easily exported to other tools for the automated creation of Device sets or it can be sent to other systems which may wish to perform simulation or modeling tasks of interest. In this way, Eugene is very much a “specification language” whereby designers can quickly and efficiently (and to a large extent unambiguously) capture their designs so that they can be shared, reused, and modified by other users and tools. This document is NOT a tutorial. It is a straight forward explanation of how to install Eugene, along with details on the language elements and how the underlying data structures are implemented. It concludes with some examples to give the reader some sense of how it may be used. The intended audience is those that wish to understand Eugene’s inner working and have a reference manual. For a “story” on Eugene, other languages for synthetic biology and specific case studies see (Lesia et al., 2011).
2. Installation and Use In order to use Eugene, the zipped folder from http://www.eugenecad.org needs to be downloaded. No installation is required, since the compiler is a stand-alone executable. However, the Java Development Kit (JDK) 6 is necessary. Once, the files have been extracted from the folder, the Eugene compiler is ready for use. The executable eugene.jar is started by doubleclicking and browsing for a file with extension .eug. However, the executable needs to be in the same directory as the lib folder. The compiler takes a *.eug file and creates a *eug.out file with the same name and directory. Errors and output are written to the file.
The Eugene Language for Synthetic Biology
155
Additionally Eugene can be run from the command line: java -jar <eugene directory path>\eugene.jar < options>
or java-jar <eugene directory path>\eugene.jar
where <eugene directory path> is the path where the jar file resides and is the path to the *.eug file to compile. Specify any number of separated by spaces. The compiler takes many or no arguments. If a file is specified, that file will be compiled. If no file is specified, a file chooser will open to browse for a *.eug file to compile. The following options are supported: 1. -xml-default: generates xml and includes all properties 2. -xml-sbdtp: generates xml in SBDTP format* 3. -print-devices: prints all devices to the console and output file * Synthetic Biology Data Transfer Protocol (under development; http:// www.sbolstandard.org)
2.1. Syntax highlighting Syntax highlighting is available for Notepadþþ (http://notepad-plus.sourceforge.net/). If %APPDATA% is being used in Windows systems, navigate to %APPDATA%\Notepadþþ. %APPDATA% is usually C:\Documents and Settings\%USERNAME%\Application Data on Windows machines. Otherwise, navigate to the installation directory, usually C:\Program Files \Notepadþþ on Windows machines. If a user-defined language is already specified, merge userDefineLang.xml with the existing file. Otherwise, copy the file into the directory. If Eugene appears under the Language menu in Notepadþþ, syntax highlighting is enabled for *.eug files and *. h header files.
3. Design and Implementation This section is divided into two primary subsections. The first subsection details syntax and language constructs of Eugene. The second subsection details the implementation and data structures involved in Eugene’s design.
3.1. Language definition In this section, we describe the elements in the language. These involve: primitive data types, Properties, Parts, Devices, rules, and conditional execution. The relationships between these language elements are shown in Fig. 7.1. Here you can see that each subsequent category is built upon the previous category.
156
Lesia Bilitchenko et al.
Figure 7.1 Relationship between Eugene data types. This figure illustrates how primitive Eugene data types are built upon to create Properties, Parts, and Devices.
3.1.1. Comments Line comments and block comments are supported. Example: // This is a line comment. /* This is a block comment. */ 3.1.2. Primitives The language supports five predefined primitives. These are txt, num, boolean, txt[ ], and num[ ]. Strings (sequences of characters) are represented through the data type “txt,” where the actual text is specified in double quotes. Real numbers and integers are supported by the data type “num” and binary logical values by the data type “boolean.” Ordered lists of num and txt values can be created and individual members inside a list accessed by specifying an integer in the range from 0 to jlistj - 1.
The Eugene Language for Synthetic Biology
157
The following operations are supported: 1. 2. 3. 4.
þ: addition for num, concatenation for txt, append for lists -: subtraction for num *: multiplication for num /: division for num
Examples (1) and (2) are two real code snippets showing how primitives can be specified in Eugene. “listOfSequences” is simply a list of three arbitrary DNA sequences. “specificSequence” is the last element of “listOfSequences” (i.e., “ATCG”). Examples (3) and (4) show how the data type “num” can support integers and decimals. txt[] listOfSequences ¼ [“ATG”, “TCG”, “ATCG”]; (1) txt specificSequence ¼ listOfSequences[2]; (2) num[] listOfNumbers ¼ [2.5, 10, 3.4, 6]; (3) num ten ¼ listOfNumbers[1]; (4) More Examples: txt txt1; num num1; txt[] txtLst1; num[] numLst1; txt txt2 ¼ “Hello world”; num num2 ¼ (((44.4 þ 6.7) * 2) - 1.45) / 5; txt[] txtLst2 ¼ [“A”, “list”, “of”, “strings”]; num[] numLst2 ¼ [1, 2.0, 3.00, 4, 2.5 þ 2.5]; num two ¼ numLst2[1]; // two holds 2.0 num four ¼ two þ 2; // four holds 4.0 txt helloAgain ¼ txt2 þ “ again” // helloAgain holds “Hello world again” txtLst3 ¼ txtLst2 þ [“with”, “more”, “strings”]; numLst3 ¼ numLst2 þ [6, 7.0, 8]; txt txt3 ¼ “This”, txt4 ¼ “is”, txt5 ¼ “a”, txt6 ¼ “shortcut”, txt7, txt8; boolean yes; boolean no ¼ false; yes ¼ true;
3.1.3. Properties Properties represent characteristics of interest and are defined by primitives and associated with Parts. For example a user could define a property “Sequence” (the DNA sequence), ID (the uuid for a relational database which may hold the part), or Orientation (e.g. a forward or backward promoter). Examples 5-8 show how such Properties would be defined. Property definitions must be defined by the five primitive types. In Part
158
Lesia Bilitchenko et al.
definitions Properties will be bound to that Part as placeholders for the instantiation of values in Part declarations. Properties have to be defined before Parts can use them. The user can create new Property labels or use those created by other users and captured in “header files” (Section 3.1.10). For example, the following Properties are predefined in the header file PropertyDefinition.h and do not need to be defined again if the header file is included in the main program: Property ID(txt); // in header file (5) Property Sequence(txt); // in header file (6) Property Orientation(txt); // in header file (7) Property RelativeStrength(num); //custom Property (8)
3.1.4. Parts The data type Part represents a standard biological Part. A Part can be defined empty initially and then Property labels can be added through the function addProperties() or Properties can be bound to a Part during the definition. Part definitions do not construct any Parts, but rather specify which Parts can be constructed. This can be done in the header file or in the main program. When the header file PartDefintion.h and PropertyDefintion.h are included, the following Parts and their corresponding property labels are predefined. For instance, the Part “Promoter” will have three Properties associated with it and all instances of Promoter will inherit ID, Sequence and Orientation: Part Promoter(ID, Sequence, Orientation); (9) Part ORF(ID, Sequence, Orientation); (10) Part RBS(ID, Sequence, Orientation); (11) Part Terminator(ID, Sequence, Orientation); (12) Part RestrictionSite(ID, Sequence, Orientation); (13) Part PrimerSite(ID, Sequence Orientation); (14)
If the properties are unknown during Part Definition process, the Part can be defined either empty or with the known Properties. Later Property labels can be added through the function addProperties() provided the property labels have been created beforehand. RBS will have four property labels after the following statement (15): RBS.addProperties(RelativeStrength); (15)
Part declarations make instances of predefined Parts and assign values to their properties. If the declaration specifies a list of values, it is assumed that every property will be assigned a value, where the order of the values corresponds to the order of the properties in the Part Definition as shown in example (17). Otherwise, a “dot notation” followed by the name of the property can be employed, where the order becomes irrelevant as specified in the example below (16).
The Eugene Language for Synthetic Biology
159
The Part instance BBa_K112234_rbs has three Properties associated with the Part RBS. These are ID, Sequence, and Orientation. The identification label of a particular Part from a database is stored in the ID placeholder to allow future access to the database. Sequence stores the DNA of a Part, while Orientation specifies the direction of the Part. Since dot notation is used, the ID value instantiation can be left out from the statement. Part declarations can be found in the header file PartDeclarations.h and are predefined if the header files are included in the main program. RBS BBa_K112234_rbs (.Sequence("gatcttaattgcggagacttt"), .Orientation("Forward")); (16) RBS BBa_K112234_rbs (“BBa_K112234_rbs”, “gatcttaattg cggagacttt”, “Forward”); (17) More Examples: txt seq ¼ “ATCG”; Promoter p1(seq, 100, [“A”, “B”, “C”], [1, 2, 3], true); Promoter p2(.RelativeStrength(20), .Sequence ("GCTA")); RBS rbs1("CGAT"); RBS rbs2(); txt p1Seq ¼ p1.Sequence; // p1Seq holds “ATCG”
3.1.5. Devices Devices represent a composite of standard biological Parts and/or other Devices. In a Device declaration, the same Part and/or Device can be used more than once. Property values of devices can be accessed with the dot operator; however, the value is the union of the property values of its members returned as a list. If the property is a txt or num, a txt[] or a num[] is returned. If the property is a txt[] or a num[], a txt[] or a num[] is also returned that consists of the lists appended together. For example the sequence of Device BBa_K112133 is the ordered union of the sequence of Part BBa_K112126 and the Device BBa_K112234. Device BBa_K112234(BBa_K112234_rbs, BBa_K112234_ orf); (18) Device BBa_K112133(BBa_K112126, BBa_K112234); (19)
Individual Parts can be accessed through the use of square brackets and an index. The first member is indexed at zero. Square brackets can be stacked in the case of Devices within Devices. To access the first element BBa_K112234_rbs of Device BBa_K112234 through Device BBa_K112133, the following notation is supported: BBa_K112133[1][0] //references BBa_K112234_rbs (20) More Examples: Device d1(p1, p1, rbs1);
160
Lesia Bilitchenko et al.
Device d2(Promoter p2, Device d1, RBS rbs2); // optional type clarity Device d3(); // empty devices can be declared as placeholders txt[] d1Seq ¼ d1.Sequence; // d1Seq holds [“ATCG”, “ATCG”, “CGAT”] txt rbs1Seq ¼ d2[1][2].Sequence; // rbs1Seq holds “CGAT”
3.1.6. Rules The specification of rules provides the ability to validate Device declarations. Rule declarations themselves do not perform the validation. They have to be “noted,” “asserted,” or used as expressions inside an if-statement to affect program operation. Rule declarations are single statements consisting of a left and right operand and one rule operator. The rule operators BEFORE, AFTER, WITH, NOTWITH, NEXTTO, NOTCONTAINS, CONTAINS, and NOTMORETHAN can be applied to Part instances or Device instances. The Rule operators in themselves are generic, in that you can apply them, however, you like. They only become problem specific when they are put into a biological context. Important to note are the commutative properties of the operators WITH, NOTWITH, NEXTTO and the noncommutative properties of the operators BEFORE, AFTER, CONTAINS, NOTCONTAINS. Thus, in the later set changing the order of operands will not yield the same result. For example, in the case of a BEFORE b, all instances of a have to occur before b. Changing the order will negate this rule and all instances of b will have to appear before a to yield a true result. Further, Property values of Part/Device instances or primitives in relation with one Part/Device can be operators in rule declarations when using the relational operators <, <¼, >, >¼, !¼, ¼¼. These operators are overloaded when evaluating text and the text is compared according to alphabetical precedence. They also follow the same mathematical properties of binary operators. Table 7.1 provides a summary of the operators for Eugene rules. Example (21) illustrates a rule where all Parts BBa_K112234_rbs have to come before all Parts BBa_K11223_orf. Example (22) illustrates a rule where the Part BBa_K112234_rbs has to be contained together with BBa_K112234_orf inside a Device. Example (23) illustrates a rule where the Part BBa_K112126 has to be next to BBa_K112234 when a Device is declared. Example (24) illustrates a rule that checks whether the sequence of BBa_K112234_rbs is equivalent to the sequence of BBa_K112234_orf. Example (25) illustrates the comparison of Property values of Parts, where the “RelativeStrength” Property value for Part BBa_K112234_rbs has to be greater than the “RelativeStrength” Property value for Part BBa_B0032.
The Eugene Language for Synthetic Biology
161
Table 7.1 Overview of the available operators and their categories for constructing rules in Eugene
Compositional operators BEFORE AFTER WITHc NOTWITHc NEXTTOc NOTCONTAINS CONTAINS NOTMORETHAN
operand 1 appears before operand 2 on devices operand 1 appears after operand 2 on devices operand 1 appears with operand 2 on devices operand 1 does not appear with operand 2 on devices operand 1 is adjacent to operand 2 on devices operand 1 is not contained in device operand 1 is contained in device operand 1 does not occur more than operand 2 instances in device
Comparison operators < <¼ > >¼ !¼ ¼¼
less than less than or equal to greater than greater than or equal to not equal to equal to
Boolean operators AND OR NOT
operand 1 AND operand 2 operand 1 OR operand 2 NOT operand
“C” indicates that the operands are commutative.
Example (26) shows a similar comparison but uses the variable “relativeStr” for comparison. Rule r1(BBa_K112234_rbs BEFORE BBa_K11223_orf); (21) Rule r2(BBa_K112234_rbs WITH BBa_K112234_orf); (22) Rule r3(BBa_K112126 NEXTTO BBa_K112234); (23) Rule r4(BBa_K112234_rbs.Sequence !¼ BBa_K112234_orf. Sequence); (24) Rule r5(BBa_K112234_rbs.RelativeStrength > BBa_B0032. RelativeStrength); (25) num relativeStr ¼ BBa_B0032.RelativeStrength; Rule r6(p.RelativeStrength > relativeStr); (26)
3.1.7. Asserting and noting rules In order to take effect, rules need to be “asserted” or “noted,” once they are declared. The scopes of all assert or note statements encompass every new Device. Every time a new Device is declared and provided “Assertions” and “Note” statements exist, the validation process is performed on the newly created Device. Rule instances can be combined with each other through
162
Lesia Bilitchenko et al.
the use of the logical operators AND, OR, NOT in the statements. The difference between rule assertions and rule notes lies in the strength of the consequence once a violation is found. If no violation is found the program continues running. Rule Assertions are strong assertions and the program terminates with an error once a Device composition violates the statement. The following statement (27) will check if BBa_K112234_rbs is not contained together with BBa_K112234_orf in the Device and their sequences should not be equal. In this case an error will terminate the program since both parts are components of the device (28), therefore violating the Assert statement. Assert ((NOT r4) AND (NOT r2)); (27) Device BBa_K112234(BBa_K112234_rbs, orf); (28)
BBa_K112234_
Notes issue warnings in the output when the violation occurs. But the program continues running. In the following example, Device BBa_K112133 (31) meets the first note’s condition (29) successfully. However, the next note (30) is violated and the program will issue a warning. Note (r2 AND r3); (29) Note (NOT r1); (30) Device BBa_K112133(BBa_K112126, BBa_K112234); (31)
3.1.8. Permute function The permute function automates the specification of many Devices that share the same basic structure. It generates a Device for every combination of predefined Parts, maintaining the Part type of each component in the original Device. If a component of the Device is a Device, even those Devices with only one Part, that component is not changed and appears in every variation. Variations are named _<x> where is the name of the argument Device, and x is a number starting at 1. They can be accessed and manipulated like normally instantiated Devices. Promoter p1(.Sequence(“atc”)); (32) Promoter p2(.Sequence(“gcta”)); RBSrbs1(.Sequence(“gatct. . .”),.Orientation(“Forward”)); (33) RBS rbs2(.Sequence(“gatcttaatt”), .Orientation (“Forward”)); (34) Device d2(p2, d1, rbs2); (35) permute(d2); (36)
The statement (36) considers the pool of all predefined Promoters, p1 and p2, and all predefined ribosome binding sites, rbs1 and rbs2. Since d2 is a Device, it is not changed. The following variations are generated (37–40):
The Eugene Language for Synthetic Biology
163
d2_1: p1, d1, rbs1 (37) d2_2: p1, d1, rbs2 (38) d2_3: p2, d1, rbs1 (39) d2_4: p2, d1, rbs2 (40)
Permute can also accept two additional arguments. Statement (41) shows how an additional argument can limit the number of permutations made (in this case to 2). Statement (42) shows how the additional keyword strict can be used to restrict permutations only to those devices that meet constraints which have been “asserted.” Statement (43) shows how “noted” constraints can be respected with the flexible argument and how all arguments can be combined. Permute(d2, 2); (41) Permute(d2, strict); (42) Permute(d2, 2, flexible); (43)
3.1.9. Conditional statements The use of conditional statements breaks up the flow of execution and allows selected blocks of code to be executed. Eugene supports two kinds of if-statements to achieve this: Rule validating if-statement and standard if-statement. The three logical operators AND, OR, NOT can combine statements of each type but not together. Rules can be checked not just through Assert and Note statements but also in an if-statement. In this approach, only specific rules will be considered, as they might not apply to all Devices. The notation should specify a list of Devices and a logical combination of rule instances pertaining to that list. Suppose we would like to test a rule only on the specific Device instance BBa_K112133, where the Promoter BBa_K112126 comes before the Ribosome Binding Site BBa_K112234_rbs. Then the following conditional statement can achieve such conditional evaluation. In this case, the ifstatement will evaluate to true: Rule r7(BBa_K112126 BEFORE BBa_K112234_rbs); if(on (BBa_K112133) r7) { Block statement, in case of true evaluation } else { Block statement, in case of false evaluation (44) Expressions not pertaining to rules and Devices can be evaluated by the standard if-statement which supports the relational operators <, <¼, >, >¼, !¼, ¼¼ as well as the logical operators AND, OR, NOT. boolean test ¼ true; if(test) { Assert(ruleWith);
164
Lesia Bilitchenko et al.
} else { Assert(NOT ruleWith); } Device BBa_K112133(BBa_K112126, BBa_K112234); (45) More Examples: num x ¼ 4; if (p1.RelativeStrength > x AND NOT(p1.Sequence ¼¼ “ATCG”)) { p1.Sequence ¼ “TCGA”; } else { p1.Sequence ¼ “CGAT”; } if (on(d1, d2, d3) (r1 AND r2) OR (NOT(r2 OR r3))) { RBS rbs3("GATC"); } else { RBS rbs3("ATCG"); }
Lists consisting of numbers or strings can be compared in if statements as well. Each element in one list is compared to the element at the corresponding position in the other list. The following logical operators are supported: 1. ¼¼: equal to 2. !¼: not equal to More Examples: num[] a9 ¼ [ 1, 2, 3 ], a9a ¼ [ 1, 2, 3 ] ; if(a9 ¼¼ a9a) { print("a9 ¼¼ a9a"); } else { print("a9 !¼ a9a"); }
3.1.10. Header files The inclusion of header files allows the use of predefined Properties, Parts, and Part Instances in the program. The manageability of code in the main file is more efficient by hiding the low level implementation of sequence and Parts. The user needs only to define Devices in the main file. On such a level the program can be written quickly and it is less error prone. Also, this allows each lab to have its own header file libraries. At the same time the option to change or declare other Properties, Parts, and Part instances exists in the language.
165
The Eugene Language for Synthetic Biology
3.1.11. Image bindings An image binding associates an image with a particular part. An image binding for a part must be made before any instances of the part are declared. An image binding for an instance of a part or device can be made after that instance is declared with the same syntax, but replacing part name with the instance name. Examples: Image(Promoter, “C:\My Images\promoter.jpg”); Image(RBS, “C:\My Images\rbs.jpg”);
Image bindings can be used by other tools to associate icons with the Parts for the purposes of manipulation by a graphical program.
3.2. Implementation After compilation, a data structure is created which can be used further by other tools to display the information visually. 3.2.1. Header file creation Header files give the language the functionality to access many already predefined Parts in the databases. For the purpose of convenient data exchange over the Internet, XML could be used to read information from a database. Then the data is converted into Eugene syntax to represent the header files. As a result the language definitions are not just abstract statements but are tied to existing designs. There are three main header files: PropertyDefintion.h, PartDefiniton.h, and PartDeclaration.h. 3.2.2. Eugene main file The main file (using a .eug extension) can include the header files, which need to be specified at the beginning: include PropertyDefintion.h, Part Declaration.h; (46)
PartDefinition.h,
The main file will generally consist of custom Part definitions/declarations, Device constructs, rule implementations, and control statements. 3.2.3. ANTLR ANTLR is a LL(*) recursive-descent parser generator that accepts lexer, parser, and tree grammars (http://www.antlr.org). It is used as the parser generator for Eugene code, since ANTLR allows the reuse of grammar with different semantic actions and the creation of parsers in another language. Both of these characteristics will be useful for integrating Eugene with other tools. After some preprocessing of the header files, a data
166
Lesia Bilitchenko et al.
structure is created, which can be applied directly to visual tools after conversion to XML from our internal Data Structure (Section 3.2.4). 3.2.4. Data structure The data structure consists of four main classes, which directly relate to the Eugene syntax. Each instance of these classes is referenced by the userdefined name from the Eugene files and stored in global hash maps according to the class type. 3.2.4.1. Classes The Primitive class acts as a container for numbers, text, lists, and Boolean values. The Part class stores the instance of a Part and its Property values as a Hash map of Property labels referencing Primitive data types. Each Part instance will point toward the Part definition it came from through its type field. An image path can be bound to a Part where Part instances can have different images if the user specifies accordingly. The Device class stores the instance of a Device and the names of the ordered list of components. An image path can also be associated with a specific instance. The Rule class stores the instance of a Rule definition, where the rule statement is broken into three components: the left and right operand and the operator. 3.2.4.2. Global data structure The data structure is divided between the storage of Part and Property definitions and the actual instances corresponding to their classes for efficient and immediate access to the data. Every instance is referenced by name and stored in a hash map according to the class it belongs. Parts, Devices, rules, and primitives are kept in separate hash maps. Figure 7.2 illustrates the Global Data Structures and Custom Classes. The hash map propertyDefinitions stores the defined Property labels and their type. For instance the sequence property will be of type txt as shown in Fig. 7.3. The hash map partDefinitions in Fig. 7.3 stores the defined Parts and the property labels as well as any image associated with the specific Part. For example, the Part Promoter will have the properties ID, Sequence, and Orientation. The hash map partDeclarations contains the declared Part instances. Each instance contains a list of Property values. For example, the Part instance BBa_I0500 is of type Promoter having the properties ID, Sequence, and Orientation where Sequence is of type “txt” and has “GATCTtta. . .” as its value as shown in Fig. 7.3. Similarly, deviceDeclarations, ruleDeclarations, and primitiveDeclarations store the instance names referring to class instances of Device, Rule, and Primitive, respectively. The hash maps ruleAsssertions and ruleNotes in Fig. 7.4 store the assert and note statements as keys which point toward lists containing the individual
The Eugene Language for Synthetic Biology
167
Figure 7.2 Eugene interpreter data structures and classes. These illustrations show how Eugene objects are stored in the interpreter’s data structures. In addition, it outlines the elements of the Primitive, Part, Device, and Rule classes.
elements of the statement in reverse Polish notation. Postfix notation is used to help evaluate the truth-value of each assert or note statement. The statements have a global scope. Therefore, every time a Device is created the program goes through each list and applies these statements to the Device.
168
Lesia Bilitchenko et al.
Figure 7.3 Eugene Property, Part, and Device relationships. This figure illustrates more details on how Primitive Parts, Properties, and Devices are related. The examples provide insight on how the data types build on one another and what type of information is stored in a Eugene design.
The Eugene Language for Synthetic Biology
169
Figure 7.4 Eugene rule relationships. The relationship between Eugene’s rule data structures and rule declaration and enforcement syntax is shown here.
4. Examples The first example provides a very basic overview of two Part declarations, three Part instantiations, and one Device built from those instantiations. It primarily illustrates how Properties and Parts are declared and later instantiated. It concludes with a very basic illustration of the “print” function.
170
Lesia Bilitchenko et al.
//Demo example //Author: Douglas Densmore Property sequence(txt); Property name(txt); Property relativeStrength(num); Property Neg10Neg35(txt[]); Part Promoter(name, sequence, relativeStrength); Part SpecialPromoter(name, sequence, Neg10Neg35); Promoter p1(“PromoterType1”, “ATC”, 10); Promoter p2(.sequence("ATCCCCC")); SpecialPromoter p3 (“PromoterType2”, “CAT”, [“CAT”, “TAG”]); p2.name ¼ “Test”; Device d1(p2, p3, p1); print(p1.sequence); //this will produce “ATC” print(p2.name); //this will produce “Test” print(p3.name); //this will produce “PromoterType2” print(d1[0].sequence); //this will produce “ATCCCCC”
This next example illustrates an alternate device construction syntax as well as conditional execution with an “if statement.” This also shows the use of header files. //Device syntax example //Author: Lesia Bilitchenko include PropertyDefinition.h, PartDefinition.h, PartDeclaration.h; //Rule testRule(BBa_I0500 BEFORE BBa_K112805); Rule testRule(BBa_I0500 BEFORE BBa_K112); Rule testRule2(BBa_I0500 AFTER BBa_K112805); Rule testRule3(BBa_J23116 NOTWITH BBa_K112807); Note(testRule); Note(testRule2); Note(testRule AND testRule3); Device BBa_K112809( Promoter BBa_I0500, ORF BBa_K112805, ORF BBa_K112806, Terminator BBa_B0010, Terminator BBa_B0012, Promoter BBa_J23116, ORF BBa_K112807, Terminator BBa_B0010 ); /*
The Eugene Language for Synthetic Biology
171
Demonstrates a rule validating if statement */ if(on (BBa_K112809) NOT testRule) { print(“BBa_K112809.Sequence: ”, BBa_K112809. Sequence); } else { print("Violation on BBa_K112809."); }
The final example is an “XOR” function. Here you will notice that Eugene is not equipped currently to express small molecule reactions explicitly. This is provided by a user defined property which is used to indicate how the promoter is regulated. //XOR example //Author: Douglas Densmore Property sequence(txt); Property smallMoleculeInteraction(txt); Property type(num); //1 - neg regulated by lacI, pos regulated by tetR //2 - neg regulated by tetR, pos regulated by lacI Part ConstitutivePromoter(sequence); Part RegulatedPromoter(sequence, type); Part ORF(sequence, smallMoleculeInteraction); //Sequences here are only used as placeholders ConstitutivePromoter cp(“ACGT. . .”); RegulatedPromoter rpType1(“ACGT. . .”, 1); RegulatedPromoter rpType2(“ACGT. . .”, 2); ORF gfp(“ACGT. . .”, “none”); ORF lacI(“ACGT. . .”, “IPTG”); ORF tetR(“ACGT. . .”, “aTc”); Device xor(cp, lacI, tetR, rpType1, gfp, rpType2, gfp); print(xor[1].smallMoleculeInteraction); //produces “IPTG”
More examples can be found at http://www.eugenecad.org or in Lesia et al. (2011).
ACKNOWLEDGMENTS The authors would like to thank the 2009 UC Berkeley iGEM Team ( Joanna Chen, Richard Mar, Thien Nguyen, Nina Revko, and Bing Xia). We also would like to thank Josh Kittleson, Mariana Leguia, Cesar Rodriguez, and J. Christopher Anderson for discussions related to the development of Eugene.
172
Lesia Bilitchenko et al.
REFERENCE Lesia, B., Adam, L., Sherine, C., Emma,W., Bing, X., Mariana L., and Christopher Anderson, J. (2011). Douglas Densmore, Eugene—A domain specific language for specifying and constraining synthetic biological parts, devices, and systems. PLoS ONE. (In Revision).
C H A P T E R
E I G H T
A Step-by-Step Introduction to Rule-Based Design of Synthetic Genetic Constructs Using GenoCAD Mandy L. Wilson, Russell Hertzberg, Laura Adam, and Jean Peccoud Contents 174 175 176 176 178 179 181 183 183 185 186 187 188
1. Introduction 2. Overview of GenoCAD 3. Requesting an Account on GenoCAD.org 4. Browsing the Parts Catalog 5. Searching for Parts 6. Using My Cart to Create Libraries 7. My Libraries 8. My Parts 9. Designing Sequences 10. Installing GenoCAD 11. Anticipated Evolutions Acknowledgments References
Abstract GenoCAD is an open source web-based system that provides a streamlined, rule-driven process for designing genetic sequences. GenoCAD provides a graphical interface that allows users to design sequences consistent with formalized design strategies specific to a domain, organization, or project. Design strategies include limited sets of user-defined parts and rules indicating how these parts are to be combined in genetic constructs. In addition to reducing design time to minutes, GenoCAD improves the quality and reliability of the finished sequence by ensuring that the designs follow established rules of sequence construction. GenoCAD.org is a publicly available instance of GenoCAD that can be found at www.genocad.org. The source code and latest build are available from SourceForge to allow advanced users to install and customize GenoCAD for their unique needs. Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00008-5
#
2011 Elsevier Inc. All rights reserved.
173
174
Mandy L. Wilson et al.
This chapter focuses primarily on how the GenoCAD tools can be used to organize genetic parts into customized personal libraries, then how these libraries can be used to design sequences. In addition, GenoCAD’s parts management system and search capabilities are described in detail. Instructions are provided for installing a local instance of GenoCAD on a server. Some of the future enhancements of this rapidly evolving suite of applications are briefly described.
1. Introduction The vision of rationally designing synthetic biological systems has proved more elusive than anticipated (Kwok, 2010). The complexity of artificial gene networks has not made significant progress since 2006 (Purnick and Weiss, 2009), which may indicate that the ad hoc processes used to develop proof-of-concept systems do not scale up well. The field still lacks a suitable framework to design more complex systems. Several authors have proposed to approach DNA sequences as a language to program biological systems (Clancy and Voigt, 2010; Goler et al., 2008). This idea may provide the foundation upon which it will be possible to develop computer-assisted design software applications for synthetic biology. A fast growing ecology of software tools to assist synthetic biologists in the development of new biological systems has been reviewed recently (Marchisio and Stelling, 2009). Gene Designer (Villalobos et al., 2006) is a stand-alone application with smooth graphical editor allowing users to drag and drop genetic parts into a larger DNA sequence. TinkerCell is another desktop application allowing users to design genetic constructs from standard parts and simulate the dynamics of the gene network they encode (Chandran et al., 2009). SynBIOSS is a web-based alternative to TinkerCell (Hill et al., 2008; Weeding et al., 2010). GEC (Pedersen and Philipps, 2009) and Clotho (www.clothocad.org) are programming environments specifically designed for synthetic biology. Like TinkerCell or Gene Designer, GenoCAD has a graphical user interface accessible to users without any programming experience. Instead of being a stand-alone application, GenoCAD is a database-driven webbased application (Czar et al., 2009). Like Clotho and GEC, GenoCAD relies on a solid foundation derived from the theory of computer languages (Cai et al., 2007, 2009). GenoCAD is an open source application distributed under the Apache software license. An instance of GenoCAD is available at www.genocad.org and is referred to as GenoCAD.org in this chapter to differentiate it from the GenoCAD software itself.
Step by Step Introduction to GenoCAD
175
2. Overview of GenoCAD Before building sequences in GenoCAD, it is helpful to understand the overall structure of the application and how the various pieces fit together to provide the user with a safe and streamlined design experience. DNA sequences are made up of smaller standardized genetic DNA segments such as promoters, transcription terminators, genes, protein domains, and others. Within GenoCAD, these segments are referred to as “parts.” GenoCAD.org has a library with thousands of distinct basic parts (Cai et al., 2010; Peccoud et al., 2008). Users are not limited to the parts included in the global GenoCAD database. They can add new sequences in their personal workspace without having to make them available to other GenoCAD users. Design strategies composed of rules describing how parts can be combined are called grammars in GenoCAD. The concept of a Design Strategy within GenoCAD is similar to the role a grammar plays within language. A writer may use a series of words that include a subject, a predicate, indirect objects, and prepositions, but if they fail to use the prescribed grammar for the language in question, the words may not come together to form a meaningful sentence. Design strategies in GenoCAD work much the same way. A design strategy uses rules to define which classes of parts, called categories, can be used to design a DNA sequence, and in what order they may appear. For example, the design of an E. coli gene expression cassette requires—at minimum—a Promoter, a Ribosome Binding Site (RBS), a Gene, and a Terminator. In GenoCAD, Promoters, RBS, Genes, and Terminators are categories, and the rules that ensure parts from each of the categories above are included define the E. coli design strategy. The design strategy ensures that categories may only be used in the appropriate order, so, in the example above, the RBS and Gene can only be inserted between a Promoter and Terminator. The rules of the design strategy also prevents parts from categories external to the E. coli design strategy from being included in the design. Design strategies are currently coded within the GenoCAD database. That’s where personal libraries come in. Libraries are named lists of parts that include only the parts the user wants to have available when creating DNA sequences for a specific project. Libraries are always design strategyspecific (e.g., an E. coli library may not contain Yeast parts), and work in conjunction with their design strategies to prevent user error during the design of a sequence. Personal libraries in GenoCAD can contain a combination of user-defined and global parts. GenoCAD Designs are DNA sequences that have been constructed using design strategies, libraries, and parts.
176
Mandy L. Wilson et al.
3. Requesting an Account on GenoCAD.org When accessing GenoCAD.org for the first time, the first page presented is the Parts tab. Although most of the available features of GenoCAD.org may be viewed without logging in, many of them are disabled or have limited functionality for the unauthenticated user. To take full advantage of the features GenoCAD.org has to offer, a user account is required. The link to apply for an account is located on the Log in tab. After loading the Log In page, the applicant would then click on the link, “Don’t have an account? Request one.” At this point, the Application for Account page is loaded in the browser (Fig. 8.1). The use of GenoCAD.org is free. However, in order to minimize the risk that this resource may be used to develop biological weapons, applications for account are reviewed to verify that the applicants can be identified and have legitimate needs to use the GenoCAD.org Web site. More information on the guidelines GenoCAD.org uses for this validation is available from the GenoCAD Privacy Policy referenced at the top of the Application for Account page. Accordingly, the more information the applicant provides on the application, the more quickly his account can be validated; if the request can be verified from the information submitted in the initial application, the turn-around time for approval is less than 48 hours. Once the account is approved, the applicant receives an e-mail that includes a link to log into the system. Users can change their personal information or password by clicking on the My Profile link available in the submenu at the top of most of the pages in the system. This user registration process is specific to GenoCAD.org because the resource is publicly available. Organizations installing GenoCAD on their own servers could most probably link GenoCAD to an existing user directory (LDAP, Active Directory).
4. Browsing the Parts Catalog When logging into GenoCAD, the Parts tab, or parts listing, is the default page (Fig. 8.2). The tabs along the top of the page guide the user to different features of the GenoCAD application, while the navigational menu on the left side of the screen contains functionality pertaining only to the Parts tab. The default navigational tab selected is Public Libraries. GenoCAD.org has thousands of public, or global, parts spread across four design strategies and a number of public libraries that users may choose from in developing their own personal libraries. GenoCAD offers two different options to aid users in finding parts.
Step by Step Introduction to GenoCAD
177
Figure 8.1 Application for Account. In order to take full advantage of the site features, users of GenoCAD.org must request an account. The identity of the applicant is verified before the account is approved. Users are encouraged to fill out the application as completely as possible to expedite the review process.
Below the Public Libraries tab, there is a collapsible hierarchical menu that is made up of Design Strategies, Libraries, and Categories, although initially only design strategies (often called grammars) and libraries are displayed. In order to see which categories are represented beneath a specific library, the user can use the mouse to click on the small arrow next to the library in question to expand the list of categories below the selected library. Alternately, the Expand All button expands the menu to show the categories beneath all of the libraries, and the Collapse All button collapses the tree structure back to the design strategy/library levels. To view the parts available under a given library or category, the user can click on the library or category name in the hierarchical menu on the left. The parts that fall under the selected library or category are displayed on the
178
Mandy L. Wilson et al.
Figure 8.2 GenoCAD Parts Catalog. The Parts Catalog page is the first page users see upon logging into GenoCAD. In addition to providing a list of the parts available within the system, the Parts Catalog allows users to add their own parts to the system and to create libraries of parts that can be used to constrain the design process for specific projects.
right side of the screen, sorted by category. The default number of parts shown is 25 parts per page; to see more of the parts for the selected library or category, the user can page through the parts list using the First, Prev, Next, Last, or page number links at the bottom of the page. The number of parts displayed per page can be changed by changing the value in the Show Entries dropdown at the top of the page, where the options are 25, 50, 100, and All parts per page. Once the parts are loaded on the right side of the page, they can be sorted by clicking on the column labels at the top of the screen. Additional information on some of the parts may be retrieved by hovering the mouse over the tool tip where applicable. Part details for a particular part may be viewed by clicking on the desired part’s Part ID; the View Part screen displays a description of the part, what categories it falls under, what libraries it belongs to, and its DNA sequence. The Filter text box at the top of the listing allows the user to search for a text string contained within the Part ID, Name, or Description fields.
5. Searching for Parts To look for parts by attribute rather than by browsing within an existing library, the user may do a site search for a part. The textbox above the menu bars can be used for a quick text search; for example, if
Step by Step Introduction to GenoCAD
179
“Promoter” is entered in the text box, the search returns all of the global promoters, along with any promoters the requestor has entered into the system and owns as a user. Any text attribute of a part can be searched using the quick search, including portions of the part DNA segment. “Quick search” results are sorted by design strategy, library, and category, just as the Public Libraries parts are sorted. For an even more complex text search, GenoCAD provides an Advanced Search option, available by clicking on a link at the top of the page. On the Advanced Search page, the user is able to build a query by selecting from a variety of attributes to limit the search results to more likely candidates. For example, to search for parts that contain the sequence “AGGA” and that are also Promoters, the dropdown lists and text boxes may be used to create a query for parts where the sequence CONTAINS AGGA AND category IS EQUAL to Promoters. As in the Quick Search, results are sorted by grammar, library, and category, as they would by the Quick Search. The Advanced Search also allows the user to do Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990) sequence homology searches against the parts in the catalog to identify sequences that have regions of local similarity to the search sequence, including those that share functional or evolutionary relationships or are members of the same gene family. In this case, GenoCAD allows users to do a BLAST search that only includes results from the GenoCAD parts catalog. The Advanced Search also provides support for doing combined searches that include doing a text search upon the results of a BLAST search, although to the user, it appears as if the search was done in a single step; the searching algorithm recognizes a combined search and handles the BLAST processing first, then applies the text filter.
6. Using My Cart to Create Libraries As users find parts they are interested in including in their libraries, they may add them to their Cart (Fig. 8.3). The paradigm here is similar to that of an online shopping cart—whereas on Amazon customers add books to a shopping cart and then order them all at the same time, the GenoCAD Cart serves as a temporary repository where parts of interest may be saved temporarily; when users are done looking for parts, the Cart to may be used to create or append to personal libraries. To add a part from the parts listing or search results to the Cart, the mouse is used to select the checkbox beside the part or parts to include; to check all the parts on the page, the checkbox at the top of the list may be checked. When all the parts of interest on a particular page are checked, the Add Selected Parts to My Cart button is used to add the parts to the user’s cart.
180
Mandy L. Wilson et al.
Figure 8.3 My Cart. The user can use the cart as a temporary listing of parts of interest that they may wish to use in designs. When the user is done selecting parts, they can merge parts in My Cart directly into personal libraries.
When the user is ready to create a library, he clicks on the My Cart tab from the left navigation menu. The menu hierarchy shown below the My Cart tab is divided into design strategies and categories, but not libraries; parts are design strategy-specific, which means that parts from one design strategy cannot be used for a library from another design strategy. Before parts can be added to a library, the library must already exist. The New Library button pops up a window to allow the specifics of the new library to be defined—the design strategy this library adheres to, the name of the library, and the description. Once saved, the new library is added to the dropdown list at the top of the screen as long as there are parts in My Cart that belong to the selected design strategy. To merge parts from My Cart into the new library, the user starts out by selecting the design strategy or category from the hierarchical menu on the left that contains the parts to assign first. When the parts have loaded, as before, the user checks the specific parts to include from the right side of the screen; clicking the checkbox on the top of the page selects all the parts on that page. When finished selecting parts from this page, the target library is selected from the dropdown on the top of the page and the Merge to Library button adds the parts to the selected library. If the Remove from My Cart? checkbox is checked, the selected parts are removed from My Cart as they are merged to the target library; if this checkbox is not checked, then the parts remain available for assignment to other libraries. Users can remove unwanted parts from My Cart in a couple of ways. The Empty Cart button removes all of the parts from My Cart, and the Remove Selected button removes only the checked parts.
Step by Step Introduction to GenoCAD
181
7. My Libraries The My Libraries tab in the left navigation bar allows users to view their own personal libraries, either for editing library information, removing parts, or adding existing parts from their personal libraries to their Cart for use in a different library (Fig. 8.4). The My Libraries view is very similar to the Public Libraries view, except that the libraries displayed there are the logged-in user’s personal libraries, and My Libraries has additional functionality to allow users to manage their libraries and parts.
New Library: As on My Cart, the user may add a new (empty) library from the left navigation bar under My Libraries. Management Console: Over the parts listing on the right side of the screen, there is a box that lists the name of the selected library and a Management Console link. Clicking this link exposes several additional options that can be used for managing libraries: ○ Add Selected to My Cart: This is identical to the corresponding button on the Public Libraries tab; it takes the selected parts and adds them to My Cart for assignment into other libraries. ○ Remove Selected from Library: This removes the selected parts from the currently selected library. ○ Edit Library: This allows the name or description of the library to be changed. Although the design strategy dropdown is displayed on the Edit Library screen, it cannot be changed at this point. This is because
Figure 8.4 My Libraries View: This view allows the user to modify their libraries. From here they can manage their libraries, add new parts, and reassign orphaned parts.
182
Mandy L. Wilson et al.
parts, like libraries, are design strategy-specific, so if a library were able to change design strategies after it was established, there could be a large number of parts in that library which would not be available for designs. ○ Delete Library: The Delete Library button can be used to remove libraries that were created by mistake or that are no longer needed. The Delete Library button assumes that the currently selected library is the one to delete. After prompting the user to verify the deletion, the library is deleted. The parts under the library, however, are not deleted, and are still available for assignment to other libraries. This is discussed in more detail under Orphaned Parts. ○ Add New Part: If the necessary part cannot be found within the GenoCAD parts repository, the user can add his or her own part. The Add New Part button allows the user to enter a part name, a DNA sequence, and a description (Fig. 8.5). After a design strategy/grammar is selected, the libraries available are limited to those that use that design strategy; new parts must be assigned to at least one library. The design strategy selection also limits which categories may be selected for the new part; at least one category must be selected, but all categories that apply may be selected. Something to consider is that GenoCAD does not allow duplicate DNA segments (parts) per library, so if one part (sequence) can be used under multiple categories, then it should be mapped to all appropriate categories rather than loading multiple parts, one for each category. It should also be noted that users may only view global parts and the parts they have added, so a
Figure 8.5 View Part. The user may view a part from any of the Parts Registry views. Users may also edit their own parts from either the My Parts or My Libraries views.
Step by Step Introduction to GenoCAD
183
particular user’s search results can include his/her own parts that meet the criteria, but cannot include parts added by other users. Orphaned Parts: At the bottom of the left navigation bar is an entry called “Orphaned Parts” that are indicated with a Recycling icon. Orphaned parts are always personal parts, and parts become orphaned when they are removed from all of their parent libraries. To remove a part from the Orphaned Parts library, it needs to be assigned to another library. Parts that remain orphaned for a configurable period of time, or a default of 2 months, are deleted by a cleanup process.
8. My Parts The My Parts tab allows users to view the parts they have added to the system. The navigation on this tab is only by design strategy and category, and library is not included in the hierarchy; this is because a part can belong to multiple libraries, but the libraries are all referencing the same part. For example, if a shared part is edited, the changes appear in all the libraries that use that part. The following options are available from the My Parts tab:
Edit Existing Parts: In GenoCAD, users may edit their own parts, but not global parts added to their private libraries. The user edits one of his own parts by clicking on the Part ID of the part he wishes to change. This pops up the Edit Part screen where the part’s name, sequence, description, libraries, and categories may be changed. As with libraries, parts are design strategy/grammar-specific and cannot be moved to other design strategies/grammars because it could affect existing libraries or designs. New Part: The New Part button is on the left navigation bar, and it works the same way as the Add New Part button on the My Libraries tab. Add Selected to My Cart: This button adds the selected parts to My Cart in the same way that parts can be added from Public Libraries or Global Libraries.
9. Designing Sequences When the users are finished assembling their personal libraries, then they are ready to create design sequences using a process described elsewhere (Cai et al., 2007, 2010; Czar et al., 2009). Briefly, to begin creating sequences, the user clicks on the Design tab. When the Design page initially loads, the first step is to select a design strategy/grammar and a library; the defaults are “E. coli Expression Grammar” and “Public Parts Library (E. coli
184
Mandy L. Wilson et al.
Expression Grammar).” If a different design strategy/grammar is selected, the library dropdown automatically updates to include only libraries that match the selected design strategy. Once the design strategy and library have been selected, the user may begin building his design. Starting from the start symbol (usually S), the user iteratively selects a number of rewriting rules of the selected grammar, transforming each category into subcategories and then into parts (Fig. 8.6). When a part from the library has been assigned to each category that compose the design, the sequence is complete, and an alert appears on the upper-right-hand side of the screen (“Your sequence is ready!”). The Download button displays the designed sequence that can be saved to a file. Since this sequence has been built using standard design strategies, and is made up of standardized genetic segments that are easier to check, there should be fewer errors in this sequence than would be encountered if it had been edited at the sequence level using traditional sequence editing software. At any point during the design process, the Save Design link at the upper-right hand corner of the Design page may be selected to save the design for future viewing and editing. When saving a design, the user is prompted to enter the design name and a description of that design. If a design is saved before it is finished, when the design is reviewed it displays its state at the Step where the user left off, but it cannot be downloaded as a finished sequence. As with libraries and parts, it is possible to edit saved designs, or even make copies of designs so the copy can be modified without losing the original design. To view saved designs, the user clicks on My Designs from the submenu at the top of the screen.
Figure 8.6 Design sequence. Using GenoCAD’s design strategies (also called grammars) and personalized libraries, the user can develop quality sequences in a very short period of time using a point-and-click user interface.
Step by Step Introduction to GenoCAD
185
10. Installing GenoCAD After experimenting with GenoCAD on GenoCAD.org, advanced users will want to install GenoCAD on their own servers. This solution allows organizations to protect their intellectual property by leaving sensitive information behind their firewall. It also makes it possible to customize the GenoCAD database content to the specific needs of an organization instead of relying on generic grammars and parts libraries. GenoCAD is developed using the PHP Zend framework. This section describes how GenoCAD 1.5.4 may be installed on a local server. If installing a future version of GenoCAD, it is recommended that the instructions in the INSTALL.txt file be followed. To run GenoCAD, the latest stable version of PHP (at least 5.0 or greater) and the latest stable release of MySQL are recommended. 1. Download the source code. To begin with, the source code should be downloaded from SourceForge. Although the source code is versioncontrolled using SourceForge’s instance of subversion, it is possible to download the latest stable release as a package by clicking on the Download Now! button provided by SourceForge. If the latest stable release is downloaded as a package, it needs to be unzipped before proceeding to the next step. 2. Copy the genocad directory to the server’s webroot directory. To simplify the coordination of the Zend parts of the application, it is strongly recommended that the GenoCAD instance have its own domain URL, rather than installing it as a subdirectory under another root, for example, having the index.php file in https://mygenocadinstance.myinstitution.edu/ is better than having index.php in https://www.myinstitution.edu/mygenocadinstance. 3. Edit the php.ini file to set short_open_tag to “On.” 4. Create a Virtual Host for Zend. Zend works best if it behaves as if it were in the root directory of the domain, but because there is some nonZend-compliant legacy code interfacing with parts of the application written using Zend, a Virtual Host must be created for the genocad/ zend directory. The exact technique may vary depending on the web server being used, but there is a good description of what needs to be done in this Zend article: http://framework.zend.com/manual/en/ learning.quickstart.create-project.html; the primary focus of this article is how to create a Zend project, but it includes a section on creating a virtual host. As an example on how to create a virtual host on an Apache server, the relevant portions of the public website’s httpd.conf are displayed below:
186
Mandy L. Wilson et al.
“/srv/www/vhosts/www.genocad.org/htdocs/no-ssl/
zend/public/“> DirectoryIndex index.php AllowOverride All Order allow,deny Allow from all Alias /zend/ “/srv/www/vhosts/www.genocad.org/htdocs/no-ssl/ zend/public/“
5. Create a mysql database instance for the local GenoCAD database. Import genocad.sql to create a seed database. As an example, the command for importing this database would probably be something like this: mysql -u < username> -p < name of genocad repository created in previous step> < genocad.sql
6. Modify the following files to set database connection settings to point to the database just created: Common.php (set variables “Database”, “Host”, “Port”, “User”, and “Password” under the $CCConnectionSettings array.) common_genocad.php (set variables $server, $user, $pwd, and $db) zend\application\configs\application.ini (set variables resources.db. params.host, resources.db.params.username, resources.db.params.password, and resources.db.params.dbname). 7. Edit the genocad/includes/top_nav.inc file to change all references to www.genocad.org to the local GenoCAD instance’s URL. If running genocad as a subdirectory off the webroot URL (instead of as a URL that routes directly to the genocad directory), the “/genocad” suffix needs to be included on the URL references (i.e., http://localhost/ genocad). 8. Restart the webserver.
11. Anticipated Evolutions Since GenoCAD is an active research project, it is already possible to give an insight into some of the upcoming enhancements. In the current version of GenoCAD, users may view their own designs, parts, and libraries, but may not share them with other users. This level of granularity of the security model is adequate if the user is working alone, but
Step by Step Introduction to GenoCAD
187
it is limiting in situations where different users need to collaborate on a project. The collaboration features will allow users to grant read or read/ write access to their libraries, parts, and designs. The permissions will be granular, so users could continue to keep some of their work private, while collaborating on other projects with others. Managers will also be able to create teams, so different people in an organization could share responsibility over different projects. Users can create their own parts and libraries, but there is no interface to allow them to create and modify their own design strategies. It is possible to customize existing grammars or create new grammars directly in the backend database. It is desirable to progressively give users more control over the design strategies they use in their project. It is, however, fairly challenging to formalize the grammar development process and develop user interfaces that can successfully guide users having no previous experience with formal grammars through that process (Oliveira et al., 2009). GenoCAD currently saves designs as the series of rewriting rules that produces a DNA sequence. This creates a number of potential problems. For instance, the sequence of a part used in a design may be edited in the database. In that case, the change of the part sequence would propagate to the design sequence unbeknownst to the design owner. It is anticipated that future designs will be saved as DNA sequences. The consistency of the design sequence with the latest version of the design strategy and parts library will be periodically checked using previously described parsing algorithms (Cai et al., 2007). Another upcoming feature is the possibility to translate DNA sequences into dynamic models of the molecular interactions encoded in the sequence. It was recently proposed to translate synthetic DNA sequences into SBML files using attribute grammars, an extension of the grammars currently used in GenoCAD. GenoCAD’s attribute grammars are grammars augmented with a semantic model describing the biological function of parts in the context in which they are used. Attribute grammars include parts attributes and semantic actions associated to rules. Together they make it possible to translate the DNA sequence of construct into equations describing the construct dynamics (Cai et al., 2009). Implementing such a translator in GenoCAD will make it possible to embed an existing SBML simulator in GenoCAD (Bergmann and Sauro, 2008). Once this feature has been implemented, it will be possible to automate the exploration of the design space generated by a grammar with the goal of finding optimal solutions to a design problem (Ball et al., 2010; Cai et al., 2009).
ACKNOWLEDGMENTS The development of GenoCAD is supported by NSF Award EF-0850100. Laura Adam is supported by a fellowship from SAIC.
188
Mandy L. Wilson et al.
REFERENCES Altschul, S. F., Gish, W., et al. (1990). Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410. Ball, D. A., Lux, M. W., et al. (2010). Co-design in synthetic biology: A system-level analysis of the development of an environmental sensing device. Pac. Symp. Biocomput. 385–396. Bergmann, F. T., and Sauro, H. M. (2008). Comparing simulation results of SBML capable simulators. Bioinformatics 24(17), 1963–1965. Cai, Y., Hartnett, B., et al. (2007). A syntactic model to design and verify synthetic genetic constructs derived from standard biological parts. Bioinformatics 23(20), 2760–2767. Cai, Y., Lux, M. W., et al. (2009). Modeling structure-function relationships in synthetic DNA sequences using attribute grammars. PLoS Comput. Biol. 5(10), e1000529. Cai, Y., Wilson, M. L., et al. (2010). GenoCAD for iGEM: A grammatical approach to the design of standard-compliant constructs. Nucleic Acids Res. 38(8), 2637–2644. Chandran, D., Bergmann, F. T., et al. (2009). TinkerCell: Modular CAD tool for synthetic biology. J. Biol. Eng. 3(1), 19. Clancy, K., and Voigt, C. A. (2010). Programming cells: Towards an automated ’Genetic Compiler’. Curr. Opin. Biotechnol. 21(4), 572–581. Czar, M. J., Cai, Y., et al. (2009). Writing DNA with GenoCAD. Nucleic Acids Res. 37(Web Server Issue), W40–W47. Goler, J. A., Bramlett, B. W., et al. (2008). Genetic design: Rising above the sequence. Trends Biotechnol. 26(10), 538–544. Hill, A. D., Tomshine, J. R., et al. (2008). SynBioSS: The synthetic biology modeling suite. Bioinformatics 24(21), 2551–2553. Kwok, R. (2010). Five hard truths for synthetic biology. Nature 463(7279), 288–290. Marchisio, M. A., and Stelling, J. (2009). Computational design tools for synthetic biology. Curr. Opin. Biotechnol. 20(4), 479–485. Oliveira, N., Henriques, P., et al. (2009). VisualLISA: Visual Programming Environment for Attribute Grammars Specification. Proceedings of the International Multiconference on Computer Science and Information Technology, pp. 691–698. Peccoud, J., Blauvelt, M. F., et al. (2008). Targeted development of registries of biological parts. PLoS ONE 3(7), e2671. Pedersen, M., and Philipps, A. (2009). Toward programming languages for synthetic biology. J. R. Soc. Interface R. Soc. 6(Suppl. 4), S437–S450. Purnick, P. E. M., and Weiss, R. (2009). The second wave of synthetic biology: From modules to systems. Nat. Rev. Mol. Cell Biol. 10(6), 410–422. Villalobos, A., Ness, J. E., et al. (2006). Gene Designer: A synthetic biology tool for constructing artificial DNA segments. BMC Bioinform. 7, 285. Weeding, E., Houle, J., et al. (2010). SynBioSS designer: A web-based tool for the automated generation of kinetic models for synthetic biological constructs. Brief. Bioinform. 11 (4), 394–402.
C H A P T E R
N I N E
Methods for Open Innovation on a Genome-Design Platform Associating Scientific, Commercial, and Educational Communities in Synthetic Biology Tetsuro Toyoda Contents 190 190 191 193 194 194 197
1. Introduction 1.1. Open optimization and OTR model 1.2. Freemium model for open-innovation platform 2. Various Platforms for Rational Genome Design 2.1. BIO bricks based on a DEM 2.2. CAD bricks based on a CEM 3. Research Use of Patents 4. Copyrights on Designed Sequences in Genetically Modified Organisms 5. Auditability of Designed Sequences for Safety Guidelines 6. Infrastructure for GenoCon 7. Cultivating Young Specialists for Genome Design 8. Perspectives References
198 199 200 201 202 202
Abstract Synthetic biology requires both engineering efficiency and compliance with safety guidelines and ethics. Focusing on the rational construction of biological systems based on engineering principles, synthetic biology depends on a genome-design platform to explore the combinations of multiple biological components or BIO bricks for quickly producing innovative devices. This chapter explains the differences among various platform models and details a methodology for promoting open innovation within the scope of the statutory exemption of patent laws. The detailed platform adopts a centralized evaluation model (CEM), computer-aided Bioinformatics and Systems Engineering division, RIKEN, Yokohama, Kanagawa, Japan Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00009-7
#
2011 Elsevier Inc. All rights reserved.
189
190
Tetsuro Toyoda
design (CAD) bricks, and a freemium model. It is also important for the platform to support the legal aspects of copyrights as well as patent and safety guidelines because intellectual work including DNA sequences designed rationally by human intelligence is basically copyrightable. An informational platform with high traceability, transparency, auditability, and security is required for copyright proof, safety compliance, and incentive management for open innovation in synthetic biology. GenoCon, which we have organized and explained here, is a competitionstyled, open-innovation method involving worldwide participants from scientific, commercial, and educational communities that aims to improve the designs of genomic sequences that confer a desired function on an organism. Using only a Web browser, a participating contributor proposes a design expressed with CAD bricks that generate a relevant DNA sequence, which is then experimentally and intensively evaluated by the GenoCon organizers. The CAD bricks that comprise programs and databases as a Semantic Web are developed, executed, shared, reused, and well stocked on the secure Semantic Web platform called the Scientists’ Networking System or SciNetS/SciNeS, based on which a CEM research center for synthetic biology and open innovation should be established.
1. Introduction Open innovation is a paradigm that assumes that firms, as outsourcers, can and should use external ideas as well as internal ideas, and internal and external paths to the market, as the firms look to advance their technology (Chesbrough, 2003). For an open innovation to succeed in the translational research phase, it is necessary to create incentives to individual contributors or participants, and to establish a platform where external participants are freely allowed to propose, test, and find better combinations among multiple biological components and technologies. The beneficiaries of open innovation are organizers, sponsors, and outsourcers, including not only profit organizations but also academics who are funded from nonprofit organizations. A small lab in an industry or university (the outsourcer) with a desire to develop a practical use for its patented invention should be matched with potential partners (the contributors) that have their own original technologies or ideas. A successful match can help propel an invention toward practical applications. Here, we introduce a new framework for open innovation called “open optimization research (OOR)” or “open translational research (OTR),” in which the process for optimizing an invention is carried out by numerous participating contributors in an open manner, rather than by a few members of a closed group of inventors.
1.1. Open optimization and OTR model In an early stage of invention, it is often the case that a DNA sequence is initially synthesized by inventors in order to verify a given idea in the easiest way possible. However, this initially tested DNA sequence may not be
191
Methods for Open Innovation on a Genome-Design Platform
optimal among the patented DNA sequences that are claimed to confer a desired function to a living organism. Finding a more optimal design of the DNA sequences for the different genetic backgrounds of commercially valuable organisms than the initially tested sequence requires that researchers conduct enough tests to find optimal examples among the patented DNA sequences, which demands securing research financing. Designing optimal sequences also requires bioinformatics specialists with the skills to make use of the existing knowledge and data, while keeping in mind the genetic background of each organism. As a result of these requirements, many DNA sequences, while protected by patents, are never actually put into practical use by the inventors due to a lack of necessary resources (this phenomenon is referred to as the “Valley of Death”). OOR/OTR involves a competition to overcome this problem by providing opportunities to optimize the design of DNA sequences with the help of external participants in the research process (Fig. 9.1). For this purpose, the platform must offer incentives for outsourcers, contributors, and sponsors that make it attractive for them to join the competition (Fig. 9.2).
1.2. Freemium model for open-innovation platform A free platform is necessary for most contributors who want to participate without incurring costs, such as securing a laboratory for experiments. Anyone who has an idea and minimum equipment, such as a Web browser, Outsourcer/organizer (premium user of the platform) Calls for designs within the scope of his/her patented invention
Coverage protected by the patent Initial example of DNA sequence not yet optimal design
Contributor/contestants (free users of the platform) Submit their designs to the contest Designed examples submitted by the contestants Evaluators (funded from the beneficiary) Evaluate submitted designs
Third
Second
First Designs that have been ranked experimentally
Outsourcer/organizer (the beneficiary) Obtains better designs protected within the scope of the original patent
Coverage protected by the patent Initial DNA sequence A better example of DNA sequence is discovered
Contestants enjoy the contest and outsourcer/organizer obtains better examples of DNA sequences.
Figure 9.1 Process flow for OOR/OTR.
192
Tetsuro Toyoda
Outsourcer/organizer/sponsor (the beneficiary) Calls for designs within the scope of his/her patented invention and obtain optimal designs
Researchers/ technical partners (evaluators) Obtain numerous examples for scientific studies to create publications with funding from the beneficiary
Contestants/researchers/students (contributors) Enjoy the contest experience Learn from their experience Receive awards Contribute to community
OOR/OTR
PR sponsor (the beneficiary) Education-related advertisement Learning material business
GenoCon Platform (SciNetS) Provides infrastructure for contest and shares designs and programs from researchers around the world
Figure 9.2 Incentives for participants in OOR/OTR.
should be able to participate. Many people in the educational and academic community are able to easily engage and even enjoy learning experiences if they are allowed to use the platform for free. These costs should not be passed down to the individual contributors; rather the beneficiary who will enjoy the profit from the competition should bear most of the entire cost. Those organizers, sponsors, and outsourcers who offer a problem call external contributors through the platform in order to supply technological information to solve the problem and share the cost of the experiments. They are responsible for selecting the proposals from the contributors for the experimental phase so that the cost for the experiments can be adjusted depending on their budget size. In our principle of a cost model for an experimental platform for open innovation, the cost load should be concentrated at the beneficiary rather than distributed to each contributor. Freemium, which is one of the most common Web business models, is such a concentrated model (Anderson, 2009). A freemium platform takes different forms, with varying tiers from free to premium services, hence the term freemium. A free version of the service needs to be provided to contributors such as scientific and educational communities, while a premium or expensive version of the service will be limited to those who receive the benefits from open innovation on the platform. For digital products, the ratio of free to paid services is very large in terms of the number of users. A typical online suite follows the rule that a small percentage of users support all the rest. In the freemium model, this means that the beneficiary pays for the premium version to support the platform, while
193
Methods for Open Innovation on a Genome-Design Platform
many external contributors receive free access to the online services. The reason this works is that the cost of providing the online services is close enough to zero to be considered negligible. Thus, in a suitable freemium model, only a premium user can organize an open-innovation project on the information platform, while free users cannot do so, but can participate as contributors to the project.
2. Various Platforms for Rational Genome Design “Rational design” in biology is a methodology for finding the optimal structure of substances introduced into target organisms for the purpose of conferring new functions to them, by using an algorithm that rationally determines the optimal structure logically based on data obtained from the organism. It is widely accepted that “rational drug design” is a methodology for finding the optimal compound chemical structure that confers a desired function to an organism (Gane et al., 2000), while “rational genome design” is a new methodology for finding the optimal DNA sequences that confer new functions to a target organism (RIKEN, 2010). Although the de novo synthesis of the whole genome in a bacterium has become possible, the design of the genome sequence information depends on the preexisting templates of highly accurate genome sequences (Gibson et al., 2010). In this chapter, “rational genome design” does not mean a de novo design of whole genome sequences, but rather an integration of the designed genetic parts into a genome of the same or different genetic background organism (Table 9.1). Most of the genetic parts or BIO bricks are well stocked for the purpose of using them in a genome of the same species. BIO bricks are distributed to contributors around the world, who are asked to evaluate them on their own. Hence, the BIO bricks system adopts a distributed evaluation model (DEM). After rapid testing of a genetic device with the BIO bricks, a crossspecies translation of the sequence for optimizing onto a genome of a different genetic background is required for commercial application purposes (thus, called “translational research”), and this is where the openinnovation strategy comes in. CAD bricks, as defined in Section 2.2, are the Table 9.1 Comparison between BIO brick and CAD brick Category
Content
Model
Examples
BIO brick CAD brick
Materials Programs
DEM CEM
BioBrickTM, BIOFAB GenoCon
194
Tetsuro Toyoda
programs that generate improved sequences of BIO bricks so that they are optimized to other species’ genetic background. CAD bricks are openly distributed through the internet to worldwide contributors, who are asked by the organizers to improve them computationally to solve the problem experimentally. The DNA sequences generated by the improved CAD bricks are experimentally evaluated intensively by the organizers. Hence, the CAD bricks system adopts a centralized evaluation model (CEM).
2.1. BIO bricks based on a DEM BIOFAB (an acronym for International Open Facility Advancing Biotechnology) aims to produce thousands of free, standardized DNA parts to shorten development time and to lower the cost of synthetic biology for academic or biotech laboratories so that they can quickly produce innovative devices that are not yet ready for mass production (Sanders, 2010). The BioBricksTM Foundation (BBF) is a nonprofit organization that supports and promotes the use of BIO bricks (or BioBricksTM) for synthetic biology. The international genetically engineered machine (iGEM; Smolke, 2009) is a DEM-style competition that values contributors’ donations of material and information of BIO bricks that are evaluated by their own experimental efforts. The organizers of these BIO bricks adopt a DEM, and deliver not only the BIO brick materials to participants, but also legal contracts that allow every participant to use the materials in terms of intellectual properties, because an OOR/OTR must be carried out without infringing the intellectual property rights of others. It is desirable for a DEM to create a common BIO brick pool of patented research tools that can be freely used by any participant because the evaluation process may well require some research-tool BIO bricks, and sharing the same tools for evaluation is necessary to make the evaluated results comparable to each other. A DEM organizer must hold the agreements for material transfer and nonassertion of intellectual property rights for each patented BIO brick prior to the assignment of materials (The BioBrickTM, 2010).
2.2. CAD bricks based on a CEM In an analogy to an electronic part, synthetic biology assumes that a genetic part or BIO brick has a fixed function by itself. The function of a genetic part, however, can be well characterized and defined as an engineering part only in the case that the part is always used in the same, fixed genetic background. The function of a genetic part is not absolute but relative in terms of genomes, and deeply depends on the remaining part or the genetic background of the genomes and other genetic parts that are simultaneously integrated to the same genome (Kwok, 2009).
195
Methods for Open Innovation on a Genome-Design Platform
x⬘ = X(x|g) g = h + x⬘
x
h
Figure 9.3 Equations of genome design with CAD brick.
CAD brick X(xjg) (i.e., a function of BIO brick x under the condition of the genome g of the transgenic organism) generates the sequence of x0 that has been optimized for g, and is denoted as follows: x0 ¼ XðxjgÞ g ¼ h þ x0 where x0 is an improved form of x under g, which is the new genome containing both x0 and h, which is the genetic background of the original host organism. Solving the following equation is equivalent to designing of genome g. Hence, the optimizing of the fragment of the DNA sequence x0 is virtually regarded as the “genome design” of g (Fig. 9.3): g ¼ h þ XðxjgÞ The operator “þ” indicates an operation of ligating the adjacent two terms or sequences. For example, an operation of ligating two BIO bricks x and y is represented as z¼xþy while that of CAD bricks is represented as z0 ¼ ZðXðxjgÞ þ Y ðyjgÞjgÞ g ¼ h þ z0 where z0 is the designed optimal DNA sequence and h is the genetic background of the host organism. A typical problem to be solved in a competition would be given as, for example, “to design cross-species CAD bricks that generate the improved sequence of a pathway (z ¼ x þ y) that has already been confirmed to work in a bacterium through rapid engineering with BIO bricks so that the improved sequence works more effectively in a different commercially valuable organism and confers a more desired function to the organism
196
Tetsuro Toyoda
than the sequence that was initially designed with BIO bricks.” A CAD brick is composed of a Semantic Web of information resources. A CAD brick, being implemented as a program file assigned with a uniform resource identifier (URI) that indicates the location of the file on the Semantic Web, contains a program and semantic links to include other CAD bricks and/or to import data files containing BIO brick sequences. The Semantic Web is necessary to ensure the traceability and reusability of the CAD bricks. Traceability is necessary for auditability of safety guidelines, while reusability is necessary for engineering efficiency. BIO bricks as DNA materials and CAD bricks as program modules improving the BIO brick sequences compensate for each other to form public commons for synthetic biology. In the near future, however, the distribution of material BIO bricks could be replaced by the distribution of informational CAD bricks because of the exponential decrease in costs required for synthesizing DNA sequences and genomes (Carlson, 2009). A CEM competition with CAD bricks is composed of the following four steps. In step 1, the organizers ask the contributors to try to design CAD bricks that generate a DNA sequence conferring a biological function on an organism requested by the organizers. In step 2, the contributors freely create CAD bricks and execute their programs that generate DNA sequences on the information platform (Fig. 9.4). In step 3, the contributors submit the designed CAD bricks,
Figure 9.4 Snapshot showing design process of a CAD brick in a browser-based programming environment provided by SciNetS.
Methods for Open Innovation on a Genome-Design Platform
197
the generated DNA sequences, and the reports explaining their original design work. The platform must provide each contributor with a Web interface where the contributor can create all three items by using only a Web browser. In step 4, the organizer and technical partners select qualified DNA sequences for experiments by screening the submissions based on scientific knowledge and their originality. The organizers intensively evaluate the design by synthesizing the selected DNA sequences, transferring them into genomes, and assaying the phenotype quantitatively. The competition must be organized with appropriate safety guidelines. The activities of the participating contributors are restricted to bioinformatics work (steps 2 and 3) to designing DNA sequences from genome-related information in a browser-based programming environment provided by the platform. “Design freely and test safely” should be the control criteria for the competition. The organizers will screen the submitted DNA sequences in the light of current safety and ethical standards, and select candidate sequences for experiments. Because any DNA sequence designed as information only does not cause safety and ethical problems, the platform accepts any freely designed CAD bricks and DNA sequences. GenoCon is a CEM-style competition that values the contributors’ improvements on CAD bricks to solve a given problem (Cyanoski, 2009).
3. Research Use of Patents It is expected that the CEM will compensate for some disadvantages of the DEM, and will enhance new discoveries or inventions by maintaining the researchers’ freedom to test various combinations among multiple biological components and technologies with requesting the participants neither to donate their patents to an independent organization, to put them in a common pool, nor to grant unlimited license use to anyone. These requests may result in the case that a person or a company having a valuable patent technology may give it up in order to participate in open innovation. Our proposed CEM must be carried out within the scope of the statutory exemption of patent use for research purposes. A number of countries have exemptions for the experimental or research use of patents. The OECD countries having statutory exemptions include Iceland (Patents Act 1993 s 3 (3)), Japan (Section 69(1) of the Patent Law), Korea (Section 96(1) of the Patent Law), Mexico (Article 22 of the Industrial Property Law), Norway (Patents Act s 3), and Turkey (Section 75 of the Patents Decree-Law). Most EU countries have statutory exemptions that implement Article 27(b) of the Community Patent Convention (CPC). However, the patent laws of each country differ from each other, and a careful legal examination of the patent law of the country where the OOR/OTR is to be held is necessary (Dent
198
Tetsuro Toyoda
et al., 2006). Some countries, including the USA, do not have a statute granting a general research exemption or only have a very limited statutory research exemption (Nakayama, 2003). U.S. case law indicates that the exemption at common law is very narrow (Dent et al., 2006). The Japanese experimental-use exception is much broader than that of the U.S. Japanese case law and prevailing theories have developed to require that the experimental use of patents be directed toward the advancement of technology (Someno, 1988). This is one of the reasons why we do not conduct intensive evaluation work of technologies in the USA, but rather do it in Japan or in a country that has a broader research exemption. A DEM in which evaluation work is conducted by each contributor in different countries may be more liable to infringe a patent in a local country due to the difference in research exemptions. Our CEM, in which the evaluation phase is carried out only in a country with a broad research exemption, may protect each contributor from infringement. Furthermore, the CEM allows us to evaluate each performance quantitatively under the same experimental condition. “Experiment or research,” as referred to in Article 69(1) of the Japanese Patent Law, is specified from both the object and purpose viewpoints. The object of the experiment or research must be the material or method of the patent itself, and the purpose of its use must fall into one of the three categories: (1) patentability research, (2) function research, and (3) experiments for the purposes of improvement and/or development (Someno, 1988). While the use of research tools does not fall into any of these categories, as shown in the previous section, the development of a CAD brick that aims to improve the BIO brick falls into the third category. For example, a CAD brick developed to generate an improved design of a patented BIO brick to make it work more effectively in an organism of a different genetic background than the original BIO brick falls into the third category. It is highly presumable that evaluating the performance among the improved CAD bricks and the original BIO brick is regarded as a research purpose of use and is executable within the scope of statutory exemption of their patents, while legal licensing is required prior to the commercial use of the patents.
4. Copyrights on Designed Sequences in Genetically Modified Organisms A natural DNA sequence is not a work of authorship (Cooper, 1982). Although the conventional definition of genetically modified organisms may not clearly fall in the category of copyrightable work, the rapid growth of technology for synthesizing a long DNA molecule of any human-
Methods for Open Innovation on a Genome-Design Platform
199
designed sequence is enabling us to create an organism whose genome carries a DNA fragment as the tangible record of the design as a work of authorship. The seminal article of Professor Irving Kayton proposes that copyright subsists in virtually all original works of genetic scientists when they are created (Kayton, 1982). We recommend it to be assumed that anything designed on the platform should be treated as potentially copyrightable, in case that such a copyright might be more recognized and strengthened in the future. It is safe to say that a CAD brick can be appropriated by copyrighting the program contained inside. The DNA sequence generated automatically by the program to design it rationally may also be copyrightable as a scientific work, depending upon the contribution of the researcher in such automatic generation. It is recommended for organizers to clearly make license agreements, such as Creative Commons attribution share-alike licenses or GNU general public licenses, on CAD bricks and their generated sequences before starting the competition, so that genome-design theories and programs submitted by contestants from all over the world can be freely shared and reused for open-innovation purposes. The correspondence between the design expressed with CAD bricks and the sequence generated thereof strengthens the traceability and transparency of the entire design process and provides a proof that the sequence is actually designed by the originators of the CAD bricks. The platform needs to support the recording functions necessary for proving the entire history of all the design work.
5. Auditability of Designed Sequences for Safety Guidelines Compliance to safety guidelines is required for each of the following three phases: (1) design phase, (2) DNA synthesis phase, and (3) transgenic phase. For the design phase, the required auditability over the entire design process includes the origins of all sequence information used for the design, the functions of the sequence, hazard and toxicity information on the sequence, the organism from which the sequence originates, and the computational processes applied on the sequences in the cascade of design flow using CAD bricks. The designed genome or the long DNA containing many DNA parts may well become so complex that it is desirable to automate the checking of each resource used for the design. The information on resources on biological parts, materials, or technologies needs to be checked in the light of biosecurity (Biosecurity Working Group, RNAAS, 2008). It is suitable to prepare the above-mentioned information as a Semantic Web, so that artificially intelligent technology can be applied to
200
Tetsuro Toyoda
logically check each design reliably without human errors occurring during the audit processes. For the DNA synthesis phase, a notice has been issued from the Department of Health and Human Services of USA (US Department of HHS, 2009): prior to synthesizing a DNA sequence, the sequence must be compared with those of biohazards. The U.S. Government recommends the use of a software tool that utilizes both a global and local sequence-alignment technique; the most popular algorithm that meets both requirements is the BLAST search tool ( Johnson et al., 2008). Similarity over the length of the sequence being screened and the identification of regions that are similar within longer segments that are not alike are both encompassed in the sequence-screening approach. Specific criteria for the statistical significance of the hit (BLAST’s e-values) or percentage identity values will not be recommended because these details depend on the specific screening protocol. By utilizing the “best match” approach, the sequence with the greatest percentage identity over the entire 66-aminoacid sequence should be considered the “best match,” regardless of the statistical significance or percentage identity (US Department of HHS, 2009). For the transgenic phase, compliance to the law ensuring the precise and smooth implementation of the Cartagena protocol, concerning the conservation and sustainable use of biological diversity through regulation on the use of living modified organisms, is required (SCBD, 2000). Auditability over the design phase is indispensable to determine the category or class level of containment measures concerning genetic recombination experiments. The traceability and transparency of every single process in the complex design must be supported by the information platform. Because the design for a genome is so complex, the forms for regulatory applications will need to be standardized into more computer-processable formats with markup language based on the Semantic Web standard.
6. Infrastructure for GenoCon Built upon Semantic Web technology, GenoCon offers contestants a free bioinformatics platform to compete in technologies for rational genome design. To succeed, contestants must make effective use of the genomic and protein data contained in the Semantic Web database clusters to design DNA sequences that improve plant physiology. RIKEN Scientists’ Networking System (SciNetS) is a Web system built upon the Semantic Web. Making use of the latest cloud-computing technology, the system is capable of simultaneously hosting upward of thousands of virtual laboratories, providing every individual user with access control to each data item. The system is being developed and operated by the Bioinformatics and Systems
Methods for Open Innovation on a Genome-Design Platform
201
Engineering (BASE) division of RIKEN. Virtual labs on the RIKEN SciNetS system enable each lab owner (premium user) to easily create and publish databases without the need to maintain individual Web servers. This information infrastructure can be beneficial for users in implementing international collaborative research. In addition, each virtual lab can also organize competitions or initiate open-innovation research. GenoCon is organized using one of the virtual labs on RIKEN SciNetS. Each contestant in GenoCon establishes a workspace (to create CAD bricks, as shown in Fig. 9.4) in the virtual lab (named “GenoCon”). Virtual labs on RIKEN SciNetS are used as an integrated life science information infrastructure, incorporating functions including a programming environment, a digital lab-notes system, and information resources for biomass-engineering research activities. By providing some of these functions, GenoCon offers an environment for open-innovation research in which contestants can submit more optimal DNA-sequence designs, potentially conferring to the plant Arabidopsis, the target function assigned by the GenoCon organizers.
7. Cultivating Young Specialists for Genome Design Contributors’ incentives can be considered from both the academic and commercial viewpoints. Some contributors, especially young scientists and students, want their design skills to be recognized by academic communities as scientific art. The organizer must clearly display their designs as work originally created by the contributor through the platform. Their contributions are evaluated based on the experimental performances improved by their designs. The CAD bricks originally created for the design also serve as their contribution to the genome-designers’ community. The more their CAD bricks are reused by other designers, the more reputation they will gain. Thus, prior to the contest, the organizer must request the contributors to publish their designs as reusable commons. It may also be possible to entice contributors to participate in the competition by giving them incentives, such as the chance to include their invention in the design for a patent. However, the academic honor of being the original author of the designed organism should be the main incentive for the contributors. GenoCon also offers, in addition to categories for international researchers and university students, a category specifically for high-school students. Similar to ROBOCON (a robot contest), GenoCon provides opportunities for young people to learn about the most cutting-edge science with a sense of pleasure, bringing intellectual excitement to the field of life science, and supporting a future generation of scientists.
202
Tetsuro Toyoda
8. Perspectives To strengthen open innovation in synthetic biology, especially toward bio-based green innovations, CEM research centers will need to be established in RIKEN and worldwide institutes with a strong bioinformatics platform, such as SciNetS, as well as biological laboratories that experiment intensively and uphold biosecurity guidelines. So far, the authorship of a genetically designed organism remains to be recognized. In the future, however, if the authorship is recognized by copyright law and the contribution to each designed part in a genome becomes referable from scientific papers, it should become a strong incentive for academic researchers to contribute to OOR/OTR competitions. To create commons in synthetic biology, it may also become necessary to establish a new copyright license that is applicable not only for artistic work and programs but also for genetic work including DNA sequences and their relevant organisms because the current Creative Commons license and GNU General Public license have been developed to apply only to artistic work and programs, respectively.
REFERENCES Anderson, C. (2009). Free: A Future of a Radical Price New York: Hyperion Books, p. 26. Carlson, R. (2009). The changing economics of DNA synthesis. Nat. Biotechnol. 27, 1091–1094. Chesbrough, H. W. (2003). Open Innovation: The New Imperative for Creating and Profiting from Technology. Boston: Harvard Business School Press, p. xxiv. Cooper, Iver, P., (1982 to date). Other Forms of Protection for Biotechnology. Biotechnology and the Law. West Group, Chapter 14, New York, 14(3), p. 14–21, 14–29. Cyanoski, D. (2009). Synthetic-biology competition launches. Nat. News 457, 516. http:// www.nature.com/news/2010/100602/full/news.2010.271.html. Dent, C., et al. (2006). Research Use of Patented Knowledge: A Review. STI Working Paper 2006/2, p. 17. Gane, P. J., et al. (2000). Recent advances in structure-based rational drug design. Curr. Opin. Struct. Biol. 10, 401–404. Gibson, D. G., et al. (2010). Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52. http://www.sciencemag.org/cgi/content/full/329/5987/52. Industrial Property Law, Article 22, Mexico. Johnson, M., et al. (2008). NCBI BLAST: A better web interface. Nucleic Acids Res. 36, W5–W9. http://nar.oxfordjournals.org/content/36/suppl_2/W5.full.pdfþhtml. Kayton, I. (1982). Copyright in living genetically engineered works. George Wash. Law Rev. 50, 191. Kwok, R. (2009). Five hard truths for synthetic biology. Nature 463(7279), 288–290. http:// www.nature.com/news/2010/100120/pdf/463288a.pdf. Nakayama, I. (2003). Relationship between ‘research freedom’ and patent rights seen from a Japan and U.S. comparison: Developments and issues in ‘the experimental or research use exemption’. AIPPI 48(6), 2ff. Patent Law, Section 69(1), Japan.
Methods for Open Innovation on a Genome-Design Platform
203
Patent Law, Section 96(1), Korea. Patents Act 1993 s 3(3), Iceland. Patents Act s 3, Norway. Patents Decree-Law, Section 75, Turkey. RIKEN (2010). GenoCon—International Science and Technology Competition Supporting Future Scientists in Rational Genome Design for Synthetic Biology. http://www.riken.go.jp/ engn/r-world/info/info/2010/100524/index.html. Sanders, R. (2010). NSF grant to launch world’s first open-source genetic parts production facility. UC Berkley News. http://www.berkeley.edu/news/media/releases/2010/01/ 20_biofab_synthetic_biology.shtml. Secretariat of the Convention on Biological Diversity (2000). Cartagena Protocol on Biosafety to the Convention on Biological Diversity. http://bch.cbd.int/protocol/publications/cartagenaprotocol-en.pdf. Smolke, C. D. (2009). Building outside of the box: iGEM and the BioBricks Foundation. Nat. Biotechnol. 27(12), 1099–1102. Someno, K. (1988). Exploitation of patented invention for experimentation and research. AIPPI Jpn. Group J. 33(3), 5. The BioBrickTM (2010). The BioBrickTM Public Agreement Draft Version 1a. http://dspace.mit. edu/bitstream/handle/1721.1/50999/BPA_draft_v1a.pdf?sequence¼1. The Biosecurity Working GroupRoyal Netherlands Academy of Arts and Sciences (2008). A Code of Conduct for Biosecurity. http://www.knaw.nl/publicaties/pdf/20071092. pdf. US Department of Health and Human Services (2009). Screening Framework Guidance for Synthetic Double-Standard DNA Providers. http://www.gpo.gov/fdsys/pkg/FR-2009-1127/pdf/E9-28328.pdf.
C H A P T E R
T E N
Recursive Construction and Error Correction of DNA Molecules and Libraries from Synthetic and Natural DNA Tuval Ben Yehezkel,* Gregory Linshiz,†,‡ Shai Kaplan,†,‡ Ilan Gronau,‡ Sivan Ravid,‡ Rivka Adar,† and Ehud Shapiro*,§ Contents 1. Introduction 1.1. Error correction of DNA 1.2. The biochemistry of recursive DNA construction 1.3. Algorithms for recursive DNA construction 1.4. Error correction in recursive DNA construction 2. Protocols 2.1. PCR primer phosphorylation 2.2. Overlap extension elongation between two ssDNA fragments 2.3. PCR amplification of the above elongation product with two primers, one of which is phosphorylated 2.4. Lambda exonuclease digestion of the above PCR product to generate ssDNA 2.5. DNA purifications References
208 208 209 211 212 216 216 217 217 217 218 244
Abstract Making error-free, custom DNA assemblies from potentially faulty building blocks is a fundamental challenge in synthetic biology. Here, we show how recursion can be used to address this challenge using a recursive procedure that constructs error-free DNA molecules and their libraries from error-prone synthetic oligonucleotides and naturally existing DNA. Specifically, we describe how divide and conquer (D&C), the quintessential recursive problem-solving * Department of Biological Chemistry, Weizmann Institute of Science, Rehovot, Israel Weizmann Institute of Science, Biological Chemistry Department, Rehovot, Israel Weizmann Institute of Science, Computer Science and Mathematics Department, Rehovot, Israel } Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel { {
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00010-3
#
2011 Elsevier Inc. All rights reserved.
207
208
Tuval Ben Yehezkel et al.
technique, is applied in silico to divide target DNA sequences into overlapping, albeit error prone, oligonucleotides, and how recursive construction is applied in vitro to combine them to form error-prone DNA molecules. To correct DNA sequence errors, error-free fragments of these molecules are then identified, extracted, and used as new, typically longer and more accurate, inputs to another iteration of the recursive construction procedure; the entire process repeats until an error-free target molecule is formed. The method allows combining synthetic and natural DNA fragments into error-free designer DNA libraries, thus providing a foundation for the design and construction of complex synthetic DNA assemblies.
1. Introduction Making faultless DNA assemblies from potentially faulty building blocks is a fundamental challenge in synthetic biology (Carr et al., 2004; Forster and Church, 2006). Complex mathematical objects such as functions (Rogers, 1967), fractals (Mandelbrot, 1982), natural and formal languages (Chomsky, 1964; Hopcroft and Ullman, 1979), and computer data structures (Aho et al., 1983) are typically described using recursion. Although the promise of recursion to physical construction has been recognized (Merkle, 1997), its application in engineering has been scarce (Knight, 2003; http://www.sloning.de/). Here, we present a recursive procedure for constructing faultless DNA molecules and libraries from potentially faulty short synthetic oligonucleotides and existing DNA fragments. Long DNA molecules encoding novel genetic elements are in broad demand (Forster and Church, 2006; Heinemann and Panke, 2006; Ryu and Nam, 2000; Tian et al., 2004); however, only short oligonucleotides (< 100 nt) are made quickly and cheaply by machines (Caruthers, 1985). Such oligonucleotides are used as building blocks to construct longer DNA molecules using one of the two basic construction strategies, namely polymerase cycling assembly (PCA) of multiple overlapping synthetic oligonucleotides (Stemmer et al., 1995) and ligation of synthetic oligonucleotides (Au et al., 1998).
1.1. Error correction of DNA The utility of synthetic DNA constructs in biology depends on their being free of sequence errors (Carr et al., 2004; Forster and Church, 2006; Tian et al., 2004), yet the synthetic oligonucleotides serving as their building blocks are error prone (about one sequence error per 160 nt; Forster and Church, 2006; Tian et al., 2004). Therefore, all DNA construction protocols struggle with the labor-intensive time-consuming task of cloning and
Recursive Construction and Error Correction of DNA Molecules
209
sequencing synthetic DNA fragments, seeking an error-free one. If none is found, a clone with sufficiently few errors that can be patched without undue effort using site-directed mutagenesis (Hutchison et al., 1978) is used. The problem is exacerbated for longer synthetic DNA, as the probability of a molecule, and hence of a clone, to be error-free decreases exponentially with its length. To partially address this problem, a two-step assembly process is commonly applied in which 300- to 500-bp fragments are constructed, cloned, sequence-validated, and then assembled into the desired target molecule (Xiong et al., 2004). Other methods enrich error-free DNA molecules with the use of special mismatch-binding proteins (Forster and Church, 2006; Tian et al., 2004) or improve site-directed mutagenesis (Xiong et al., 2006) to address this fundamental problem in de novo DNA construction.
1.2. The biochemistry of recursive DNA construction Our procedure for constructing error-free DNA molecules integrates recursive construction and error correction. It uses divide and conquer (D&C; Aho et al., 1983; Alsuwaiyel, 1999), the quintessential recursive problem-solving technique, to construct long DNA molecules from short oligonucleotides and then to error-correct the resulting molecules, until an error-free molecule is obtained. D&C solves a problem (in our case, the construction of a particular ssDNA molecule) by dividing it in silico into two smaller subproblems (in our case, the construction of two shorter ssDNA molecules, as shown in Fig. 10.1, top); solving each subproblem recursively, using D&C; and combining in vitro the solutions to the subproblems into a solution to the original problem (in our case, combining the two ssDNA molecules into the desired longer ssDNA molecule, as shown in Fig. 10.1). If the problem is small enough (in our case, the ssDNA molecule is short enough), it is not divided further but is solved directly (in our case, synthesized as an oligo). A fundamental prerequisite of a recursive procedure is that its output be of the same type as its inputs. Examples of DNA composition procedures that do not comply with this input–output compatibility requirement include overlap extension, which takes two ssDNA fragments that overlap at their 30 as input and produces the corresponding elongated dsDNA molecule as output, and PCA, mentioned above, which takes two or more overlapping DNA molecules as input and produces a mixture of the input molecules and some elongated dsDNA molecules as output. Our construction procedure (shown in Fig. 10.1B and Figs. 10.A1 and 10.A2) is thus designed so that it accepts two overlapping ssDNA molecules as input and produces an elongated ssDNA molecule as its output (Fig. 10.1B), utilizing three known enzymatic reactions: overlap extension between ssDNAs, PCR with 50 phosphate labeling, and Lambda exonuclease-mediated ssDNA
210
Tuval Ben Yehezkel et al.
A
B Core step of recursive construction
1........................................................768
Target GFP sequence
Specification: Input: two overlapping ssDNA molecules
1...............................440
Recursive division in silico (0.5 h)
411.....................768 411........590
1...........242
Output: one elongated ssDNA molecule
587.........768
219.........440
Implementation:
Basic oligo sequences Chemical synthesis(4–8 h)
Elongation
Basic oligos (with errors)
Recursive construction in vitro (~14 h)
PCR with phosphorylated primer
P
P
Target molecule with errors
Lambda exonuclease P
Cloning and sequencing (~24 h) Target clones with errors Computing minimal cut in silico for corrective construction (0.1 h)
411............768 1.............242 219........440
Amplification of error-free fragments from clones (~1 h)
Err o
Recursive reconstruction in vitro (~7 h) Target GFP molecule with no errors Natural and synthetic input DNA
Target 3-kb molecule with no errors
1
minim
al cut
Synthetic GFP 690
Recursive Natural fragment reconstruction 1 700 of 3-kb fragment (~7 h) 1
r-fr ee
1363
1330
Natural fragment
3000
1363
3000
Figure 10.1 Recursive construction of error-free DNA molecules from error-prone oligonucleotides. (A) Recursive construction of the GFPDNA. The divide and conquer procedure, as applied to the construction of the 768-nt GFP, is illustrated from top to bottom. The target sequence is recursively divided in silico into overlapping oligonucleotide sequences (16 oligos of average size 75 bp for the synthesis of GFP). The specified oligos are synthesized by conventional means and serve as inputs (in blue) for recursive construction, performed in vitro. Construction proceeds by recursively combining pairs of overlapping ssDNA molecules into ever longer ssDNA molecules, as described in (B) until the target molecule is formed. Target molecules thus produced typically have the same error rate as their source oligos and hence are subject to recursive error correction as follows. A certain number of target molecules are cloned and sequenced (this number is optimized as described in the text, seven in the case of GFP). Errors (marked in red) are identified. Error-free segments found in the clones are then amplified from the clones and used as inputs to another recursive reconstruction of the target molecule (one half molecule and two quarter molecules in this case). The error-free segments are chosen to correspond to nodes in the recursive construction tree, so that they can be
Recursive Construction and Error Correction of DNA Molecules
211
generation. It can be applied recursively since its input and output are of the same type (ssDNA). In principle, a recursive construction procedure that uses dsDNA as its input and output can also be devised. We chose ssDNA rather than dsDNA because the extension of overlapping ssDNA molecules can be performed in quasi-equilibrium (i.e., denaturation and then very slow cooling to annealing temperature), thereby greatly improving control, yield, and specificity (see results for CE fragment analysis of composition reactions) of elongation products. This is in contrast to the rapid thermal cycling conditions commonly used when elongating two or more dsDNA molecules, which often result in low elongation yield and in nonspecific elongated products (see Fig. 10.A3).
1.3. Algorithms for recursive DNA construction The D&C recursive algorithm receives a user-specified target sequence as its input and returns as output a list of oligos to be synthesized and a protocol in the form of a robot control program that can be used to construct the desired DNA molecule using the specified set of oligos. The basic recursive subroutine of the algorithm takes as input the sequence of a target molecule and returns as output a recursive construction protocol and its associated cost. This subroutine divides the target sequence into two overlapping sequences and calls itself recursively with these subtarget sequences as new input. The cost of constructing the target molecule by this protocol is computed by adding the cost of assembling the two overlapping subfragments to the cost of constructing these two individual amplified using the same primers used in the initial procedure and are further optimized, using the mathematical notion of minimal cut in a graph (explained in the text) so as to minimize the number of reactions needed for reconstruction (only two reactions out of the total of 15 in this case). This second iteration of the procedure typically (as in this case and all our experiments to date) results in an error-free clone. However, if errors remain, another error-correcting iteration of the procedure can be performed. The figure further demonstrates the construction of a 3-kb DNA fragment by combining, using the same construction procedure, the synthetically produced GFP molecule and DNA from a natural source as input (bacterial plasmid, in green), which yielded an error-free molecule. Expected optimal times for each step using state-of-the-art standard equipment are shown on the left. The cloning step could potentially be replaced by single molecule PCR. (B) The core step of recursive construction receives two overlapping ssDNA molecules as inputs and produces the elongated ssDNA molecule as output, as follows: the overlapping ssDNA molecules hybridize and prime each other for an overlap extension elongation reaction to form a dsDNA molecule (elongation), which is then amplified by PCR with one of the two primers phosphorylated at its 50 end (PCR with phosphorylated primers). The phosphate-labeled PCR strand is then degraded with Lambda exonuclease, yielding an elongated ssDNA molecule as output (Lambda exonuclease).
212
Tuval Ben Yehezkel et al.
subfragments. The computed cost accounts for the various features of the construction process, including the number and length of oligos, number of reactions, and the total number of levels in the protocol (see Appendix). The recursive division ends if the subroutine’s target is short enough to be synthesized directly as an oligonucleotide. If any fragment of the target sequence is already available as existing DNA (say in a plasmid or in previously constructed DNA), then the algorithm can take this information into account and use these fragments as input to the construction process instead of synthesizing it from basic oligos (Fig. 10.1). Division points are not chosen so that oligos are of equal length, as usually practiced in PCA methods (Smith et al., 2003). Instead, division points are selected to minimize the cost of constructing the target and to respect a set of constraints, including whether good PCR primers exist for each of the subtargets and whether the two subtargets can be elongated together efficiently and specifically in the elongation reaction described in Fig. 10.1B. Validation of specificity and affinity of elongation overlaps and PCR primers is performed using sequence alignment algorithms and Tm calculations, respectively (see Appendix). The optimized recursive protocol is then transformed into a robot control program that instructs the robot to construct the molecule bottom-up. It starts with the leaves of the recursive construction tree and iteratively executes the basic chemical step (Fig. 10.1B) all the way up to the root of the tree until the target molecule is constructed. The hierarchical structure of our procedure, induced by the use of recursion, enables DNA construction by pairwise composition reactions that are performed independently of each other and in equilibrium, which greatly increases the predictability (and hence amenability to automation) of the core biochemical reactions of our procedure.
1.4. Error correction in recursive DNA construction The hierarchical structure of the recursive construction tree is also at the foundation of our error-correction procedure. The molecules produced by the first iteration of our recursive construction procedure are error prone (see Table 10.A1) and have the same error rate as the oligos used to produce them. Our recursive construction procedure enables a novel error-correction strategy that employs the very same construction methodology and reagents to produce error-free molecules. Like previous DNA construction protocols (Tian et al., 2004), our errorcorrection procedure uses cloning and sequencing to identify faults, but unlike previous protocols, it does not require additional or external methods or reagents to turn the error-prone DNA into error-free DNA. The overall strategy is described in Fig. 10.1: short oligos are used as error-prone basic components and composed as described above till the target DNA molecule
213
Recursive Construction and Error Correction of DNA Molecules
Table 10.A1 This table summarizes the errors found in our constructed DNA as revealed by the sequencing of 98 clones of the p53 library, each containing 880-883 nucleotides, totaling over 86 thousand sequenced base pairs Error rate statistics
Number of Clones Length of each variant (bp) Deletion Insertion Substitution Total errors (bp) Total error free clones Total error free bp Total bp Error rate (Total bp/Total errors)
98 880 or 883 0 1 14 15 86 86421 86436 5762.4
is constructed. However, unlike other methods, if no error-free molecules are found by cloning and sequencing, then error-free parts of the erroneous target DNA molecules are identified and used as new, typically longer, inputs to the same recursive construction procedure. Since this construction starts from typically larger DNA-building blocks that are error free, the number of errors in the resulting reconstructed DNA is expected to decrease, possibly down to zero, eschewing additional screening of clones. Specifically, the error-prone clones from the initial construction are analyzed to find a minimal cut in the recursive construction tree, defined as follows (see also mathematical definitions in Appendix). A node in the tree is said to be covered by a set of clones if its sequence occurs error free in at least one of the clones. A set of clones induce a minimal cut on the tree, defined to be the set of the most shallow (closest to the root) nodes in the tree that are covered by the clones. If some leaf is not covered, it means that the oligo is erroneous in all clones. In such a case, we can either analyze additional clones in the hope to find that leaf error free and recompute the minimal cut or, if we reason that a systematic error has occurred in the synthesis of an oligo (i.e., the same error is represented uniformly in all clones), then there is no reason to analyze additional clones and we simply resynthesize that oligo and try again. Mathematically, we simply assume that the newly ordered oligo would cover the leaf node and proceed with the computation of the minimal cut. Since the boundaries of the error-free DNA fragments that constitute the minimal cut coincide with boundaries of fragments of the initial recursive construction tree, they can be extracted from their respective clones using PCR and the same primers used in their corresponding composition step (Fig. 10.1B). As a result, no additional
214
Tuval Ben Yehezkel et al.
methods or reagents are needed to obtain error-free molecules beyond those used in the initial construction. Moreover, based on the known rate and distribution of errors, we can predict the number of times error-free components will occur in a given number of constructed objects. Further, we can calculate the probability that a certain number of error-free components would collectively span the entire target object. Conversely (and more importantly), we can calculate the number of object copies (clones) required so that their error-free components span the entire target object with a desired probability (chosen to be 95% in this work, see Appendix). An important feature of our error-correction procedure is that it bypasses a major obstacle in constructing synthetic DNA, namely the exponential decrease in the fraction of error-free molecules with the length of the molecule, as seen in naı¨ve approaches to DNA synthesis (Fig. 10.2A, blue plot). This is possible since our error-correction procedure avoids the difficult task of finding complete error-free molecules. Instead, it efficiently utilizes small error-free parts and combines them back into an error-free target molecule. The probability of finding an error-free fragment of a fixed small size is high and (more importantly) fixed regardless of the overall length of the target molecule. Hence, the small linear increase in the number of clones needed to construct increasingly larger error-free target molecules (Fig. 10.2A, purple plot) compared to the exponential increase in the number of clones needed when constructing DNA without any error correction (Fig. 10.2A, blue plot). Even if some sort of building block (oligo) purification is applied, for example, PAGE purification (Fig. 10.2A, green plot), the number of clones still becomes overwhelming in the construction of DNA several kilobase pairs long. Other methods for DNA synthesis also employ a hierarchical strategy in construction and error correction. For example, fragments of 500 bp are constructed by PCA, cloned, and screened for error-free molecules, which are then combined into larger fragments by different methodologies (Xiong et al., 2004). Such a two-step construction strategy is compared to ours in Fig. 10.2A (red plot). Although we are not aware of evidence that PCA works with automation level robustness at 500 bp, for this plot, we assumed it does and that cloning of PCA products occurs uniformly at this length. The purification of initial building blocks by PAGE (Fig. 10.2A, green plot) or even an improved building block purification technology (Tian et al., 2004) combined with a two-step assembly process (Fig. 10.2A, cyan plot) still does not avoid the large number of molecules that need to be screened to construct molecules several kilobase pairs long. Other error-correction methods not presented in Fig. 10.2A include those which enrich error-free DNA molecules with the use of special mismatch-binding or -cleaving proteins (Bang and Church, 2008; Carr et al., 2004; Forster and Church, 2006) or improve site-directed
215
Recursive Construction and Error Correction of DNA Molecules
Number of clones versus target length
Required number of clones
A
B
90
1.1
1.2
1.3
1.4
2.1
2.2
2.3
2.4
3.1
3.2
3.3
3.4
4.1
4.2
4.3
4.4
1.1
1.2
1.3
1.4
2.1
2.2
2.3
2.4
3.1
3.2
3.3
3.4
4.1
4.2
4.3
4.4
80 70 60
× 256
50 40 30 20 10 0
1000
2000
3000
4000
Fragment length
5000
×4
Construction from unpurified oligos Construction from gel-purified oligos Two-step DNA construction from unpurified oligos Construction from DNA chip with hybridization purification Recursive construction and error correction
Figure 10.2 Comparative analysis of error-correction methodologies. (A) Error correction of a single molecule. The required number of clones that have to be sequenced to obtain an error-free synthetic DNA molecule as a function of its length is shown for different methods of construction: naı¨ve construction from synthetic oligos with no error correction (blue); construction from gel-purified oligos (green); a two-step DNA construction, where in the first step, molecules of length 500 are constructed, cloned, sequenced, and in the second step, these error-free molecules are used as building blocks for larger molecules (red); a two-step construction from oligos purified by hybridization (Tian et al., 2004; cyan); and recursive construction with iterative error correction (purple; see Appendix for mathematical analysis). (B) Error correction of libraries: a graph representing a DNA library with four variable sites, each containing four variants, totaling 256 possible library members (top). Using recursive construction, one can first construct and error correct a representative set of only four library members, which constitute a minimal cut through the construction graph of the entire library. A subsequent iteration of the protocol can use error-free fragments obtained from these four library members to efficiently construct the entire 256-strong library. This dramatically economizes the error correction of libraries compared to the correction of each library member separately, as presented in (A).
mutagenesis (Xiong et al., 2006). The former requires the use of special mismatch-binding proteins and is limited to relatively short fragments with only a few errors. The latter performs corrective PCR with corrective primers for each error, which requires both the retrospective synthesis of new PCR primers for each such error and that the newly corrected PCR fragments be combined back into the target sequence. The fact that the identity of the new PCR fragments and the resulting structure of the construction protocol are dictated by the random distribution of errors and not by engineering considerations impairs robustness and hence amenability to automation. This is also why we do not choose any error-free fragments from our clones or design new primers which span them, but only the ones that coincide with fragments from our construction plan.
216
Tuval Ben Yehezkel et al.
The basic principles used to construct DNA molecules can also be applied to construct DNA libraries. DNA libraries are an important source for selecting molecules encoding novel genetic sequences for use in medicine, research, and industry (Heinemann and Panke, 2006). Recursive construction can be extended to produce error-free combinatorial DNA libraries with prespecified and/or randomized members. Our library construction protocol delivers each library member separately, say in a separate well of a plate, which may facilitate a richer set of screening methods. The error correction of large libraries can be further economized. For example, in the construction of a library with 256 members (Fig. 10.2B top), a subset of only four clones containing all library components (Fig. 10.2B bottom) should be initially constructed and error corrected. Only then, should all 256 members of the library be constructed from these four error-free corrected clones. The principle of first constructing and error correcting a minimal set of DNA fragment from which the entire library can later on be generated improves on the efficiency of our errorcorrection method for single sequences (shown in Fig. 10.2A).
2. Protocols The entire method, as presented in this chapter, can be performed both manually and using an automated robotic system that we have specifically customized for this method. The protocols described in this chapter are intended to describe the manual execution of the method, not the automated one.
2.1. PCR primer phosphorylation Each PCR in the recursive composition step needs to have one of its 50 termini (the one that overlaps with the other PCR of that composition step) phosphorylated. This phosphorylation is used in a subsequent step for generating ssDNA using Lambda exonuclease. Lambda exonuclease recognizes phosphorylated 50 in dsDNA as substrate and can be used to specifically target one of the strands for degradation. Phosphorylation of all PCR primers used in the recursive construction protocol is performed beforehand simultaneously with the appropriate polarity, according to the following protocol: 50 DNA termini (300 pmol) in a 50 ml reaction containing 70 mM Tris– HCl, 10 mM MgCl2, 7 mM dithiothreitol, pH 7.6 at 37 C, 1 mM ATP, 10 U T4 polynucleotide kinase (NEB). Incubation is at 37 C for 30 min, and inactivation is at 65 C for 20 min.
Recursive Construction and Error Correction of DNA Molecules
217
2.2. Overlap extension elongation between two ssDNA fragments Each pair of overlapping ssDNA fragments produced by the Lambda exonuclease described above is combined into an overlap elongation reaction with a thermostable DNA polymerase. In this reaction, pairs of complementary ssDNA molecules are elongated bidirectionally using an overlap designed to a Tm ¼ 60, forming elongated dsDNA molecules. Each elongation reaction consists of 1–5 pmol of each ssDNA progenitor in a reaction containing 25 mM TAPS pH 9.3 at 25 C, 2 mM MgCl2, 50 mM KCl, 1 mM bmercaptoethanol, 200 mM each of dNTP, 4 U Thermo-Start DNA Polymerase (ABgene). Thermal cycling program is as follows: enzyme activation at 95 C for 15 min, slow annealing at 0.1 C/s from 95 to 55 C, elongation at 72 C for 10 min. The slow annealing up to 55 C and elongation at 72 C can be repeated two to three times in the elongation of long fragments (> 1 kb), but is not necessary in shorter fragments.
2.3. PCR amplification of the above elongation product with two primers, one of which is phosphorylated The products of elongation reactions (described above) are diluted in DDW by a factor of 1:100 and used as template in PCRs. They are amplified with a PCR using 50 phosphorylated primers at their overlapping terminus and nonphosphorylated primers at the nonoverlap terminus. Each PCR consists of: Template (5–0.1 fmol per reaction), 10 pmol of each primer in a 25 ml reaction containing 25 mM TAPS pH 9.3 at 25 C, 2 mM MgCl2, 50 mM KCl, 1 mM b-mercaptoethanol 200 mM each of dNTP, 1.9 U AccuSure DNA Polymerase (BioLINE). Thermal Cycler program is enzyme activation at 95 C for 10 min, denaturation at 95 C for 30 s, annealing at Tm of primers minus 5 C, extension at 72 C for 1.5 min/kb. The PCRs are monitored in a real-time PCR machine to avoid over-cycling of the reactions once the amplification reaction has ended. In some cases, continuing the thermal cycling of PCRs after they have reached the amplification plateau may generate nonspecific amplification products and we recommend avoiding this by monitoring the PCRs in a real-time PCR machine and terminating the thermal cycling once the plateau is reached.
2.4. Lambda exonuclease digestion of the above PCR product to generate ssDNA Pairs of overlapping PCR products are treated with an enzyme, Lambda exonuclease, which recognizes 50 phosphorylated ends of dsDNA as substrate, degrading the phosphorylated strand, and leaving the ssDNA of the nonphosphorylated strand. One PCR of each overlapping pair is
218
Tuval Ben Yehezkel et al.
phosphorylated on the minus strand while the other is phosphorylated at the plus strand. This ensures that, once degraded using Lambda, the resulting ssDNA molecules are complementary at the overlap. The degradation reaction consists of: 50 phosphorylated DNA termini (1–5 pmol) in a reaction containing 25 mM TAPS pH 9.3 at 25 C, 2 mM MgCl2, 50 mM KCl, 1 mM b-mercaptoethanol, 5 mM 1,4-dithiothreitol, 5 U Lambda exonuclease (Epicenter). Thermal Cycler program is 37 C for 30 min, 42 C for 2 min and enzyme inactivation at 70 C for 10 min.
2.5. DNA purifications DNA is purified according to the protocol described below following the PCR (Section 2.3) and Lambda exonuclease (Section 2.4) steps but not after elongation (Section 2.2) reactions. Elongation reactions are not purified since they are diluted 1:100 in DDW prior to their use as template in the subsequent PCR steps. The purification can be performed manually using single purification columns or in 96 column plates using an automated robotic setup. 2.5.1. Automated DNA purification Automated DNA purification is performed using ZR-96 DNA Clean-up KitTM according to manufacturer’s protocols adapted to work with a Tecan Freedom EVOÒ robot and an automated centrifuge. Residual ethanol is evaporated from the 96 column plate prior to the elution step using a SpeedVac for plates for 15 min, and the DNA is subsequently eluted using an elution volume of 50 ml DDW. 2.5.2. Manual DNA purification Manual DNA purification is performed using the Qiagen MiniElute PCR purification kit according to manufacturer protocols. Residual ethanol is evaporated from the column prior to elution using a SpeedVac for 15 min, and the DNA is subsequently eluted using an elution volume of 20 ml DDW. 2.5.3. Quality controls Each substep of the recursive composition operation (PCR, Lambda, elongation) is quality controlled by gel electrophoresis, capillary electrophoresis (ABI), or a microchip electrophoresis system (MultiNA by Shimadzu) to verify that fragments are of the correct length and that the reaction does not contain additional nonspecific products. 2.5.4. Cloning of composite DNA fragments In our experiments, target PCR fragments are cloned into the pGEM T easy Vector System1 from Promega. This requires the PCR products to be A tailed using a nonproofreading Taq polymerase. The vectors containing
Recursive Construction and Error Correction of DNA Molecules
219
cloned fragments are transformed into JM109 competent cells from Promega using heat shock and sequenced.
Appendix A.1. Divide and Conquer Algorithm for DNA Molecule Synthesis A.1.1. Goal The algorithm searches for an optimal protocol for the construction of the target molecule under a set of constrains and a set of cost parameters. A.1.2. General description The algorithm starts with the target molecule given by the user and searches for an optimal and valid division point. Such a point fulfill a set of constrains such as the existence of specific primers and overlap specificity for the division point in addition to the existence of two valid protocols which may build the two divided parts. Once a valid division point is found, the algorithm recursively searches a protocol to build its two parts. The recursion stops when the target molecule is short enough to be produced by oligo synthesizer. A cost function is computed for each subprotocol based on the number of oligos and their lengths, the number of reactions, and the number of protocol tree levels required to build its target, and the smallest cost protocol is selected as the optimal protocol. As the protocol space is very large, dynamic programming algorithm is used to keep previously computed subprotocols in a cache and is reused when needed in a different search path. In addition, branch and bound algorithm is used to trim the search space when the intermediate cost shows that the current best cost for a protocol cannot be improved. A.1.3. Algorithm INPUT: TARGET_SEQUENCE and CONFIGURATION with parameter values. OUTPUT: A protocol to construct the target molecule including:
SEQUENCE LIST: a list of all oligos and intermediate target products. REACTION LIST: a list of reactions, each reaction has two sequences from the sequence list as its input and one sequence as its product thus describing the protocol tree. OLIGOS and PRIMERS list: a list of oligos that are needed to be synthesized as building blocks for the protocols.
220
Tuval Ben Yehezkel et al.
A.1.3.1. Algorithm pseudo-code
1. Preprocessing 1.1. Find best overlap for each point and compute its range of specificity. 1.2. Find best primer for each point and compute its range of specificity. 2. Divide & Conquer (TRAGET_SEQUENCE). 2.1. IF TRAGET_SEQUENCE is in Cache, THEN return the protocol from cache. 2.2. IF TRAGET_SEQUENCE is shorter than MAX_OLIGO_SIZE, THEN return the OLIGO as the current protocol. 2.3. Set CURRENT_PROTOCOL to NO_PROTOCOL and CURRENT_BEST_COST to Inf. 2.4. For each Division Point, check the following: 2.4.1. Valid overlap exists and comply with overlap constrains. 2.4.2. Primers for division exist and comply with Primers constrains. 2.4.3. Check Primers dimmers for each subtarget. 2.4.4. IF any of the checks 2.3.1–2.3.3 failed, THEN continue to next point. 2.4.5. Compute lower bound on the cost of the protocol IF LOWER_BOUND_COST > CURRENT_BEST_COST continue to next point. 2.4.6. Divide&Conquer (LEFT_SUBTRAGET). 2.4.7. Divide&Conquer (RIGHT_SUBTRAGET). 2.4.8. Merge protocols and compute the CURRENT_COST. 2.4.9. IF CURRENT_COST < BEST_CURRENT_COST, set CURRENT_BEST_PROTOCOL ¼ CURRENT_ PROTOCOL (update cache). 2.5. Return CURRENT_BEST_PROTOCOL. A.1.3.2. Pseudo-code of recursive cost function In case of division: CURRENT_TARGET_COST ¼ LEFT_SUBTARGET_COST þ RIGHT_SUBTARGET_COSTþLEVELCOST þ REACTIONCOST In case of oligo: oligo_constant_cost þ oligo_nuc_cost*oligo_length Parameters used to evaluate the specificity and affinity of primers and elongation overlaps and parameters used to compute the cost function.
○ ○ ○ ○ ○ ○
max_oligo_len: 80 maximal oligo length min_oligo_len: 30 minimal oligo length max_primer_Tm: 70 maximal primer melting temperature min_primer_Tm: 60 minimal overlap melting temperature min_primer_len: 14 minimal primer length max_primer_len: 30 maximal primer length
Recursive Construction and Error Correction of DNA Molecules
○ ○ ○ ○ ○ ○ ○
221
min_overlap_Tm: 60 minimal overlap melting temperature min_overlap: 15 minimal overlap length max_overlap: 70 maximal overlap length levelcost: 50 cost of additional level in the protocol reactioncost: 10 cost of additional reaction in the protocol oligo_constant_cost: 2 constant cost for a single oligo oligo_nuc_cost: 0.2500 length-dependent cost for an oligo
Specificity of fragments in elongation reactions and of PCR primers was evaluated using sequence alignment algorithms and Tm formulas from the MatLab bioinformatics toolbox.
A.1.4. Discussion of algorithm Division points are not chosen so that oligos are of equal length, as usually practiced in PCR methods (Smith et al., 2003). Instead, for each possible division point, the algorithm validates that a certain set of constrains is fulfilled, including whether good PCR primers exist for each of the subtargets and whether the two subtargets can be elongated together efficiently and specifically in the elongation reaction described in Fig. 10.2B. Validation of specificity and affinity of elongation overlaps and PCR primers is performed using sequence alignment algorithms and Tm calculations, respectively. The sequences of all oligos, primers, construction intermediates, and full lengths reported in this work are available online (see Appendix). To improve the performance of the optimization algorithm, a dynamic programming (Alsuwaiyel, 1999) approach was implemented by storing (also called memoing) the protocol of each subtarget, thereby avoiding computing a protocol for a subtarget more than once. In addition, a “branch-and-bound” approach (Alsuwaiyel, 1999) was implemented to trim the search space in cases where the cost function cannot improve on a previously found division
A.1.4.1. General description of error correction In general, a composite object constructed from potentially faulty basic components is expected to have a higher number of errors than each of its components. However, if errors are randomly distributed among the basic components and occur randomly during construction, and if sufficiently many copies of an object are constructed, it is expected that some of the copies may contain error-free composite parts. If such parts could be identified and extracted from the faulty objects, they could be reused as inputs to recursively reconstruct the object. Since the reconstruction starts from larger parts that are error free, the number of errors in the resulting object is expected to decrease, possibly down to zero. Even if the objects produced this way have errors,
222
Tuval Ben Yehezkel et al.
they are expected to have fewer errors than their predecessors and hence to have even larger error-free parts, which can be reused in another iteration of the recursive construction process, until an error-free object is formed.
A.2. Divide and Conquer Algorithm for DNA Combinatorial Library Synthesis A.2.1. Goal The algorithm searches using D&C approach for a protocol to construct a combinatorial library described by the user with an efficient utilization of the library shared sequences.
A.2.2. General description The algorithm receives a library description with variable regions separated by shared regions. Each variable region may have two or more variants of different sequence and size. Using D&C approach, the algorithm finds the optimal library protocol to construct the library from its shared and variable regions with minimal number of reactions considering that intermediate product may be a multiplication of the sizes of two variable regions. The algorithm then finds a specific valid overlap within the shared regions that is suitable for synthesizing the two adjacent regions with all their variants. The overlap defines the building blocks of the library, both the shared and variant fragments. Each building block is then planned using a D&C algorithm for a single molecule (described above). The protocols of the building blocks are merged and additional reactions are added according to the libraries’ optimal construction protocol previously calculated. A.2.3. Minimal cut A cut in a tree is a set of nodes that includes a single node on any path from the root to a leaf. Let T be a recursive construction protocol tree and S a set of strings. We say that S covers T if there is a set of strings C such that every string in C is a substring of some string in S and C is a cut C of T. In such a case, we also say that S covers T with C. Claim: If S covers T, then there is a unique minimal set C such that S covers T with C. Proof: Easy Error-free reconstruction algorithm: Given an RC protocol T and a set of sequences (of molecular clones) S, find a minimal C such that S covers T with C. Then we lift C with PCR and do the recursive construction starting with C.
Recursive Construction and Error Correction of DNA Molecules
223
A.2.4. Computing the minimal cut We use a recursive approach for computing the minimal cut of a protocol tree. Each node in the tree represents a biochemical process with a product and two precursors. The algorithm starts with the root of the tree (target molecule) and for each node checks whether its product sequence exists with no errors in one of the clones. If such a clone exists, then this product is marked as a new basic building block for reconstruction of the target molecule and its primer pair and relevant clone (as template) are registered as its generating PCR. If there is no clone which contains an error-free sequence of the node product, then the reaction is registered as existing reaction in the new protocol and the algorithm is recursively executed on the two precursors of the product. The output of such a protocol is a tree of reactions which comprises a minimal cut of the original tree. It contains leaves for which error-free products exist and that all its internal nodes are error free in the clones that contain them. An automated program that utilizes these new error-free building blocks for recursive construction of the target molecule is generated for the robot. A.2.5. Computing the required number of clones For a fragment of size L under mutation rate R, the probability of having an error-free fragment in a single clone is taken from a Poisson distribution with LAMBDA ¼ L*R (the probability to have 0 errors when the expected errors are L*R). To find the smallest number of clones required to get an error-free fragment with probability larger than 95%, we use a binomial distribution and compute the probability of having at least one error-free fragment out of N clones. In the D&C approach, the length of the pure fragment can be reduced to the size of an oligo (80 bp) at the expense of having to perform more steps during reconstruction. Thus, in order to guarantee that we have full errorfree coverage of the target sequence molecule, the probability of having a pure fragment of size L in N clones—P Success(L,N)—is multiplied by itself, the number of fragment of size L that are required to construct the target molecule (the first part is error free and the second part is error free, etc.). We compute this number after considering the overlap which reduces the contribution of each oligo to be smaller than its actual size (55 bp). Then, we find the smallest number of clones which satisfies the requirement that the total probability of having a minimal cut will exceed 95%. The elongation step of the step shown in Fig. 10.1b is performed in quasi-equilibrium. A.2.6. Quasi-equilibrium processes A quasi-equilibrium process is one in which the intermediate steps in the process are all close to equilibrium. In the context of overlapping ssDNA molecules, this means that the molecules are not reacted in rapid thermal cycling conditions
224
Tuval Ben Yehezkel et al.
(nonequilibrium, like in PCR) but are first heated to denaturation and then very slowly cooled to annealing and elongation conditions. A.2.6.1. Example script of the robot programming language For a detailed description of the language, go to www.weizmann.ac.il/udi/ papers/rpl.pdf A preset definition of the way our working deck is organized
TABLE table_PIE1000.gem DOC PCR ON PLASMID FOR Fragments A, B, C & D. PURIFYING PCR SAMPLES MEASURING CONCENTRATION OF PCRS PREPARING C.E. AND G.E. ANALYSIS
Verbal documentation for the program
ENDDOC ADDRESS [email protected] SMS_NUMBER 0528681034
Defining e-mail and SMS number for notification of errors during preparation
T2 5 LCWAUTOBOT 4 REAGENT LB_SYBR T4 1 LCWAUTOBOT 4 REAGENT DDW REAGENT PCR_dNTP_Mix_x5 T1 1 LCWAUTOBOT 1 T1 5 LCWAUTOBOT 1 REAGENT TEMP_PCR_A REAGENT TEMP_PCR_B T1 6 LCWAUTOBOT 1 REAGENT TEMP_PCR_C T1 7 LCWAUTOBOT 1 REAGENT TEMP_PCR_D T1 8 LCWAUTOBOT 1 P2 LCWAUTOBOT LOAD GFP_primers_6
Defining the reagents we are working with and their locations
LIST Reaction_200 GFP1F_1p_FAM 10 GFP_A_R_1p_phos 10
TEMP_PCR_A
5 PCR_dNTP_Mix_x5 6.25
GFP_B_F_1p_phos 10 GFP_B_R_1p
10
TEMP_PCR_B
5 PCR_dNTP_Mix_x5 6.25
10 GFP_C_R_1p_phos 10
TEMP_PCR_C
5 PCR_dNTP_Mix_x5 6.25
10
TEMP_PCR_D
5 PCR_dNTP_Mix_x5 6.25
GFP1F_1p_FAM 10 GFP_A_R_1p_phos 10
TEMP_PCR_D
5 PCR_dNTP_Mix_x5 6.25
GFP_B_F_1p_phos 10 GFP_B_R_1p
10
TEMP_PCR_C
5 PCR_dNTP_Mix_x5 6.25
10 GFP_C_R_1p_phos 10
TEMP_PCR_B
5 PCR_dNTP_Mix_x5 6.25
TEMP_PCR_A
5 PCR_dNTP_Mix_x5 6.25
GFP_C_F_1p
GFP_D_F_1p_phos10 GFP_D_R_1p
# NTC
GFP_C_F_1p
GFP_D_F_1p_phos10 GFP_D_R_1p
10
ENDLIST List of reactions to assemble: specifies reagents and volume of each reagent. Each line is one reaction
The following figures (Figs. 10.A1–10.A16) show a highly detailed description of the recursive construction of the molecules reported in the
Recursive Construction and Error Correction of DNA Molecules
225
Figure 10.A1 C.E. fragment analysis of an example of an elongation reaction. One strand is labeled by HEX (green) and second by FAM (blue). The elongated product is thus labeled by two fluorophores (far right). An excess of the strand labeled by HEX (green peak on the left) does not elongate.
1
2
Figure 10.A2 Example of a lambda exonuclease activity: Generating ssDNA from dsDNA by lambda exonuclease. Lane 1: dsDNA labeled with a 50 phosphate; lane 2: same DNA after treatment with lambda exonuclease shows a lower ssDNA band and the disappearance of the dsDNA band, reflecting the generation of ssDNA from dsDNA by the enzyme.
226
Tuval Ben Yehezkel et al.
A | B | C | D | E | F
1000 bp
400 bp
200 bp
100 bp
Figure 10.A3 Gel electrophoresis of assembly reactions using dsDNA instead of ssDNA in the construction of the GFP gene. All quarters have nonspecific amplification products on top of the correct size. This problem is exacerbated as assembly proceeds (F). Construction of the same fragments but using our method is described by C.E. in Fig. 10.A5 and shows that the same fragments are built accurately. (A) 100-bp size marker, (B) GFP quarter1, (C) GFP quarter2, (D) GFP quarter3, (E) GFP quarter4, (F) Further assembly of quarters A–D leads to faulty construction.
paper through capillary electrophoresis fragment analysis of each DNA fragment that occurred during the construction. The fragment analysis provides a single base-pair resolution of the construction process and is accompanied by a description of the expected sizes of these fragments. Gel electrophoresis is provided for fragments that are too large to be analyzed properly using C.E. Additionally, we provide examples that demonstrate how we perform our analysis and Q.C at different stages of the basic recursive step, namely gel electrophoresis of ssDNA and real-time analysis of PCRs. The sizes of the size marker (orange peaks) in all capillary electrophoresis runs presented here are, from left to right, in base pairs: 35, 50, 75, 200, 139, 150, 160, 200, 247, 300, 340, 350, 400, 450, 490, and 500.
Recursive Construction and Error Correction of DNA Molecules
A Level 0
Level 1
Level 2
Level 3
Level 4
B
Figure 10.A4 (Continued)
227
C
D
E
Figure 10.A4 C.E. fragment analysis of PCRs from the construction of the GFP DNA. (A) The construction protocol of the GFP molecule is divided to levels of construction. Each of the C.E. runs (B–E) contains two fragments that are to be elongated together in the next level. (B) GFP construction level 1. Expected sizes in base pairs are in pairs from top to bottom: 133 and 134, 135 and 119, 95 and 110, 134 and 126. (C) GFP construction level 2. Expected sizes in base pairs are in pairs from top to bottom: 242 and 222, 180 and 202. (D) GFP construction level 3. Expected sizes in base pairs are in pairs from top to bottom: 440 and 358. (E) GFP construction level 4 and control PCR on pGFP-N1. Expected size is 768 bp.
Recursive Construction and Error Correction of DNA Molecules
229
Level 1
Level 2
Level 3 A
B
C
Figure 10.A5 C.E. fragment analysis of PCRs during the reconstruction of GFP from two quarters and one half. (A) GFP reconstruction level 1. Expected sizes in base pairs are in pairs from top to bottom: 242 and 222. (B) GFP reconstruction level 2. Expected sizes in base pairs are in pairs from top to bottom: 440 and 358. (C) GFP reconstruction level 3. Expected size is 768 bp.
230
Tuval Ben Yehezkel et al.
Level 1
Level 2
Level 3 A
B
Figure 10.A6 C.E. fragment analysis of PCRs from the reconstruction of GFP from two halves. (A) GFP reconstruction level 1. Expected sizes in base pairs are in pairs from top to bottom: 440 and 358. (B) GFP reconstruction level 2. Expected size is 768 bp.
231
Recursive Construction and Error Correction of DNA Molecules
Figure 10.A7 C.E. fragment analysis of PCRs from the composition of natural and synthetic fragments. The final PCR product is 1.363 kb. C.E. 1 shows the natural fragment, C.E. 2 shows the synthetic GFP, and C.E. 3 shows that the composed (PCR amplified) product has the correct size of 1363 bp long.
B
A
3054 bp 1636 bp
2036 bp
1018 bp
2036 bp ~700 bp
517 bp
3000 bp
3054 bp
1636 bp
1700 bp 1363 bp
Figure 10.A8 Composition of natural and synthetic fragments. (A) PCRs of two 700 bp fragments. (B) A PCR of 1363 bp composed from the fragments shown in (A) and a 1700-bp fragment from a natural plasmid. (C) PCR of a 3000-bp fragment composed of the fragments from (B).
232
Tuval Ben Yehezkel et al.
DT-19-1-823
DT-19-1-544
DT-19-1-321 DT-19-529-823
DT-19-529-717
DT-19-1-201 DT-19-306-544
DT-19-1-89
DT-19-306-432 DT-19-74-201
DT-19-O-1
DT-19-O-3
DT-19-O-2
DT-19-186-321
DT-19-O-5
DT-19-O-4
DT-19-417-544
DT-19-O-9
DT-19-O-7 DT-19-O-6
DT-19-O-8
DT-19-591-717
DT-19-O-11 DT-19-O-10
DT-19-O-12
DT-19-702-823
DT-19-O-14 DT-19-O-13
DT-19-O-15
Figure 10.A9 Construction protocol of Tachylectinll molecule with E. coli codons 823 bp.
Recursive Construction and Error Correction of DNA Molecules
A
B
Figure 10.A10 (Continued)
233
234
Tuval Ben Yehezkel et al.
C
D
E
Figure 10.A10 C.E. fragment analysis of PCRs during Tachylectinll construction. (A) Tachylectinll construction level 1. (B) Tachylectinll construction level 2. (C) Tachylectinll construction level 3. (D) Tachylectinll construction level 4. (E) Tachylectinll construction level 5.
235
Recursive Construction and Error Correction of DNA Molecules
A
Protocol DotPlot for sequence: DT19
B L1: 1---89
100
L2: 1---201 L1: 74---201 L3: 1---321
200
L1: 186---321
Overlap
300
L4: 1---544 L1: 306---432
400 L2: 306---544 L1: 417---544
500
L5: 1---823
600
L2: 529---717 L1: 591---717
700
L3: 529---823 L1: 702---823
800 100
200
300
400 500 Synthesized fragment
600
700
800
Figure 10.A11 (A) Crystal structure of Tachylectinll with its five-bladed beta-propeller fold. (B) Dot plots of Tachylectinll molecule with E. coli codons showing how our protocol design avoids miss-priming during construction. The target DNA sequence is aligned against itself and the five repetitive sequences are visualized by the red lines. Fragments that we designed to occur during construction are showed in green, and the height of each green fragment shows where the overlap region in elongation reactions is located.
236
Tuval Ben Yehezkel et al.
A
Figure 10.A12 (Continued)
Recursive Construction and Error Correction of DNA Molecules
B
C
D
Figure 10.A12 (Continued)
237
238
Tuval Ben Yehezkel et al.
E
F
Figure 10.A12 C.E. fragment analysis of PCRs from the Construction of the P53 Library variants. (A). P53 library construction level 1. Expected sizes in base pairs are from top to bottom: 99, 99, 116, 137, 127, 131, 127, and 136. (B) P53 library construction level 2. Expected sizes in base pairs are 247, 171, and 239. (C) P53 library construction level 3. Expected sizes in base pairs are 283, 280, 183, and 270. (D) P53 library construction level 4. Expected sizes in base pairs are 351 and 351. (E) P53 library construction level 5. Expected sizes in base pairs are 619 and 619. (F) P53 library construction level 6, showing all six variants of P53 library. Expected sizes in base pairs are 883, 883, 883, 883, 880, and 880.
Recursive Construction and Error Correction of DNA Molecules
A
B
Figure 10.A13 (Continued)
239
240
Tuval Ben Yehezkel et al.
C
Figure 10.A13 (Continued)
Recursive Construction and Error Correction of DNA Molecules
241
D
Figure 10.A13 C.E. fragment analysis of PCRs from the P53 Library reconstruction. The expected sizes are identical to the sizes for the corresponding levels in the reconstruction shown in Fig. 10.A4 above. (A) P53 library reconstruction level 1. (B) P53 library reconstruction level 2. (C) P53 library reconstruction level 3. (D) P53 library reconstruction level 4, showing all six variants of P53 library.
242
Tuval Ben Yehezkel et al.
Variant region 1
P53 library Variant region 1
Variant region 2
Variant region 2
Figure 10.A14 Two variant regions of the P53 library are marked at the variant bases. Variant region 1 has three variants; Variant region 2 has two variants; black dots (bottom) mark differences in sequence between variants.
243
PCR base line subtracted CF RFU
Recursive Construction and Error Correction of DNA Molecules
130
130
120
120
110
110
100
100
90
90
80 70
80 70
60
60
50
50
40
40
30
30
20
20
10 0
10 0
–10
–10 0
2
4
6
8
10
12
14
16
18
20
22
Cycle
–d(RFU)/dT
Figure 10.A15 Real-time PCR (RT-PCR) amplification curve. RT-PCR is used routinely for detecting the end point of the amplification process to avoid potentially harmful over-cycling of PCR products.
13 12 11 10 9 8 7 6 5 4 3 2 1 0 –1 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 Temperature (⬚C)
Figure 10.A16 RT-PCR melt curve. A post RT-PCR melt curve provides us with an initial pre-electrophoresis indication to the quality of our amplified DNA.
244
Tuval Ben Yehezkel et al.
LIST CE_200
On screen textual notification for user during robot operation
lambda1_5 ambda1_6 lambda1_7 lambda1_8
List of sample names for export to the C.E. analysis machine.
ENDLIST SCRIPT PROMPT remove tube covers!!!!! % PREPARING PCR REACTIONS PREPARE_LIST Reaction_200 P3 A1+8 DEFAULT MIX:LCWMXSLOW:10x8,LOG:R200 %TRANFERING PCR PLATEFROM ITS POSITION TO THE PCR BLOCK AND BACK MOVE_PLATE P3 PCR MOVE_OBJECTCOVER HA7 PCR Commands for using the robots arm to move plates, RUN_PCR 3 1 1 1 plate covers and for operating the PCR block MOVE_OBJECT COVER PCR HA7 MOVE_PLATE PCR P3
% PCRPURIFICATION PCR_PURE P3 A1+8 V1 A1+8 P3 A2+8 DDW 31 60 PCR_PURE
Command for preparing the list that was specified in the list section
Command for purifying samples with a vacuum based purification scheme
% PREPARING MEASURMENT OF DNA BY PICO-GREEN FLUORESCENCE PG_PREPARE_STD P4 A1+8 PG_PREPARE_SAMPLE P3 A2+8 P4 A2+8 3 Command for measuring DNA concentration using the picogreen reagent and a table-top fluorimeter % GENERATING A FRAGMENT ANALYSIS C.E. RUN CEPLATE Gr_GFP_React_200 CE_200 A1 Gr_Data FA_50_POP4_Time3500_Temp60_InjVolt1.0_InjTime20 Command for generating a C.E. analysis experiment for the samples specified in the list before (CE_200)
% PREPARING SAMPLES FOR GEL ELECTROPHORESIS 5 DEFAULT LOG:SYBR DIST_REAGENT LB_SYBR P6 A1+8 TRANSFER_WELLS P3 A2+8 P6 A1+8 6 LCWBOT End of program
Command for distributing one reagent to different destinations (loading buffer)
Command for transferring a volume of liquid from source wells to destination wells
ENDSCRIPT
The fragments from our construction process are shown on top of this marker in blue and green representing the FAM and HEX fluorescence of our PCR primers. Note that a certain shift in the size calling occurs due to the FAM and HEX fluorophores. This paper is based on a paper with a similar title published in Molecular Systems Biology with the following reference: Mol. Syst. Biol. 2008;4:191.
REFERENCES Aho, A. V., Hopcroft, J. E., and Ullman, J. D. (1983). Data Structures and Algorithms. Addison-Wesley, Reading, MA/London. Alsuwaiyel, M. H. (1999). Algorithms: Design Techniques and Analysis. World Scientific, Singapore/New Jersey.
Recursive Construction and Error Correction of DNA Molecules
245
Au, L. C., Yang, F. Y., Yang, W. J., Lo, S. H., and Kao, C. F. (1998). Gene synthesis by a LCR-based approach: High-level production of leptin-L54 using synthetic gene in Escherichia coli. Biochem. Biophys. Res. Commun. 248, 200–203. Bang, D., and Church, G. M. (2008). Gene synthesis by circular assembly amplification. Nat. Methods 5, 37–39. Carr, P. A., Park, J. S., Lee, Y. J., Yu, T., Zhang, S., and Jacobson, J. M. (2004). Proteinmediated error correction for de novo DNA synthesis. Nucleic Acids Res. 32, e162. Caruthers, M. H. (1985). Gene synthesis machines: DNA chemistry and its uses. Science 230, 281–285. Chomsky, N. (1964). Syntactic Structures. The Hague, Mouton. Forster, A. C., and Church, G. M. (2006). Towards synthesis of a minimal cell. Mol. Syst. Biol. 2, 45. Heinemann, M., and Panke, S. (2006). Synthetic biology—Putting engineering into biology. Bioinformatics 22, 2790–2799. Hopcroft, J. E., and Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, MA. Hutchison, C. A., III, Phillips, S., Edgell, M. H., Gillam, S., Jahnke, P., and Smith, M. (1978). Mutagenesis at a specific position in a DNA sequence. J. Biol. Chem. 253, 6551–6560. Knight, T. (2003). Idempotent Vector Design for Standard Assembly of Biobricks. MIT Synthetic Biology Working Group, Boston. Mandelbrot, B. B. (1982). The fractals book. Observatory 102, 151. Merkle, R. C. (1997). Convergent assembly. Nanotechnology 8, 18–22. Rogers, H. (1967). Theory of Recursive Functions and Effective Computability. McGrawHill, New York. Ryu, D. D., and Nam, D. H. (2000). Recent progress in biomolecular engineering. Biotechnol. Prog. 16, 2–16. Smith, H. O., Hutchison, C. A., III, Pfannkoch, C., and Venter, J. C. (2003). Generating a synthetic genome by whole genome assembly: phiX174 bacteriophage from synthetic oligonucleotides. Proc. Natl. Acad. Sci. USA 100, 15440–15445. Stemmer, W. P., Crameri, A., Ha, K. D., Brennan, T. M., and Heyneker, H. L. (1995). Single-step assembly of a gene and entire plasmid from large numbers of oligodeoxyribonucleotides. Gene 164, 49–53. Tian, J., Gong, H., Sheng, N., Zhou, X., Gulari, E., Gao, X., and Church, G. (2004). Accurate multiplex gene synthesis from programmable DNA microchips. Nature 432, 1050–1054. Xiong, A. S., Yao, Q. H., Peng, R. H., Li, X., Fan, H. Q., Cheng, Z. M., and Li, Y. (2004). A simple, rapid, high-fidelity and cost-effective PCR-based two-step DNA synthesis method for long gene sequences. Nucleic Acids Res. 32, e98. Xiong, A.-S., Yao, Q.-H., Peng, R.-H., Duan, H., Li, X., Fan, H.-Q., Cheng, Z.-M., and Li, Y. (2006). PCR-based accurate synthesis of long DNA sequences. Nat. Protoc. 1, 791–797.
C H A P T E R
E L E V E N
Industrial Scale Gene Synthesis Frank Notka,* Michael Liss,* and Ralf Wagner*,† Contents 1. Brief History of Gene Synthesis 2. Applications of Synthetic Genes 2.1. Availability and safety 2.2. Origin and reliability 2.3. Expression efficiency 2.4. Protein performance 2.5. Cost, capacity, and speed 2.6. Flexibility of design: artificial genes, operons, and genomes 3. State-of-the-Art Gene Synthesis 4. Gene Synthesis and Synthetic Biology-From Genes to Genomes 4.1. Information 4.2. Modularity 4.3. Standardization 4.4. Technological developments 5. Industrial Gene Synthesis—From Bench to Manufacturing 5.1. Process features 5.2. Biosafety/biosecurity 5.3. Optimization rational 5.4. Optimizer software 6. Design Tool—GeneOptimizer 6.1. Project design 6.2. Sequence design 6.3. Construction design 7. Production Processing—LIMS 7.1. Steering process 7.2. Process control 7.3. Process expansion 7.4. Order entry 7.5. Order processing 7.6. Oligonucleotide production
248 250 250 250 251 251 252 252 253 254 255 255 256 258 258 259 260 262 262 264 264 265 266 266 266 266 267 267 268 268
* Life Technologies Inc./GeneArt AG, Regensburg, Germany Institute of Medical Microbiology and Hygiene, Molecular Microbiology and Gene Therapy, University of Regensburg, Regensburg, Germany
{
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00011-5
#
2011 Elsevier Inc. All rights reserved.
247
248
Frank Notka et al.
7.7. Subfragment production 7.8. Assembly 8. Case Study: Large-Scale Gene Production 9. Conclusion References
269 269 270 272 272
Abstract The most recent developments in the area of deep DNA sequencing and downstream quantitative and functional analysis are rapidly adding a new dimension to understanding biochemical pathways and metabolic interdependencies. These increasing insights pave the way to designing new strategies that address public needs, including environmental applications and therapeutic inventions, or novel cell factories for sustainable and reconcilable energy or chemicals sources. Adding yet another level is building upon nonnaturally occurring networks and pathways. Recent developments in synthetic biology have created economic and reliable options for designing and synthesizing genes, operons, and eventually complete genomes. Meanwhile, high-throughput design and synthesis of extremely comprehensive DNA sequences have evolved into an enabling technology already indispensable in various life science sectors today. Here, we describe the industrial perspective of modern gene synthesis and its relationship with synthetic biology. Gene synthesis contributed significantly to the emergence of synthetic biology by not only providing the genetic material in high quality and quantity but also enabling its assembly, according to engineering design principles, in a standardized format. Synthetic biology on the other hand, added the need for assembling complex circuits and large complexes, thus fostering the development of appropriate methods and expanding the scope of applications. Synthetic biology has also stimulated interdisciplinary collaboration as well as integration of the broader public by addressing socioeconomic, philosophical, ethical, political, and legal opportunities and concerns. The demand-driven technological achievements of gene synthesis and the implemented processes are exemplified by an industrial setting of large-scale gene synthesis, describing production from order to delivery.
1. Brief History of Gene Synthesis Since about three decades, the top-down approach of manipulating living organisms by breeding and crossbreeding has been largely augmented by the novel bottom-up techniques of direct genetic manipulation. In 1978, the Nobel Prize in physiology or medicine was awarded to Werner Arber, Daniel Nathans, and Hamilton O. Smith for discovering restriction enzymes and their application in molecular genetics. At the time, an editorial comment in Gene stated “. . . The work on restriction nucleases not only permits us easily to construct recombinant DNA molecules and to analyze
Gene Synthesis
249
individual genes but also has led us into the new era of synthetic biology where not only existing genes are described and analyzed but also new gene arrangements can be constructed and evaluated” (Szybalski and Skalka, 1978). This cornerstone in molecular biology gave birth to the success story of genetic engineering we have witnessed over the past 30 years. Other important milestones during this period were certainly the invention of the polymerase chain reaction (PCR) (Saiki et al., 1985), cheap automated production of oligonucleotides, and high-throughput DNA sequencing systems. The systematic genetic manipulation and redesign of novel strains and genetically modified organisms (GMOs) are based on the removal of cross-species boundaries, the rearrangement of natural genetic building blocks, and the introduction of minor modifications into natural DNA sequences. Still today, most attempts to generate organisms with novel phenotypes rely on a trial-and-error approach due to the fact that living systems are extremely complex by nature and far from being fully understood. This is somewhat unsatisfying, since true construction and genuine design of machines or other man-made items aim to be as flexible, yet as standardized and predictive as possible. The emerging field of synthetic biology aims to apply the standardized process of engineering disciplines to biological sciences: working with standardized parts, combining these elements according to given syntax rules, and finally, being able to predict the effect of an assembly as precisely as possible. The prime requirement for this task is the actual availability of genetic elements that do not exist in nature. As such, de novo gene synthesis is considered the key enabling technology for synthetic biology. In 1970, the first example of a synthetically produced gene was demonstrated by Khorana and coworkers (Agarwal et al., 1970). In an effort taking several years, they assembled a 77-bp gene encoding yeast alanine transfer RNA using short oligonucleotides obtained by organic chemistry methods. While in those days gene synthesis was still restricted by the limited availability of synthetic oligonucleotides, the development of automated oligo synthesizers and subsequent decline in prices of related services motivated the emergence of novel gene synthesis methods, for example, using a T4 DNA ligase (Edge et al., 1981), heat stable ligases (Barany and Gelfand, 1991), and the ligase chain reaction (LCR) (Young and Dong, 2004). With the invention of the PCR by Kary B. Mullis in 1985 (Saiki et al., 1985), de novo gene synthesis became accessible to a broad market. Several PCR oligonucleotide assembly methods emerged based on one or more primer extension steps with subsequent amplification. Their application crossed the 1000 bp size barrier in 1990 with the synthesis of a 2.1-kb fully synthetic plasmid by Young and colleagues (Mandecki et al., 1990). Since then, ever larger synthetic DNA molecules have been constructed, although usually put together from smaller de novo synthesized 1–2 kb modules by classical ligation and/or recombination, for example, an infectious
250
Frank Notka et al.
approximately 7.5 kb poliovirus cDNA (Cello et al., 2002), or a contiguous 32 kb polyketide synthase gene cluster (Kodumal et al., 2004; Menzella et al., 2006). The current pinnacle of this advance is the compilation of an entirely synthetic bacterial genome. The group around J. Craig Venter designed, synthesized, and assembled the 1.08-Mbp Mycoplasma mycoides JCVI-syn1.0 genome starting from digitized genome sequence information. Synthetic building blocks of approximately 1 kb were first assembled from oligonucleotides and then recombined into approximately 10 kb fragments in yeast. In a next step, these were likewise recombined into approximately 100 kb intermediates, and then into the complete bacterial genome, which was subsequently transplanted into recipient Mycoplasma capricolum cells. This resulted in the first self-replicating organism derived from a fully synthetic genome (Gibson et al., 2010).
2. Applications of Synthetic Genes The first examples of genes constructed from synthetic oligonucleotides were primarily motivated by the relative complexity of attaining these molecules using alternative molecular techniques (Itakura et al., 1977; Koster et al., 1975). The ensuing rapid progress of genetic manipulation, in particular the invention of PCR, later offered much faster access to genetic material from natural sources. Thus, for some years, the potential of synthetic genes fell into oblivion, until the coverage of sequence databases and limited flexibility and performance of natural genes stimulated a new need for synthetic genes.
2.1. Availability and safety Today, the conversion of electronic sequence data into actual bioactive molecules is a vital tool in biotechnology. In many cases, the natural source material for isolating genes is simply not available, or the necessary steps required to attain a full-length gene are too labor intensive. Biosafety may also be an issue for choosing artificial genes, since working with isolated genes removed from the context of the complete organism is classified as level 1 (no risk) in most cases. Another protective measure of synthetic genes using alternative codons is their decreased ability to recombine with otherwise homologous wild-type sequences, which may be an issue with viral sequences or human oncogenes.
2.2. Origin and reliability Particularly industrial projects require most steps in research and production to be well documented and certified for regulatory reasons. This also includes the audit trail of the research reagents’ origin. It is sometimes
Gene Synthesis
251
challenging to retrace a gene’s laboratory history, or it may derive from sources or collections that do not meet regulatory demands. The source of a physical gene manufactured by an ISO certified provider circumvents this problem and is a straightforward strategy for gapless documentation. It also assures the full sequence fidelity according to project design requirements, since based on experience, many constructs derived from in-house, public, and commercial gene collections are not identical to the documented sequence.
2.3. Expression efficiency To date, most experiments in biotechnology include the recombinant expression of proteins, either to change the host’s phenotype or to directly obtain and purify the overproduced polypeptide. The dissimilar genetic and biochemical setup of different species usually causes nonoptimal transcription, processing, stability, and translation of the extrinsic gene or mRNA. Employing multiparameter optimization allows adaptation of a coding sequence to the requirements of the host, so that it performs like a native gene. Moreover, since most natural genes have not evolved for maximum expression, optimization can introduce this feature. With an overall effect on protein production yields ranging from a 10% increase to obtaining high expression of a previously undetectable gene product, optimization not only improves cross-species performance but also autologous expression, for example, the production of human genes in mammalian cells (Fath et al., 2011).
2.4. Protein performance Not only the genes are in suboptimal shape for technological and industrial purposes but also their products. Increasing numbers of recombinant proteins are being employed in healthcare, chemical and food industries, agriculture, and everyday household products. Here, they must perform under conditions that are substantially different from their previous natural environment. Viral antigens for immunization ought to be highly immunogenic, humanized antibodies for cancer therapy must recognize distinct cellular targets, and enzymes in laundry detergents have to perform under the harsh conditions of a washing machine, to name just a few. Proteins need to be engineered in order to be of commercial use. However, rational computation and prediction of necessary alterations is extremely difficult, and in most cases unachievable, since we still lack sufficient knowledge to deduce three-dimensional protein structures from the amino acid sequence. Here, it is common practice to involve methods of directed evolution—the generation and selection of many protein variants. While earlier methods to produce gene collections or gene libraries for this purpose involved tedious targeted or random mutagenesis, gene synthesis provides much faster access
252
Frank Notka et al.
to these collections and on a more rational basis. During gene fabrication, the use of oligonucleotides carrying controlled impurities (degenerations) at defined positions allows the production of libraries that result in proteins where only the relevant amino acids are prone to substitutions. This narrows down the desired fuzziness of the variants to the areas of interest and dramatically increases the success rate of protein improvement through directed evolution.
2.5. Cost, capacity, and speed The considerable decline in prices for synthetic genes has today created a source of biological DNA sequences that economically outcompetes classic genetic engineering methods. Molecular cloning steps, necessary in many projects as groundwork, can be outsourced and internal resources focused on genuine research goals. Relocating the manual DNA manipulation to an automated industrial manufacturing process also dramatically increases the processable unit size—more genes can be obtained in a shorter time—a vital necessity in the competitive domains of commercial and scientific biotechnology.
2.6. Flexibility of design: artificial genes, operons, and genomes The freedom to access any imaginable DNA sequence allows not only the modification and adaptation of naturally occurring molecules but also the manifestation of some very new visions in synthetic biology (Heinemann and Panke, 2006). A major goal within this field is to design and construct new metabolic pathways within a producer cell. This must address three major obstacles. First, for a stable and efficient series of reactions, the enzymes involved must be expressed in a highly concerted manner. Very much like other engineering technologies, this demands the availability of standardized regulatory parts and elements. Ideally, promoters, ribosome binding sites, terminators, DNA-binding proteins, corresponding protein landing sites, etc. should be available with various well-characterized potencies and specificities. Together with sophisticated computer-aided design and simulation tools, these elements ought to be combined a priori to compile novel pathways. Second, fast and efficient formation of new gene clusters or operons requires the simultaneous assembly of such parts in a robust, yet flexible way. Classical restriction sites do not allow for arbitrary combination of multiple elements simultaneously. Novel in vitro recombination technologies in conjunction with artificial modular junction sites can offer solutions in this direction. Third, establishing an extrinsic biochemical pathway within a living cell must always be perceived in the context of its entire metabolism. The availability of only one diffusion space does not allow efficient spatial separation of distinct reaction steps, and participating
Gene Synthesis
253
intermediates can always interfere with both the projected pathway and total cell fitness. Therefore, one aim is to construct simplified “chassis” strains with genomes reduced to the lowest number of genes necessary for cellular survival and growth (Gibson et al., 2008). Here again, knocking out dispensable genes one by one by conventional methods is likely to be a highly tedious strategy. More likely, the in vitro synthesis of complete genomes, designed from scratch, will provide a much faster and more flexible way to bring these organisms to life. It is reasonable to assume that the cornerstone of the complete synthesis and transplantation of a 1.08Mbp M. mycoides genome will drive further developments toward modular gene cluster construction kits in conjunction with compatible host strains, allowing for true engineering strategies in biological sciences.
3. State-of-the-Art Gene Synthesis Gene synthesis has emerged as a new application of genetic engineering, utilizing oligonucleotides and different methods of assembling these to generate stretches of double-stranded DNA usually cloned into a plasmid vector. The numerous methods employed today vary widely based on the length and complexity of the DNA, and depending on other factors such as intellectual property rights or high throughput and automation capability. Although synthetic genes can readily be ordered via the internet, the methods applied are usually basic genetic engineering methods and hence gene synthesis can be performed at the molecular biology bench using typical reagents and procedures. In general, DNA synthesis relies on assembling individually presynthesized oligonucleotides, typically by PCR-based reactions (e.g., SCR: sequential chain reaction), or by ligation of predefined reusable oligos (Slonomics technology) (Van den Brulle et al., 2008), followed by standard cloning procedures and final quality control. Oligonucleotides are usually ordered from commercial providers, since the process of oligonucleotide synthesis has been automated and oligos can be produced very economically. Standard chemical oligo synthesis is a cyclical process that elongates a chain of nucleotides from the 30 - to the 50 -end. The phosphoramidite four-step process, developed in the early 1980s, couples an acid-activated deoxynucleoside phosphoramidite to a deoxynucleoside on a solid support. Although this is the method of choice currently used by most commercial oligonucleotide synthesizers, the specifications for oligo usage in gene synthesis require adjustments to quantity and quality. Since PCR-based methods are intrinsically error-prone due to the high error rate associated with oligonucleotide synthesis and sequence mutations introduced during PCR amplification (Xiong et al., 2008), gene synthesis greatly depends on oligonucleotides with maximum sequence accuracy.
254
Frank Notka et al.
In addition, a balanced ratio of oligo quality and quantity is desired, because for gene synthesis only low amounts are needed compared to other conventional oligonucleotide-based applications. Oligonucleotide synthesis scale down is one of the most efficient ways to reduce gene synthesis costs, although reducing production volume and chemicals consumption is limited with current phosphoramidite-based synthesis processes and the need to preserve a high quality level. Continuous developments in oligonucleotide synthesis, specifically for use in gene synthesis, is progressing in terms of production scales or bringing in new methods developed for different applications, for example, chip synthesis technology. Combining these basic components into the first-step synthetic DNA fragment is usually limited to approximately 500–2000 bp, depending on the technology used. Accordingly, subsequent assembly steps are required for larger genes or higher order complexes. Again, the arsenal of assembly technologies is large and still evolving. Today, most users apply techniques that are PCR-based (e.g., PCA: polymerase chain assembly), ligase-based (LCR), mixtures of both, or homology-based methods (SLIC: sequenceand ligationindependent cloning; RED recombination, etc.). However, commercial gene synthesis performing large-scale DNA fabrication at base-level precision is transforming genetic engineering from a laborious art to an industrial, information, and technology-driven discipline (for a review see Czar et al., 2009). It is expected that synthetic biology driving demands for synthetic genes will challenge existing gene synthesis capabilities. Thus it is not surprising that current chemical DNA synthesis and gene assembly methods are being supplemented with new engineering tools, technologies, and trends aiming at providing or extending gene synthesis capacities, and at the same time cutting production costs. Some of these developments include oligonucleotide synthesis from DNA microarrays or the use of microfluidics and multiplex gene synthesis technologies (reviewed by Tian et al., 2009). Recent developments target reliability and process stability as well as simplifying existing processes, for example, by introducing smart error correction methods, reducing and improving oligo assembly, or providing assembly devices (Cheong et al., 2010; Gordeeva et al., 2010; Huang et al., 2009; TerMaat et al., 2009).
4. Gene Synthesis and Synthetic BiologyFrom Genes to Genomes Synthetic biology is a truly interdisciplinary development with many different scientific, commercial, social and political aspects, interests, and implications. Much has been initiated and already accomplished; some aspects, however, are still at an infant or developmental stage. For example,
Gene Synthesis
255
simple provision of synthetic DNA as elementary components is rather advanced and a complete industry has evolved within the past decade (Graf et al., 2009). Still, in order to efficiently exploit the potential that gene synthesis offers synthetic biology for developing new applications, some issues need specific attraction.
4.1. Information Synthetic biology comprises different layers of networks associated with information. Function, regulation, flux, and genetic information are just a selection of relevant categories where information input is needed to generate specific processes. Starting from (i) assignable single functions, such as specific catalytic activities or unique binding sites, to (ii) more complex but still defined regulation interrelations, as experienced in operons or logical function devices, and (iii) complex higher order processes found in metabolic pathways, cells, or organisms, to a comparable extent the information needed gains complexity and is cumulatively more difficult to provide. Nevertheless, or precisely for this reason, the first requirement in synthetic biology is knowledge and more importantly access to it. A comprehensive database for synthetic biology—should it exist—must, in addition to technological details, provide information aside from function and how a product can be used. Depending on the goals and application field, it is mandatory to be informed about community standards, for example, on assembly, quality control, quantification, intellectual property rights, etc., associated with a component or device and on other legal regulations such as biosafety and biosecurity issues. At present, a comprehensive data collection is not available, although the information exists. Therefore, one central demand for synthetic biology to prosper is without a doubt generating and maintaining an information database. Since such a project reflects the broader public interest in addition to its scientific significance, and since it will require substantial sourcing, it will most likely require public funding. One example pointing in the right direction has been developed by the Spanish National Cancer Research Center supported by various national and international funding agencies. The Bionemo (Biodegradation Network Molecular Biology Database) reflects an online data collection that stores manually organized information about proteins and genes directly implicated in biodegradation metabolism that has been extracted from published articles (http://bionemo.bioinfo.cnio.es/Run.cgi).
4.2. Modularity Information such as the properties associated with a part (protein) or a subpart (domain) is readily available and can be used to design and produce new functions by combining what is known. The engineering and
256
Frank Notka et al.
modification of individual proteins was traditionally dominated by directed evolution methods, providing appropriate pools of proteins with partly randomized sequence and methods to select the desired variation (Bershtein and Tawfik, 2008). More recently, computational protein design methods are becoming increasingly successful with structure-based engineering of protein folds, interactions, and activities (Van der Sloot et al., 2009). Apart from the design and engineering by manipulation of residues, combination and fusion of whole protein domains is gradually becoming more popular (Heyman et al., 2007; Parmeggiani et al., 2008) and might be further boosted by the concept of accessing standardized parts and subparts. The smallest design entity in the vocabulary of synthetic biology refers to a subpart that characterizes the discrete minimal sequence requirement associated with a function performing specific tasks independently of the other subparts (one defined segment of a more complex whole). The subpart comprises (i) a domain with respect to structural sequences (being translated into proteins) and (ii) a functional motif, for example, a transcription factor binding site, with respect to functional sequences (e.g., regulatory sequences as represented by a promoter). Starting with subparts and increasing the complexity via assembly into parts, devices, systems, and genomes, the basic requirements for a circuit diagram-based construction that can be characterized by interchangeable, functionally well-defined, and ultimately normalized components become obvious. Subparts and all subsequent constructions need to be freely linked and arranged in order to combine the intended functions. The connection sites also need to be flexible, providing the potential to introduce additional motifs, such as linkers, restriction sites, or protease cleavage sites. The assembly process thus needs to be highly flexible and the system requires a high degree of modularity. Scar-free assembly of sequences resulting in the exact input sequence depends on sophisticated bioinformatics tools for sequence modulation and optimization. These tools are available, and in addition to gluing the exact sequences together, most of the developed tools provide optimization algorithms to improve expression characteristics in the selected host system (Raab et al., 2010). Apart from the technological feasibility of domain and circuit assembly, apparent biological complexity appears to impede the rational design of sophisticated protein circuitry. However, progress in this direction is evident and fusion of individual domains to new functional entities seems possible (Gru¨nberg and Serrano, 2010).
4.3. Standardization Conceptual frameworks and related international collaboration opportunities are sparse. Standards underlie most aspects of the modern world, especially when it comes to engineering principles that rely on the exact description of individual elements used for a construction plan-based
Gene Synthesis
257
design. Considering the complexity of biological systems, an adequate process of standardization seems inordinately more difficult in the science of biology (De Lorenzo and Danchin, 2008). Still, a number of useful standards have already been described, and the number is increasing partly in response to the development of widely practiced methods that generate significant amounts of data (exterior impulse), and partly due to initiatives aiming at transforming biology into an engineering discipline (interior impulse). Existing standards include information at different levels, ranging from one-dimensional descriptions (e.g., enzyme nomenclature, endonuclease activities, DNA sequence data, and genetic features) to complex data handling and description (e.g., microarray data, protein crystallographic data, and systems biology models) (Endy, 2005). These standards have to be supplemented by accurate technical standards for most classes of basic biological functions and experimental measurements, as well as by standards beyond technology, facilitating cooperation, sharing, public acceptance (e.g., common language, IP regulation, and biosecurity guidance). This is essential for a prospering and responsible synthetic biology community. In principle, two technological standardization categories have to be considered when designing a device or a system: (i) the physical assembly of parts within a construction plan, based on cloning/assembly rules and (ii) function, based on consistent characterization and score classification of reusable standard biological parts. Fueling the synthetic biology idea, the Registry of Standard Biological Parts starting at MIT as a first practical example, now maintains and distributes thousands of BioBrick biological parts (Canton et al., 2008). However, BioBrick parts are only standardized in terms of how individual parts are physically assembled into multicomponent systems, and most parts remain uncharacterized. Therefore, scientists have started to develop measurements and processes to characterize certain functions, within a defined environment, based on reference activities. In anticipation of global acceptance and use of synthetic biology standards, researchers started to assemble kits for lab use, as exemplified by the definition of the Relative Promoter Unit as a measurement of promoter activity (Kelly et al., 2009). The initial BioBrick limitations have also prompted scientists to develop their own standards, providing different avenues to overcome these shortcomings. The lack of compatibility between independently proposed standards has significantly increased the complexity of assembling constructs from standardized parts. These problems have recently also been recognized and addressed, especially by means of computer-aided design concepts. Computer tools have been developed to provide a framework for the precise description of part assembly in the context of a stimulated progression of physical construction methods and rules. In addition, these tools provide methods for assembly from large libraries of genetic parts, as well as simulation functions to model different biological systems and for testing predicted functions in silico. In analogy to
258
Frank Notka et al.
providing standardization kits, these programs are available online to be accessible to a large community of synthetic biologists (Cai et al., 2010, Cooling et al., 2010, Marchisio and Stelling, 2009).
4.4. Technological developments The list of requirements can be continued and further depends on the perspective and individual position, situation, or objective. A public spokesman has different concerns and needs than a government representative, a synthetic biology user, or a basic material provider, although overlaps are obvious. One major and common requirement is the ability to provide the raw material for developing environmental, energy, medical, material, and other applications, that is, the technological competence to produce devices, systems, and even genomes in a usable, economical as well as ethical and legally justifiable manner. While gene synthesis technologies are rapidly advancing, the assembly of readily fabricated fragments for producing genetic metabolic networks or even genomes is at the moment practically a manual process. However, the potential of assembling genomes using recombination technologies in yeast has been acknowledged (Gibson et al., 2008) and technological progress is evident (Shao et al., 2009). Therefore, to be able to satisfy the anticipated demand for large gene constructs, the scales and costs for assembling technologies need further promotion. To a similar extent, it is absolutely essential to integrate the option of providing a defined amount of variation at a certain position, meaning that computer and wet-lab tools for the design and implementation of gene libraries in synthetic biology projects need to be advanced.
5. Industrial Gene Synthesis—From Bench to Manufacturing Over the past three decades, the ability to amplify DNA dramatically boosted the availability of natural templates otherwise inaccessible in sufficient amounts for genetic manipulation. In conjunction with easy and cheap availability of oligonucleotide synthesis, PCR also allowed direct and flexible manipulation of amplified DNA fragments, although introduction of larger mutations and/or rearrangements of DNA fragments remained only possible through consecutive rounds of alterations, in other words, time consuming and expensive. Furthermore, automated fluorescence-based sequencing techniques significantly accelerated molecular cloning and facilitated easy examination of intermediate steps. High-throughput sequencing also led to the exponential growth of available sequence information in publicly available databases, with a doubling rate of approximately
Gene Synthesis
259
18 months. This in turn motivated the development of sophisticated algorithms and web applications to manage and use this vast amount of data. By the mid-1990s, the records of DNA and protein sequences, structural data, protein interaction networks, expression profiles, etc. became comprehensive enough to substitute for real-life experiments. Today, it is difficult to perform BLAST analysis of a sequence that has not been previously identified, in addition to finding numerous-related sequences from many different species, alive or extinct. Moreover, modern nextgeneration high-throughput sequencing of complete genomes or even metagenomes predominantly store the data electronically on hard drives, rather than in tangible genomic or cDNA libraries. Ideally, this could free the experimenter from genetic source material, which is often difficult, impossible, or sometimes dangerous to obtain. What remains is the problem of the fundamental difference between electronic sequence data and its physical counterpart preserved in a tangible gene. A “translation” machine or process capable of quickly converting an ASCII input sequence into a cloned DNA molecule in a copy/paste manner was needed.
5.1. Process features These promises were so tempting that consequently around the year 2000 the first companies appeared on the market offering such services. The gene synthesis business started out with a relative high price for artificial genes amounting to US$12 per base pair or US$10,000—20,000 for an average sized gene. The application of synthetic genes in scientific projects involved careful preparation and budgeting and was still far from being widespread. However, during the following 10 years, the price rapidly declined exponentially. Today, gene synthesis costs are about 3% of their original figure and have reached a level that is highly competitive with any alternative cloning method. This remarkable price drop was due to challenging competition between gene synthesis providers, not only at the level of product prices but also in service coverage, quality, capacity, and delivery time. While at first pricing was the deciding factor for the customer, the falling market price for synthetic genes forced providers to drive technological and administrative developments toward being cost effective in a tight market and coping with an exponential increase in demand. Since nowadays related costs are no longer the vital or limiting factor for deciding to work with synthetic genes, providers concentrate more on total synthesis capacity and the reduction and reliability of delivery time. With the growing market and the beginning of the era of synthetic biology, the business model for gene synthesis companies changed from a high-priced low-quantity niche market provider to a high-throughput supplier of a common research reagent. For the scientist, the order process needs to be easy and intuitive. Ideally, the electronic sequence can be submitted online
260
Frank Notka et al.
within a web interface similar to current sequence manipulation software. It must provide straightforward tools for submitting bulk orders of many different sequences and turnaround times for generating quotes need to be short. As such, many of the involved data processing steps—order entry, sequence optimization, quote generation—must be automated to minimize the level of human involvement in order to secure the scalability of the service.
5.2. Biosafety/biosecurity Synthetic biology is generally believed to have beneficial environmental, biomedical, and commercial potential; at the same time, potential “highrisk” factors and applications cannot be neglected. Gene synthesis is a typical “dual-use” technology. It can be applied for the greater good, providing research material for therapeutics and vaccine development, but the very same genes can be misused for nefarious purposes to cause considerable harm. In particular, the possibility of synthesizing pathogens and using these as biological weapons is palpable. The successful synthesis or reproduction of a poliovirus accomplished by online ordering of oligonucleotides (Cello et al., 2002), the reconstruction of the 1918 “Spanish Flu” virus (Tumpey et al., 2005) and many more examples fuel this notion. At the moment, legally binding regulations for screening do not exist, but the awareness regarding the dual-use problematic within the uniformly propagated potential of synthetic biology and the resulting need for appropriate directives is high (Samuel et al., 2009). The gene synthesis industry, represented by five major companies joined within the International Gene Synthesis Consortium (IGSC), has taken on its responsibility for a secure and fair supply of genetic material. The IGSC has developed and presented a harmonized best practice screening protocol in compliance with draft guidelines released by the US government. The member companies committed themselves to comply with the developed protocol and the maxim behind it, and implemented or adjusted screening processes in view of that. The second risk factor in high scale gene synthesis regards the biosafety evaluation of hundreds of sequences. The eventuality of GMOs to escape from a research laboratory or containment facility with the potential to proliferate out of control causing environmental damage or threatening public health is a long known concern, coinciding with the very early advantages in recombinant DNA technology. Already in 1975 a group of professionals joining the Asilomar conference defined voluntary guidelines to ensure the safe handling of recombinant DNA. These guidelines were in general adopted by the scientific community and brought forth stringent regulation in many biosafety (genetic engineering) laws. Accordingly, one integral component of the ordering process affects the authentication of a sequence request in order to provide security that the generated GMO has no potential to harm
261
Gene Synthesis
lab personnel, that the sequence cannot be misused for hostile or malicious purposes, and that a customer is ordering with the intention to promote legitimate research (Fig. 11.1). In accordance with the U.S. governmental guidelines (Screening Framework Guidance for Providers of Synthetic Double-Stranded DNA, effective on October 13, 2010), the biosecurity evaluation process is divided into two tasks: first, the identity and legitimacy of a customer is assessed. Second, the sequences for all ordered gene products are identified and screened against specific databases to determine whether they match a sequence related to an existing hazardous or controlled agent or toxin. Regulation protocols have been implemented into the order process to provide decision guidance for safety officers in the case that a sequence or a customer raises concerns. Problematic sequence requests are processed in absolute compliance with national and international regulations and laws. These include export control regulations and the guidelines and lists established by the Australia Group, an informal forum of member countries with the goal to strengthen global security through harmonization of export controls to prevent illegitimate supply of compounds for chemical or biological weapons (http://www.australiagroup.net/en/index.html). In addition, customers located in “Countries of Concern” as determined by official authorities are informed that due to compliance with all export controls, sanctions, and related laws and Local biosafety classification
Critical sequence lists (AG list, CDC)
BioSafety
BioSecurity Sequence check
Sequence identification
Sequence host Sequence function
Production
check
NCBI BLAST
ok?
Country check www
?
ok
?
ok?
check
FedEx
Customer check
Approval of summary export control document
Legitimacy
BioSecurity Customer evaluation lists
Figure 11.1 Schematic overview of Life Technologies’ biosafety and biosecurity screening practice integrated into the ordering process.
262
Frank Notka et al.
regulations, their order cannot be accepted. The relevant information is concised and an internal summary export control document has to be completed before shipment of goods.
5.3. Optimization rational The first step in gene synthesis is specifying the sequence itself. Given the flexibility of synthesizing any conceivable string of nucleotides, it is reasonable to alter a natural gene to ensure its best performance in the required application or experiment. The second rationale for gene optimization is of practical nature. Since the synthesis of genes relies on the correct assembly of short oligonucleotides, copious motif repeats and inverted repeats need to be avoided. This, again, is a beneficial feature in the final sequence regarding genetic stability. The most commonly employed modification of proteincoding genes is adapting codon usage. With the rapidly growing size of natural sequence databases, numerous sequenced genes are listed for many species—up to fully sequenced genomes of the most studied organisms. This information is compiled into codon usage databases, reflecting the relative frequency of alternative codons in each organism. Different schemes and algorithms have been developed to best adapt a coding gene to the codon usage of the host organism. The most common optimization strategy to date is completely avoiding rare codons, and aiming for maximum saturation with the most frequent ones. It has been demonstrated that the most frequent codons correlate with the most abundant tRNA pools, while the relative tRNA levels do not change with expression or cellular growth and are available for the translational machinery (Emilsson et al., 1993; Ikemura, 1985). Codon choice, however, is not the only parameter when contemplating a well-designed gene. Other variables to consider are adjusting GC content, and avoiding direct and reverse repeats, restriction sites, ribosomal entry sites, cryptic splice motifs, polyadenylation signals, sequences controlling mRNA half-life, RNA secondary structures, etc. (Fig. 11.2). However, it may be desirable to introduce certain DNA motifs, or avoid similarities to naturally occurring sequences.
5.4. Optimizer software Together, this approach results in a multiparameter optimization. The challenge is to find the sequence that represents the best compromise between different and sometimes conflicting requirements. Without doubt, the best solution would be to generate all possible combinations of codons representing a given amino acid sequence, assess all of them with the help of a quality function, and finally choose the one with the highest quality score regarding all necessary parameters (Fig. 11.3A). Unfortunately, even for a rather small protein of 100 amino acids, the number of possible
263
Gene Synthesis
Wild type gene sequence
Sequence repeats PABP
Codon usage
GC content
PAB
P
PABP
A AAA AAAAA AA A AAAAA AA
AAAAAA
AAAAAA
RNA sec. structures
Splice sites
Poly(A) sites, killer motifs
GeneOptimizer®
Optimized gene
Figure 11.2 GeneOptimizer multiparameter gene optimization: Parallel processing of performance relevant sequence parameter.
A Test all possible
1
2
check
3
check
4
check
check
B Iterative
next
next
next
C Sliding window
Figure 11.3
Schematic overview of potential optimization strategies.
combinations is in the range of 3100 5 1047, making the outlined approach impossible to perform in practice. The high-throughput processing of several hundred sequences per day asks for an algorithm capable of optimizing a gene in a matter of minutes. Another strategy is the serial
264
Frank Notka et al.
optimization of each sequence feature. Here, a first round could optimize the codon usage, a second cycle would adapt GC content, a third iteration eliminates repetitive sequences, and so on (Fig. 11.3B). Obviously, with each iteration, the quality of the primary parameters decreases and undesired motifs may occur. In order to still find an optimal solution, it is necessary to reduce the search space by performing an exhaustive search for the best solution only inside a small sequence window, which is moved along the whole sequence from the 50 - to the 30 -end of the reading frame. In each iteration, all codon combinations of the current window are calculated and ranked according to the desired parameters, also taking the already optimized part of the sequence into account. The best 50 codon of the window is fixed and the aperture for the next calculation round is slid one codon toward the 30 -end (Fig. 11.3C). This strategy considers both local and global sequence traits and can find an optimized sequence without human interaction in a matter of 1–3 min on a standard personal computer. This logic has been implemented in the GeneOptimizerÒ sequence suite, developed by GeneArt (Raab et al., 2010). The following passages describe the necessary developments and processes that have been implemented in Life Technologies/GeneArt’s technology platforms in order to materialize the transition of bench style gene synthesis into industrial scale DNA manufacturing. The production process chain is exemplified along the data content and the informational flow embodied within GeneOptimzerÒ and laboratory information and management system (LIMS), representing the company’s most fundamental IT groundwork.
6. Design Tool—GeneOptimizer 6.1. Project design Gene optimization is an optional process usually applied for biotechnological applications where protein expression using a specific host system is involved. There is strong evidence that optimization in general has a beneficial influence on production rates in different expression systems (Gustafsson et al., 2004, Maertens et al., 2010) as well as on expression level and duration in vivo (Kosovac et al., 2010). However, optimization can have advantages other than influencing expression. Whenever it seems favorable to avoid sequence homology, this can be achieved by gene optimization. Potential applications include (i) prevention of homology to host chromosomal sequences for enhanced plasmid stability and reduced integration events, (ii) reducing homologous recombination events to enhance safety in gene therapy or genetic vaccination approaches (Wagner et al., 2000), (iii) rescue experiments with modified genes that are, in contrast to the natural gene, not affected by siRNA-mediated
Gene Synthesis
265
silencing targeting the wild-type gene (Fath et al., 2011), or (iv) introducing silent mutations to eliminate specific DNA motifs (e.g., restriction endonuclease recognition sites). The GeneOptimizerÒ offers solutions for individual requests. Gene sequences are initially subjected to a multiparameter analysis. Subsequent modifications usually span the following options: (i) change only a specific parameter, (ii) perform complete optimization or optimization of defined sequence stretches, or (iii) process sequences in their original wildtype appearance. Thus, the GeneOptimizerÒ represents a valuable tool for project design. For example, it has been used to convert a commonly used reporter gene RNA (gfp gene; green fluorescent protein) into a quasilentiviral message, strictly following complex lentiviral regulation by adapting the gfp reporter gene to HIV codon bias (Graf et al., 2006). Gene synthesis in general contributes significantly to project design independently from any optimization process. Since natural templates are not required for gene synthesis, there is a high degree of freedom for sequence design. For example, any fusion or chimeric gene construct can be freely designed. There is no restriction in designing higher order complexes with alternating coding and noncoding regions up to the in silico design of a complete plasmid or even a genome (Gibson et al., 2008).
6.2. Sequence design The GeneOptimizerÒ tool has two independent but nevertheless integrative optimization functions. A given sequence is first optimized mainly at RNA level to improve expression characteristics as described above. In a second optimization process, the defined sequence is processed for production, providing computational segmentation and refinement cycles generating optimal production parameters. Depending on the length and complexity (e.g., sequence repeats, motif stretches, GC content, etc.), the sequence can be divided into subfragments of variable length (usually between 200 and 1800 nucleotides (nt)). Each subfragment is divided into overlapping oligonucleotides following a defined pattern: the sense strand sequence is split into sequential L-oligos of 50–60 nt in length. The antisense strand is split into shorter M-oligos of approximately 40 nt in length partially overlapping the corresponding, complementary L-oligos. This process is automated in a way that a given subfragment length is divided into a calculated number of L-oligos and corresponding M-oligos, matching a predefined oligo length interval. The complementary terminal overlap sequences are evaluated for potential mismatches (alternative pairing, self assembly) and if a certain threshold is exceeded, the process reenters the cycle with a slightly changed starting parameter. The whole cycle can be repeated until the predefined limit is reached. In an analogous process step, additional terminal amplification primers (providing cloning sites) and sequencing primer are automatically calculated.
266
Frank Notka et al.
6.3. Construction design If a sequence requires breakdown into subfragments, the program calculates all necessary subcloning steps and provides a cloning strategy. This process applies top-down assembly tree computation: starting from the final specifications (e.g., a 10-kb gene cloned into a plasmid harboring kanamycin resistance), (i) the cloning steps (e.g., step 1: 10 subfragments; step 2: combining five fragments at a time; and step 3: fusion of the two resulting fragments) and (ii) the cloning strategy including the choice of vectors for each subfragment and more importantly the respective antibiotic resistance provided (e.g., a kanamycin vector for the subfragments, an ampicillin vector for the first cloning step, and a kanamycin vector for the last cloning step) are defined in order to facilitate convenient cloning by resistance switch.
7. Production Processing—LIMS 7.1. Steering process The LIMS has been developed to virtually mirror and very specifically steer the gene synthesis process from ordering to shipment. It contains all production relevant operational tasks, rules, and information. The workflow engine provides the basis for steering and tracking the production status of any order started within the system. Fundamental tools such as bioinformatics sequence design or analyses tools are integrated and the system is capable of further plug-in extensions. The specific functions include (i) informational sequence processing; (ii) support of production logistics by generating work lists, linking the lab staff to automated pipetting stations or barcode-aided sample tracking; (iii) information database for accurate production monitoring, statistical process evaluation, and customer feedback; (iv) control and data acquisition from individual lab automats, such as liquid handling robots, oligo synthesizers, or analytical instruments; and (v) control of integrated and fully automated assembly modules.
7.2. Process control One of the most dominant functions of the LIMS is the provision and monitoring of in-process controls. Each production step requires a release entry in the system. For some steps, quality control demands for visual inspection of a process product (appearance of colonies on selection plates, PCR band(s) in gel electrophoresis, restriction analysis, etc.), for others analytical measurements have to be evaluated (e.g., optical density, highperformance liquid chromatography [HPLC] results). Results are reported
Gene Synthesis
267
back to the system, and depending on the result, positive or negative, the next task is generated (for positive results: the next step; for negative results: repeating the step or applying an alternative route) and displayed. This system is perfectly suited to handling a large number of parallel operations. Providing task lists that contain all orders designated for the specific operation (ligation, transformation, PCR1, PCR2, etc.) guarantees that each order is automatically shifted to the next operation step.
7.3. Process expansion Understanding how the LIMS operates highlights certain prerequisites that need to be addressed in order to enable the high degree of parallelization that the LIMS can theoretically achieve: the most important ones being standardization and automation. The capacity of the LIMS as well as carrying out manual or automated operations is limited. Therefore, it is mandatory to restrict the number of potential operations by introducing standards. Standards can be defined and applied for protocols, cloning/assembling methods, reagents, or operation conditions. These are defined in SOPs (standard operating protocols) and can be implemented in parallel within the LIMS. By defining operational standards, the production process can be dissected into defined, manageable, self-contained, and self-controlled operation steps. Having the production process divided into small and defined operation steps additionally provides optimal conditions for automation. Automated process solutions not only require precisely described standard protocols but are also required to manage the handling of a large number of parallel operations. Thus, the automation of manual tasks embedded in an LIMS environment is a consequent step toward, and at the same time, a necessary prerequisite for, high-throughput gene synthesis. Automation in gene synthesis is employed from oligonucleotide synthesis to fragment production, gene assembly, sequencing and operational tasks, such as sequence analysis and optimization, and then evaluation of sequence results. The implementation of process automation modules allows for specific and directed targeting of automatable operations and fast progress. The process chain from customer request to delivery comprises two interdependent strands: the information flow mapped inside LIMS and the material flow controlled by LIMS.
7.4. Order entry The process starts with entering customer and project information into the customer portal. These information data are directed to different registries, for example, the customer data into a customer relation management (CRM) system, the project cost and sales figures into an enterprise resource planning (ERP) system, and the project data into a production monitoring system. The production monitoring system contains all relevant data
268
Frank Notka et al.
on sequence, optimization, source organism, target organism, biosafety, biosecurity, required documents, etc. and feeds the production relevant information into the LIMS, while the sequence is loaded into the GeneOptimizerÒ and amended according to the project specifications.
7.5. Order processing The GeneOptimizerÒ defines the final sequence, the cloning strategy, and the fragment and oligo breakdown as described above and supplements the information already contained within the LIMS. The LIMS dissects the provided information into process tasks, clusters the tasks of all contained projects, and creates task lists. The majority of the process steps are automated (e.g., preparation of the sequencing reaction) or semiautomated (e.g., evaluation of sequence results), whereas all of the processes are managed within the LIMS. In order to match the information flow with the material flow, the LIMS contains additional modules (e.g., material or plasmid registries) and the physical containments are specified using barcodes, which enable accurate assignment of any sample to the correct order, the actual status, and the next task.
7.6. Oligonucleotide production The first task list is ordering the GeneOptimizerÒ defined oligonucleotides. The oligo sequences are transferred to a task list and allocated to an oligo synthesizer. Oligonucleotide synthesis at Life Technologies/GeneArt is based on a technology platform called Cerberus. This platform was developed to provide a synthesis format customized for large-scale gene synthesis specifications, which are mainly (i) parallel synthesis (operates in a 4 96well format), (ii) low consumables consumption (production in 96-well format), and (iii) high-quality output (error rate <0.1%). In addition to the actual oligo synthesizer machine, this platform comprises devices associated with the preparation of the synthesis plates, oligo deprotection and cleavage, central supply and waste management, exhaust air clearing and technical monitoring, and alarm systems. A standard synthesis run can be completed within 10 h and the daily production capacity of 10,000 oligos is sufficient to produce approximately 5 Mbp of dsDNA per month. Automated photometric evaluation of the oligo concentration is implemented and random sampling HPLC is used for quality control of each synthesis run. The next task within the process chain affects the first fragment assembly step: all oligonucleotides belonging to a specific fragment (L- and M-oligos) are mixed using liquid handling robots; the concentrations are adjusted according to a predefined range. The respective plates containing the oligo-mixes are transferred to a robotic-based production module that performs the initial gene synthesis steps.
Gene Synthesis
269
7.7. Subfragment production The first post-chemistry step in gene synthesis is assembling the oligonucleotides to yield longer contiguous sequences. The maximal final length of these constructs must be considered carefully to limit the likelihood of errors in the product as well as the number of transformants to screen. Currently, the most cost-effective size of these synthetic building blocks is between 1 and 2 kb. The assembly process is basically a multiplex primer extension reaction, taking place under controlled temperature cycling conditions. In the first cycling round, overlapping primers anneal to each other and are filled in by polymerase to form short double strands. These can again anneal to each other in the subsequent cycle and are extended to fragments bridging four oligonucleotides. This progression continues until fragments arise, containing the complete length of the intended product. Once achieved, terminal primers, present in excess, take effect and amplify the full-length product exponentially. In a next step, the linear DNA molecule is ligated into a minimal cloning vector using classical restriction endonuclease techniques. After transformation into E. coli and bacterial cultivation, some colonies are selected for evaluation by colony PCR sequencing. Again, the results are reported back to the LIMS, positive clones are further analyzed by DNA plasmid preparation, and the accuracy of the synthesized DNA construct is verified by sequencing (in-process QC).
7.8. Assembly Altogether, conditions for mass production are chosen to have a more than 95% chance of picking at least one correct fragment with a single screen. This sets a limitation on the total size of the initial product, since longer molecules accumulate mutations exponentially, resulting in increased necessary screening efforts for correct fragments. Therefore, in order to compile synthetic gene constructs exceeding 1–2 kb, they are reassembled from the sequence-verified first building blocks. Since assembling DNA elements is the nuts and bolts of biotechnology, several techniques exist to do so efficiently, although not all of them are equally apt for sequence-independent gene synthesis. The straightforward approach for DNA fragment linkage is classical manipulation with restriction enzymes and ligase (Maniatis et al., 1982). This method, however, is quite inflexible in terms of junction sequence design and involves well-known problems regarding availability and uniqueness of appropriate restriction sites. Type II class S restriction sites can eliminate scar sequences at the boundaries. These enzymes can produce sticky ends outside their recognition sequence, while the nucleotides of the adjacent cohesive stretch can be chosen freely, representing a common part of the intended compound product for ligation (Padgett and Sorge, 1996).
270
Frank Notka et al.
Designing this common part to have a length of approximately 20 bp allows flexible and specific attachment of two or more DNA fragments by fusion PCR, but is limited to moderate overall size and inherits an additional source of sequence errors (Mullinax et al., 1992). The DISEC-TRISEC and LIC-POR methods employ the exonuclease activity of Klenow or T4 DNA polymerase to generate compatible single-stranded overhangs, which are then combined with or without ligase, respectively (Aslanidis and de Jong 1990; Dietmaier et al., 1993). In vitro recombination extends this technology by annealing the overhangs under more stringent conditions at elevated temperatures and then filling and closing gaps with a heat stable polymerase and ligase.
8. Case Study: Large-Scale Gene Production Direct evolution strategies aim to improve protein or enzyme functions toward novel nonnatural properties. Since the potential sequence space of molecular variants is so vast, it is a common strategy to limit variation to those positions of a protein that is known to be related to function. In many cases, however, it turns out that substitutions of unexpected residues are responsible for advancing the molecule. A straightforward approach to obtain a complete data matrix of all beneficial, adverse, and neutral single amino acid substitution is to actually generate all these mutants and test them. For a regular 300 amino acid protein, this involves screening 300 19 ¼ 5700 variants, which is manageable even with low throughput screening assays. The functional analysis of these mutants generates a data matrix containing information about the importance of each protein position regarding overall function, as well as which non-wild-type amino acid contributes to adapting the protein toward the technical demands (Geddie and Matsumura, 2004; Tan et al., 2008). While the challenging expertise is a good screening system, the actual production of the necessary DNA constructs is tedious—and an excellent example of where high-throughput gene synthesis can be of great support. The described automated gene synthesis workflow together with the associated LIMS-guided data management (see Fig. 11.4) allows for the systematic replacement of single oligonucleotides during the assembly of synthetic fragments. In practice, for a given amino acid position, 19 oligonucleotides are synthesized, each containing a non-wild-type codon. Instead of one oligonucleotide mix, necessary for the construction of one gene, 19 parallel reactions are set up, with only one particular primer being different. This is an ideal prerequisite for automation and parallel processing in production of large quantities of similar DNA constructs and facilitates accessible and feasible projects in directed evolution.
271
Gene Synthesis
CRM Biosecurity
Status
>NM_001002749 ATGTGGAAGAGTGTGTTTAGTG TCAGTTTCCGCGTCAGTGTATGAC GCTGCCCATCACCTACAGGACTGG GGCAGAGGTCAGTGTGTGTCAATT TCAGTATCCTCACGACGGCCGGGA
Design and optimization
Oligo sequences Assembly rules Cloning strategy
Oligonucleotide synthesis
A
C
G
T Oligo sequences
Oligonucleotide assembly and amplification
Assembly rules
Cloning
Cloning strategy
Identification of correct clone
L I M S
Controlling
Order entry
Sequence data Clone ID A
Plasmid preparation B C
Plasmid preparation
A
B
C
Cloning strategy A
B
C
Sequence data TCTCTCGATCCCATTCCATCCAGGT
Final QC
Export
Statistics
Subfragment assembly
Clone ID
Export data
Figure 11.4 Schematic overview of gene synthesis production flow and interconnections to controlling LIMS.
272
Frank Notka et al.
9. Conclusion The complete process of gene synthesis—from sequence submission to shipping the final plasmid—is a process involving many different disciplines. Sales, bioinformatics, organic chemistry, molecular biology, export, and logistics must all play hand in hand to shift the entire workflow from small-scale to an industrial high-throughput operation. The LIMS is essential to track every intermediate in the multistep production when dealing with hundreds and thousands of syntheses in parallel. Equally, an increasing degree of automation is mandatory to avoid exponential growth in production volume necessitating an equivalent increase in manpower. Pipetting robots communicate flawlessly with an LIMS network, and vice versa. Some steps are also simply no longer manageable by humans, such as the move from 96- to 384-well plates, or the decrease of reaction volumes below 1 ml. It is the interplay between LIMS, automation, and miniaturization that creates the prerequisites necessary for a smooth and robust production platform enabling cheap and fast production of synthetic genes. The differences between lab-scale and industry-scale gene synthesis are thus not based upon novel technologies or innovative synthesizers as one might expect. The state-of-the-art technology has proven to be sufficient to satisfy the current gene synthesis demand. This does not imply that we can abstain from novel developments and even technological leaps forward in order to satisfy future demands in scales and costs. However, today’s technology—if employed in a reasonable and dedicated way—can provide the output demanded by the scientific community. The differences, therefore, rather originate from very straight forward adaptation of each single operational step to the process specifications, which are mainly (i) low consumables consumption, (ii) high degree of parallel sample processing, and (iii) low error rates in connection to reliability and reproducibility. Implementation of these specifications resulted in concrete measures related to method adaptation, standardization, in-process QC, information handling and flow, automation and machine development, and quality management compliance, in combination enabling industrial scale gene synthesis.
REFERENCES Agarwal, K. L., Buchi, H., Caruthers, M. H., Gupta, N., Khorana, H. G., Kleppe, K., Kumar, A., Ohtsuka, E., Rajbhandary, U. L., Van de Sande, J. H., Sgaramella, V., Weber, H., et al. (1970). Total synthesis of the gene for an alanine transfer ribonucleic acid from yeast. Nature 227, 27–34. Aslanidis, C., and de Jong, P. J. (1990). Ligation-independent cloning of PCR products (LIC-PCR). Nucleic Acids Res. 18(20), 6069–6074.
Gene Synthesis
273
Barany, F., and Gelfand, D. H. (1991). Cloning, overexpression and nucleotide sequence of a thermostable DNA ligase-encoding gene. Gene 109, 1–11. Bershtein, S., and Tawfik, D. S. (2008). Advances in laboratory evolution of enzymes. Curr. Opin. Chem. Biol. 12, 151–158. Cai, Y., Wilson, M. L., and Peccoud, J. (2010). GenoCAD for iGEM: A grammatical approach to the design of standard-compliant constructs. Nucleic Acids Res. 38(8), 2637–2644. Canton, B., Labno, A., and Endy, D. (2008). Refinement and standardization of synthetic biological parts and devices. Nat. Biotechnol. 26, 787–793. Cello, J., Paul, A. V., and Wimmer, E. (2002). Chemical synthesis of poliovirus cDNA: Generation of infectious virus in the absence of natural template. Science 297(5583), 1016–1018. Cheong, W. C., Lim, L. S., Huang, M. C., Bode, M., and Li, M. H. (2010). New insights into the de novo gene synthesis using the automatic kinetics switch approach. Anal. Biochem. 406(1), 51–60. Cooling, M. T., Rouilly, V., Misirli, G., Lawson, J., Yu, T., Hallinan, J., and Wipat, A. (2010). Standard virtual biological parts: A repository of modular modeling components for synthetic biology. Bioinformatics 26(7), 925–931. Czar, M. J., Anderson, J. C., Bader, J. S., and Peccoud, J. (2009). Gene synthesis demystified. Trends Biotechnol. 27(2), 63–72. De Lorenzo, V., and Danchin, A. (2008). Synthetic biology: Discovering new worlds and new words. EMBO Rep. 9(9), 822–827. Dietmaier, W., Fabry, S., and Schmitt, R. (1993). DISEC-TRISEC: Di- and trinucleotidesticky-end cloning of PCR-amplified DNA. Nucleic Acids Res. 21(15), 3603–3604. Edge, M. D., Green, A. R., Heathcliffe, G. R., Meacock, P. A., Schuch, W., Scanlon, D. B., Atkinson, T. C., Newton, C. R., and Markham, A. F. (1981). Total synthesis of a human leukocyte interferon gene. Nature 292, 756–762. Emilsson, V., Naslund, A. K., and Kurland, C. G. (1993). Growth-rate-dependent accumulation of twelve tRNA species in Escherichia coli. J. Mol. Biol. 230, 483–491. Endy, D. (2005). Foundations for engineering biology. Nature 438, 449–453. Fath, S., Bauer, A. P., Liss, M., Spriestersbach, A., Maertens, B., Hahn, P., Ludwig, C., Scha¨fer, F., Graf, M., and Wagner, R. (2011). Multiparameter RNA and Codon Optimization: A Standardized Tool to Assess and Enhance Autologous Mammalian Gene Expression. PLoS ONE. 6(3), e17596, 1–14. Geddie, M. L., and Matsumura, I. (2004). Rapid evolution of betaglucuronidase specificity by saturation mutagenesis of an active site loop. J. Biol. Chem. 279(25), 26462–26468. Gibson, D. G., Benders, G. A., Andrews-Pfannkoch, C., Denisova, E. A., BadenTillson, H., Zaveri, J., Stockwell, T. B., Brownley, A., Thomas, D. W., Algire, M. A., Merryman, C., Young, L., et al. (2008). Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 319, 1215–1220. Gibson, D. G., Glass, J. I., Lartigue, C., Noskov, V. N., Chuang, R. Y., Algire, M. A., Benders, G. A., Montague, M. G., Ma, L., Moodie, M. M., Merryman, C., Vashee, S., et al. (2010). Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329(5987), 52–56. Gordeeva, T. L., Borschevskaya, L. N., and Sineoky, S. P. (2010). Improved PCR-based gene synthesis method and its application to the Citrobacter freundii phytase gene codon modification. J. Microbiol. Methods 81(2), 147–152. Graf, M., Ludwig, C., Kehlenbeck, S., Jungert, K., and Wagner, R. (2006). A quasilentiviral green fluorescent protein reporter exhibits nuclear export features of late human immunodeficiency virus type 1 transcripts. Virology 352, 295–305.
274
Frank Notka et al.
Graf, M., Schoedl, T., and Wagner, R. (2009). Rationales of gene design and de novo gene construction. In “Systems Biology and Synthetic Biology,” (P. Fu and S. Panke, eds.), John Wiley & Sons, Inc., Hoboken, NJ. 10.1002/9780470437988.ch12. Gru¨nberg, R., and Serrano, L. (2010). Strategies for protein synthetic biology. Nucleic Acids Res. 38(8), 2663–2675. Gustafsson, C., Govindarajan, S., and Minshull, J. (2004). Codon bias and heterologous protein expression. Trends Biotechnol. 22, 346–353. Heinemann, M., and Panke, S. (2006). Synthetic biology-putting engineering into biology. Bioinformatics 22(22), 2790–2799. Heyman, A., Barak, Y., Caspi, J., Wilson, D. B., Altman, A., Bayer, E. A., and Shoseyov, O. (2007). Multiple display of catalytic modules on a protein scaffold: Nano-fabrication of enzyme particles. J. Biotechnol. 131, 433–439. Huang, M. C., Ye, H., Kuan, Y. K., Li, M. H., and Ying, J. Y. (2009). Integrated two-step gene synthesis in a microfluidic device. Lab Chip 9(2), 276–285. Ikemura, T. (1985). Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2, 13–34. Itakura, K., Hirose, T., Crea, R., Riggs, A. D., Heyneker, H. L., Bolivar, F., and Boyer, H. W. (1977). Expression in Escherichia coli of a chemically synthesized gene for the hormone somatostatin. Science 198, 1056–1063. Kelly, J. R., Rubin, A. J., Davis, J. H., Ajo-Franklin, C. M., Cumbers, J., Czar, M. J., de Mora, K., Glieberman, A. L., Monie, D. D., and Endy, D. (2009). Measuring the activity of BioBrick promoters using an in vivo reference standard. J. Biol. Eng. 20, 3–4. Kodumal, S. J., Patel, K. G., Reid, R., Menzella, H. G., Welch, M., and Santi, D. V. (2004). Total synthesis of long DNA sequences: Synthesis of a contiguous 32-kb polyketide synthase gene cluster. Proc. Natl. Acad. Sci. USA 101, 15573–15578. Kosovac, D., Wild, J., Ludwig, C., Meissner, S., Bauer, A. P., and Wagner, R. (2010). Minimal doses of a sequence-optimized transgene mediate high-level and long-term EPO expression in vivo: Challenging CpG-free gene design. Gene Ther. 18(2), 189–198. 10.1038/gt.2010.134. Koster, H., Blocker, H., Frank, R., Geussenhainer, S., and Kaiser, W. (1975). Total synthesis of a structural gene for the human peptide hormone angiotensin II. Hoppe Seylers Z. Physiol. Chem. 356, 1585–1593. Maertens, B., Spriestersbach, A., von Groll, U., Roth, U., Kubicek, J., Gerrits, M., Graf, M., Liss, M., Daubert, D., Wagner, R., and Schafer, F. (2010). Gene optimization mechanisms: A multi-gene study reveals a high success rate of full-length human proteins expressed in Escherichia coli. Protein Sci. 19(7), 1312–1326. Mandecki, W., Hayden, M. A., Shallcross, M. A., and Stotland, E. (1990). A totally synthetic plasmid for general cloning, gene expression and mutagenesis in Escherichia coli. Gene 94, 103–107. Maniatis, T., Fritsch, E. F., and Sambrook, J. (1982). Molecular Cloning: A Laboratory Manual. Cold Spring Harbor laboratory, Cold Spring Harbor, NY. Marchisio, M. A., and Stelling, J. (2009). Computational design tools for synthetic biology. Curr. Opin. Biotechnol. 20(4), 479–485. Menzella, H. G., Reisinger, S. J., Welch, M., Kealey, J. T., Kennedy, J., Reid, R., Tran, C. Q., and Santi, D. V. (2006). Redesign, synthesis and functional expression of the 6-deoxyerythronolide B polyketide synthase gene cluster. J. Ind. Microbiol. Biotechnol. 33, 22–28. Mullinax, R. L., Gross, E. A., Hay, B. N., Amberg, J. R., Kubitz, M. M., and Sorge, J. A. (1992). Expression of a heterodimeric Fab antibody protein in one cloning step. Biotechniques 12(6), 864–869. Padgett, K. A., and Sorge, J. A. (1996). Creating seamless junctions independent of restriction sites in PCR cloning. Gene 168(1), 31–35.
Gene Synthesis
275
Parmeggiani, F., Pellarin, R., Larsen, A. P., Varadamsetty, G., Stumpp, M. T., Zerbe, O., Caflisch, A., and Plu¨ckthun, A. (2008). Designed armadillo repeat proteins as general peptide-binding scaffolds: Consensus design and computational optimization of the hydrophobic core. J. Mol. Biol. 376, 1282–1304. Raab, D., Graf, M., Notka, F., Schoedl, T., and Wagner, R. (2010). The GeneOptimizer Algorithm: Using a sliding window approach to cope with the vast sequence space in multiparameter DNA sequence optimization. Syst. Synth. Biol. 4(3), 215–225. Saiki, R. K., Scharf, S., Faloona, F., Mullis, K. B., Horn, G. T., Erlich, H. A., and Arnheim, N. (1985). Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science 230(4732), 1350–1354. Samuel, G. N., Selgelid, M. J., and Kerridge, I. (2009). Managing the unimaginable. Regulatory responses to the challenges posed by synthetic biology and synthetic genomics. EMBO Rep. 10(1), 7–11. Shao, Z., Zhao, H., and Zhao, H. (2009). DNA assembler, an in vivo genetic method for rapid construction of biochemical pathways. Nucleic Acids Res. 37(2), e16. Szybalski, W., and Skalka, A. (1978). Nobel prizes and restriction enzymes. Gene 4, 181–182. Tan, L., Wiesler, S., Trzaska, D., Carney, H. C., and Weinzierl, R. O. (2008). Bridge helix and trigger loop perturbations generate superactive RNA polymerases. J. Biol. 7(10), 40. TerMaat, J. R., Pienaar, E., Whitney, S. E., Mamedov, T. G., and Subramanian, A. (2009). Gene synthesis by integrated polymerase chain assembly and PCR amplification using a high-speed thermocycler. J. Microbiol. Methods 79(3), 295–300. Tian, J., Ma, K., and Saaem, I. (2009). Advancing high-throughput gene synthesis technology. Mol. Biosyst. 5(7), 714–722. Tumpey, T. M., Basler, C. F., Aguilar, P. V., Zeng, H., Solo´rzano, A., Swayne, D. E., Cox, N. J., Katz, J. M., Taubenberger, J. K., Palese, P., and Garcı´a-Sastre, A. (2005). Characterization of the reconstructed 1918 Spanish influenza pandemic virus. Science 310(5745), 77–80. Van den Brulle, J., Fischer, M., Langmann, T., Horn, G., Waldmann, T., Arnold, S., Fuhrmann, M., Schatz, O., O’Connell, T., O’Connell, D., Auckenthaler, A., and Schwer, H. (2008). A novel solid phase technology for high-throughput gene synthesis. Biotechniques 45(3), 340–343. Van der Sloot, A. M., Kiel, C., Serrano, L., and Stricher, F. (2009). Protein design in biological networks: from manipulating the input to modifying the output. Protein Eng. Des. Sel. 22, 537–542. Wagner, R., Graf, M., Bieler, K., Wolf, H., Grunwald, T., Foley, P., and Uberla, K. (2000). Rev-independent expression of synthetic gag-pol genes of human immunodeficiency virus type 1 and simian immunodeficiency virus: implications for the safety of lentiviral vectors. Hum. Gene Ther. 11(17), 2403–2413. Xiong, A. S., Peng, R. H., Zhuang, J., Liu, J. G., Gao, F., Chen, J. M., Cheng, Z. M., and Yao, Q. H. (2008). Non-polymerase-cycling-assembly-based chemical gene synthesis: Strategies, methods, and progress. Biotechnol. Adv. 26, 121–134. Young, L., and Dong, Q. (2004). Two-step total gene synthesis method. Nucleic Acids Res. 32, e59.
C H A P T E R
T W E LV E
Gene Synthesis: Methods and Applications Randall A. Hughes,*,† Aleksandr E. Miklos,*,† and Andrew D. Ellington*,† Contents 1. Introduction 2. Oligonucleotide Synthesis 2.1. Solid phase phosphoramidite synthesis 2.2. Microchip-based oligonucleotide synthesis 3. Gene Assembly Methodologies 3.1. Ligation-mediated assembly 3.2. PCR-mediated assembly 4. Gene Design Software 5. Synthesis Fidelity/Error Correction Methods and Considerations 5.1. Oligonucleotide purification 5.2. Reading frame selection 5.3. Mismatch binding and cleavage 5.4. Correcting errors in synthetic genes by site-directed mutagenesis 6. Applications of Gene Synthesis 6.1. Codon optimization 6.2. Synthetic biology 7. Example of High-Throughput Gene Synthesis Using Protein Fabrication Automation 8. Conclusions References
278 278 278 280 282 282 284 291 293 294 294 295 297 297 297 299 300 303 303
Abstract DNA synthesis techniques and technologies are quickly becoming a cornerstone of modern molecular biology and play a pivotal role in the field of synthetic biology. The ability to synthesize whole genes, novel genetic pathways, and even entire genomes is no longer the dream it was 30 years ago. Using little more than a thermocycler, commercially synthesized * Applied Research Laboratories, The University of Texas at Austin, Austin, Texas, USA Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, USA
{
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00012-7
#
2011 Elsevier Inc. All rights reserved.
277
278
Randall A. Hughes et al.
oligonucleotides, and DNA polymerases, a standard molecular biology laboratory can synthesize several kilobase pairs of synthetic DNA in a week using existing techniques. Herein, we review the techniques used in the generation of synthetic DNA, from the chemical synthesis of oligonucleotides to their assembly into long, custom sequences. Software and websites to facilitate the execution of these approaches are explored, and applications of DNA synthesis techniques to gene expression and synthetic biology are discussed. Finally, an example of automated gene synthesis from our own laboratory is provided.
1. Introduction The ability to design and synthesize custom DNA sequences is at the heart of many current transformative research efforts in biotechnology, and the emerging field of synthetic biology. Interlocking technological improvements and research efforts have driven the development of cost-effective, highthroughput, and high-fidelity methods for the synthesis of de novo DNA sequences of ever increasing length and complexity. While beginning to end chemical synthesis of very long (>500 nt) DNA sequences remains elusive, techniques which combine chemical oligonucleotide synthesis and enzymatic assembly of oligonucleotides into longer DNA duplexes have undergone extensive optimization. Using existing methods, a standard molecular biology laboratory can routinely synthesize DNA sequences up to 2 kbp within a week using just a thermocycler, DNA polymerases, and commercially synthesized oligonucleotides. This is a far cry from the 5 years it took to synthesize the first synthetic gene (a 77 nt tRNA gene) some 40 years ago (Agarwal et al., 1970). In the past decade alone, researchers have gone beyond the mere synthesis of genes to the synthesis of multiple gene operons and even the synthesis of the genomes for entire organisms (Cello et al., 2002; Gibson et al., 2008a, 2010a; Kodumal et al., 2004; Smith et al., 2003). Herein, we review the relevant methodologies for gene synthesis from synthetic oligonucleotides and give an example of high-throughput gene synthesis from our own laboratory.
2. Oligonucleotide Synthesis 2.1. Solid phase phosphoramidite synthesis All gene synthesis technologies rely on the chemical synthesis of oligonucleotides to supply the building blocks for enzymatic assembly. The most commonly used method for oligonucleotide synthesis is the cyclical fourstep phosphoramidite synthesis method developed in the 1980s (Caruthers et al., 1983, 1987; Fig. 12.1). DNA oligonucleotides are synthesized in a
279
Gene Synthesis: Methods and Applications
Cycle start
DMTO
Next cycle
Deprotection
(or release)
O
O
DMTO
Basen–1
O
Basen O
HO
O
Basen–1
O
O
CN
P O
O
O
Basen–1
O
O
DMTO
Basen O
O
O
O
O
DMTO
CN
P
Oxidation and capping
Coupling
Basen–1
O
O P
O
Basen
CN
N
Figure 12.1 Oligonucleotide synthesis from phosphoramidites. The initial nucleoside is tethered to controlled pore glass via its 30 hydroxyl. The synthetic cycle involves deprotection, coupling, oxidation, and capping. Synthesis proceeds in a 30 –50 direction.
30 –50 manner by coupling acid-activated deoxynucleoside phosphoramidites to an initial deoxynucleoside attached to a solid support (usually controlled pore glass (CPG) or polystyrene (PS) beads) through its 30 hydroxyl group. In most commercial DNA synthesizers, the solid support matrix is packed into a flow-through column and the reagents necessary for the phosphoramidite synthesis cycle flow by the solid support matrix. The addition of each nucleotide monomer to the growing oligonucleotide chain is carried out in four steps (Fig. 12.1): (1) deprotection: a weak acid is used to remove the dimethoxytrityl (DMT) ether group from the 50 -end of the growing oligonucleotide chain leaving a reactive 50 -hydroxyl (-OH) for the next coupling step; (2) coupling: the reactive 50 -OH reacts with a tetrazoleactivated monomer by simultaneous addition of the monomer and the activator solutions. (3) capping: any uncoupled 50 -OH groups are blocked by acylation to minimize the formation of deletion products; and finally (4) oxidation: the relatively unstable phosphite triester internucleotide linkages are oxidized into more stable phosphotriester linkages. This cyclic process is repeated until the oligonucleotide is complete. The oligonucleotide is then cleaved from the solid support, and the remaining protecting
280
Randall A. Hughes et al.
groups are removed by treatment with a strong base such as ammonium hydroxide. Synthesis of oligonucleotides with lengths <100 nt is common place using solid phase phosphoramidite chemistry with coupling efficiencies that can approach 99% (Rayner et al., 1998). However, during acid deprotection, depurination side reactions limit the quality and yields of oligonucleotides above 100 nt (Hall et al., 2009). A recent modification of the traditional reaction conditions using a novel detritylation process has been reported to remedy this problem and can lead to the improved synthesis of oligonucleotides up to 150 nt in length (LeProust et al., 2010). The robustness of solid phase phosphoramidite synthesis makes it easily amenable to automation, and this method is now used in almost all commercially available DNA synthesizers (Caruthers, 1985; Caruthers et al., 1983, 1987). Different machines can generate oligonucleotides in a range of synthesis scales, although for gene synthesis relatively small amounts (picomoles) are necessary. Automated synthesizers have a parallel throughput of up to 1536 oligonucleotides (Cheng et al., 2002; Horvath et al., 1987; Lashkari et al., 1995; Rayner et al., 1998; Sindelar and Jaklevic, 1995). While the amount of time necessary to synthesize a given oligonucleotide will vary between automated synthesizer platforms, the synthesis of up to 1536 20-mer oligonucleotides can now be carried out in a matter of hours (Cheng et al., 2002; Horvath et al., 1987; Lashkari et al., 1995; Rayner et al., 1998; Sindelar and Jaklevic, 1995). Further improvements in throughput and concomitant reductions in cost may be achieved by miniaturization. Quake and coworkers have developed microfluidic reaction devices in which the distribution of minute amounts of synthesis reagents into gated microfluidic reaction vessels leads to the synthesis of oligonucleotides at a 100-pmol scale. This scale is compatible with traditional microliter scale gene synthesis reactions while simultaneously reducing reagent consumption over traditional techniques by 100-fold (Lee et al., 2010). An alternative two-step version of the phosphoramidite synthesis cycle has been published that could also potentially reduce reagent consumption (Sierzchala et al., 2003). In this method, the DMT protecting group on the 50 -OH of each phosphoramidite is replaced with a carbonate group. A peroxy anion then serves as a nucleophile to simultaneously remove the 50 -carbonate protecting group and oxidize the phosphite triester internucleotide linkage to the phosphotriester linkage. This procedure should also reduce the mutations observed in synthetic DNA that arise from depurination upon acid deprotection (Septak, 1996).
2.2. Microchip-based oligonucleotide synthesis Emerging DNA synthesis technologies may eventually reduce the waste, cost, and errors associated with current automated solid phase oligonucleotide synthesizers, while further increasing throughput. In particular, DNA
Gene Synthesis: Methods and Applications
281
synthesis on microarrays or microfluidic-devices is becoming increasingly commonplace (Barone et al., 2001; Fodor et al., 1991; Hughes et al., 2001; Zhou et al., 2004). Many of the microarray synthesis methods rely on lightdirected synthesis. The extraordinary spatial control available through illumination allows the parallel synthesis of many thousands of individual strands on a single microchip (Barone et al., 2001; Fodor et al., 1991). In one implementation, photolithography masks combined with photolabile nucleotide monomers direct the light-mediated reactions at particular positions on a microchip. Areas not covered by the mask are exposed to light, resulting in the removal of the photolabile (or acid labile) protecting group on the 50 -OH, which in turn activates that position for participation in the next round of nucleotide coupling. As with other solid phase synthetic methods, reagents for chain extension flow past the surface of the chip and react with the previously deprotected positions. Excess reagents are washed away, and the process of masking, photoremoval of protecting groups, and coupling is repeated until the oligonucleotide sequence is completed. The primary drawback of the photolithography-based methods is the need for large numbers of unique prefabricated photomasks for each synthesis step. These methods are therefore primarily practical for the large-scale, large-volume synthesis of DNA chips. A relatively low-cost alternative to these methods is to direct light to particular positions via digital photolithography with a micromirror device such as the NimbleGen device available from Roche. Digital photolithography methods are amenable to different photochemistries, including cleaving photolabile protecting groups such as (R,S)-1-(3,4-(methylenedioxy)-6-nitrophenyl)ethyl chloroformate (MeNPOC) or 2-(2-nitrophenyl)propoxy-carbonyl (NPPOC) on the phosphoramidite monomers or linkers (Gao et al., 2001; Richmond et al., 2004; Singh-Gasson et al., 1999; Tian et al., 2004; Zhou et al., 2004). Alternatively, photo-generated acids can be produced by these methods which allows for deprotection using standard phosphoramidites (Gao et al., 2001). While the multiplex synthesis capabilities of chip-based devices (up to 105 different sequences per chip) are almost unparalleled, the total synthetic yield of any given oligonucleotide (107–108 molecules) is too low to use these oligonucleotides directly in many conventional molecular biology reactions. However, George Church and colleges may have found a possible solution to this problem by using postsynthesis PCR amplification to increase the quantities of synthesized oligonucleotides to the concentrations required for standard DNA assembly methods (Tian et al., 2004). While both of these technologies (along with several others; reviewed in Tian et al., 2009) show substantial promise in eventually helping to reduce the cost of oligonucleotide synthesis and thus the cost of gene synthesis itself, further refinement and commercialization will likely be necessary
282
Randall A. Hughes et al.
before these technologies see enough market penetrance to displace traditional automated solid phase synthesizers.
3. Gene Assembly Methodologies Chemical synthesis is typically used to synthesize oligonucleotides of up to 120–150 nt in length. While the synthesis of oligomers of lengths up to 600 nt has been reported (Ciccarelli et al., 1991), the yields are exceedingly poor and the synthesis error rate increases as a function of oligonucleotide length (because of the multiple acid deprotection steps). Therefore, it is not generally recommended that very long pieces be used for gene synthesis. In consequence, a number of methods have been developed in the past three decades to assemble relatively short synthetic oligonucleotides into longer gene sequences. These methods can be roughly grouped into ligation-mediated assembly and PCR-mediated assembly methodologies.
3.1. Ligation-mediated assembly The joining of shorter DNA sequences together into longer DNA sequences using DNA ligase represents the earliest example of synthetic gene synthesis. The first chemically synthesized gene encoded a 30 bp fragment of the yeast alanine tRNA and was constructed in the late 1960s using rudimentary oligonucleotide synthesis techniques and DNA ligasemediated strand joining (Gupta et al., 1968). Over a decade later, the first chemically synthesized gene for a protein, human insulin A (63 bp), was similarly assembled using DNA ligase (Hsiung et al., 1979). A more modern version of this ligation-mediated assembly method called “Shotgun Ligation,” involves splitting the desired gene product into multiple fragments composed of overlapping, phosphorylated oligonucleotides (Eren and Swenson, 1989; Grundstrom et al., 1985) (Fig. 12.2). Each of the gene fragments are assembled by annealing the pooled oligonucleotides for each fragment together and then ligating them en masse. After sub-fragments of a gene are individually assembled, the full-length gene is created by pooling and ligating the subfragments together. The discovery of thermostable ligases has since resulted in the synthesis of even longer DNA sequences (reviewed in Xiong et al., 2008). For example, the ligase chain reaction (LCR) (Au et al., 1998) involves a temperature cycling reaction similar to PCR. In this method, 50 -phosphorylated oligonucleotides with carefully designed overlap sequences that span both strands of a desired DNA duplex are mixed together, heated to denature the oligonucleotides, and then cooled to promote both annealing of the oligonucleotides and ligation by the thermostable ligase. The newly formed
283
Gene Synthesis: Methods and Applications
Ligase
Product
Ligase
Figure 12.2 Ligase-based gene synthesis methods. Top, ligase may be used to ligateassembled (gap-free) overlapping oligonucleotides into a single strand. Bottom, ligase may also be used in a stepwise synthesis approach where the growing product is tethered to a solid support and oligonucleotide pairs are washed over and sequentially ligated to the tethered product.
products can serve as the templates for additional ligation events, and the temperature cycling process is continued until the desired gene or gene fragment has been assembled. Depending upon the length of the desired gene product, assembly of gene fragments by LCR is often accompanied by a final PCR step to amplify or stitch together the shorter gene fragments produced by the ligation reactions (Au et al., 1998). One of the advantages of ligation-mediated assembly over PCRmediated methods is the potential cost savings from reductions in oligonucleotide synthesis. For example, longer oligonucleotides that encode one strand of the desired duplex can be bridged by small complementary oligonucleotides that align ligation junctions (Adams et al., 1988; Chen et al., 1990; Mehta et al., 1997). By first creating only a single strand of the eventual DNA duplex, the cost of oligonucleotide synthesis can be roughly halved. Once the full-length, single-stranded DNA sequence has been assembled by ligation, the complementary strand can be generated by primer extension with a DNA polymerase by PCR. For example, one of the first demonstrations of PCR-mediated assembly was with a 924-bp gene coding for horseradish peroxidise isozyme (Smith et al., 1990). In this study, the authors broke the genes into double-stranded oligonucleotide pairs that were 30–60 nt in length and which contained 6–9 nt overlaps. The authors allowed the oligonucleotides to anneal based upon the sequence complementarity of the overlap regions, ligated them together, and then amplified
284
Randall A. Hughes et al.
the assembled gene by PCR using flanking primers ( Jayaraman et al., 1991). A version of this protocol that relied on enzymatically phosphorylated oligonucleotides was used to assemble the entire 5.4-kb genome of the bacteriaphage jX174 (Smith et al., 2003). A slightly different version of this protocol used a known, largely complementary reference sequence to guide the assembly of a mismatched DNA duplex and led to the assembly of a 1.9-kb Bacillus thuringiensis d-endotoxin gene that had been altered for optimized expression in transgenic alfalfa and tobacco (Strizhov et al., 1996). Ligation-mediated assembly has found a home in some commercial gene synthesis operations due to its inherently low mutagenesis rate (no errors due to DNA polymerase) and its relative ease of use (Mulligan et al., 2002, 2007). For example, Blue Heron Biotechnology uses a solid support-based ligation-mediated oligonucleotide assembly process to synthesize customersupplied DNA sequences (Fig. 12.2). The Blue Heron technology assembles a DNA duplex sequence on a solid support by iterative annealing and ligation of oligonucleotide pairs. In each round of synthesis, the next section of the DNA duplex anneals to the previously assembled DNA through designed sequence overlaps. Once annealed, the new section of DNA duplex can be ligated together using DNA ligase. This process is repeated until the entire gene sequence is sequentially assembled. This technology has been fully automated allowing for the efficient, high-throughput synthesis of DNA sequences at a commercial scale.
3.2. PCR-mediated assembly The most commonly used gene synthesis techniques currently rely on the polymerase chain reaction (PCR) to mediate assembly of a desired DNA sequence from short oligonucleotides. These methodologies can be characterized as “one-pot” or single-step assemblies where the desired gene product is assembled (often as a mixture of PCR products of varying lengths) in a single enzymatic reaction or as multiple step assemblies where the gene is first divided into separate subassembly reactions. In these latter methods, the various subassemblies are then mixed and “stitched” together in a series of hierarchical thermal cycling reactions to yield the fully assembled gene products. A ligase-independent, one-pot PCR-mediated gene assembly from oligonucleotides has been reported by Stemmer et al. (1995). In this method, a DNA polymerase was used to stitch and extend 56 40 nt oligonucleotides designed to cover both strands of a 1.1-kb gene encoding a beta-lactamase gene fragment (Stemmer et al., 1995). All of the oligonucleotides encoding the gene were pooled together and assembled in a onepot polymerase chain assembly (PCA) reaction, leading to a mixture of elongated gene fragments that included the full-length desired product. The desired product was then amplified from the PCA reaction in a standard
Gene Synthesis: Methods and Applications
285
PCR using the outermost primers (Fig. 12.3). To demonstrate the utility of this method, a 2.7-kb plasmid was assembled from overlapping oligonucleotides (Stemmer et al., 1995), and this method was also successfully used to synthesize a 2.1-kb gene from Plasmodium falciparum (pfsub-1) (Stemmer et al., 1995). The one-pot PCR assembly methodology has since been used to synthesize a number of long gene sequences (Kodumal et al., 2004; Reisinger et al., 2006; Xiong et al., 2004a, 2006a). Variations of the onepot PCA synthesis of genes from oligonucleotides have also been employed to reduce the amount (and thus the cost) of oligonucleotide synthesis. These variations typically design the oligonucleotides such that the entire sequence of the DNA duplex to be synthesized is not represented in the synthesized oligonucleotides. Instead, the oligonucleotides encoding each strand of the DNA duplex are staggered, with overlap sequences between adjacent oligonucleotides being filled in during polymerase elongation, a process known as overlap extension (Chen et al., 1994; Horton et al., 1989). As the number of oligonucleotides needed to assemble increasingly complex DNA sequences increased, alternative methodologies for oligonucleotide design and PCR assembly were devised to increase the rates of successful assembly, reduce error rates, and increase throughput. Most of these methods required multiple PCRs that built a gene in sections prior to splicing the sections together. The overlap extension process is frequently utilized to stitch multiple subsections (subassemblies) of genes together in an ordered fashion through designed overlap sequences. Along these lines, Young and Dong combined dual asymmetrical PCR (DA-PCR) (Sandhu et al., 1992) and overlap extension PCR (OE-PCR) (Horton et al., 1989) to develop a PCR assembly method that did not require optimization of the reaction conditions and could be performed without the need for
Set of gapless, overlapping oligonucleotides is amplified forming a mixture of products.
Full-length product is amplified from the mixture using flanking primers. Product
Figure 12.3 The Stemmer method for gene assembly. A gapless series of overlapping oligonucleotides extend one another sequentially to produce a mixture of products, up to and including the full-length gene. This mixture is resolved by PCR with flanking primers to selectively amplify the desired full-length product.
286
Randall A. Hughes et al.
phosphorylated or gel-purified oligonucleotides (Young and Dong, 2004). In this method, the oligonucleotides that encoded the gene of interest were designed such that the sequence overlaps between each subassembly (composed of four overlapping oligonucleotides) were longer than the overlaps between adjacent oligonucleotides. When combined with a fivefold excess of the outermost primer pair, this method led to the preferential extension and amplification of longer subassemblies over shorter products. The products of each DA-PCR (primary subassemblies) could then be stitched together into the full-length gene product via an OE-PCR step. The use of multiple subassembly reactions reduced the likelihood of nonspecific annealing and thereby increased the purity of the assembled product. This methodology also reduced the quantity of synthesized oligonucleotide bases necessary to synthesize the full-length, double-stranded gene product to 1.2 times the length of the gene versus the previously required two times the length (Stemmer et al., 1995). Xiong et al. (2004a) described another PCR-based, two-step DNA synthesis (PTDS) method for the synthesis of long gene sequences based on the sequential OE-PCR assembly of genes from overlapping synthetic oligonucleotides (Fig. 12.4). In this method, fragments of the gene of interest were synthesized by mixing 10–12 60-mer oligonucleotides with 20 bp overlaps in a single PCR to assemble each 400–500 bp gene fragment. The full-length gene was then assembled from the gene fragments in a secondary OE-PCR with the outermost primers (Xiong et al., 2004a). The authors demonstrated that this method was a relatively high-fidelity (error rate 0.12% average) and cost-effective way to assemble larger gene sequences regardless of high G/C content, repetitive sequences, or complex secondary structures. Using this method, the authors have synthesized the genes for a 657-bp rice transcription factor, a 1230-bp Peniophora lycii phytase gene (Xiong et al., 2006b), a 1245-bp HBV PRS-S1S2S large
Block 1
Block 2
Second step
1.5 pmol each per reaction 30 pmol each per reaction
Product
Figure 12.4 PCR-based two-step DNA synthesis (PTDS) strategy used by Xiong et al. The product is constructed by first assembling 400–500 bp blocks from overlapping oligonucleotides. Higher concentrations of the outermost flanking primers (example concentrations are at right) facilitate the assembly of individual blocks. These blocks are then assembled into a full-length gene in a second PCR step.
287
Gene Synthesis: Methods and Applications
surface antigen (Lou et al., 2007), and 2382-bp vip3aI and 5367-bp CrtEBWY genes (Xiong et al., 2004a). An improved version of this methodology has also been described that includes additional gel purification steps for the oligonucleotide substrates, and an OE-PCR-based error correction step to improve the overall fidelity of the synthesized product (Xiong et al., 2006a). Gao et al. (2003) introduced the concept of thermodynamically balanced, inside-out (TBIO) nucleation PCR-mediated gene synthesis (Fig. 12.5). The TBIO method involves a five-step design and synthesis protocol and is especially notable for its unique primer design and assembly scheme. In the TBIO assembly scheme, sense-strand primers encode the N-terminal half of a gene sequence, while antisense-strand primers encode the C-terminal half. The sense-strand primers and the antisense primers meet at a point in the center of the gene (or gene fragment). During assembly, a nucleation event initially happens at the center of the gene sequence in which the first pair of overlapping sense and antisense primers is extended by PCR. This is followed by the annealing and extension of the next pair of sense and antisense oligonucleotide primers to the previously created duplex DNA, which anneal through overlap sequences designed to have identical annealing temperatures (i.e., thermodynamically balanced). This serial extension scheme builds the gene up in stages and aids in the assembly of genes which are otherwise difficult to assemble. Once the oligonucleotides necessary to assemble the gene fragments have been made, the second step in the assembly process is to use from four to six pairs of 60 nt TBIO primers for inside-out, bidirectional PCRs that will yield
Oligonucleotide
Concentration Overlaps adjusted to match T ms across assemblies
Figure 12.5 Thermodynamically balanced, inside-out (TBIO) assembly. Overlaps between primers are balanced to obtain uniform annealing temperatures across the assembly. Primers are added in a concentration gradient with the innermost pair having the lowest concentration and the outermost pair having the highest concentration. This results in selective pressure for amplification of ever longer products and ultimately favors the full-length gene.
288
Randall A. Hughes et al.
400–500 bp gene fragments upon assembly. The primers in a TBIO reaction are added in a gradient of concentrations with the innermost primers added at the lowest concentration and the outermost primers at the highest concentration; which encourages the inside-out extension reaction. Next, the gene fragment is gel-purified to remove any spurious products and is further extended by inside-out, bidirectional PCRs with additional sets of sense and antisense primers. The inside-out, bidirectional assembly process is repeated with additional sets of primers until the fulllength gene is obtained. The authors compared the TBIO method to “conventional” thermodynamically balanced (TBC) gene assembly methods for the synthesis of the human protein kinase genes PKB2 (1.5 kb), S6K1 (1.6 kb), and PDK1 (1.7 kb) (Gao et al., 2003). Of the 15 synthetic genes sequenced, the error rate for the TBIO PCR-based synthesis method ranged from 0% to 0.3% (Gao et al., 2003), making this one of the more high-fidelity gene synthesis methods available. Other TBC methods often have error rates between 0.1% and 1% (Binkowski et al., 2005; Hoover and Lubkowski, 2002; Smith et al., 2003; Xiong et al., 2004a). A modification of the original TBIO method, called sequential TBIO or seqTBIO, has been published in which the authors performed the PCR assembly using small volumes (12 mL) and few cycles (7 cycles) with one TBIO primer pair at a time, starting with the centermost primer pair (Marsic et al., 2008). After the first primer pair in the center of the desired gene is extended in a short seven cycle PCR extension, another seven cycle extension reaction is set up with the next pair of TBIO primers, rather than a mixture of four to six primer pairs at a time. The assembly is much cleaner and eliminates the time-consuming need to purify products between reactions. Using this methodology, the authors report the successful synthesis of the 449 bp gene encoding for a PAZ (Piwi/Argonaute/Zwille) domain and the 1826 bp gene encoding for a polA DNA polymerase mutant (Marsic et al., 2008). Another strategy for gene assembly that is very similar to the TBIO assembly method is named successive extension PCR (Peng et al., 2003; Xiong et al., 2004b). Peng et al. used this method to synthesize the gene encoding the B. thuringiensis d-endotoxin, cryIA(c) from 26 oligonucleotides. The first 13 primers encoded the N-terminal half of the gene and the last 13 primers encoded the C-terminal half of the gene, with 20 bp overlaps throughout (Peng et al., 2003). The primer assembly of the gene fragments was almost identical to that of the TBIO method except that the authors initially used a one-pot PCR synthesis (although mechanistically assembly still occurred in an inside-out bidirectional manner). It was observed that equal concentrations of each oligonucleotide resulted in relatively low yields and extensive “smearing” of DNA fragments over 1.0 kb in length (Peng et al., 2003; Xiong et al., 2004a,b). However, these problems could be corrected by adjusting the amounts of the primers added to the one-pot
289
Gene Synthesis: Methods and Applications
assembly reaction (1.5 pmol for the innermost primers and 30 pmol for the outermost primers) followed by amplifying the full-length gene product in a secondary PCR using the outermost primers (Xiong et al., 2004a,b). A version of this procedure in which a gene was first subdivided into multiple fragments for one-pot TBIO PCR assembly from oligonucleotides, and these gene fragments were then stitched together in a secondary OE-PCR into the full-length gene has been reported to improve the synthesis yield of the fully assembled product (Xiong et al., 2004a). Additionally, a variation of this procedure has been successfully automated for high-throughput assembly of genes encoding several hundred mutant proteins at a scale and cost savings heretofore unseen in laboratory-scale gene synthesis applications (Cox et al., 2007) (Fig. 12.6). Recently an improved PCR-based gene synthesis (IPS) method was described that may lower cost of gene synthesis and eliminate the need for multiple gel purification steps associated with other assembly procedures (Gao et al., 2003; Xiong et al., 2004a, 2006a; Young and Dong, 2004), while simultaneously improving fidelity (Gordeeva et al., 2010).
Primary reaction 1
Primary reaction 2
95 º, 94 º, 60 º, 68 º,
2 min 15 s 30 s 30 s
16 cycles
Dilute primary reactions 3:1, add 1.5 mL as template for secondary.
95 º, 2 min 94 º, 15 s 60 º, 30 s 72º, 30 s
Secondary reaction
Each 50 mL reaction uses 13 mL of an amplification master mix comprised materials from the NovaGen KOD Hot-Start polymerase kit:
Product [P]T [Pi] =
n–1 Σj = 0
cj
c n–j
[Pi ] = concentration of ith primer pair c = constant, (0.65 ≤ c ≤ 0.75) [P]T = total [primer] (≈600 nM)
16 cycles
67.1 nM 95.9 nM 137 nM
500 mL 500 mL 300 mL 100 mL
10⫻ KOD buffer dNTP mix 25 mM MgSO4 Hot-Start KOD
Figure 12.6 Gene synthesis as described by Cox et al. Oligonucleotide assembly scheme (top) for a three-pair per fragment by two fragments per gene assembly, similar to that used in Reddy et al. Primary reactions assemble oligonucleotides from inside-out due to a concentration gradient of material (top left) in 16 cycles of PCR (conditions right). The primary reactions are diluted and used as templates for a secondary (bottom) reaction that assembles and amplifies the final product via another 16 cycles of PCR (conditions right). The entire process can be automated on a liquid handling robot. The formula from Cox et al. used to calculate oligonucleotide concentrations for the primary reactions, and the concentrations actually used for this sample “3 2” reaction, are given in the lower left. The recipe for the amplification master mix is given in the lower right.
290
Randall A. Hughes et al.
The primary cost savings afforded by the IPS method arises during the design of the oligonucleotide primers (Fig. 12.7). One strand of the desired DNA duplex sequence is synthesized in whole via end-to-end 60 nt oligonucleotide primers without overlaps. On the opposing strand, 30 nt primers are designed that have 15 bp of complementary overlap with each of the abutted oligonucleotides and also a uniform melting temperature (Tm). These shorter oligonucleotides act like splints to orient the longer 60 nt oligonucleotides and provide overlap sequences for annealing and assembly of adjacent oligonucleotides. The IPS method then synthesizes a gene using a three-step process. In the first two steps, the gene sequence is divided into 300–400 bp gene fragments and the oligonucleotides necessary to assemble each fragment are divided into multiple two-primer (one 60 nt primer and its complementary 30 nt “splint” primer) 5-cycle PCR-mediated extension reactions. The 15 bp extensions in turn form overlaps for adjacent fragments (now 75 nt long) to anneal to one another. Following this first, brief extension cycle each of the extended oligonucleotide pairs are pooled into assembly PCRs that lead to the synthesis of 300–400 bp gene fragments. The third step of the IPS method is the joining of these 300–400 bp gene fragments together into a full-length gene and amplification using the outmost primers. Due to the unique design of the oligonucleotide primers used in the IPS method, any errors in the assembled sequence can be corrected by an OE-PCR step without the need for the synthesis of additional oligonucleotides (Gordeeva et al., 2010) (Fig. 12.7). This is because any 60 nt or longer fragments containing a mutation can be reamplified and reassembled using existing oligonucleotides and then stitched back together using sequence verified segments of a previously cloned (and sequenced) variant. This is a decided improvement over other OE-PCR-mediated error correction methodologies that require the synthesis of additional primers (Xiong et al., 2006a).
5 cycles of PCR (make mini-fragments)
20 cycles of PCR (make main fragments)
PCR for final product
Figure 12.7 The improved PCR synthesis (IPS) method. This method simplifies the correction of synthetic errors by starting from mini-fragments that are formed by briefly extending primer pairs. These pairs then build larger fragments, which are in turn assembled into a final product. Errors discovered in the final product can be readily corrected using original primer pairs that comprise the region containing the error.
Gene Synthesis: Methods and Applications
291
As we have seen, most PCR assembly methods rely on at least two separate PCRs to generate clean, full-length genes: one to assemble the oligonucleotides into gene fragments, and the other to either amplify the fully assembled gene from a population of assembly products or to stitch together smaller gene fragments. However, Wu et al. (2006) have described a one-pot method that is at root a simplification of the traditional two-step method originally reported by Stemmer et al. (1995). In the traditional PCA methodology, 40-mer oligonucleotides with 18–20 bp of sequence overlap between adjacent oligonucleotides encode both strands of a DNA duplex with no sequence gaps between the oligonucleotides. The oligonucleotides encoding for the entire DNA duplex are then mixed at equal molar concentrations, and the mixture is assembled into the full-length gene using PCR. This process, if successful, results in the assembly of the desired sequence along with the many intermediate and spurious products. The desired gene is then amplified from the mixture by adding an excess of the outermost primers and carrying out a secondary PCR. Wu et al. (2006) simplified this procedure by simply initially adding at least a 10-fold excess of the outermost primers to the primary assembly PCR, thereby encouraging the amplification of the desired full-length product from the outset. However, the authors noted that this methodology was highly dependant on the DNA polymerase used for the PCR: KOD HiFi (Novagen) and KOD XL (Novagen) could be used to assemble a 777 bp Gene A fragment and a 936 bp Gene C fragment, while Taq and Pfu (Stratagene) DNA polymerases could not be used to assemble the genes under the same reaction conditions. While other sized DNA products are still produced by this method, it is nonetheless quick, simple, and convenient (and other DNA products can still be removed by gel purification).
4. Gene Design Software While the technical considerations accompanying gene synthesis are daunting, the design phase is perhaps the most important. There are software packages, such as GeneDesigner (https://www.dna20.com/genedesigner2/; Villalobos et al., 2006) and GenoCAD (http://www.genocad.org/ ; Czar et al., 2009), that will allow users to drag-and-drop genetic elements based on “parts databases” such as the Registry of Standard Biological Parts (http://www.partsregistry.org/). However, this type of software is beyond the scope of this review as it uses preassembled DNA parts which are not necessary derived from synthetic starting materials. Instead, we are more concerned with the handful of packages that enable users to design overlapping oligonucleotides for gene synthesis.
292
Randall A. Hughes et al.
One of the easiest to use design programs is DNAWorks (http://helixweb.nih.gov/dnaworks/; Hoover and Lubkowski, 2002). This program has the advantage that it has been under development for quite some time. In its current version, it is capable of setting up overlap extension PCR assembly reactions as well as TBIO assemblies with user-specified Tm windows and oligonucleotide lengths. It will reverse-translate an amino acid sequence with a variety of preexisting codon-usage tables as well as user-entered codon frequency data and permits a cutoff value for low-use codons to be specified. Assembly PCR Oligo Maker (http://publish.yorku.ca/pjohnson/ AssemblyPCRoligomaker.html; Rydzanicz et al., 2005) and Gene2Oligo (http://berry.engin.umich.edu/gene2oligo; Rouillard et al., 2004) both automate the design of oligonucleotides for use in Stemmer-style assembly methods. GeneDesign (http://genedesign.thruhere.net/gd/; Richardson et al., 2006, 2010). This program has been under active development and includes a variety of tools beyond mere gene synthesis design. The synthesis design module will break a gene into 500 bp fragments joined by unique restriction sites and then further break each of those fragments into overlapping oligonucleotides (Fig. 12.8). This hierarchical approach can facilitate the synthesis of very large stretches of DNA. The additional features include a “Codon Juggling” module that will produce not only optimized sequences (based on standard frequency tables) but also next-most optimized sequences and most different sequences (iso-coding sequences with as little DNA sequence identity as possible). Also present are features to locate and remove restriction sites and user-specified arbitrary sequences from genes. It is worth mentioning that this website is particularly self-explanatory and easy to use. TmPrime (http://prime.ibn.a-star.edu.sg; Bode et al., 2009) also designs oligonucleotides for gapless PCR or LCR assembly and breaks large sequences into smaller segments for multipart assembly. However, it is noteworthy for carrying out a pairwise sequence alignment to identify and Overlaps adjusted to match Tms across assemblies
Block 1
Block 2 Restriction sites to facilitate product assembly
Figure 12.8 Gene synthesis as implemented by GeneDesign. The product is synthesized in 500 bp segments with unique restriction sites at each segment junction to facilitate final assembly by ligation. Each segment is assembled essentially by the Stemmer method but the overlaps contain gaps, and the overlap regions are globally normalized to have similar Tm values.
Gene Synthesis: Methods and Applications
293
remove hetero- or homodimers, potential mishybridization events, and hairpins from the design. GeneComposer (http://www.genecomposer.net/; Lorimer et al., 2009) is a software package with a mixture of high-level design features and the capability to break down a gene into oligonucleotide sequences for assembly. The oligonucleotide design modules attempt to balance Tms and permit user-specified minimum and maximum oligonucleotide lengths, but the assembly strategy is limited to gapless overlaps. This software can be used to create scripts that can be used by TECAN Evo liquid handling robotic workstations to set up gene assembly reactions, a boon for assembling multiple genes. Going beyond design and considering the use of laboratory automation for synthesizing genes, GeneFab and FabMgr are programs that can be used to first design and implement a two-step gene assembly process (where smaller segments are assembled using inside-out nucleation PCR) and then join these fragments into a final product using conventional overlap extension PCR (Cox et al., 2007). In addition to generating code to facilitate the use of liquid handling robots in gene construction, this software package will construct entire families of variant genes while reusing as many oligonucleotides as possible. This is an extremely beneficial and thrifty feature for protein engineering projects that require the construction of numerous variants of a single parental gene. This software has been used to explore the sequence space around protein-coding genes (Allert et al., 2010) and has also been adapted to efficiently construct ensembles of antibody genes (Reddy et al., 2010).
5. Synthesis Fidelity/Error Correction Methods and Considerations One of the primary problems with current chemical gene synthesis methods is the need to reduce or correct introduced sequence errors. Errors are introduced into the synthesis process during both chemical synthesis steps and enzymatic assembly steps. During chemical oligonucleotide synthesis, the coupling efficiency for the addition of each successive monomer is typically 98% (Hall et al., 2009; Zhou et al., 2004). Insertions and deletions are the most common errors that arise. Deletions can occur as a result of incomplete capping or deprotection and can occur at 0.5% per position. Insertions are typically caused by unwanted DMT cleavage by tetrazole and can occur at a frequency of up to 0.4% per position (Hall et al., 2009). The accumulation of these deleterious by-products can lead to only about 30% of any synthesized oligonucleotide (100-mer) being the desired sequence (Hall et al., 2009; Tian et al., 2009).
294
Randall A. Hughes et al.
The assembly of relatively short oligonucleotides into longer DNA sequences requires enzymatic extension by DNA polymerases that can also introduce errors into synthetic genes, albeit at a much lower rate than those introduced by oligonucleotide synthesis. A premium is placed on the use of high-fidelity DNA polymerases in virtually all PCR-based assembly methods, though errors still can and do occur (Andre et al., 1997; Cline et al., 1996). Taken together, all of these sources of potential error in synthetic gene synthesis lead to error rates of 1–10 mutations per kilobase of DNA (Binkowski et al., 2005; Hoover and Lubkowski, 2002; Xiong et al., 2004a). These introduced errors must generally be detected by screening, and multiple clones must be sequenced to obtain the desired DNA sequence. Consequentially, several methods have been developed to eliminate errors associated with chemical DNA synthesis and to increase the throughput of the synthesis process itself.
5.1. Oligonucleotide purification The quality and purity of the oligonucleotides used in chemical gene synthesis is the large source of potential error. To increase the purity of input oligonucleotides, many gene synthesis protocols rely on purification of full-length oligonucleotides from capped, truncated variants by denaturing polyacrylamide gel electrophoresis (PAGE) (Sambrook and Russell, 2001), and this can indeed reduce error rates in the final assembled products by several fold (Xiong et al., 2006a). However, removing synthesis by-products by PAGE is a costly, low-throughput, labor-intensive process. It is also difficult to separate full-length oligonucleotides from single deletion variants, even though these are the predominant cause of synthesis-related errors in any gene assembly process (Temsamani et al., 1995; Tian et al., 2004). Purification of gene fragments by agarose gel electrophoresis steps is often used in multiple two-step assembly PCR protocols to reduce error and promote the successful assembly of a desired DNA product (Gao et al., 2003; Kodumal et al., 2004; Xiong et al., 2006a; Young and Dong, 2004). Again, gel purification steps add to the time, cost, and labor involved in gene assembly. Therefore, several alternative, higher throughput methods have been devised to reduce or eliminate introduced synthetic errors.
5.2. Reading frame selection If the desired gene is a protein-coding sequence, then functional selection can be used to prescreen and enrich mutation-deficient sequences. For synthetic protein-coding sequences whose function cannot be readily assayed, it may instead be possible to screen for clones that have the proper
Gene Synthesis: Methods and Applications
295
reading frame. This is especially useful, as many of the errors that accumulate are single nucleotide deletions from the n 1 by-products of chemical oligonucleotide synthesis (Temsamani et al., 1995; Tian et al., 2004). Many of these single nucleotide deletions will cause reading frame shifts within a protein-coding sequence that will in turn lead to premature translation termination (Cho et al., 2000). To remove these variants, the synthetic genes can be cloned into a frame-shift selection vector that will express a fusion with a reporter protein, usually an antibiotic resistance element (Cox et al., 2007; Gerth et al., 2004; Lutz et al., 2002; Seehaus et al., 1992). The cells expressing genes with deletions will be sensitive to the antibiotic and only cells with genes that lack these deletions will survive on selective media. This method of presequencing has been reported to enrich for correct sequences by four- to fivefold and virtually eliminates single and double deletion variants (Cox et al., 2007).
5.3. Mismatch binding and cleavage If the gene to be synthesized does not encode a protein, more general methods for error correction will have to be applied. Many methods take advantage of the mismatch repair proteins responsible for maintaining sequence fidelity during replication of genomic DNA (Modrich, 1991). By denaturing (heating) the synthesized DNA duplexes (some strands with mutations, some without) and then reannealing (cooling) the mixture, new duplexes will form, some of which contain mismatches (generally due to errors), while others will not (generally correct) (Fig. 12.9). The heteroduplexes that contain mismatches or bulges due to deletions or insertions can then be detected and/or cleaved by mismatch repair proteins. In one such method, the Escherichia coli mismatch repair MutHLS proteins locate, bind, and cleave mismatched duplexes. The Mut S and Mut L proteins are known to bind the mismatched position and then recruit the Mut H endonuclease to scan and cleave the DNA duplex at the first d(GATC) sequence 30 to the mutation site (Smith and Modrich, 1996, 1997). The uncleaved (largely mutation free) duplexes that survive the cleavage can be purified by agarose gel electrophoresis based on their size and then subsequently cloned and sequenced. The authors demonstrated that this method can effectively enrich for sequences that lack G-T, A-C, G-G, or A-A mismatches, as well as deletions or insertions. There was up to a 10-fold enrichment of correct sequences over nontreated control sequences (Smith and Modrich, 1996, 1997). A similar enrichment technique relies on a thermostable version of the MutS protein from Thermus aquaticus. In this instance, the MutS protein is immobilized on beads or other solid support, binds to heteroduplex DNA sequences containing mismatches, and foments their selective removal from the pool of sequences, leaving mutation-free duplexes to be eluted from a column (Carr et al., 2004; Fig. 12.9). This method has been shown to enrich for
296
Randall A. Hughes et al.
Mismatch-specific endonuclease
Heteroduplex
Gel purify correct (full-length) product
Formation
Immobilized MutS
Captures heteroduplexes
Fragment Deplete heteroduplex fragments with immobilized MutS Reassemble
Figure 12.9 Error correction schemes based on heteroduplex removal. Heteroduplexes are formed through melting and reannealing the synthetic products. Any resultant mismatch-bearing heteroduplexes are removed by degradation with a mismatch-specific endonuclease (top right) or by binding to immobilized MutS (middle right and bottom right). Iterative cycles of removal and resynthesis can be accomplished by shuffling the fragments (bottom right). Shuffling has an efficiency advantage when error-free transcripts are relatively rare, which is frequently the case when synthesizing genes greater than 1 kb in length. By removing only short fragments containing errors, the error-free fragments comprising the remainder of each synthetic gene can be used to reassemble error-free genes.
correct genes by as much as 15-fold and has an error rate of only 1 per 10 kb (Carr et al., 2004). Binkowski et al. have further modified this technique to include DNA shuffling. Heteroduplex containing mixtures are digested with an endonuclease, and immobilized MutS is used to remove fragments containing mismatches. Following this purification, reassembly of the cleaved genes will enrich the correct consensus sequence (Binkowski et al., 2005; Fig. 12.9). This is useful because the purification of mismatches by MutS is not fully efficient, and those mismatches that make it past the first screen will be further eliminated when the shuffled genes are again passed over the MutS column. The authors demonstrated that two rounds of shuffling could significantly reduce the errors associated with chemical synthesis of a green fluorescent protein variant (GFPuv); only one error per 3500 bp was observed, a further 3.5- to 4.3-fold decrease relative to no iterative binding and shuffling (Binkowski et al., 2005).
Gene Synthesis: Methods and Applications
297
Several other methods for enriching mutation-free sequences from assembled DNAs use other mismatch endonucleases. Pincas et al. used Endo V endonuclease from Thermotoga maritima to cleave heteroduplexes one base 30 to the mismatch followed by thermostable ligase from Thermus species AK16D to proofread and repair any nonspecific nicks (Huang et al., 2002; Pincas et al., 2004). Other more specific mismatch detecting endonucleases including phage T4 endonuclease VII, T7 endonuclease I, and E. coli endonuclease V have been shown to be useful for identifying and cleaving synthetic heteroduplex DNAs containing mutations (Fuhrmann et al., 2005; Fig. 12.9). The use of these mismatch cleaving endonucleases increases the quality of the primary PCR assembly reactions to the point that mutation-free synthetic genes over 1 kb can be made with relative ease (Fuhrmann et al., 2005; Young and Dong, 2004). The T4 and E. coli endonucleases were shown to be superior in this regard and reduced errors by 400-fold over no endonuclease-treated controls (Fuhrmann et al., 2005).
5.4. Correcting errors in synthetic genes by site-directed mutagenesis Finally, the direct correction of known errors can be used to generate desired DNA sequences following assembly (Gordeeva et al., 2010; Xiong et al., 2006a). Xiong et al. (2006a) design new oligonucleotide primers that contain the targeted correction and reassemble the gene via overlap extension PCR. Gordeeva et al. employ a similar methodology except that they do not employ new oligonucleotides (as the original oligonucleotides are presumably correct, and the mutation is just a statistical anomaly; Gordeeva et al., 2010). They demonstrate that one can simply rebuild the section of the gene which contained the error by using the original synthesis primers and then reassemble the gene again by overlap extension PCR (Gordeeva et al., 2010). In practice, any oligonucleotide-based, site-specific mutagenesis protocol is likely to work for correcting errors in synthetic DNA sequences. However, those that can manage error correction without the need for subcloning, such as the widely used QuickChange method from Stratagene, may be preferable to others (Salerno et al., 2005; Wang and Malcolm, 1999, 2002).
6. Applications of Gene Synthesis 6.1. Codon optimization Synonymous codon usage varies by organism; this phenomenon has been reviewed elsewhere both in terms of evolutionary implications (Hershberg and Petrov, 2008) and in terms of implications for heterologous expression
298
Randall A. Hughes et al.
of proteins (Gustafsson et al., 2004). The process by which an amino acid sequence is rendered as a DNA sequence with codon usage suitable to a given organism is known as codon optimization. At the most basic level, an amino acid sequence can be reverse-translated using highly utilized codons for an expression host, and this is almost automatically implemented for E. coli in many DNA editing programs. More sophisticated decision trees regarding codon-use can also be implemented; for example, next-best codons may be introduced to avoid creating undesirable elements in the DNA sequence (restriction sites, transcription terminators, inverted repeats). For example, the JCat Web site (www.jcat.de; Grote et al., 2005) uses codon adaptation index (CAI) values (Sharp and Li, 1987) to optimize codon usage for a wide range of prokaryotic hosts and a handful of eukaryotes, but can also take into account these additional DNA sequence features. Similarly, the OPTIMIZER website (http://genomes.urv.es/ OPTIMIZER/; Puigbo et al., 2007) uses information from a database of genes that are predicted to be highly expressed (Puigbo et al., 2008) to optimize codon usage. Several groups have subjected sequence-based expression optimization strategies to extensive experimental tests. Kudla et al. generated a synthetic library of 154 iso-coding GFP genes with random silent mutations. There proved to be no correlation between CAI and GFP fluorescence, but instead the greatest expression of fluorescence stemmed from the occurrence of weak RNA structures in the first 28 bases of the open reading frame (Kudla et al., 2009). Welch et al. (2009) constructed 40 iso-coding variants for two genes using a Monte Carlo approach that should have explored a wide range of parameters thought to affect expression (secondary structure, GC content, codon frequency; Welch et al., 2009). Based on the observed expression levels, they concluded that the use of codons which corresponded to tRNAs which were well charged during amino acid depletion led to optimal expression, regardless of CAI values, AT content, or secondary structure, although they also noted some effect of 50 RNA structures. These authors have developed a codon-usage model based on this empirical optimization, and this has been commercialized through the company DNA2.0. Finally, Allert et al. (2010) examined 816 complete bacterial genomes for CAI values, AT content, and RNA secondary structure at several positions within open reading frames. They found that AT content at the 50 and 30 ends of genes was significantly higher than in the middle and used a Monte Carlo approach to explore whether this skewing impacted the expression of 285 synthetic variants of three genes. Indeed increasing the AT content at the extremes (particularly at the 50 end) was shown to increase expression levels of targeted protein sequences (Allert et al., 2010). Taken together, these studies clearly indicate that merely recoding a gene using commonly used host codons will not maximize expression. Removing RNA structure (often by enhancing AT content) is also likely
Gene Synthesis: Methods and Applications
299
to be important. That said the derivation of rules for the expression of synthetic genes is still in its infancy. Calculating and comparing the codon usages of the high-expressing variants from Allert et al. with the highexpressing codon usages from Welch et al. would be illuminating, at least for E. coli.
6.2. Synthetic biology The raison d’eˆtre of synthetic biology is the desire to engineer and understand biological components in a modular, scalable, and (most importantly) engineerable way. To invoke the late Richard Feynman: “What I cannot create, I do not understand.” Overall, despite multiple competing definitions and sects, synthetic biology seeks to demonstrate an understanding of biology by rebuilding it (or at least parts of it). Irrespective of school of thought, at the center of this discipline is the synthesis of synthetic DNA. Indeed, it can be reasonably said that advances in synthesis and assembly techniques have led to the emergence and definition of the discipline. Applications for gene synthesis within the burgeoning field of synthetic biology are largely up to the imagination of the people doing the experiments. Most importantly, the ability to make synthetic DNA means that researchers are no longer limited to natural sources or natural encodings. As a result, the construction of synthetic reporters, enzymes, regulatory elements, and even entire pathways has been undertaken (Burbelo et al., 2010; Khalil and Collins, 2010; Young and Alper, 2010). These peregrinations have been abetted by attempting to modularize biology using standardized “parts” such as Biobricks (Kelly et al., 2009; Shetty et al., 2008) or BglBricks (Anderson et al., 2010). Recently, the technological frontier in DNA synthesis has been expanded beyond the modular assembly of biological components into the synthesis of entire genomes. The synthesis of complete genomes has become a possible (if not yet practical) reality. The assembly of the 7.5 kb poliovirus cDNA from synthetic oligonucleotides into subfragment assemblies followed by iterative cloning of assembled synthons into the full-length genome was both a technical and controversial advance when it was completed in 2002 (Cello et al., 2002). Next came the synthesis of the genome for the 5.3 kb jX174 bacteriaphage by ligation of synthetic oligonucleotides into subassemblies followed by PCR assembly into the full-length chromosome (Cello et al., 2002; Smith et al., 2003). The viral genome for the 1918 “Spanish” Influenza A strain has even been assembled from synthetic oligonucleotides (Neumann et al., 1999). Most remarkably, though, J. Craig Venter and colleagues at the Venter Institute have greatly expanded the limits of chemically synthesized genomes and have developed a number of techniques to synthesize, assemble, passage, and transplant entire bacterial genomes.
300
Randall A. Hughes et al.
While novel techniques such as genome transplantation (Lartigue et al., 2007), passaging synthetic genomic DNA (Lartigue et al., 2009), and the cloning of whole bacterial genomes using yeast vector systems (Benders et al., 2010) have so far proved most relevant to manipulating genomic DNA, some of the assembly methods are also relevant to laboratory-scale DNA synthesis. For example, the enzymatic assembly of DNA fragments up to several hundred kilobases in vitro opens the way to the construction of novel operons and biosynthetic pathways. In this regard, a combination of T5 exonuclease, Taq ligase, and Phusion DNA polymerase can be used to assemble large overlapping subassemblies of DNA (Gibson et al., 2009). Overlapping DNA duplexes are first chewed back at their 50 ends by T5 exonuclease to yield long single-stranded overlaps. The adjacent complementary fragment is then used as a primer for a fill-in reaction with Phusion DNA polymerase. Finally, thermostable Tag ligase seals up nicks in the assembled sequence. Using this method, the authors have synthesized sections of the bacterial Mycoplasma genitalium genome (Gibson et al., 2009) and the complete 16.3-kb mouse mitochondrial genome from synthetic oligonucleotides (Gibson et al., 2010b). Another clever assembly method takes advantage of yeast’s ability to readily perform homologous recombination with DNA fragments (Gibson et al., 2008b). Gibson et al. (2008a,b) showed that 25 overlapping DNA fragments could be transformed into yeast to produce the 592-kb synthetic genome for the bacterium Mycoplasma genitalium. Recently, this group applied these same techniques to synthesize the 1.08-Mbp genome for the bacteria Mycoplasma mycoides (Gibson et al., 2010a).
7. Example of High-Throughput Gene Synthesis Using Protein Fabrication Automation Our own version of high-throughput gene synthesis is called protein fabrication automation (PFA), derived from advances made by Homme Hellinga at Duke and largely implemented by Dr Colin Cox. We are currently equipped with a Mermade 192 oligonucleotide synthesizer and a TECAN Evo 200 liquid handling robotic platform to synthesize genes from oligonucleotides in a high-throughput manner. The TECAN robot is equipped with integrated automated thermocyclers to perform assembly PCR with little operator intervention. The first step in this (and most other) synthetic method is to break down the target sequence into a set of overlapping primers compatible with the general assembly strategy. In the case of PFA, the strategy is to assemble primary fragments of approximately 200–600 bp using inside-out nucleation PCR and then to concatenate these fragments into a final product via
Gene Synthesis: Methods and Applications
301
overlap extension PCR (Fig. 12.6). This process was designed to eliminate intermediate purification steps, so that the primary products may be diluted and added directly to the secondary reactions, enabling the overall workflow to be entirely automated on a robotic liquid handling workstation. In short, this is a turnkey operation that goes from oligonucleotide synthesis to full-length genes. There are two software components that facilitate the implementation of PFA, GeneFab, and FabMgr. The GeneFab software will render a DNA sequence as overlapping fragments and oligonucleotide pairs based on userprovided overlaps and oligonucleotide length constraints. It has a graphical interface that provides a view of the sequence, its translation, the location of each primer, and a mispriming graph to aid in manually adjusting derived assembly schemes. It is also capable of limited point-and-click DNA editing and of filtering of sequences for unwanted restriction sites. FabMgr is used to manage variant sequences and maintains a database to enable efficient reuse of shared oligonucleotides. The combination of these two programs ultimately produces scripts for TECAN robotic workstations. During our use of PFA, we have observed that 80–100 bp oligonucleotides with 30-bp overlaps within fragments and 35-bp overlaps between fragments produce uniformly excellent results for most gene sequences up to approximately 2 kb. Once a suitable scheme is devised in GeneFab, the assembly strategy is set up in FabMgr, which takes into account the number of oligonucleotides in each primary fragment and the number of secondary fragments that will be assembled. In addition to the oligonucleotide assembly scheme, the concentration of each component is specified: in the inside-out nucleation reactions where the primary fragments are assembled, oligonucleotide concentrations are set in a gradient where low concentrations are used for the innermost pairs and high concentrations are used for the outermost pairs. Being able to quickly and reliably manufacture numerous, related genes is a transformative capability for protein engineering. The ability to efficiently reuse primers is an important and powerful feature of this software package, and once the assembly scheme has been established, additional mutation lists may be imported to create variants of a parental sequence. This capability even extends beyond lists of mutations, as additional, arbitrary sequences may be specified for inclusion in a gene, provided that the additional sequences are of identical length; this preserves the reading frame of the synthetic genes and facilitates oligonucleotide reuse (example below). As an example, we have recently used the PFA approach to construct numerous single chain antibodies (scFvs) based on sequence analyses of antibody abundances in immune repertoires (Reddy et al., 2010). As the antibody sequences were destined for protein expression in E. coli, we recoded each murine antibody gene using a simplified codon table to ensure only one common E. coli codon would be used for each amino acid, thus negating differences in codon usage at a given position and thereby
302
Randall A. Hughes et al.
avoiding the generation of a new primers. After aligning the antibody sequences around a common poly-glycine-serine linker, we were able to reuse on average three oligonucleotides out of the 12 required for each gene constructed. Oligonucleotide reuse was further facilitated by making the sequence of otherwise different genes more uniform. For example, the antibody genes naturally ranged from 740 to 824 bp due to the variability of the antibody complementary determining regions (Reddy et al., 2010). All of these variants were paradoxically padded to be a uniform 824 bp in length by adding stuffer sequences onto their 50 ends to enable construction with a uniform overlapping oligonucleotide scheme. By employing this design strategy, we were ultimately able to further reduce the total number of oligonucleotides needed to assemble the antibody variants by one-third. As each specified mutant or additional sequence is parsed by FabMgr, any additional oligonucleotides needed are dynamically created and databased. The software also automatically generates an oligonucleotide order. We have found that the control provided by in-house synthesis allows for long (>70 bp) oligonucleotides to be produced with a suitable degree of purity for successful assembly, although commercially sourced oligonucleotides can also be used (and indeed the software will output comma-separated lists suitable for input into a synthesizer or uploading as a commercial order). After the oligonucleotides are synthesized, they are concentration-normalized to 1 mM and placed on the TECAN robot deck. Scripts to direct robotic assembly are also generated by FabMgr. In addition to TECAN-specific code, FabMgr outputs a humanreadable file which enumerates each liquid handling operation, allowing smaller numbers of assemblies to be carried out manually, if desired. The primary fragment-generating PCRs are set up and run in an automated fashion on the robot deck, provided that a plate-handling arm and a suitable PCR machine (controllable through the TECAN software and equipped with a motorized hot top) are available. Upon completion of this reaction, the robot will retrieve the plate from the thermocycler, dilute the reactions, and combine the fragments in another plate for assembly of the final sequence. In the current configuration, the number of 96-well plates of oligonucleotides that can be drawn from is limited to the number of spots available on a plate carrier (9-position carriers are common). The number of total gene variants which may be constructed is in turn limited by the number of fragments required per gene, which is capped at 96 as the software currently only processes one plate at a time. Two- and threefragments assemblies are common for genes <2 kb, and therefore on the order of 32 or 48 gene variants are created per run. The overall rate of any gene synthesis operation is primarily limited by the time it takes to synthesize the oligonucleotides necessary to construct the genes. We typically assemble genes from 80 to 100 nt oligonucleotides. Using the Mermade 192 DNA synthesizer, the synthesis of 192 oligonucleotides takes approximately 3 days, with an additional half day for the robotically
Gene Synthesis: Methods and Applications
303
controlled PCR assembly of the oligonucleotides into full-length genes. We now routinely synthesize up to 50 unique 1–2 kb gene sequences (or several hundred 1–2 kb protein variants) per week at a current cost of between $0.14 and $0.25 per base-pair synthesized. These prices are comparable to those charged by commercial operations that enjoy much larger economies of scale, and our experience strongly suggests that gene fabrication facilities can be successfully developed for many different research settings.
8. Conclusions Thirty years ago, the synthesis of DNA sequences multiple kilobases in length would have seemed like an impossibility, yet recently, we have seen the complete combined chemical-enzymatic synthesis of the 1.08 Mbp genome of the bacterium M. mycoides (Gibson et al., 2010a). This impressive feat demonstrates the power of existing gene synthesis techniques for assembling large DNA sequences from chemically synthesized oligonucleotides. However, this extreme example is currently more of an exception to the rule. The specialized technical skills, equipment, and cost required to build a synthetic DNA sequence of this size makes it almost unobtainable to most research laboratories at the current time. That said oligonucleotide synthesis and enzymatic gene assembly have been optimized to the point where small operons, plasmids, and viruses can be constructed from scratch with relative ease. Given that the techniques reviewed herein are quickly becoming the cornerstone of modern molecular and synthetic biology methods, it will be interesting to see whether they remain the province of individual researchers and facilities, or whether they are largely outsourced to commercial enterprises, as previously occurred with the synthesis of short chemical oligonucleotides and as is currently occurring with the serial sequencing of single DNA templates. It can be argued that the commoditization of gene synthesis will put all researchers on the same footing; it can equally well be argued that such outsourcing may stymie innovation in this area. This review and method therefore stands at a crossroads and may either soon be regarded as an anachronism or as a call to arms.
REFERENCES Adams, S. E., Johnson, I. D., Braddock, M., Kingsman, A. J., Kingsman, S. M., and Edwards, R. M. (1988). Synthesis of a gene for the HIV transactivator protein TAT by a novel single stranded approach involving in vivo gap repair. Nucleic Acids Res. 16, 4287–4298. Agarwal, K. L., Buchi, H., Caruthers, M. H., Gupta, N., Khorana, H. G., Kleppe, K., Kumar, A., Ohtsuka, E., Rajbhandary, U. L., Van de Sande, J. H., Sgaramella, V.,
304
Randall A. Hughes et al.
Weber, H., et al. (1970). Total synthesis of the gene for an alanine transfer ribonucleic acid from yeast. Nature 227, 27–34. Allert, M., Cox, J., and Hellinga, H. (2010). Multifactorial determinants of protein expression in prokaryotic open reading frames. J. Mol. Biol. 402(5), 905–918. Anderson, J. C., Dueber, J. E., Leguia, M., Wu, G. C., Goler, J. A., Arkin, A. P., and Keasling, J. D. (2010). BglBricks: A flexible standard for biological part assembly. J. Biol. Eng. 4, 1. Andre, P., Kim, A., Khrapko, K., and Thilly, W. G. (1997). Fidelity and mutational spectrum of Pfu DNA polymerase on a human mitochondrial DNA sequence. Genome Res. 7, 843–852. Au, L. C., Yang, F. Y., Yang, W. J., Lo, S. H., and Kao, C. F. (1998). Gene synthesis by a LCR-based approach: High-level production of leptin-L54 using synthetic gene in Escherichia coli. Biochem. Biophys. Res. Commun. 248, 200–203. Barone, A. D., Beecher, J. E., Bury, P. A., Chen, C., Doede, T., Fidanza, J. A., and McGall, G. H. (2001). Photolithographic synthesis of high-density oligonucleotide probe arrays. Nucleosides Nucleotides Nucleic Acids 20, 525–531. Benders, G. A., Noskov, V. N., Denisova, E. A., Lartigue, C., Gibson, D. G., AssadGarcia, N., Chuang, R. Y., Carrera, W., Moodie, M., Algire, M. A., Phan, Q., Alperovich, N., et al. (2010). Cloning whole bacterial genomes in yeast. Nucleic Acids Res. 38, 2558–2569. Binkowski, B. F., Richmond, K. E., Kaysen, J., Sussman, M. R., and Belshaw, P. J. (2005). Correcting errors in synthetic DNA through consensus shuffling. Nucleic Acids Res. 33, e55. Bode, M., Khor, S., Ye, H., Li, M., and Ying, J. (2009). TmPrime: Fast, flexible oligonucleotide design software for gene synthesis. Nucleic Acids Res. 37, W214. Burbelo, P. D., Ching, K. H., Han, B. L., Klimavicz, C. M., and Iadarola, M. J. (2010). Synthetic biology for translational research. Am. J. Transl. Res. 2, 381–389. Carr, P. A., Park, J. S., Lee, Y. J., Yu, T., Zhang, S., and Jacobson, J. M. (2004). Proteinmediated error correction for de novo DNA synthesis. Nucleic Acids Res. 32, e162. Caruthers, M. H. (1985). Gene synthesis machines: DNA chemistry and its uses. Science 230, 281–285. Caruthers, M. H., Beaucage, S. L., Becker, C., Efcavitch, J. W., Fisher, E. F., Galluppi, G., Goldman, R., deHaseth, P., Matteucci, M., McBride, L., et al. (1983). Deoxyoligonucleotide synthesis via the phosphoramidite method. Gene Amplif. Anal. 3, 1–26. Caruthers, M. H., Barone, A. D., Beaucage, S. L., Dodds, D. R., Fisher, E. F., McBride, L. J., Matteucci, M., Stabinsky, Z., and Tang, J. Y. (1987). Chemical synthesis of deoxyoligonucleotides by the phosphoramidite method. Methods Enzymol. 154, 287–313. Cello, J., Paul, A. V., and Wimmer, E. (2002). Chemical synthesis of poliovirus cDNA: Generation of infectious virus in the absence of natural template. Science 297, 1016–1018. Chen, H. B., Weng, J. M., Jiang, K., and Bao, J. S. (1990). A new method for the synthesis of a structural gene. Nucleic Acids Res. 18, 871–878. Chen, G.-Q., Choi, I., Ramachandran, B., and Gouaux, J. E. (1994). Total gene synthesis: Novel single-step and convergent strategies applied to the construction of a 779 base pair bacteriorhodopsin gene. J. Am. Chem. Soc. 116, 8799–8800. Cheng, J. Y., Chen, H. H., Kao, Y. S., Kao, W. C., and Peck, K. (2002). High throughput parallel synthesis of oligonucleotides with 1536 channel synthesizer. Nucleic Acids Res. 30, e93. Cho, G., Keefe, A. D., Liu, R., Wilson, D. S., and Szostak, J. W. (2000). Constructing high complexity synthetic libraries of long ORFs using in vitro selection. J. Mol. Biol. 297, 309–319.
Gene Synthesis: Methods and Applications
305
Ciccarelli, R. B., Gunyuzlu, P., Huang, J., Scott, C., and Oakes, F. T. (1991). Construction of synthetic genes using PCR after automated DNA synthesis of their entire top and bottom strands. Nucleic Acids Res. 19, 6007–6013. Cline, J., Braman, J. C., and Hogrefe, H. H. (1996). PCR fidelity of pfu DNA polymerase and other thermostable DNA polymerases. Nucleic Acids Res. 24, 3546–3551. Cox, J., Lape, J., Sayed, M., and Hellinga, H. (2007). Protein fabrication automation. Protein Sci. 16, 379–390. Czar, M., Cai, Y., and Peccoud, J. (2009). Writing DNA with genocadtm. Nucleic Acids Res. Epub. Eren, M., and Swenson, R. P. (1989). Chemical synthesis and expression of a synthetic gene for the flavodoxin from Clostridium MP. J. Biol. Chem. 264, 14874–14879. Fodor, S. P., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T., and Solas, D. (1991). Light-directed, spatially addressable parallel chemical synthesis. Science 251, 767–773. Fuhrmann, M., Oertel, W., Berthold, P., and Hegemann, P. (2005). Removal of mismatched bases from synthetic genes by enzymatic mismatch cleavage. Nucleic Acids Res. 33, e58. Gao, X., LeProust, E., Zhang, H., Srivannavit, O., Gulari, E., Yu, P., Nishiguchi, C., Xiang, Q., and Zhou, X. (2001). A flexible light-directed DNA chip synthesis gated by deprotection using solution photogenerated acids. Nucleic Acids Res. 29, 4744–4750. Gao, X., Yo, P., Keith, A., Ragan, T. J., and Harris, T. K. (2003). Thermodynamically balanced inside-out (TBIO) PCR-based gene synthesis: A novel method of primer design for high-fidelity assembly of longer gene sequences. Nucleic Acids Res. 31, e143. Gerth, M. L., Patrick, W. M., and Lutz, S. (2004). A second-generation system for unbiased reading frame selection. Protein Eng. Des. Sel. 17, 595–602. Gibson, D. G., Benders, G. A., Andrews-Pfannkoch, C., Denisova, E. A., BadenTillson, H., Zaveri, J., Stockwell, T. B., Brownley, A., Thomas, D. W., Algire, M. A., Merryman, C., Young, L., et al. (2008a). Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 319, 1215–1220. Gibson, D. G., Benders, G. A., Axelrod, K. C., Zaveri, J., Algire, M. A., Moodie, M., Montague, M. G., Venter, J. C., Smith, H. O., and Hutchison, C. A., 3rd (2008b). Onestep assembly in yeast of 25 overlapping DNA fragments to form a complete synthetic Mycoplasma genitalium genome. Proc. Natl. Acad. Sci. USA 105, 20404–20409. Gibson, D. G., Young, L., Chuang, R. Y., Venter, J. C., Hutchison, C. A., 3rd, and Smith, H. O. (2009). Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods. 6, 343–345. Gibson, D. G., Glass, J. I., Lartigue, C., Noskov, V. N., Chuang, R. Y., Algire, M. A., Benders, G. A., Montague, M. G., Ma, L., Moodie, M. M., Merryman, C., Vashee, S., et al. (2010a). Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52–56. Gibson, D. G., Smith, H. O., Hutchison, C. A., 3rd, Venter, J. C., and Merryman, C. (2010b). Chemical synthesis of the mouse mitochondrial genome. Nat. Methods 7(11), 901–903. Gordeeva, T. L., Borschevskaya, L. N., and Sineoky, S. P. (2010). Improved PCR-based gene synthesis method and its application to the Citrobacter freundii phytase gene codon modification. J. Microbiol. Methods 81, 147–152. Grote, A., Hiller, K., Scheer, M., Munch, R., Nortemann, B., Hempel, D., and Jahn, D. (2005). JCat: A novel tool to adapt codon usage of a target gene to its potential expression host. Nucleic Acids Res. 33, W526. Grundstrom, T., Zenke, W. M., Wintzerith, M., Matthes, H. W., Staub, A., and Chambon, P. (1985). Oligonucleotide-directed mutagenesis by microscale ‘shot-gun’ gene synthesis. Nucleic Acids Res. 13, 3305–3316.
306
Randall A. Hughes et al.
Gupta, N. K., Ohtsuka, E., Sgaramella, V., Buchi, H., Kumar, A., Weber, H., and Khorana, H. G. (1968). Studies on polynucleotides, 88. Enzymatic joining of chemically synthesized segments corresponding to the gene for alanine-tRNA. Proc Natl. Acad Sci USA 60, 1338–1344. Gustafsson, C., Govindarajan, S., and Minshull, J. (2004). Codon bias and heterologous protein expression. Trends Biotechnol. 22, 346–353. Hall, B., Micheletti, J. M., Satya, P., Ogle, K., Pollard, J., and Ellington, A. D. (2009). Design, synthesis, and amplification of DNA pools for in vitro selection. Curr Protoc Nucleic Acid Chem. Chapter 9, Unit 9.2. Hershberg, R., and Petrov, D. (2008). Selection on codon bias. Annu. Rev. 42, 287–299. Hoover, D., and Lubkowski, J. (2002). DNAWorks: An automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 30, e43. Horton, R. M., Hunt, H. D., Ho, S. N., Pullen, J. K., and Pease, L. R. (1989). Engineering hybrid genes without the use of restriction enzymes: Gene splicing by overlap extension. Gene 77, 61–68. Horvath, S. J., Firca, J. R., Hunkapiller, T., Hunkapiller, M. W., and Hood, L. (1987). An automated DNA synthesizer employing deoxynucleoside 30 -phosphoramidites. Methods Enzymol. 154, 314–326. Hsiung, H. M., Brousseau, R., Michniewicz, J., and Narang, S. A. (1979). Synthesis of human insulin gene. Part I. Development of reversed-phase chromatography in the modified triester method. Its application in the rapid and efficient synthesis of eight deoxyribooligonucleotides fragments constituting human insulin A DNA. Nucleic Acids Res. 6, 1371–1385. Huang, J., Kirk, B., Favis, R., Soussi, T., Paty, P., Cao, W., and Barany, F. (2002). An endonuclease/ligase based mutation scanning method especially suited for analysis of neoplastic tissue. Oncogene 21, 1909–1921. Hughes, T. R., Mao, M., Jones, A. R., Burchard, J., Marton, M. J., Shannon, K. W., Lefkowitz, S. M., Ziman, M., Schelter, J. M., Meyer, M. R., Kobayashi, S., Davis, C., et al. (2001). Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat. Biotechnol. 19, 342–347. Jayaraman, K., Fingar, S. A., Shah, J., and Fyles, J. (1991). Polymerase chain reactionmediated gene synthesis: Synthesis of a gene coding for isozyme c of horseradish peroxidase. Proc. Natl. Acad. Sci. USA 88, 4084–4088. Kelly, J. R., Rubin, A. J., Davis, J. H., Ajo-Franklin, C. M., Cumbers, J., Czar, M. J., de Mora, K., Glieberman, A. L., Monie, D. D., and Endy, D. (2009). Measuring the activity of BioBrick promoters using an in vivo reference standard. J. Biol. Eng. 3, 4. Khalil, A. S., and Collins, J. J. (2010). Synthetic biology: Applications come of age. Nat. Rev. Genet. 11, 367–379. Kodumal, S. J., Patel, K. G., Reid, R., Menzella, H. G., Welch, M., and Santi, D. V. (2004). Total synthesis of long DNA sequences: Synthesis of a contiguous 32-kb polyketide synthase gene cluster. Proc. Natl. Acad. Sci. USA 101, 15573–15578. Kudla, G., Murray, A., Tollervey, D., and Plotkin, J. (2009). Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255. Lartigue, C., Glass, J. I., Alperovich, N., Pieper, R., Parmar, P. P., Hutchison, C. A., 3rd, Smith, H. O., and Venter, J. C. (2007). Genome transplantation in bacteria: Changing one species to another. Science 317, 632–638. Lartigue, C., Vashee, S., Algire, M. A., Chuang, R. Y., Benders, G. A., Ma, L., Noskov, V. N., Denisova, E. A., Gibson, D. G., Assad-Garcia, N., Alperovich, N., Thomas, D. W., et al. (2009). Creating bacterial strains from genomes that have been cloned and engineered in yeast. Science 325, 1693–1696.
Gene Synthesis: Methods and Applications
307
Lashkari, D. A., Hunicke-Smith, S. P., Norgren, R. M., Davis, R. W., and Brennan, T. (1995). An automated multiplex oligonucleotide synthesizer: Development of high-throughput, low-cost DNA synthesis. Proc. Natl. Acad. Sci. USA 92, 7912–7915. Lee, C. C., Snyder, T. M., and Quake, S. R. (2010). A microfluidic oligonucleotide synthesizer. Nucleic Acids Res. 38, 2514–2521. LeProust, E. M., Peck, B. J., Spirin, K., McCuen, H. B., Moore, B., Namsaraev, E., and Caruthers, M. H. (2010). Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522–2540. Lorimer, D., Raymond, A., Walchli, J., Mixon, M., Barrow, A., Wallace, E., Grice, R., Burgin, A., and Stewart, L. (2009). Gene Composer: Database software for protein construct design, codon engineering, and gene synthesis. BMC Biotechnol. 9, 36. Lou, X. M., Yao, Q. H., Zhang, Z., Peng, R. H., Xiong, A. S., and Wang, H. K. (2007). Expression of the human hepatitis B virus large surface antigen gene in transgenic tomato plants. Clin. Vaccine Immunol. 14, 464–469. Lutz, S., Fast, W., and Benkovic, S. J. (2002). A universal, vector-based system for nucleic acid reading-frame selection. Protein Eng. 15, 1025–1030. Marsic, D., Hughes, R. C., Byrne-Steele, M. L., and Ng, J. D. (2008). PCR-based gene synthesis to produce recombinant proteins for crystallization. BMC Biotechnol. 8, 44. Mehta, D. V., DiGate, R. J., Banville, D. L., and Guiles, R. D. (1997). Optimized gene synthesis, high level expression, isotopic enrichment, and refolding of human interleukin-5. Protein Expr. Purif. 11, 86–94. Modrich, P. (1991). Mechanisms and biological effects of mismatch repair. Annu. Rev. Genet. 25, 229–253. Mulligan, J. T., Tabone, J. C., and Brickner, R. G. Method and system for polynucleotide synthesis (2002). United States Patent. 7,164,992. Mulligan, J. T., Tabone, J. C., and Brickner, R. G. Method and system for polynucleotide synthesis (2007). United States Patent. 7,164,992. Neumann, G., Watanabe, T., Ito, H., Watanabe, S., Goto, H., Gao, P., Hughes, M., Perez, D. R., Donis, R., Hoffmann, E., Hobom, G., and Kawaoka, Y. (1999). Generation of influenza A viruses entirely from cloned cDNAs. Proc. Natl. Acad. Sci. USA 96, 9345–9350. Peng, R., Xiong, A., Li, X., Fuan, H., and Yao, Q. (2003). A delta-endotoxin encoded in Pseudomonas fluorescens displays a high degree of insecticidal activity. Appl. Microbiol. Biotechnol. 63, 300–306. Pincas, H., Pingle, M. R., Huang, J., Lao, K., Paty, P. B., Friedman, A. M., and Barany, F. (2004). High sensitivity EndoV mutation scanning through real-time ligase proofreading. Nucleic Acids Res. 32, e148. Puigbo, P., Guzman, E., Romeu, A., and Garcia-Vallve, S. (2007). OPTIMIZER: A web server for optimizing the codon usage of DNA sequences. Nucleic Acids Res. 35, W126. Puigbo, P., Romeu, A., and Garcia-Vallve, S. (2008). HEG-DB: A database of predicted highly expressed genes in prokaryotic complete genomes under translational selection. Nucleic Acids Res. 36, D524. Rayner, S., Brignac, S., Bumeister, R., Belosludtsev, Y., Ward, T., Grant, O., O’Brien, K., Evans, G. A., and Garner, H. R. (1998). MerMade: An oligodeoxyribonucleotide synthesizer for high throughput oligonucleotide production in dual 96-well plates. Genome Res. 8, 741–747. Reddy, S., Ge, X., Miklos, A., Hughes, R., Kang, S., Hoi, K., Chrysostomou, C., HunickeSmith, S., Iverson, B., and Tucker, P. (2010). Monoclonal antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells. Nat. Biotechnol. 28(9), 965–969. Reisinger, S. J., Patel, K. G., and Santi, D. V. (2006). Total synthesis of multi-kilobase DNA sequences from oligonucleotides. Nat. Protoc. 1, 2596–2603.
308
Randall A. Hughes et al.
Richardson, S., Wheelan, S., Yarrington, R., and Boeke, J. (2006). GeneDesign: Rapid, automated design of multikilobase synthetic genes. Genome Res. 16, 550. Richardson, S. M., Nunley, P. W., Yarrington, R. M., Boeke, J. D., and Bader, J. S. (2010). GeneDesign 3.0 is an updated synthetic biology toolkit. Nucleic Acids Res. 38, 2603–2606. Richmond, K. E., Li, M. H., Rodesch, M. J., Patel, M., Lowe, A. M., Kim, C., Chu, L. L., Venkataramaian, N., Flickinger, S. F., Kaysen, J., Belshaw, P. J., Sussman, M. R., et al. (2004). Amplification and assembly of chip-eluted DNA (AACED): A method for highthroughput gene synthesis. Nucleic Acids Res. 32, 5011–5018. Rouillard, J., Lee, W., Truan, G., Gao, X., Zhou, X., and Gulari, E. (2004). Gene2Oligo: Oligonucleotide design for in vitro gene synthesis. Nucleic Acids Res. 32, W176. Rydzanicz, R., Zhao, X., and Johnson, P. (2005). Assembly PCR oligo maker: A tool for designing oligodeoxynucleotides for constructing long DNA molecules for RNA production. Nucleic Acids Res. 33, W521. Salerno, J. C., Jones, R. J., Erdogan, E., and Smith, S. M. (2005). A single-stage polymerasebased protocol for the introduction of deletions and insertions without subcloning. Mol. Biotechnol. 29, 225–232. Sambrook, J., and Russell, D. W. (eds.), (2001). Molecular Cloning: A Laboratory Manual, Cold Springs Harbor Laboratory Press, Cold Springs Harbor. Sandhu, G. S., Aleff, R. A., and Kline, B. C. (1992). Dual asymmetric PCR: One-step construction of synthetic genes. Biotechniques 12, 14–16. Seehaus, T., Breitling, F., Dubel, S., Klewinghaus, I., and Little, M. (1992). A vector for the removal of deletion mutants from antibody libraries. Gene 114, 235–237. Septak, M. (1996). Kinetic studies on depurination and detritylation of CPG-bound intermediates during oligonucleotide synthesis. Nucleic Acids Res. 24, 3053–3058. Sharp, P., and Li, W. (1987). The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281. Shetty, R. P., Endy, D., and Knight, T. F., Jr. (2008). Engineering BioBrick vectors from BioBrick parts. J. Biol. Eng. 2, 5. Sierzchala, A. B., Dellinger, D. J., Betley, J. R., Wyrzykiewicz, T. K., Yamada, C. M., and Caruthers, M. H. (2003). Solid-phase oligodeoxynucleotide synthesis: A two-step cycle using peroxy anion deprotection. J. Am. Chem. Soc. 125, 13427–13441. Sindelar, L. E., and Jaklevic, J. M. (1995). High-throughput DNA synthesis in a multichannel format. Nucleic Acids Res. 23, 982–987. Singh-Gasson, S., Green, R. D., Yue, Y., Nelson, C., Blattner, F., Sussman, M. R., and Cerrina, F. (1999). Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nat. Biotechnol. 17, 974–978. Smith, J., and Modrich, P. (1996). Mutation detection with MutH, MutL, and MutS mismatch repair proteins. Proc. Natl. Acad. Sci. USA 93, 4374–4379. Smith, J., and Modrich, P. (1997). Removal of polymerase-produced mutant sequences from PCR products. Proc. Natl. Acad. Sci. USA 94, 6847–6850. Smith, A. T., Santama, N., Dacey, S., Edwards, M., Bray, R. C., Thorneley, R. N., and Burke, J. F. (1990). Expression of a synthetic gene for horseradish peroxidase C in Escherichia coli and folding and activation of the recombinant enzyme with Ca2þ and heme. J. Biol. Chem. 265, 13335–13343. Smith, H. O., Hutchison, C. A., 3rd, Pfannkoch, C., and Venter, J. C. (2003). Generating a synthetic genome by whole genome assembly: phiX174 bacteriophage from synthetic oligonucleotides. Proc. Natl. Acad. Sci. USA 100, 15440–15445. Stemmer, W. P., Crameri, A., Ha, K. D., Brennan, T. M., and Heyneker, H. L. (1995). Single-step assembly of a gene and entire plasmid from large numbers of oligodeoxyribonucleotides. Gene 164, 49–53.
Gene Synthesis: Methods and Applications
309
Strizhov, N., Keller, M., Mathur, J., Koncz-Kalman, Z., Bosch, D., Prudovsky, E., Schell, J., Sneh, B., Koncz, C., and Zilberstein, A. (1996). A synthetic cryIC gene, encoding a Bacillus thuringiensis delta-endotoxin, confers Spodoptera resistance in alfalfa and tobacco. Proc. Natl. Acad. Sci. USA 93, 15012–15017. Temsamani, J., Kubert, M., and Agrawal, S. (1995). Sequence identity of the n1 product of a synthetic oligonucleotide. Nucleic Acids Res. 23, 1841–1844. Tian, J., Gong, H., Sheng, N., Zhou, X., Gulari, E., Gao, X., and Church, G. (2004). Accurate multiplex gene synthesis from programmable DNA microchips. Nature 432, 1050–1054. Tian, J., Ma, K., and Saaem, I. (2009). Advancing high-throughput gene synthesis technology. Mol. Biosyst. 5, 714–722. Villalobos, A., Ness, J., Gustafsson, C., Minshull, J., and Govindarajan, S. (2006). Gene Designer: A synthetic biology tool for constructing artificial DNA segments. BMC Bioinform. 7, 285. Wang, W., and Malcolm, B. A. (1999). Two-stage PCR protocol allowing introduction of multiple mutations, deletions and insertions using QuikChange site-directed mutagenesis. Biotechniques 26, 680–682. Wang, W., and Malcolm, B. A. (2002). Two-stage polymerase chain reaction protocol allowing introduction of multiple mutations, deletions, and insertions, using QuikChange site-directed mutagenesis. Methods Mol. Biol. 182, 37–43. Welch, M., Govindarajan, S., Ness, J. E., Villalobos, A., Gurney, A., Minshull, J., and Gustafsson, C. (2009). Design parameters to control synthetic gene expression in Escherichia coli. PLoS ONE 4, e7002. Wu, G., Wolf, J. B., Ibrahim, A. F., Vadasz, S., Gunasinghe, M., and Freeland, S. J. (2006). Simplified gene synthesis: A one-step approach to PCR-based gene construction. J. Biotechnol. 124, 496–503. Xiong, A. S., Yao, Q. H., Peng, R. H., Li, X., Fan, H. Q., Cheng, Z. M., and Li, Y. (2004a). A simple, rapid, high-fidelity and cost-effective PCR-based two-step DNA synthesis method for long gene sequences. Nucleic Acids Res. 32, e98. Xiong, A. S., Yao, Q. H., Peng, R. H., Li, X., Fan, H. Q., Guo, M. J., and Zhang, S. L. (2004b). Isolation, characterization, and molecular cloning of the cDNA encoding a novel phytase from Aspergillus niger 113 and high expression in Pichia pastoris. J. Biochem. Mol. Biol. 37, 282–291. Xiong, A. S., Yao, Q. H., Peng, R. H., Duan, H., Li, X., Fan, H. Q., Cheng, Z. M., and Li, Y. (2006a). PCR-based accurate synthesis of long DNA sequences. Nat. Protoc. 1, 791–797. Xiong, A. S., Yao, Q. H., Peng, R. H., Zhang, Z., Xu, F., Liu, J. G., Han, P. L., and Chen, J. M. (2006b). High level expression of a synthetic gene encoding Peniophora lycii phytase in methylotrophic yeast Pichia pastoris. Appl. Microbiol. Biotechnol. 72, 1039–1047. Xiong, A. S., Peng, R. H., Zhuang, J., Liu, J. G., Gao, F., Chen, J. M., Cheng, Z. M., and Yao, Q. H. (2008). Non-polymerase-cycling-assembly-based chemical gene synthesis: Strategies, methods, and progress. Biotechnol. Adv. 26, 121–134. Young, E., and Alper, H. (2010). Synthetic biology: Tools to design, build, and optimize cellular processes. J. Biomed. Biotechnol. 2010, 130781. Young, L., and Dong, Q. (2004). Two-step total gene synthesis method. Nucleic Acids Res. 32, e59. Zhou, X., Cai, S., Hong, A., You, Q., Yu, P., Sheng, N., Srivannavit, O., Muranjan, S., Rouillard, J. M., Xia, Y., Zhang, X., Xiang, Q., et al. (2004). Microfluidic PicoArray synthesis of oligodeoxynucleotides and simultaneous assembling of multiple DNA sequences. Nucleic Acids Res. 32, 5409–5417.
C H A P T E R
T H I R T E E N
Assembly of BioBrick Standard Biological Parts Using Three Antibiotic Assembly Reshma Shetty,*,1 Meagan Lizarazo,† Randy Rettberg,*,† and Thomas F. Knight†,1 Contents 312 316 318
1. 2. 3. 4.
Introduction Construction of New BioBrick Standard Biological Parts 3A Assembly of BioBrick Standard Biological Parts Verification of Correct Assembly of BioBrick Standard Biological Parts 5. Preparation of Linearized Destination Vector by PCR to Improve 3A Assembly 6. Available BioBrick Destination Vectors 7. Preparation of Chemically Competent Cells 8. Conclusions Acknowledgments References
320 321 323 323 325 325 325
Abstract An underlying goal of synthetic biology is to make the process of engineering biological systems easier and more reliable. In support of this goal, we developed BioBrick assembly standard 10 to enable the construction of systems from standardized genetic parts. The BioBrick standard underpins the distributed efforts by the synthetic biology research community to develop a collection of more than 6000 standard genetic parts available from the Registry of Standard Biological Parts. Here, we describe the three antibiotic assembly method for physical composition of BioBrick parts and provide step-by-step protocols.
* Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 1 Current address: Ginkgo BioWorks, Inc., Boston, Massachusetts, USA {
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00013-9
#
2011 Elsevier Inc. All rights reserved.
311
312
Reshma Shetty et al.
The method relies on a combination of positive and negative selection to eliminate time- and labor-intensive steps such as column cleanup and agarose gel purification of DNA during part assembly.
1. Introduction A fundamental goal of synthetic biology is to make the process of designing, building, and testing biological systems easier and more reliable. Drawing inspiration from other engineering fields, we have been working to develop methods that support the design and construction of multicomponent, synthetic biological systems from standardized biological parts. In 2002, Knight proposed and implemented the BioBrick standard for assembly of standard genetic parts (Knight, 2003). Parts that have been refined to adhere to the BioBrick assembly standard are called BioBrick standard biological parts. The key innovation of the BioBrick standard is that assembly of any two BioBrick parts yields a composite object that is itself a BioBrick part that can be further combined with any other BioBrick parts. The idempotent nature of the BioBrick standard has helped to support the distributed development of a collection of over 6000 standard biological parts in the Registry of Standard Biological Parts (http://partsregistry.org). Parts in the Registry are cataloged, describing and conveying units of biological function rather than simply arbitrary DNA sequences. Participants in the international Genetically Engineered Machines competition (iGEM, http://igem.org) as well as many academic laboratories have made use of the BioBrick assembly standard in construction of genetic parts, devices, and systems (Afonso et al., 2010; Agapakis et al., 2010; Ajo-Franklin et al., 2007; Canton et al., 2008; Du et al., 2009; Gru¨nberg et al., 2010; Haynes et al., 2008; Huang et al., 2010; Kelly et al., 2009; Levskaya et al., 2005; Shetty et al., 2008; Tabor et al., 2009). Several additional physical composition standards have been proposed that extend or build upon the BioBrick assembly standard (Anderson et al., 2010; Ellison et al., 2009; Gru¨nberg et al., 2009; Knight, 2008; Peisajovich et al., 2009; Phillips and Silver, 2006). The BioBricks Foundation (BBF) now manages an open standards setting process for the synthetic biology community via the submission and publication of requests for comment (RFCs). In accordance with BBF RFC 29, the BioBrick assembly standard is now referred to as BioBrick assembly standard 10 in the Registry of Standard Biological Parts, and other assembly standards are similarly numbered (Shetty and Rettberg, 2009). For brevity, here we use the term BioBrick parts to refer to parts that adhere to BioBrick assembly standard 10. In addition to supporting the distributed development of a collection of standard biological parts, the BioBrick standard also enables the iterative and pairwise hierarchical assembly of genetic parts into systems containing many
313
3A Assembly
parts. The standard prescribes specific sequences, termed the prefix and suffix sequences, that must flank the 50 - and 30 ends of a genetic part, respectively (Fig. 13.1A and B). The prefix sequence encodes an EcoRI and XbaI restriction enzyme recognition site, and the suffix sequence encodes a SpeI and PstI recognition site. In addition, the standard prescribes that the sequence of the genetic part itself must not contain recognition sites for EcoRI, XbaI, SpeI, or PstI enzymes. During assembly of any two BioBrick parts, the 30 end of the upstream part is digested with SpeI and the 50 end of the downstream part is digested with XbaI. As SpeI and XbaI produce compatible cohesive ends, ligation of the upstream and downstream parts produces an 8-bp “mixed” or “scar” sequence between the two parts that is not recognized by either enzyme (Fig. 13.1C). As the resulting composite part is still flanked by the prefix and suffix sequences, it can again be digested with either XbaI or SpeI during a subsequent assembly step. Hence, a biological engineer can use the same, iterative process of restriction digestion and ligation to assemble multiple parts into a synthetic biological system. Although the BioBrick standard constrains the sequence of the genetic parts as well as the overall assembly approach, it does not specify the methods used during the assembly process. As a result, there has been
A
EcoRI NotI XbaI SpeI NotI PstI 5' GAATTC GCGGCCGC T TCTAGA G [part sequence] T ACTAGT A GCGGCCG CTGCAG 3' 3' CTTAAG CGCCGGCG A AGATCT C [part sequence] A TGATCA T CGCCGGC GACGTC 5' BioBrick assembly standard 10 prefix
B
EcoRI NotI XbaI 5' GAATTC GCGGCCGC T TCTAG 3' CTTAAG CGCCGGCG A AGATC
BioBrick assembly standard 10 suffix
SpeI NotI PstI [ATG....TAATAA] T ACTAGT A GCGGCCG CTGCAG 3' [TAC....ATTATT] A TGATCA T CGCCGGC GACGTC 5'
BioBrick assembly standard 10 CDS prefix
BioBrick assembly standard 10 suffix
C
EcoRI NotI XbaI mixed SpeI NotI PstI 5' GAATTC GCGGCCGC T TCTAGA G [upstream part] T ACTAGA G [downstream part] T ACTAGT A GCGGCCG CTGCAG 3' 3' CTTAAG CGCCGGCG A AGATCT C [upstream part] A TGATCT C [downstream part] A TGATCA T CGCCGGC GACGTC 5' BioBrick assembly standard 10 prefix
BioBrick assembly standard 10 suffix
Figure 13.1 (A) Parts conforming to BioBrick assembly standard 10 are flanked on the 50 end by a prefix sequence encoding EcoRI, NotI, and XbaI restriction enzyme recognition sites and on the 30 end by a suffix sequence encoding SpeI, NotI, and PstI recognition sites. (B) BioBrick parts encoding protein coding sequences generally begin with an ATG start codon and end with two TAA stop codons. The prefix sequence is truncated by 2 bp on the 30 end relative to other types of BioBrick parts. The truncation ensures that, when assembled, the Shine–Dalgarno sequence of a ribosome binding site part and the ATG start codon of a protein coding sequence part have correct, intervening spacing for expression in E. coli and related hosts. (C) Assembly of two BioBrick parts results in a new BioBrick part with appropriate prefix and suffix sequences. An 8-bp “mixed” or “scar” sequence is formed between the upstream and downstream parts, which is not recognized by any of the BioBrick enzymes. Thus, the composite BioBrick part can be reused in another assembly step.
314
Reshma Shetty et al.
some variation and experimentation on assembly methods by different research groups (Knight, 2003; Sleight et al., 2010). Here, we describe a method for assembly of BioBrick parts called three antibiotic assembly (3A assembly, Fig. 13.2). We designed the 3A assembly method to make use of negative and positive selection to reduce the number of possible incorrect assembly products. The inputs to the process are plasmids propagating an upstream part, a downstream part, and a destination vector into which the two parts will be assembled. The destination vector must have a different antibiotic resistance marker from the plasmids encoding the upstream and
E X
S P
E X
S P
E X
S P
R0010
I13507
P1010
Upstream part plasmid
Downstream part plasmid
Destination plasmid
1
A
1
K
Cut E + S
1
Cut X + P
C
Cut E + P E
E X
X
S
S P I13507
R0010
P
Destination vector 1
Ligate
Ligate
Ligate
M
E X R0010
C
S P I13507
Composite part plasmid 1
C
Figure 13.2 Schematic overview of the 3A assembly process for assembly of BioBrick parts. The destination plasmid must have a different antibiotic resistance marker from both the upstream and downstream part plasmids. To perform a 3A assembly, digest the upstream part plasmid with EcoRI and SpeI. Digest the downstream part plasmid with XbaI and PstI. Digest the destination plasmid with EcoRI and PstI. Ligate the resulting DNA fragments to form a circular plasmid composed of the destination vector and composite part and then transform. E, EcoRI site; X, XbaI site; S, SpeI site; P, PstI site; M, mixed XbaI/SpeI site; 1, pUC19-derived high copy replication origin; A, ampicillin resistance marker; K, kanamycin resistance marker; C, chloramphenicol resistance marker; R0010, BBa_R0010; I13507, BBa_I13507; and P1010, BBa_P1010. BBa_R0010 is a BioBrick genetic part encoding a LacI-repressible promoter. BBa_I13507 is a composite BioBrick genetic part composed of a ribosome binding site, monomeric red fluorescent protein-coding sequence, and a transcriptional terminator. BBa_P1010 encodes the ccdB positive selection marker.
315
3A Assembly
downstream parts. (The name 3A assembly derives from the maximum of three antibiotic resistance markers needed for any assembly step.) The upstream part is digested with EcoRI and SpeI, the downstream part is digested with XbaI and PstI, and the destination vector is digested with EcoRI and PstI. The three digested plasmids are then mixed, ligated, and transformed with no intervening purification steps. Given the variety of DNA fragments present in the ligation reaction, an assortment of ligation products is possible in 3A assembly (Fig. 13.3). Fortunately, transformation
A
M
E X R0010
S P
E X
D
P1010
I13507
Destination plasmid
Composite part plasmid 1 B
C
E X
1 S P
M 1
A
E X
I13507
1
S R0010
C
S P
F
E X
S P R0010
I13507
Part concatemer plasmid 1
I13507
Ligation product without a replication origin
M R0010
S P
M R0010
C
E R0010
E X
C
E
Plasmid with two origins
C
S P
Part plasmid 1
A
Figure 13.3 An overview of selective pressures for or against possible ligation products from the 3A assembly process. (A) The desired ligation product is composed of a composite part in the destination vector. (B) Colony growth usually selects against ligation products with two or more identical replication origins. Alternatively, transformed colonies can be screened for resistance to only the antibiotic corresponding to the destination vector. (C) Part concatemers are possible but occur with significantly lower frequency than the desired ligation product. (D) Expression of ccdB positive selection marker is toxic to growth of E. coli cloning strains preventing propagation of uncut or recircularized destination plasmid. (E) Transformation and subsequent growth selects against both linear ligation products and circular ligation products without a replication origin. (F) Growth on selective media selects for presence of destination vector in the ligation product. There is no selection for propagation of the part plasmid. Abbreviations are identical to Fig. 13.2.
316
Reshma Shetty et al.
and subsequent colony growth on selective media constitute strong selective pressure for circular ligation products that include the destination vector. In addition, propagation of the uncut or recircularized destination plasmid is selected against by inclusion of the ccdB positive selection marker in the BioBrick cloning site of the destination vector (Bernard, 1995, 1996; Bernard et al., 1994). Ligation products composed of part concatemers, though possible, occur with significantly lower probability than the desired ligation product. Finally, ligation products with two or more identical replication origins usually fail to propagate in growing Escherichia coli cells. Hence, there is significant positive and negative selective pressure for the transformation and propagation of the desired assembly product. As an additional benefit of the 3A assembly method, it is very amenable to scale up and automation because it avoids difficult-to-automate steps such as column cleanup and agarose gel purification of DNA. We have successfully used the method to assemble parts ranging in length from 12 bp to 3–4 kb. Below, we provide detailed protocols for the construction and 3A assembly of BioBrick parts. All protocols are adapted from standard molecular biology techniques and/or the manufacturer’s directions, unless otherwise specified (Sambrook and Russell, 2001).
2. Construction of New BioBrick Standard Biological Parts To adhere to BioBrick assembly standard 10, genetic parts must be designed with specific prefix and suffix sequences (Fig. 13.1A). The suffix sequence for all BioBrick genetic parts is 50 -TACTAGTAGCGGCCGCTGCAG-30 . The prefix sequence for genetic parts, with the exception of protein coding sequence parts, is 50 -GAATTCGCGGCCGCTTCTAGAG-30 . For protein coding sequences beginning with an ATG start codon, the prefix sequence is truncated by 2 nt at the 30 end (Fig. 13.1B). The 2 nt truncation ensures that when assembled together, two genetic parts encoding a ribosome binding site and a protein coding sequence have a suitable spacing between the Shine–Dalgarno sequence and the ATG start codon to promote efficient translation in E. coli and related organisms (Chen et al., 1994). Hence, the prefix sequence for protein coding sequence parts beginning with an ATG start codon is 50 -GAATTCGCGGCCGCTTCTAG-30 . An additional constraint on the sequence of parts conforming to BioBrick assembly standard 10 is that recognition sites for the following restriction enzymes must be absent: EcoRI, XbaI, SpeI, and PstI. New BioBrick parts can be constructed by either PCR amplification from template DNA or de novo DNA synthesis. For PCR-based construction of parts, we routinely use the primer sequences listed in Table 13.1 in
Table 13.1
Primer sequences using during construction and 3A assembly of BioBrick parts
Primer description (name)
Primer sequence
Forward amplification primer for protein coding sequence parts Forward amplification primer for all other parts
50 -GTTTCTTCGAATTCGCGGCCGCTTCTAG [ATG þ 15-21 nt coding sequence]-30 0 5 -GTTTCTTCGAATTCGCGGCCGCTTCTAGAG [18-24 nt part sequence]-30 0 5 -GTTTCTTCCTGCAGCGGCCGCTACTAGTA [18-24 nt part sequence (reverse complement)]-30 0 5 -GTTTCTTCGAATTCGCGGCCGCTTCTAG-30 50 -GTTTCTTCCTGCAGCGGCCGCTACTAGTA-30 50 -TGCCACCTGACGTCTAAGAA-30 50 -ATTACCGCCTTTGAGTGAGC-30 50 -GCCGCTGCAGTCCGGCAAAAAAACG-30 50 -ATGAATTCCAGAAATCATCCTTAGCGAA-30
Reverse amplification primer Forward colony verification primer (BioBricks-f ) Reverse colony verification primer (BioBricks-r) Forward sequencing primer (VF2) Reverse sequencing primer (VR) Vector amplification primer (SB-prep-3P) Vector amplification primer (SB-prep-2Eb)
318
Reshma Shetty et al.
which the prefix and suffix sequences are encoded on the forward and reverse primers, respectively. Both the forward and reverse amplification primers are designed to have an extra 8 nt of sequence on the 50 end. The 8nt spacer sequence serves two purposes. First, it enables efficient cleavage of the linear PCR product by EcoRI or PstI, as most restriction enzymes cleave less efficiently near DNA termini (Moreira and Noren, 1995). Second, the selected sequence GTTTCTT was found to promote the addition of a 30 A overhang by Taq polymerase thereby facilitating downstream TA cloning of the PCR-amplified part (Brownstein et al., 1996). Optimal PCR conditions will vary based on part length, sequence, and amplification primers. When constructing a new part by PCR amplification, any EcoRI, XbaI, SpeI, or PstI recognition sites present in the part sequence must be eliminated by site-directed mutagenesis. For parts that are constructed by de novo DNA synthesis, we recommend eliminating the restriction sites listed in Table 13.2 when possible to facilitate conversion to other BioBrick assembly standards, part reuse, and downstream cloning operations. We also recommend cloning constructed parts into a standard BioBrick vector prior to 3A assembly.
3. 3A Assembly of BioBrick Standard Biological Parts To assemble two BioBrick parts together into a destination vector via 3A assembly (Fig. 13.2), we recommend the following protocol. Enzymes and reagents are available from New England Biolabs, Ipswich, MA, in the BioBrick Assembly Kit (Catalog # E0546S). 1. Digest the upstream part with EcoRI-HF and SpeI, the downstream part with XbaI and PstI, and the destination vector with EcoRI-HF and PstI. The destination vector must have a different antibiotic resistance marker from both the upstream and downstream parts. Prepare all three digestion reactions with 500 ng plasmid DNA, 1 NEBuffer 2, 100 mg/mL bovine serum albumin, and 1 mL each enzyme in a total volume of 50 mL. Note that issues in the digest step arise if nearly all of the reaction volume consists of plasmid DNA from miniprep. Residual ethanol and/or guanidine hydrochloride contamination in miniprep eluant is common and can inhibit subsequent digestion. A significant volume of the digest reaction volume should consist of water to dilute these impurities. 2. Incubate all three reactions at 37 C for 15 min to digest the DNA and then at 80 C for 20 min to heat inactivate the restriction enzymes. For convenience, we recommend performing this step in a programmable
319
3A Assembly
Table 13.2 Restriction enzymes whose sites are recommended for removal when constructing BioBrick parts by de novo DNA synthesis Restriction enzyme
Explanation
EcoRI, XbaI, SpeI, PstI
Required by BioBrick assembly standard 10 Compatibility with BioBrick assembly standards 12, 15, 21, 25, 28, 37, 44, and 45
NheI, PsrI, PpiI, AscI, PacI, MabI, BglII, BamHI, XhoI, NgoMIV, AgeI, AarI, XmaI, BspEI, BpuEI, BseRI, HindIII ApoI, MfeI, AvrII, NsiI, SbfI BbsI, BbvI, BfuAI, BsmAI, BsmBI, BsmFI, BsaI, BspMI, BtgZI, EarI, FokI, HgaI, SapI, SfaNI Nt.BsmAI, Nt.BspQI, Nt.BstNBI, Nt.AlwI
3.
4.
5.
6.
Produce compatible cohesive ends to BioBrick assembly standard 10 enzymes Offset cutters that produce arbitrary 3–4 nt overhangs Offset nicking enzymes
thermocycler. If troubleshooting, confirm digestion by analyzing 20 mL of reaction by agarose gel electrophoresis. Prepare the ligation reaction with 2 mL from each digest reaction, 1 T4 DNA ligase reaction buffer, and 1 mL T4 DNA ligase in a total volume of 20 mL. It is not necessary to purify the linearized part DNA prior to ligation. Ligation reactions are done at low DNA concentrations to favor production of circular DNA rather than concatemers. Quick ligation buffers are not desirable for the cohesive end ligations used here. Incubate the ligation reaction at room temperature for 10 min. Store the ligation reaction at 20 C or proceed immediately to the chemical transformation step. We prefer chemical transformation to electroporation, because it is easier to do a large numbers of transformations in parallel. Thaw the chemically competent cells on ice. Competent cells can be purchased commercially (e.g., NEB 10-beta competent E. coli (high efficiency)) or can be prepared via the protocol below. Regardless, it is important to verify experimentally that the competent cells used have a high transformation efficiency (>108 cfu per mg of plasmid DNA is desirable). Prepare a transformation reaction by adding 1 mL of ligation reaction to 15–50 mL competent cells in a prechilled 2-mL Eppendorf tube. Increasing the amount of ligation product in the transformation
320
7. 8. 9. 10. 11.
12.
13.
Reshma Shetty et al.
reaction is usually not effective in increasing the number of transformants. Less than 5% of the transformation reaction volume should consist of ligation product. A 2-mL rather than a 1.7-mL Eppendorf tube provides better mixing and aeration of the cell culture during outgrowth in a rotator (step 11). Incubate the transformation reaction on ice for 30 min. Immerse the transformation reaction in a 42 C water bath for 60 s to heat shock the cells. Incubate the transformation reaction on ice for 2 min. Add 200 mL sterile SOC media to the transformation reaction to begin cell outgrowth. Incubate the transformation reaction at 37 C for 2 h with gentle shaking. Although an 1-h incubation is sufficient for destination vectors with ampicillin or kanamycin resistance markers, we have found that a 2 h incubation results in an increased number of colonies for destination vectors with chloramphenicol or tetracycline markers. Using sterile 3-mm glass beads, spread 200 mL of the transformation reaction on an LB agar plate supplemented with the antibiotic corresponding to the destination vector. Sterile glass beads are preferable to a glass spreader because multiple transformation reactions can be plated simultaneously. Incubate transformation plates overnight at 37 C for colony growth.
An attractive, automation-friendly alternative to plating the transformation reaction on LB agar plates is to perform serial dilution of the outgrowth cultures. Outgrowth can be done in the first, leftmost column of a 96-deep well plate in which each well is filled with 316 mL of SOC medium. Fill remaining wells with 684 mL of LB medium supplemented with the appropriate antibiotic and 0.002% cresol red. Serial dilutions, transferring 316 mL each time, result in 10 dilutions every two columns. Incubate plates overnight at 37 C for cell growth. A red to yellow color change in a well indicates cell growth. Pick the last, rightmost (most dilute) yellow well and test for correct assembly as described below. This approach is only viable given the high likelihood and strong selection for correctly assembled parts in 3A assembly.
4. Verification of Correct Assembly of BioBrick Standard Biological Parts Based on our empirical observations, 3A assembly usually yields more than 80% correct clones, assuming that the assembled parts confer no significant growth disadvantage to the host strain. To confirm that two parts have been correctly assembled together via 3A assembly, we routinely
3A Assembly
321
screen colonies for sensitivity to the antibiotic(s) corresponding to the two part vectors and verify the length of the assembled part by colony PCR followed by agarose gel electrophoresis as described below (Gu¨ssow and Clackson, 1989). Clones that pass these two checks are then analyzed by DNA sequencing. 1. Use a sterile tip to pick, inoculate and mix a colony into 20 mL of sterile H2O. We generally pick four or eight colonies from each transformation plate. For this and all subsequent steps, we use 8-strip PCR tubes and/or 96-well plates to facilitate use of a multichannel pipettor for the transfer operations. 2. Transfer approximately 2 mL of the colony suspension onto four LB agar plates each supplemented with one of ampicillin, kanamycin, chloramphenicol, and tetracycline. Incubate the plates either at room temperature overnight or at 37 C for 6–8 h to check for the expected antibiotic resistance and sensitivity. 3. For each colony suspension, prepare 9.5 mL PCR mix using primers BioBricks-f and BioBricks-r that anneal to the BioBrick prefix and suffix, respectively (Table 13.1). Transfer 0.5 mL colony suspension to each reaction tube and run PCR in a programmable thermocycler. Optimal PCR conditions will vary based on part length and sequence, but generally an annealing temperature of 56–58 C works well with these primers. We prefer to use colony verification primers BioBricks-f and BioBricks-r rather than sequencing primers VF2 and VR in this step to test for the presence of a BioBrick prefix and suffix in the assembled clones. 4. Analyze the resulting reactions by agarose gel electrophoresis. Confirm that the PCR product has the expected length by comparison to a molecular weight standard (DNA ladder). 5. Sequence plasmid DNA from one to two colonies per assembly from those colonies that pass both the antibiotic resistance/sensitivity test and length verification test. We routinely use primers VF2 and VR for sequence verification of parts assembled in BioBrick vectors (Table 13.1).
5. Preparation of Linearized Destination Vector by PCR to Improve 3A Assembly Our driving goal in designing the 3A assembly process was to limit the number of possible incorrect assembly products as much as possible by leveraging a combination of positive and negative selection (Fig. 13.3). Hence, incorrect clones arising from 3A assembly are of significant interest
322
Reshma Shetty et al.
because they inform further improvements to the process. In our experience, the three most common failure modes of 3A assembly are the following. First, colonies on the transformation plate are resistant to the antibiotic corresponding to not only the destination vector but also one of the parent vectors. Such a result generally arises from a cotransformation, or possibly a ligation followed by transformation, of the part and destination vectors. Second, a contaminating E. coli genomic DNA fragment ligates to the destination vector backbone and is transformed. In this error case, the genomic DNA fragment is often flanked by EcoRI and PstI sites and can ligate directly to the destination vector. Such a two-fragment ligation event competes with the desired three-fragment ligation event needed for correct part assembly. Third, uncut or recircularized destination vector propagates in the host strain as a result of a loss-of-function mutation in the ccdB positive selection marker. While the verification tests described above can generally detect these three error cases prior to DNA sequencing, we prefer to reduce or eliminate these failure cases if possible. In particular, both the second and third failure modes described above stem from issues arising during preparation and digestion of the destination vector. To further reduce the number of possible incorrect assembly products in 3A assembly, we recommend preparing the destination vector by PCR instead of by plasmid DNA purification. Such an approach nearly eliminates contaminating genomic DNA in the destination vector digestion reaction, thereby reducing competition from EcoRI/PstI linearized genomic DNA fragments in the ligation reaction (second failure mode). Similarly, preparation of the destination vector by PCR also reduces contaminating uncut destination plasmid from the transformation reaction (third failure mode). As destination vectors can be prepared in batches in advance of any assembly steps, the duration and automation of the 3A assembly process are unaffected by these improvements. 1. Amplify the pSB1A3, pSB1K3, pSB1C3, and pSB1T3 destination vectors by PCR using primers SB-prep-3P and SB-prep-2Eb (Table 13.1). Prepare each amplification reaction with 1 PCR SuperMix High Fidelity (Life Technologies, Carlsbad, CA), 60 pmol each primer, and 2–5 ng template destination plasmid in a 100-mL total volume. We designed the primers to amplify the destination vector with only EcoRI and PstI recognition sites, leaving only the minimal number of bases 50 to the sites to allow efficient DNA cleavage of the vector by these enzymes. The resulting short fragments produced from digestion of the PCR-amplified destination vector are too short for ligation during assembly, increasing the efficiency of the assembly step by limiting competing ligation products. 2. Thermocycle the reactions using the following conditions. 1. Initial denaturation at 94 C for 1 min.
3A Assembly
3. 4.
5. 6. 7.
323
2. Denaturation step at 94 C for 30 s. 3. Annealing step at 58 C for 30 s. 4. Extension step at 68 C for 3 min. 5. Cycle 40 times to step 2. 6. Final extension at 68 C for 10 min. Add 1 mL DpnI to each reaction. Incubate all reactions at 37 C for 1 h to digest the template DNA and then at 80 C for 20 min to heat inactivate the restriction enzyme. For convenience, we recommend performing this step in a programmable thermocycler. Confirm by agarose gel electrophoresis that the full-length destination vector is primary PCR product. Purify the PCR product using the QIAquick PCR Purification Kit from QIAGEN or similar kit to remove enzymes and dNTPs. The resulting PCR-amplified destination vector can be used directly in the EcoRI and PstI digestion of a 3A assembly step. We do not recommend storage of digested linearized plasmid backbones.
6. Available BioBrick Destination Vectors Several standard BioBrick assembly vectors are available from the Registry of Standard Biological parts including high copy vectors pSB1A3, pSB1K3, pSB1C3, pSB1T3, pSB1AK3, pSB1AC3, and pSB1AT3. In addition, a selection of low and medium copy BioBrick vectors is also available from the Registry including pSB4A5, pSB4K5, pSB4C5, pSB4T5, pSB3K5, pSB3C5, and pSB3T5 (Shetty et al., 2008). All of these vectors contain a BioBrick cloning site and primer binding sites for verification primers VF2 and VR. The VF2 and VR primer binding sites are positioned at a sufficient distance from the cloning site to allow high quality sequence reads of BioBrick parts from Sanger sequencing. Use of part plasmids and destination plasmids with different replication origins can increase the likelihood that ligation products containing two replication origins will propagate during cell growth (first failure mode described above).
7. Preparation of Chemically Competent Cells A key factor in the success of 3A assembly of BioBrick parts is the transformation efficiency of the competent cells used. Low efficiency competent cells can result in few or no colonies on the transformation plate. We have tested several different protocols for preparation of chemically
324
Reshma Shetty et al.
competent cells and have found that the following protocol yields competent cells with high transformation efficiency. It builds on extensive prior work on bacterial transformation (Hanahan et al., 1991). We have used this protocol successfully with E. coli strains TOP10 and Mach1. 1. Detergent residue inhibits competent cell growth and transformation. Remove detergents from all glassware by autoclaving glassware filled 3/4 full with DI water prior to use. Prepare all media and buffers in detergent-free glassware. Similarly, grow cultures in detergent-free glassware. 2. Prepare CCMB80 buffer. 10 mM KOAc, pH 7.0 (10 mL of a 1 M stock solution) 80 mM CaCl22H2O (11.8 g/L) 20 mM MnCl24H2O (4.0 g/L) 10 mM MgCl2 6H2O (2.0 g/L) 10% glycerol Adjust pH down to 6.4 with 0.1 N HCl if necessary. Filter sterilize and store at 4 C. 3. Streak the E. coli strain on an SOB agar plate and grow at 23 C until single colonies appear. 4. Prepare a set of seed cultures by picking single colonies into several culture tubes each with 2 mL SOB medium. Shake culture tubes at 23 C overnight. 5. Prepare frozen seed stocks by adding sterile glycerol to a final concentration of 15% to each seed culture. Aliquot 1 mL seed culture into cryogenic vials. Freeze seed stocks in a dry ice/ethanol bath and store at 80 C. 6. Inoculate 250 mL SOB medium with a 1-mL vial of seed stock and grow at 20 C to an OD600 nm of 0.3. This culture may be grown at room temperature if necessary. 7. Chill glassware, plasticware, and reagents prior to use. 8. Centrifuge the culture at 3000g at 4 C for 10 min in a flat bottom centrifuge tube. The cell pellet is easier to resuspend in a flat bottom tube. 9. Gently resuspend cell pellet in 80 mL of ice cold CCMB80 buffer. To make cell resuspension easier, initially add a small volume of buffer, resuspend, and then add the rest of the buffer. 10. Incubate resuspended culture on ice for 20 min. 11. Centrifuge resuspended culture at 3000g at 4 C for 10 min and gently resuspend cell pellet in 10 mL ice cold CCMB80 buffer. 12. Measure the OD600 nm of a mixture of 200 mL SOC and 50 mL resuspended culture. Add sufficient chilled CCMB80 buffer to resuspended culture to yield a final OD600 nm of 1.0–1.5 in this measurement. 13. Incubate the resuspended culture on ice for 20 min. 14. Aliquot competent cells into prechilled tubes and store at 80 C.
3A Assembly
325
15. Measure transformation efficiency of prepared competent cells using plasmid DNA. Target transformation efficiency is >108 cfu per mg of plasmid DNA.
8. Conclusions 3A assembly is a method for the assembly of standard biological parts that conform to BioBrick assembly standard 10. The 3A assembly method relies on a combination of positive and negative selection to achieve a high frequency of correctly assembled clones (>80% in our experience). It avoids labor-intensive, difficult-to-automate steps such as column cleanup and agarose gel purification, and we have used it successfully with parts ranging in length from 12 bp to 3–4 kb. To use 3A assembly, the destination vector, into which two BioBrick parts will be assembled, must have a different antibiotic resistance marker from the plasmids encoding the two input parts. The 3A assembly method is extensible to other BioBrick assembly standards, assuming that the restriction enzymes used can be heat inactivated.
ACKNOWLEDGMENTS We thank the MIT Synthetic Biology Working Group for valuable discussions and advice as we developed the 3A assembly method. BioBrick is a trademark of the BioBricks Foundation (http://biobricks.org).
REFERENCES Afonso, B., Silver, P. A., and Ajo-Franklin, C. M. (2010). A synthetic circuit for selectively arresting daughter cells to create aging populations. Nucleic Acids Res. 38, 2727–2735. Agapakis, C. M., Ducat, D. C., Boyle, P. M., Wintermute, E. H., Way, J. C., and Silver, P. A. (2010). Insulation of a synthetic hydrogen metabolism circuit in bacteria. J. Biol. Eng. 4, 3. Ajo-Franklin, C. M., Drubin, D. A., Eskin, J. A., Gee, E. P., Landgraf, D., Phillips, I., and Silver, P. A. (2007). Rational design of memory in eukaryotic cells. Genes Dev. 21, 2271–2276. Anderson, J. C., Dueber, J. E., Leguia, M., Wu, G. C., Goler, J. A., Arkin, A. P., and Keasling, J. D. (2010). BglBricks: A flexible standard for biological part assembly. J. Biol. Eng. 4, 1. Bernard, P. (1995). New ccdB positive-selection cloning vectors with kanamycin or chloramphenicol selectable markers. Gene 162, 159–160. Bernard, P. (1996). Positive selection of recombinant DNA by CcdB. Biotechniques 21, 320–323. Bernard, P., Gabant, P., Bahassi, E. M., and Couturier, M. (1994). Positive-selection vectors using the F plasmid ccdB killer gene. Gene 148, 71–74.
326
Reshma Shetty et al.
Brownstein, M. J., Carpten, J. D., and Smith, J. R. (1996). Modulation of non-templated nucleotide addition by Taq DNA polymerase: Primer modifications that facilitate genotyping. Biotechniques 20, 1004–1010. Canton, B., Labno, A., and Endy, D. (2008). Refinement and standardization of synthetic biological parts and devices. Nat. Biotechnol. 26, 787–793. Chen, H., Bjerknes, M., Kumar, R., and Jay, E. (1994). Determination of the optimal aligned spacing between the Shine-Dalgarno sequence and the translation initiation codon of Escherichia coli mRNAs. Nucleic Acids Res. 22, 4953–4957. Du, L., Gao, R., and Forster, A. C. (2009). Engineering multigene expression in vitro and in vivo with small terminators for T7 RNA polymerase. Biotechnol. Bioeng. 104, 1189–1196. Ellison, M., Ridgway, D., Fedor, J., Garside, E., Robinson, K., and Lloyd, D. (2009). BBF RFC 47: BioBytes Assembly Standard. DOI: 1721.1/49518. Gru¨nberg, R., Arndt, K., and Mu¨ller, K. (2009). Fusion Protein (Freiburg) Biobrick Assembly Standard. DOI: 1721.1/45140. Gru¨nberg, R., Ferrar, T. S., van der Sloot, A. M., Constante, M., and Serrano, L. (2010). Building blocks for protein interaction devices. Nucleic Acids Res. 38, 2645–2662. Gu¨ssow, D., and Clackson, T. (1989). Direct clone characterization from plaques and colonies by the polymerase chain reaction. Nucleic Acids Res. 17, 4000. Hanahan, D., Jessee, J., and Bloom, F. R. (1991). Plasmid transformation of Escherichia coli and other bacteria. Methods Enzymol. 204, 63–113. Haynes, K. A., Broderick, M. L., Brown, A. D., Butner, T. L., Dickson, J. O., Harden, W. L., Heard, L. H., Jessen, E. L., Malloy, K. J., Ogden, B. J., Rosemond, S., Simpson, S., et al. (2008). Engineering bacteria to solve the Burnt Pancake Problem. J. Biol. Eng. 2, 8. Huang, H. H., Camsund, D., Lindblad, P., and Heidorn, T. (2010). Design and characterization of molecular tools for a Synthetic Biology approach towards developing cyanobacterial biotechnology. Nucleic Acids Res. 38, 2577–2593. Kelly, J. R., Rubin, A. J., Davis, J. H., Ajo-Franklin, C. M., Cumbers, J., Czar, M. J., de Mora, K., Glieberman, A. L., Monie, D. D., and Endy, D. (2009). Measuring the activity of BioBrick promoters using an in vivo reference standard. J. Biol. Eng. 3, 4. Knight, T. F. (2003). Idempotent Vector Design for Standard Assembly of Biobricks. DOI: 1721.1/21168. Knight, T. F. (2008). Draft Standard for Biobrick BB-2 Biological Parts. DOI: 1721.1/ 45139. Levskaya, A., Chevalier, A. A., Tabor, J. J., Simpson, Z. B., Lavery, L. A., Levy, M., Davidson, E. A., Scouras, A., Ellington, A. D., Marcotte, E. M., and Voigt, C. A. (2005). Synthetic biology: Engineering Escherichia coli to see light. Nature 438, 441–442. Moreira, R. F., and Noren, C. J. (1995). Minimum duplex requirements for restriction enzyme cleavage near the termini of linear DNA fragments. Biotechniques 19, 56–59. Peisajovich, S. G., Horwitz, A., Hoeller, O., Rhau, B. and Lim, W. (2009). BBF RFC 28: A method for combinatorial multi-part assembly based on the Type IIs restriction enzyme AarI. DOI: 1721.1/46721. Phillips, I., and Silver, P. (2006). A New Biobrick Assembly Strategy Designed for Facile Protein Engineering. DOI: 1721.1/32535. Sambrook, J., and Russell, D. W. (2001). Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York. Shetty, R., and Rettberg, R. (2009). Naming of standards of physical composition of BioBrick parts. DOI: 1721.1/45137. Shetty, R. P., Endy, D., and Knight, T. F. (2008). Engineering BioBrick vectors from BioBrick parts. J. Biol. Eng. 2, 5. Sleight, S. C., Bartley, B. A., Lieviant, J. A., and Sauro, H. M. (2010). In-Fusion BioBrick assembly and re-engineering. Nucleic Acids Res. 38, 2624–2636. Tabor, J. J., Salis, H. M., Simpson, Z. B., Chevalier, A. A., Levskaya, A., Marcotte, E. M., Voigt, C. A., and Ellington, A. D. (2009). A synthetic genetic edge detection program. Cell 137, 1272–1281.
C H A P T E R
F O U R T E E N
Genetic Assembly Tools for Synthetic Biology Billyana Tsvetanova, Lansha Peng, Xiquan Liang, Ke Li, Jian-Ping Yang, Tony Ho, Josh Shirley, Liewei Xu, Jason Potter, Wieslaw Kudlicki, Todd Peterson, and Federico Katzen Contents 328 329 329 330 332 334 335 336 336 337 339 339 340 341 342 342 344 345 345 345 346
1. Overview 2. Yeast-Based Homologous Recombination 2.1. Yeast–Escherichia coli shuttle vector 2.2. DNA assembly using fragments with identical ends 2.3. DNA assembly using fragments without identical ends 2.4. Screening for the positive clone 2.5. Yeast–E. coli transfer 2.6. Yeast plasmid conversion cassette 2.7. Typical results 3. In Vitro Recombineering—DNA Assembly 3.1. Materials needed 3.2. Assembly reaction 3.3. Typical results 4. In Vitro Recombineering—Site-Directed Mutagenesis 4.1. Primer specifications 4.2. Materials needed 4.3. Mutagenesis protocol 4.4. Typical results 5. Concluding Remarks 6. Disclosure References
Abstract With the completion of myriad genome sequencing projects, genetic bioengineering has expanded into many applications including the integrated analysis of complex pathways, the construction of new biological parts and the redesign of existing, natural biological systems. All these areas require the precise and Life Technologies Corporation, Carlsbad, California, USA Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00014-0
#
2011 Elsevier Inc. All rights reserved.
327
328
Billyana Tsvetanova et al.
concerted assembly of multiple DNA fragments of various sizes, including chromosomes, and the fine-tuning of gene expression levels and protein activity. Current commercial cloning products are not robust enough to support the assembly of very large or very small genetic elements or a combination of both. In addition, current strategies are not flexible enough to allow further modifications to the original design without having to undergo complicated cloning strategies. Here, we present a set of protocols that allow the seamless, simultaneous, flexible, and highly efficient assembly of genetic material, designed for a wide size dynamic range (10s to 100,000s base pairs). The assembly can be performed either in vitro or within the living cells and the DNA fragments may or may not share homology at their ends. A novel site-directed mutagenesis approach enhanced by in vitro recombineering is also presented.
1. Overview The consensual definition of the “synthetic biology” term refers to the design and construction of new biological parts, devices, and systems, and the redesign of existing, natural biological systems for useful purposes (www.syntheticbiology.org). Challenges of these aspirations include the standardization of parts and devices, the modeling of intricate metabolic circuits, and the seamless assembly of fairly large pieces of DNA in a predetermined order to program the circuits above. The ultimate goal would be to convert a digitized sequence of a chromosome into its biochemical counterpart in a single step. This latter ambition implies that the classical recombinant DNA technology, based on restriction enzymes, ligase, and PCR, needs to evolve to cover a wide range of DNA sizes and number of fragments. Significant progress has been made during the past few years. Advanced systems for assembling multiple DNA fragments have been developed. Some of them such as RecA-independent recombination, Red/ET recombination, Gateway, or loxP based are practical and fast but they work with a limited number of fragments. (Bubeck et al., 1993; Cheo et al., 2004; Datsenko and Wanner, 2000; Hartley et al., 2000; Liu et al., 1998; Zhang et al., 1998). Other approaches, such as ligation-independent cloning, overlap PCR, SLIC, or the use of Type IIS restriction enzymes, require considerable design and/or preparation time (Aslanidis and de Jong, 1990; Gao et al., 2003; Lebedenko et al., 1991; Li and Elledge, 2007). In our experience, the most viable strategies are those that make use of single step either yeast-based or in vitro homologous recombination. In this chapter, we provide a set of protocols based on homologous recombination useful for successfully assembling and editing small, intermediate, and large DNA fragments.
Genetic Assembly Tools for Synthetic Biology
329
2. Yeast-Based Homologous Recombination The approach described in this section relies on the yeast’s powerful ability to take up and recombine DNA fragments. This property has been first described over 30 years ago (Hinnen et al., 1978; Orr-Weaver et al., 1981), and since then it has been applied to the generation of plasmids and yeast artificial chromosomes under the name of transformation-associated recombination (TAR; Ebersole et al., 2005; Larionov et al., 1994, 1996; Ma et al., 1987; Marykwas and Passmore, 1995). At the same time, it has been shown that any given DNA sequence can be joined to a vector using short synthetic linkers that bridge the ends (Raymond et al., 1999, 2002). More recently, the technology has been used to build an artificial bacterial chromosome and to assemble genes from oligonucleotides (Gibson, 2009; Gibson et al., 2008a,b, 2010; Lartigue et al., 2009). All these approaches used linearized yeast vectors and fragments with overlapping ends, and thereby can undergo recombination that restores the circular topology and functionality of the plasmid as linearized plasmids lacking telomeres fail to replicate (Szostak and Blackburn, 1982).
2.1. Yeast–Escherichia coli shuttle vector A fundamental requirement for the TAR approach is a Saccharomyces cerevisiae replicating vector with a selectable marker. Replication elements routinely used in yeast plasmids consist of either an autonomously replicating sequence (ARS) and a centromere (CEN) or replication and partitioning sequences from the endogenous 2-mm plasmid (Clarke and Carbon, 1980; Murray, 1987; Struhl et al., 1979). The most common selectable markers are those that restore prototrophy for an essential metabolite in auxotrophic cells. We designed a yeast plasmid vector that has ARS4 and CEN5 as replication elements plus the trp1 gene encoding the yeast phosphoribosylanthranilate isomerase as a selectable marker that complements strains deficient in the synthesis of tryptophan such as those harboring homozygous trp1-901 alleles (Fig. 14.1). As genetics in yeast is hampered by genetic instability and low yields of purified DNA products, we decided to clone into the plasmid a bacterial replication origin, thus effectively converting it into a yeast–E. coli shuttle vector. In order to maximize DNA capacity, the F0 replication origin and accessory genes from the mini-F plasmid (reviewed by Kline, 1985) were cloned. Other elements such as a spectinomycin resistance gene (SpnR), an origin of transfer (oriT) for plasmid mobilization, and the cos site from bacteriophage l were also included (Fig. 14.1).
330
Billyana Tsvetanova et al.
I-SceI oriT NotI sopB
sopA
PacI I-CeuI
pYES1L 9380 bp trp1
repE COS
ARS4 CEN5 SpnR
Figure 14.1 Map of the linearized BAC/YAC vector pYES1L. Circularization of the plasmid results in a functional E. coli–S. cerevisiae shuttle episome. Unique rare restriction sites such as PacI, I-CeuI, I-SceI, and NotI were added for mapping or further linearization or cloning. For additional details and abbreviations see text.
2.2. DNA assembly using fragments with identical ends The method presented here is similar to that one described earlier (Raymond et al., 1999) with the following modifications: (i) we use the yeast strain MaV203 (MATa leu2-3,112 trp1-901; his3D200; ade2-101; cyh2R; can1R; gal4D; gal80D; GAL1::lacZ SPAL10::URA3 HIS3UASGAL1::HIS3@LYS2), a derivative from a cross between two nonisogenic strains, PCY2 and MaV99 (Chevray and Nathans, 1992; Vidal et al., 1996); (ii) we use chemically competent yeast cells; and (iii) we recommend 30 bp of fragment overlap for constructs <60 kbp, and 50 bp of fragment overlap for constructs >60 kbp. End homology can be added manually to the fragments by PCRamplifying the elements with oligonucleotides bearing additional sequences at their 50 end, or can be exposed by excising the fragments with restriction endonucleases from larger DNA entities. Residual nucleotides derived from the original restriction site do not interfere and are readily eliminated during the recombination process. The recombinogenic properties of S. cerevisiae strongly promote DNA recombination between any pair of highly homologous sequences 50 bp or larger. Therefore, we recommend verifying that the fragment sequences do not share patches of DNA identity other than those at the end of adjacent fragments.
Genetic Assembly Tools for Synthetic Biology
331
To avoid using fragments with unwanted similarity, we internally blast our sequences using the Oligo Designer Web tool (www.invitrogen. com/DesignDNAassembly), a free online software that detects potential patches of internal homology, designs oligonucleotides for TAR recombination, and provides a GenBank annotated sequence of the assembled construct.
2.2.1. Materials needed
DNA fragments to assemble Stitching oligonucleotides (only if necessary; see above and below) pYES1L linear cloning vector or your own yeast-adapted cloning vector MaV203 competent yeast cells (Life Technologies) or equivalent PEG/LiAc solution (Life Technologies) or any other DNA condensing reagent 0.9% NaCl solution (sterile) DNase-, RNase-free water CSM-Trp agar plates (Life Technologies) or equivalent 30 and 42 C water baths 30 C incubator Microcentrifuge SpeedVacÒ (Thermo Scientific, Waltlham, MA; optional)
In addition, the following materials are needed for screening for the positive clone:
Pair of diagnostic primers for each DNA junction including the cloning vector Yeast lysis buffer (Life Technologies) or equivalent Thermocycler and sterile PCR tubes or plates PlatinumÒ PCR SuperMix (Life Technologies) or similar Finally, required materials for DNA transfer from yeast to E. coli are
Plates with yeast colonies containing the plasmid of interest Yeast lysis buffer (Life Technologies) or similar Glass beads (Life Technologies) or similar SOC medium at room temperature One ShotÒ TOP10 electrocompetent E. coli cells (Life Technologies) or similar Electroporation cuvettes, chilled on ice Electroporator LB plates with the appropriate selection antibiotics, prewarmed to 37 C
332
Billyana Tsvetanova et al.
2.2.2. Performing the DNA assembly 1. Add the following components to a microcentrifuge tube and mix: Linearized vector DNA fragments
2. 3. 4. 5. 6. 7. 8. 9.
10.
100 ng 100 ng each (if final construct is 25 kbp) 200 ng each (if final construct is >25 kbp)
If the total volume of the DNA mix is smaller than 10 mL, proceed to Step 2. If the total volume of the DNA mix is larger than 10 mL, reduce total volume down to 5–10 mL using a SpeedVacÒ or a centrifugal filter device. Do not let the liquid dry out completely. Add 100 mL of 30 C thawed MaV203 cells into the DNA mix (the volume of the DNA mix should be 10 mL). Mix well by tapping the tube. Add 600 mL of the PEG/LiAc solution to the DNA/competent cell mixture. Mix by inverting the tube five to eight times until the mix is homogeneous. Incubate the mixture in the 30 C water bath for 30 min. Invert the tube occasionally (every 10 min) to resuspend the components. Add 35.5 mL of DMSO to the tube. Mix by inverting the tube five to eight times. Heat-shock the cells by incubating the tube in the 42 C water bath for 20 min. Invert the tube occasionally to resuspend the components. Centrifuge the tube at 1800 rpm (200–400 g) for 5 min. Carefully discard the supernatant from the tube and resuspend the cell pellet in 1 mL of sterile 0.9% NaCl by gentle pipetting. Plate 100 mL of the transformed cells onto CSM-Trp agar plates. For final constructs > 60 kbp, we recommend centrifuging the remaining 900 mL of the transformation mixture, removing approximately 750 mL of the supernatant, resuspending the cell pellet in the remaining 100– 150 mL of supernatant, and plate all cells onto another CSM-Trp agar plate to ensure that you have sufficient number of colonies to screen. Incubate the cells at 30 C for 3 days and proceed to screening for the correct clone (see below).
2.3. DNA assembly using fragments without identical ends The TAR approach has been applied to recombine adjacent fragments that do not share end homology (DeMarini et al., 2001; Raymond et al., 2002). In this case, the necessary homology is provided in trans by complementary oligonucleotides that overlap both fragments, thereby serving as recombination linkers (stitching oligonucleotides). The method is particularly well
Genetic Assembly Tools for Synthetic Biology
333
suited for reusing fragments in a new sequence context, or for cloning DNA targets that cannot be readily amplified by PCR. The linker-mediated recombinational feature (oligonucleotide stitching) allows editing the fragment junctions, thus generating required imperfections. This is particularly useful when the removal of end sequences such as restriction sites or primer tails is required. It also permits the opposite type of alterations such as the addition of foreign sequences to insert restriction sites, small tag coding regions, or small watermarks. We successfully inserted up to 20 bp and effectively deleted up to 12 nucleotides from up to three fragment junctions. 2.3.1. Guidelines for designing the stitching oligonucleotides
Each junction between adjacent fragments requires two oligonucleotides for oligonucleotide stitching, a sense and an antisense oligonucleotides. Up to five fragments plus a vector can be assembled using stitching oligonucleotides, provided that not more than three junctions are formed by the stitching oligonucleotides and the remaining junctions are produced by shared end-terminal homology. Oligonucleotides used for oligonucleotide stitching of up to three nonhomologous fragments of <5 kbp must be 80-mers (i.e., they must have a 40-bp overlap with each adjacent fragment). Prepare stitching oligonucleotide stocks at a final concentration of 100 mM in 1 TE buffer, pH 8 (10 mM Tris–HCl, 1 mM EDTA, pH 8). Stitching oligonucleotides used for insertion editing must have a 30-nucleotide overlap with each adjacent fragment in addition to the insertion bases (for a total length of up to 80-mer, including up to 20 insertion bases). Stitching oligonucleotides used for deletion editing must have a 40-nucleotide overlap with each adjacent fragment, annealing up to six nucleotides from the junction into each fragment, thus leaving up to 6 bp at the end of each fragment to be deleted during TAR.
2.3.2. Materials needed Required materials are the same as those depicted in Section 2.2.1 plus the corresponding double-stranded stitching oligonucleotides (up to three pairs per assembly reaction) as described in Section 2.3.1. 2.3.3. Performing the DNA assembly 1. Add the following components to a microcentrifuge tube and mix:
334
Billyana Tsvetanova et al.
pYES1L or your own linearized vector DNA fragments
Stitching oligonucleotides
100 ng 100 ng each (if final construct is 25 kbp) 200 ng each (if final construct is >25 kbp) 500 ng each (20 pmol each)
If the total volume of the DNA mix is smaller than 10 mL, proceed to Step 2 in Section 2.2.2. If the total volume of the DNA mix is larger Ò than 10 mL, reduce total volume down to 5–10 mL using a SpeedVac or a centrifugal filter device. Do not let the liquid dry up completely. 2. Proceed to Step 2 of Section 2.2.2.
2.4. Screening for the positive clone The fastest way to screen for yeast colonies containing the right assembled construct is by performing colony-PCR assays using pair of diagnostic primers that amplify each single expected junction. For example, for a five-fragment (plus vector) assembly configuration, six pair of diagnostic primers should be used. We recommend designing oligonucleotide pairs (forward and reverse) at a distance of 100–250 bp from the ends of each DNA fragment (including the cloning vector) so that the colony-PCR products would be 200–500 bp in size and span the junctions between the fragments. 1. Aliquot 15 mL of lysis buffer into PCR tubes or plates. 2. Pick individual yeast colonies one at a time using a sterile 20 mL pipette tip. Leave the tip in the PCR tube or the well until all the colonies have been picked. 3. Resuspend the cells by pipetting up and down three times. 4. Transfer 5 mL of each cell suspension into fresh PCR tubes and store at 4 C until verified that the colony is positive (see below). 5. Heat the remaining cells (10 mL) at 95 C for 5 min in a thermocycler and cool them down at 4 C or on ice. Briefly centrifuge the PCR tubes or plates to bring down condensed water. 6. Add 40 mL of nuclease-free water into each lysate and pipette up and down three to five times to mix. 7. Set up a PCR master mix for each junction and aliquot 49.5 mL of it into fresh PCR tubes or plates.
Genetic Assembly Tools for Synthetic Biology
335
8. Add 0.5 mL of each diluted yeast lysate (from Step 4) into each PCR tube or well. Do not exceed 0.5 mL of lysed yeast cells for 50 mL of PCR volume. 9. Vortex to mix the contents and briefly centrifuge to bring down all liquid. 10. Perform PCR cycling in a thermocycler. 11. Load 10 mL onto an agarose gel to visualize the PCR products. Sequencing these PCR products is recommended.
2.5. Yeast–E. coli transfer Plasmid DNA preparations from yeast cells usually result in very low yield and poor DNA quality. Therefore, it is a common laboratory practice to retrieve these shuttle plasmids out of yeast and transfer back into E. coli for additional analysis or manipulation. Established protocols for the retrieval procedure, however, are sometimes more cumbersome and time consuming than the initial plasmid transfer into yeast and in some cases, the transfer efficiency is not high enough to actually obtain E. coli colonies (Gunn and Nickoloff, 1995; Marcil and Higgins, 1992). We have developed a highly efficient protocol with a few modifications to an earlier strategy (Summers and Withers, 1990) that streamline the process. 1. Aliquot four to five glass beads into a fresh PCR tube and add 10 mL of lysis buffer. 2. Add 5 mL of the cell suspension that was stored into the lysis buffer/glass beads mix (Section 2.4). Pipette up and down three to five times to mix. 3. Vortex the cells at room temperature for 5 min. Do not heat the lysed cells. 4. Add 1 mL of the lysed cells (from Step 3, above) into a vial of electrocompetent cells and mix gently. Do not add more than 1 mL of the lysed cells to avoid arcing during electroporation. 5. Transfer the cells to the chilled electroporation cuvette on ice. 6. Electroporate the cells following the manufacturer’s recommended protocol. 7. Add 250 mL of prewarmed SOC medium to each vial. 8. Transfer the solution to a 15-mL snap-cap tube and shake for at least 1 h at 37 C to allow expression of the antibiotic resistance gene. 9. Spread 10–50 mL from each transformation on a prewarmed LB plate supplemented with the appropriate selection antibiotic. 10. Invert the selective plate(s) and incubate at 37 C overnight.
336
Billyana Tsvetanova et al.
2.6. Yeast plasmid conversion cassette In order to extend the usability of this system to any E. coli vector, we designed a linear adaptation cassette with all the features necessary for yeast cloning and replication (Fig. 14.2). The fragment contains all the yeastrelated features described for the plasmid pYES1L (see Section 2.1 and Fig. 14.1) plus the ura3 gene encoding yeast’s orotidine 5-phosphate decarboxylase (Boeke et al., 1984). This counter-selectable marker becomes lethal when 5-fluoroorotic acid (5-FOA) is added to the media, as it converts it into the toxic compound 5-fluorouracil. During development, we learned that the use of this feature is not really necessary, as the frequency of plasmid recircularization is negligible. The cassette contains also the bacterial chloramphenicol acetyl transferase (or chloramphenicol resistance gene, CamR), which facilitates plasmid adaptation by selecting recombinants in agar plates containing chloramphenicol and the corresponding antibiotic specific to the vector backbone. Convenient rare restriction sites were added in order to linearize the final adapted molecule for cloning in yeast.
2.7. Typical results Standard assemblies were performed using different fragment numbers and sizes. Here, we used (i) preexisting fragments that had been previously cloned into pACYC184, (ii) PCR-amplified fragments, or (iii) a combination of both. The recipient plasmid was pYES1L. Additional variables tested were (i) fragment overlap (80 or 30 bp) and amount of each insert (100 or 200 ng per reaction). For large or complicated assemblies, 200 ng and 80 bp overlap are recommended. However, for simple assemblies (e.g., one fragment), 100 ng of DNA and 30 bp overlap are sufficient for attaining virtually 100% cloning efficiency (Table 14.1). AscI
AscI AsiSI
AsiSI
NotI
PacI
I-SceI
URA3
I-CeuI
CamR
trp1
ARS4 / CEN5
URA3 promoter
Vector conversion cassette 3834 bp
Figure 14.2 see text.
Scheme of the yeast adaptation cassette. For details and abbreviation
337
Genetic Assembly Tools for Synthetic Biology
Table 14.1 Assemblies using fragments with end homology Number length of precloned Number of fragments fragmentsa (kbp)b
3 5 10 20 20 1 1 1 10 a b c
3 10 5 10 10 10 8 10 8 10 00 00 00 00
Number length of PCRamplified fragments (kbp)
Cloning Total Overlap Insert efficiency size (kbp)c (bp) (ng) (%)
00 30 00 50 00 100 12 0.5–2.5 100 12 0.5–2.5 100 1 0.7 0.7 1 10 10 1 10 10 10 5 50
80 80 80 80 80 30 80 30 30
100 100 100 100 200 100 200 200 200
100 100 50 58 83 100 100 100 92
Vector not counted as a fragment. Fragments were initially cloned into pACYC184 and excised by NotI digestion. Vector size not considered.
For assemblies using stitching oligonucleotides, we have tried several conditions, which included (a) the use single- or double-stranded oligonucleotides, (b) the use of oligonucleotides with varying lengths, and (c) the use of different oligonucleotide:fragment molar ratios. Best results were obtained using 80-mer double-stranded oligonucleotides. However for simple assemblies (one vector þ one fragment), 60-mers exhibited satisfactory performance (Fig. 14.3A). Single-stranded oligonucleotides are not recommended. Up to five fragments plus vector plus three pairs of stitching oligonucleotides can be successfully assembled. We also performed assemblies with imperfect junctions using a combination of two 5 kbp fragments plus a vector (Fig. 14.3B). Recombination efficiency of imperfect junctions should be significantly higher if only one fragment plus vector are joined by stitching oligonucleotides. Oligonucleotides shorter than 60-mers did not perform satisfactorily (not shown).
3. In Vitro Recombineering—DNA Assembly Most of the in vitro methods for joining two or more DNA fragments require specific sequences that leave watermarks or scars at the end of the process. In addition, the number of fragments that can be simultaneously assembled is limited due to either poor cloning efficiency or limited number of available recognition elements. In the case of restriction–ligation
338
A
B
Billyana Tsvetanova et al.
Fragment size
Oligo size
Cloning efficiency (%)
1 × 1 kbp
60 mer
94
2 × 5 kbp
80 mer
75
3 × 5 + 2 × 0.5 kbp
60 mer
37
3 × 5 + 2 × 0.5 kbp
80 mer
75
2 × 5 kbp
80 mer 10 bp insertion
63
2 × 5 kbp
80 mer 20 bp insertion
50
2 × 5 kbp
80 mer 12 bp deletion
87
Figure 14.3 Bridging DNA fragments with stitching oligonucleotides. (A) One or more DNA fragments were bridged with the vector or adjacent fragment using doublestranded oligonucleotides perfectly complementary to the bridged molecules’ ends. Hundred to two hundred nanograms of each DNA fragment plus 40 pmol of doublestranded oligonucleotides were transformed into MaV203 yeast competent cells and processed as indicated in the text. (B) Adjacent DNA fragments were bridged with imperfect double-stranded 80 bp oligonucleotides that generate insertions or deletions. DNA fragments are represented by green lines, vector (ends only) by black lines, and oligonucleotides by red lines.
approaches, further constraints are imposed by the availability of unique sites in the vector and fragments. Homologous recombination strategies have proven to overcome the issues above. Two methods stand out, which are based on chewing back and repairing the DNA ends. The first one likely uses vaccinia polymerase (Hamilton et al., 2007; Willer et al., 2000; Zhu et al., 2007). The enzyme has both exonuclease and polymerase activities. It works quite efficiently for the assembly of one or two fragments, but requires two incubation steps at different temperatures, and cloning efficiencies for assembling more than three fragments are quite low (not shown). The second strategy uses a combination of thermostable polymerase, and ligase plus a heat-labile exonuclease and works in an isothermal context (Gibson et al., 2009). This approach efficiently assembles large number of fragments of considerable size; however, adjacent fragments must share at least 40 bp of end homology, which is sometimes difficult to accommodate in standard PCR
339
Genetic Assembly Tools for Synthetic Biology
oligonucleotides. It also requires electroporating the cells, which precludes the use of this method in a high-throughput context. Our method combines the advantages of the strategies above, namely it requires only 15 bp of end homology, it efficiently assembles more than three fragments, and readily works with standard transformation protocols. In addition, the experimental design of our approach is greatly facilitated by using of the Oligo Designer Web tool (www.invitrogen.com/DesignDNAassembly), a free online software that designs the oligonucleotides with the required overlap tails and provides a GenBank annotated sequence of the assembled construct.
3.1. Materials needed
DNA fragments for DNA assembly Linearized E. coli vector In vitro recombination buffer (Life Technologies or equivalent) In vitro recombination enzyme mix (Life Technologies or equivalent) Deionized, sterile water One ShotÒ TOP10 chemically competent E. coli (Life Technologies or similar) SOC medium at room temperature LB plates with the appropriate selection antibiotics, pre-warmed to 37 C
3.2. Assembly reaction 1. In a microcentrifuge tube, add the components below in the order they are listed: Insert(s) Linearized vector 5 reaction buffera Deionized water 10 Enzyme mixa
20–200 ng each 100 ng 4 mL 18 mL 2 mL
a
As an example, we list the in vitro recombination buffer and enzyme mix in the GENEARTÒ Seamless Cloning and Assembly Kit (Life Technologies). Other kits are available from different commercial sources.
For optimum results, use a 2:1 molar ratio of insert:vector. 2. Incubate at room temperature for 30 min. 3. Immediately transform competent E. coli cells, following standard protocols. Important: do not use electrocompetent cells. 4. Plate the cells on LB agar plates with the corresponding antibiotics.
340
Billyana Tsvetanova et al.
3.3. Typical results Fragments of different sizes were amplified and recombined into a linearized vector (Fig. 14.4A). Results showed that at least four fragments can be joined in a predetermined order into a vector with high cloning efficiencies. Results also indicated that fragments amplified with a proofreading polymerase produce significantly higher cloning efficiencies than those amplified with a standard polymerase (Fig. 14.4A).
A
0.3 kbp
1 kbp
1 kbp
2 kbp
1 kbp
Cloning efficiency (%)
100.0 PCR SuperMix
80.0 AccuPrime Pfx SuperMix
60.0 40.0 20.0 0.0 1
1
4 Number of fragments
4
4
B
Cloning efficiency (%)
100 80 60 40 20 0 2
4
8
16 32 50 End deletion span (bp)
100
200
Figure 14.4 In vitro recombineering: DNA assembly. (A) Fragments of the indicated size were amplified using a standard PCR polymerase (PCR SuperMix, Life Technologies) or a proofreading thermostable DNA polymerase (AccuPrime Pfx SuperMix, Life Technologies) and recombined into PstI–KpnI linearized pUC19 plasmid. Identity between the fragments and the vectors was generated by the addition of 15 nucleotides to the 50 -end of the oligonucleotides, whereas identity between adjacent fragments was created with 8 bp primer extensions. (B) Junction editing. Two DNA fragments were simultaneously recombined into a vector. One of the fragments shared 15 bp of homology with the vector at the indicated distances from the vector’s end, generating constructs bearing up to 200 bp deletions.
341
Genetic Assembly Tools for Synthetic Biology
Table 14.2 Effect of the topology of the end of the vector Observed cloning efficiency (%)b Number of fragmentsa
1 2 3 4 a b
30 50 Blunt protruding protruding ended
PCR amplified
Average
99.4 74.5 95.2 87.0
96.0 71.0 96.0 83.0
96.9 69.0 94.7 75.2
92.4 65.0 88.5 51.8
99.6 65.5 99.0 79.0
Fragments were PCR amplified with AccuPrime Pfx SuperMix (Life Technologies) using standard oligonucleotides. Cloning efficiency determined by colony PCR and sequencing.
In addition, 1-, 2-, 3-, and 4-fragment assembly reactions were performed using a plasmid linearized with restriction enzymes that leave behind 30 protruding, 50 protruding, and blunt ends. Also, a PCR-amplified vector was used instead of a digested one. Results showed that the cloning efficiency does not depend on the topological ends of the vector (Table 14.2). An interesting feature of our approach is that recombination can occur not only at the end of the fragments, but it also works at least up to 200 bp away from their ends (Fig. 14.4B). This attribute is useful for generating cloning variants using a single linearized vector.
4. In Vitro Recombineering—Site-Directed Mutagenesis During the past three decades, site-directed mutagenesis has become one of the most powerful tools in genetics. Its power lies in its ability, by chemical and/or enzymatic manipulation, to change a specific DNA target in a definable and predetermined way. With the advent of synthetic biology and rational design, the manipulation of genes to produce enzymes with subtle differences with respect to the natural ones and the modification of promoters to finely tune metabolic flows depend even more on reliable sitedirected mutagenesis approaches. Site-directed mutagenesis kits commercially available use, at least, one of the following approaches: (i) the isolation of single strand template DNA and the generation of the mutation with one complementary primer (Hutchison et al., 1978); (ii) the design of two sets of PCR primers that overlap the mutation site, amplify the template by two PCR reactions, and then the cloning of the two PCR fragments and the vector by three-piece
342
Billyana Tsvetanova et al.
ligation (Stemmer and Morris, 1992); (iii) the PCR amplification of a plasmid using complementary oligonucleotides and the subsequent elimination of the template molecule (Hemsley et al., 1989; Kunkel, 1985). We applied our homologous recombination approach (Section 3) to join the ends of a single DNA molecule, thereby enabling a highly efficient sitedirected mutagenesis strategy. The system relies on the inherent properties of a CpG methyltransferase, a high fidelity thermostable DNA polymerase, recombination enzymes, and the E. coli McrBC restriction–modification system. The DNA methylation and amplification steps are combined into a single reaction followed by a 10-min recombination step. This short in vitro recombination reaction of PCR products increases the colony output by 3- to 10-fold. Finally, the products are transformed into a host strain that degrades the methylated DNA template (Fig. 14.5).
4.1. Primer specifications
Both primers (forward and reverse) should contain the desired mutation. The mutation site should be centrally located on both primers and can be up to 12 bases (deletions, insertions, and/or any substitutions). Both primers (forward and reverse) should be approximately 30–45 nucleotides in length, not including the mutation site. Primers longer than 45 nucleotides increase the likelihood of secondary structure formation, which may affect the efficiency of PCR amplification. Primers should have an overlapping region at the 50 ends of 15–20 nucleotides, for efficient end-joining of mutagenesis product. For most applications, DNA oligonucleotides purified by desalting are generally sufficient, although oligonucleotides purified by HPLC or PAGE may increase the mutagenesis efficiency.
4.2. Materials needed
Target plasmid DNA Custom mutagenic oligonucleotide pair AccuPrimeTM Pfx DNA polymerase (Life Technologies) or similar CpG methyltransferase S-adenosyl methionine PCR enhancer (Life Technologies or similar) Thermocycler In vitro recombination buffer (Life Technologies) or equivalent In vitro recombination enzyme mix (Life Technologies) or equivalent Deionized, sterile water EDTA
343
Genetic Assembly Tools for Synthetic Biology
A +
Methylation
+
Amplification
Recombination
Transformation
B Mutagenesis efficiency (%)
100 90 80 70 60
1 de 2-b le p tio n
in 12se bp rt io n
m 12ut bp at io n
8 pl .9-k as b m p id
5 pl .9-k as b m p id
2 pl .8-k as b m p id
50
Figure 14.5 In vitro recombineering: site-directed mutagenesis. (A) Strategy’s workflow. Template strands are shown in black. Methylated strands are shown as dotted lines. Oligonucleotides and new strands are shown in blue and red. For details see main text. (B) Plasmids with frameshift mutations in the lacZa gene were subjected to the mutagenesis protocol described in the text, transformed into DH5a-T1R cells, and then plated onto LB agar ampicillin X-gal plates. The mutagenesis efficiency was calculated by the ratio blue/total colonies.
344
Billyana Tsvetanova et al.
One ShotÒ MAX EfficiencyÒ DH5a-T1R competent cells (Life Technologies) or similar. SOC medium at room temperature LB plates with the appropriate selection antibiotics, prewarmed to 37 C
4.3. Mutagenesis protocol 1. In a PCR tube, add the components below: 10 AccuPrimeTM Pfx Reaction mix 10 PCR enhancer Oligonucleotide pair Plasmid DNA CpG DNA methyl transferase S-adenosyl methionine AccuPrimeTM Pfx Deionized, sterile water
1 1 0.3 mM each 20 ng 4 units 160 mM 1 unit 50 mL
2. Perform PCR using the following parameters: 37 ºC 94 ºC (a) 94 ºC (b) 57 ºC (c) 68 ºC 12–18 cycles of (a), (b) and (c)b 68 ºC 4 ºC
12–20 mina 2 min 20 s 30 s 30 s/kbp of plasmid 5 min As needed
a
Perform methylation of the plasmid at 37 C for 12–20 min. We recommend 12 min for 2.8–4 kbp plasmids and 20 min for 4–14 kbp plasmids. b The cycling parameters specify a 30-s extension for each 1 kbp of DNA. For optimal mutagenesis efficiency, we recommend 12–15 cycles for 2.8–4 kbp plasmids and 18 cycles for 4–14 kbp plasmids.
3. Analyze 5 mL of the product on a 0.8% agarose gel. 4. Set up the recombination reaction as follows: 10 reaction buffer Deionized, sterile water PCR sample 10 enzyme mix
4 mL 10 mL 4 mL 2 mL
5. Mix well and incubate at room temperature for 10 min. 6. Stop the reaction by adding 1 mL 0.5 M EDTA. Mix well and place the tubes on ice.
Genetic Assembly Tools for Synthetic Biology
345
7. Immediately transform competent E. coli cells, following standard protocols. Important: do not use electrocompetent cells. 8. Plate the cells on LB agar plates with the corresponding antibiotics
4.4. Typical results In our experiments, we used pUC19-derivative plasmid templates of sizes ranging from 2.8 to 8.9 kbp (Fig. 14.5B) All these plasmids encoded a lacZa gene derivative with a frameshift mutation that enabled us to readily calculate the mutagenesis efficiency by simply plating the cells onto LB agar plates supplemented with X-gal (the mutagenic primers were designed to revert the mutation to a wild-type lacZa gene). We also designed derivatives with insertions deletions, and substitutions spanning multiple base pairs. Cognate primer pairs were designed to revert those mutants to a wild-type lacZa allele. Results revealed a remarkable high mutagenesis efficiency, with a negative trend that followed larger plasmid sizes (Fig. 14.5B). In addition, we generated a 14 kbp plasmid, which was used as a template for a 1-bp transversion using an arbitrary mutagenic primer pair. Ten out of 10 independent clones sequenced, revealed the presence of the expected mutation.
5. Concluding Remarks The emergence of the synthetic biology field clearly calls for reconsideration of the current paradigms in different life science and engineering areas. As this field continues to expand, an agile back and forth conversion process between digital information and “analog” biological systems will become a necessity. Our ultimate goal is to attain a comprehensive solution to generate any DNA molecule from small assemblies up to high-level genetic systems starting from digital sequences stored in a computer. The approaches presented here are relevant not only for the area of synthetic biology, but they also have remarkable implications on the current cloning standards.
6. Disclosure Products are for research use only; not intended for any animal or human therapeutic or diagnostic use.
346
Billyana Tsvetanova et al.
REFERENCES Aslanidis, C., and de Jong, P. J. (1990). Ligation-independent cloning of PCR products (LIC-PCR). Nucleic Acids Res. 18, 6069–6074. Boeke, J. D., LaCroute, F., and Fink, G. R. (1984). A positive selection for mutants lacking orotidine-50 -phosphate decarboxylase activity in yeast: 5-fluoro-orotic acid resistance. Mol. Gen. Genet. 197, 345–346. Bubeck, P., Winkler, M., and Bautsch, W. (1993). Rapid cloning by homologous recombination in vivo. Nucleic Acids Res. 21, 3601–3602. Cheo, D. L., Titus, S. A., Byrd, D. R., Hartley, J. L., Temple, G. F., and Brasch, M. A. (2004). Concerted assembly and cloning of multiple DNA segments using in vitro sitespecific recombination: Functional analysis of multi-segment expression clones. Genome Res. 14, 2111–2120. Chevray, P. M., and Nathans, D. (1992). Protein interaction cloning in yeast: Identification of mammalian proteins that react with the leucine zipper of Jun. Proc. Natl. Acad. Sci. USA 89, 5789–5793. Clarke, L., and Carbon, J. (1980). Isolation of a yeast centromere and construction of functional small circular chromosomes. Nature 287, 504–509. Datsenko, K. A., and Wanner, B. L. (2000). One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl. Acad. Sci. USA 97, 6640–6645. DeMarini, D. J., Creasy, C. L., Lu, Q., Mao, J., Sheardown, S. A., Sathe, G. M., and Livi, G. P. (2001). Oligonucleotide-mediated, PCR-independent cloning by homologous recombination. Biotechniques 30, 520–523. Ebersole, T., Okamoto, Y., Noskov, V. N., Kouprina, N., Kim, J. H., Leem, S. H., Barrett, J. C., Masumoto, H., and Larionov, V. (2005). Rapid generation of long synthetic tandem repeats and its application for analysis in human artificial chromosome formation. Nucleic Acids Res. 33, e130. Gao, X., Yo, P., Keith, A., Ragan, T. J., and Harris, T. K. (2003). Thermodynamically balanced inside-out (TBIO) PCR-based gene synthesis: A novel method of primer design for high-fidelity assembly of longer gene sequences. Nucleic Acids Res. 31, e143. Gibson, D. G. (2009). Synthesis of DNA fragments in yeast by one-step assembly of overlapping oligonucleotides. Nucleic Acids Res. 37, 6984–6990. Gibson, D. G., Benders, G. A., Andrews-Pfannkoch, C., Denisova, E. A., BadenTillson, H., Zaveri, J., Stockwell, T. B., Brownley, A., Thomas, D. W., Algire, M. A., Merryman, C., Young, L., et al. (2008a). Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 319, 1215–1220. Gibson, D. G., Benders, G. A., Axelrod, K. C., Zaveri, J., Algire, M. A., Moodie, M., Montague, M. G., Venter, J. C., Smith, H. O., and Hutchison, C. A., 3rd (2008b). Onestep assembly in yeast of 25 overlapping DNA fragments to form a complete synthetic Mycoplasma genitalium genome. Proc. Natl. Acad. Sci. USA 105, 20404–20409. Gibson, D. G., Young, L., Chuang, R. Y., Venter, J. C., Hutchison, C. A., 3rd, and Smith, H. O. (2009). Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6, 343–345. Gibson, D. G., Glass, J. I., Lartigue, C., Noskov, V. N., Chuang, R. Y., Algire, M. A., Benders, G. A., Montague, M. G., Ma, L., Moodie, M. M., Merryman, C., Vashee, S., et al. (2010). Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52–56. Gunn, L., and Nickoloff, J. A. (1995). Rapid transfer of low copy-number episomal plasmids from Saccharomyces cerevisiae to Escherichia coli by electroporation. Mol. Biotechnol. 3, 79–84. Hamilton, M. D., Nuara, A. A., Gammon, D. B., Buller, R. M., and Evans, D. H. (2007). Duplex strand joining reactions catalyzed by vaccinia virus DNA polymerase. Nucleic Acids Res. 35, 143–151.
Genetic Assembly Tools for Synthetic Biology
347
Hartley, J. L., Temple, G. F., and Brasch, M. A. (2000). DNA cloning using in vitro sitespecific recombination. Genome Res. 10, 1788–1795. Hemsley, A., Arnheim, N., Toney, M. D., Cortopassi, G., and Galas, D. J. (1989). A simple method for site-directed mutagenesis using the polymerase chain reaction. Nucleic Acids Res. 17, 6545–6551. Hinnen, A., Hicks, J. B., and Fink, G. R. (1978). Transformation of yeast. Proc. Natl. Acad. Sci. USA 75, 1929–1933. Hutchison, C. A., 3rd, Phillips, S., Edgell, M. H., Gillam, S., Jahnke, P., and Smith, M. (1978). Mutagenesis at a specific position in a DNA sequence. J. Biol. Chem. 253, 6551–6560. Kline, B. C. (1985). A review of mini-F plasmid maintenance. Plasmid 14, 1–16. Kunkel, T. A. (1985). Rapid and efficient site-specific mutagenesis without phenotypic selection. Proc. Natl. Acad. Sci. USA 82, 488–492. Larionov, V., Kouprina, N., Eldarov, M., Perkins, E., Porter, G., and Resnick, M. A. (1994). Transformation-associated recombination between diverged and homologous DNA repeats is induced by strand breaks. Yeast 10, 93–104. Larionov, V., Kouprina, N., Graves, J., Chen, X. N., Korenberg, J. R., and Resnick, M. A. (1996). Specific cloning of human DNA as yeast artificial chromosomes by transformation-associated recombination. Proc. Natl. Acad. Sci. USA 93, 491–496. Lartigue, C., Vashee, S., Algire, M. A., Chuang, R. Y., Benders, G. A., Ma, L., Noskov, V. N., Denisova, E. A., Gibson, D. G., Assad-Garcia, N., Alperovich, N., Thomas, D. W., et al. (2009). Creating bacterial strains from genomes that have been cloned and engineered in yeast. Science 325, 1693–1696. Lebedenko, E. N., Birikh, K. R., Plutalov, O. V., and Berlin Yu, A. (1991). Method of artificial DNA splicing by directed ligation (SDL). Nucleic Acids Res. 19, 6757–6761. Li, M. Z., and Elledge, S. J. (2007). Harnessing homologous recombination in vitro to generate recombinant DNA via SLIC. Nat. Methods 4, 251–256. Liu, Q., Li, M. Z., Leibham, D., Cortez, D., and Elledge, S. J. (1998). The univector plasmid-fusion system, a method for rapid construction of recombinant DNA without restriction enzymes. Curr. Biol. 8, 1300–1309. Ma, H., Kunes, S., Schatz, P. J., and Botstein, D. (1987). Plasmid construction by homologous recombination in yeast. Gene 58, 201–216. Marcil, R., and Higgins, D. R. (1992). Direct transfer of plasmid DNA from yeast to E. coli by electroporation. Nucleic Acids Res. 20, 917. Marykwas, D. L., and Passmore, S. E. (1995). Mapping by multifragment cloning in vivo. Proc. Natl. Acad. Sci. USA 92, 11701–11705. Murray, J. A. (1987). Bending the rules: The 2-mu plasmid of yeast. Mol. Microbiol. 1, 1–4. Orr-Weaver, T. L., Szostak, J. W., and Rothstein, R. J. (1981). Yeast transformation: A model system for the study of recombination. Proc. Natl. Acad. Sci. USA 78, 6354–6358. Raymond, C. K., Pownder, T. A., and Sexson, S. L. (1999). General method for plasmid construction using homologous recombination. Biotechniques 26(134–8), 140–141. Raymond, C. K., Sims, E. H., and Olson, M. V. (2002). Linker-mediated recombinational subcloning of large DNA fragments using yeast. Genome Res. 12, 190–197. Stemmer, W. P., and Morris, S. K. (1992). Enzymatic inverse PCR: A restriction site independent, single-fragment method for high-efficiency, site-directed mutagenesis. Biotechniques 13, 214–220. Struhl, K., Stinchcomb, D. T., Scherer, S., and Davis, R. W. (1979). High-frequency transformation of yeast: Autonomous replication of hybrid DNA molecules. Proc. Natl. Acad. Sci. USA 76, 1035–1039. Summers, D. K., and Withers, H. L. (1990). Electrotransfer: Direct transfer of bacterial plasmid DNA by electroporation. Nucleic Acids Res. 18, 2192.
348
Billyana Tsvetanova et al.
Szostak, J. W., and Blackburn, E. H. (1982). Cloning yeast telomeres on linear plasmid vectors. Cell 29, 245–255. Vidal, M., Brachmann, R. K., Fattaey, A., Harlow, E., and Boeke, J. D. (1996). Reverse two-hybrid and one-hybrid systems to detect dissociation of protein-protein and DNAprotein interactions. Proc. Natl. Acad. Sci. USA 93, 10315–10320. Willer, D. O., Yao, X. D., Mann, M. J., and Evans, D. H. (2000). In vitro concatemer formation catalyzed by vaccinia virus DNA polymerase. Virology 278, 562–569. Zhang, Y., Buchholz, F., Muyrers, J. P., and Stewart, A. F. (1998). A new logic for DNA engineering using recombination in Escherichia coli. Nat. Genet. 20, 123–128. Zhu, B., Cai, G., Hall, E. O., and Freeman, G. J. (2007). In-fusion assembly: Seamless engineering of multidomain fusion proteins, modular vectors, and mutations. Biotechniques 43, 354–359.
C H A P T E R
F I F T E E N
Enzymatic Assembly of Overlapping DNA Fragments Daniel G. Gibson1 Contents 1. Introduction 2. Design and Preparation of the dsDNA for In Vitro Recombination 3. Two-Step Thermocycled Assembly of Overlapping dsDNA 4. One-Step Thermocycled Assembly of Overlapping dsDNA 5. One-Step ISO Assembly of Overlapping dsDNA 6. One-Step ISO DNA Assembly of Overlapping ssDNA Acknowledgments References
350 352 353 355 357 358 360 360
Abstract Three methods for assembling multiple, overlapping DNA molecules are described. Each method shares the same basic approach: (i) an exonuclease removes nucleotides from the ends of double-stranded (ds) DNA molecules, exposing complementary single-stranded (ss) DNA overhangs that are specifically annealed; (ii) the ssDNA gaps of the joined molecules are filled in by DNA polymerase, and the nicks are covalently sealed by DNA ligase. The first method employs the 30 -exonuclease activity of T4 DNA polymerase (T4 pol), Taq DNA polymerase (Taq pol), and Taq DNA ligase (Taq lig) in a two-step thermocycled reaction. The second method uses 30 -exonuclease III (ExoIII), antibody-bound Taq pol, and Taq lig in a one-step thermocycled reaction. The third method employs 50 -T5 exonuclease, PhusionÒ DNA polymerase, and Taq lig in a onestep isothermal reaction and can be used to assemble both ssDNA and dsDNA. These assembly methods can be used to seamlessly construct synthetic and natural genes, genetic pathways, and entire genomes and could be very useful for molecular engineering tools.
J. Craig Venter Institute Inc., Synthetic Biology Group, La Jolla, California, USA Present address: Daniel G. Gibson, Science Center Drive, San Diego, CA
1
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00015-2
#
2011 Elsevier Inc. All rights reserved.
349
350
Daniel G. Gibson
1. Introduction For nearly 40 years, scientists have had the ability to join DNA sequences to produce combinations that are not present in nature. “Recombinant DNA technology” was initiated soon after the discovery of DNA ligase (Gellert, 1967; Weiss and Richardson, 1967) and restriction endonucleases (Smith and Wilcox, 1970). In 1972, Berg and colleagues constructed the first recombinant DNA fragment, an SV40 hybrid DNA molecule ( Jackson et al., 1972). In the following year, Cohen et al. (1973) combined two antibiotic resistance markers in a single plasmid and showed that when bacterial cells take up the recombined plasmid, they become resistant to both antibiotics. By then, a 77-nucleotide gene encoding a yeast alanine transfer RNA had been synthesized from 17 overlapping oligonucleotides (oligos) (Agarwal et al., 1970). It was not long before it was announced that the human genes for somatostatin (Itakura et al., 1977) and insulin (Crea et al., 1978; Goeddel et al., 1979) were synthesized and inserted into a vector for expression in Escherichia coli. Polymerase chain reaction (PCR) allows specific genes to be amplified from a complex mixture of genomic DNA and cloned into a vector. Restriction sites can be added to PCR primers to allow amplified products to be inserted into vector cloning sites. Overlapping DNA fragments can be assembled during PCR (Horton et al., 1990) using primers specific to the ends of the assembly; however, the size of the product cannot exceed the length that can be readily amplified by PCR. Complementary vector sequence is added to PCR primers in a technique known as ligationindependent cloning (LIC) (Aslanidis and de Jong, 1990). An improvement on this method was made and is known as sequence and ligation-independent cloning (SLIC) (Li and Elledge, 2007), which removes the sequence constraints of LIC. Both methods utilize exonuclease activity to expose complementary ssDNA sequences that can be specifically annealed. In these two systems, ligation of the DNA molecules is not performed in vitro but instead performed within a host organism following transformation of the annealed molecules. Type II restriction enzymes have traditionally been used to clone DNA fragments by inserting them into a vector. However, this approach is of limited value for assembling multiple pieces at once and for producing large DNA fragments that do not contain a restriction site “scar.” On the other hand, type IIS restriction enzymes, which cleave adjacent to but outside their recognition sites to produce sticky ends specific for the sequences to be joined, have become widely used for constructing seamless DNA molecules. For example, this scheme was used to assemble a 31.5-kb full-length infectious cDNA of the group II coronavirus mouse hepatitis virus strain
In Vitro Recombination
351
A59 (Yount et al., 2002) and a completely synthetic 32-kb polyketide synthase gene cluster (Kodumal et al., 2004). However, as the recombined DNA molecules get larger, it becomes increasingly difficult to find a type IIS restriction enzyme that does not cut within the assembly. Extensive reengineering of genetic elements relies on technology that enables the assembly of small synthetic oligos. A number of in vitro enzymatic strategies are available for the assembly of single-stranded (ss) oligos into larger double-stranded (ds) DNA constructs (Czar et al., 2008; Xiong et al., 2008a,b). For example, oligos can be joined by polymerase cycling assembly and subsequently amplified by PCR (Smith et al., 2003; Stemmer et al., 1995). These dsDNA fragments can then be assembled into a vector by any of the methods described earlier and cloned in E. coli. Overlapping DNA molecules can also be joined by the action of three enzymes: (i) an exonuclease, which chews back the ends of the fragments and exposes ssDNA overhangs that are specifically annealed; (ii) a polymerase, which fills the gaps in the annealed products; and (iii) a ligase, which covalently seals the resulting nicks. A two-step thermocycled in vitro recombination method was used to join 101 overlapping DNA cassettes into four quarter molecules of the Mycoplasma genitalium genome, each between 136 and 166 kb in size (Gibson et al., 2008). Since then, two additional in vitro recombination methods were described that can join and clone DNA molecules larger than 300 kb, and can be carried out in a single step (Gibson et al., 2009). The simplest approach is a one-step isothermal (ISO) reaction that can be used to join both ssDNA and dsDNA (Gibson et al., 2009, 2010). All three assembly systems utilize the three enzymes above to produce DNA molecules that are covalently joined, and polyethylene glycol (PEG), a macromolecular crowding agent, to stimulate recombination. Because the DNA products are covalently joined, PCR or rolling-circle amplification (RCA) can be performed directly from the reactions (Gibson et al., 2009, 2010). Although three recombination methods are described later, the one-step ISO system is typically used due to its simplicity. We found that all components of this assembly system can be premixed and kept frozen until needed. Thus, all that is required for DNA assembly is for input ssDNA or dsDNA to be added to this mixture, and the mixture to be briefly incubated at 50 C. These approaches could be very useful for cloning multiple inserts into a vector without relying on the availability of restriction sites and for rapidly constructing large DNA molecules. For example, regions of DNA too large to be amplified by a single PCR event can be divided into multiple overlapping PCR amplicons and then assembled into one piece. The ISO system is advantageous for assembling circular products, which accumulate because they are not substrates for any of the three enzymes. The one-step thermocycled method, however, can be used to generate linear assemblies because the exonuclease is inactivated during the reaction. Protocols for these recombination methods are provided later.
352
Daniel G. Gibson
2. Design and Preparation of the dsDNA for In Vitro Recombination DNA molecules are designed such that neighboring fragments contain at least 40 bp of overlapping sequence. However, as much as 500 bp can be used if the procedures are slightly modified as indicated below. If the DNA fragments will originate from PCR products, 40 bp overlapping sequences are introduced at the 50 ends of the primers used in the amplification reactions. DNA molecules can also be assembled from overlapping restriction fragments. The noncomplementary partial restriction sites will be removed during recombination and form a contiguous piece of DNA without intervening sequence. DNA fragments are often assembled with a vector to form a circular product. PCR amplification can be used to produce a unique vector containing terminal overlaps to the ends of the DNA fragments being joined. To produce these cloning vectors, each PCR primer includes an overlap with one end of the vector, a restriction site (e.g., NotI) not present within the insert to allow its release from the vector, and an overlap with one end of the DNA fragment assembly. An example of how to assemble a linear fragment into PCR-amplified pUC19 is shown in Fig. 15.1. 1. E. coli strains carrying overlapping restriction fragments, contained within a plasmid, are propagated in Luria Broth (LB) containing the appropriate antibiotic and incubated at 30 or 37 C for 16 h. The cultures are harvested, and the DNA molecules are isolated using a commercially available kit (e.g., Qiagen’s HiSpeed Plasmid Maxi Kit) or by following a standard alkaline-lysis procedure. Plasmid DNA is A 5⬘-GAAAATGAAGATTATGATGACTTTCTTGAAATCCCTTTACAAGCAGCTAACAAAATAAACAGTTCATT GCAATTAGGTGATGTGTTGCGAAAACCAATCCCCTTAAAAAA……………………………………………… …………………………………………………ACCACCTATTGTTACTATCATGGGTCATGTTGACCATGGT AAAACTTCGCTTTTAGACACAATTAGAAAAACTAATGTAACTGCTAAGGAGTTTGGCGGAATTACCC-3⬘
B Oligo 1 = 5⬘-GTAAAGGGATTTCAAGAAAGTCATCATAATCTTCATTTTCGCGGCCGCgatcctctagagtcgacctg-3⬘ Oligo 2 = 5⬘AAAACTAATGTAACTGCTAAGGAGTTTGGCGGAATTACCCGCGGCCGCcgggtaccgagctcgaattc-3⬘
Figure 15.1 Assembly vector primer design. (A) A linear DNA sequence that is to be assembled into a vector. The first and last 40 bp of DNA sequence is underlined. (B) Two primers that could be used to PCR-amplify pUC19 to produce a vector containing overlaps to the sequence shown in (A), thus producing a circle. The primer sequences include regions that can anneal to pUC19 (nonbolded, lowercase), NotI restriction sites (bolded and italicized, uppercase) to release the insert from the vector, and 40-bp overlaps (underlined, uppercase) to the ends of the DNA sequence shown in (A).
In Vitro Recombination
353
eluted or resuspended in Tris–EDTA buffer, pH 8.0 (TE buffer). Overlapping DNA fragments are then released from the vector by restriction digestion. These reactions are terminated by heat inactivation or phenol–chloroform–isoamyl alcohol (PCI) extraction and ethanol precipitation. DNA is dissolved in TE buffer and then quantified by gel electrophoresis with standards. 2. Overlapping PCR fragments are produced with a high-fidelity (HF) DNA polymerase such as the hot-start PhusionÒ polymerase (New England Biolabs, NEB) with the HF buffer. Results may be improved by gel purifying the PCR products prior to DNA assembly. However, this is not necessary and instead, reactions may be column purified (e.g., QIAquick PCR purification Kit, Qiagen). For the one-step thermocycled and ISO systems, which are not inhibited by the presence of dNTPs, PCR products may be directly used in assembly reactions without additional purification. PCR products are quantified by gel electrophoresis with standards.
3. Two-Step Thermocycled Assembly of Overlapping dsDNA This two-step in vitro recombination method for assembling overlapping DNA molecules makes use of the 30 -exonuclease activity of T4 DNA polymerase (T4 pol) to produce ssDNA overhangs, and a combination of Taq DNA polymerase (Taq pol) and Taq DNA ligase (Taq lig) to repair the annealed joints (Gibson et al., 2008, 2009; Fig. 15.2). The reaction is carried out in a thermocycler, in two steps. In the first step, the 30 -ends of the DNA fragments are digested to expose the overlap regions using T4 pol in the absence of dNTPs. The T4 pol is then inactivated by incubation at 75 C, followed by slow cooling to anneal the complementary overlap regions. In the second step, the annealed joints are repaired using Taq pol and Taq lig at 45 C in the presence of all four dNTPs. Taq pol is generally used as the gapfilling enzyme in this system because it does not strand-displace, which would lead to disassembly of the joined DNA fragments. It also has inherent 50 exonuclease activity (or nick translation activity) (Chien et al., 1976), which eliminates the need to phosphorylate the input DNA (a requirement for DNA ligation). This is because 50 -phosphorylated ends are created following nick translation. Further, this activity removes any noncomplementary sequences (e.g., partial restriction sites at the ends of overlapping restriction fragments), which would otherwise end up in the final joined product. 1. A 4 chew-back and anneal (CBA) reaction buffer is prepared. This buffer consists of 20% PEG-8000 (United States Biochemical), 800 mM
354
Daniel G. Gibson
Overlap
5⬘ 3⬘
3⬘ 5⬘ 3⬘ 5⬘
5⬘ 3⬘
Two-step thermocycled
Chew-back at 37 ⬚C with T4 pol 3⬘
5⬘
5⬘
3⬘
3⬘
5⬘
3⬘ Anneal at 75 ⬚C → 60 ⬚C
5⬘
T4 pol 3⬘ 5⬘
5⬘ 3⬘
3⬘ 5⬘ 3⬘
5⬘
Repair at 45 ⬚C with Taq pol and Taq lig
5⬘ 3⬘
3⬘ 5⬘
Figure 15.2 Two-step thermocycled in vitro recombination. Two adjacent DNA fragments (magenta and green) sharing terminal sequence overlaps (thickened black line) are joined into one covalently sealed molecule by a two-step thermocycled reaction.
2.
3. 4. 5.
Tris–HCl, pH 7.5, 40 mM MgCl2, and 4 mM DTT and can be aliquoted and stored at 20 C for several years. The DNA fragments to be assembled are mixed in a volume not exceeding 10 ml. Approximately 10–100 ng of each DNA segment is used in equimolar amounts. For 5–8 kb DNA fragments, 25 ng substrate DNA is ideal. For larger assemblies, the amount of DNA is increased accordingly (e.g., for 20–32 kb DNA fragments, 100 ng DNA substrate is used). It is best to use 1 kb DNA fragments in 5- to 10-fold excess. In a 0.2-ml PCR tube on ice, a 20-ml reaction is prepared and consists of 5 ml 4 CBA reaction buffer, 0.2 ml of 10 mg/ml BSA (NEB), 0.4 ml of 3 U/ml T4 pol (NEB), and the DNA prepared in step 2. The tube is added to a thermocycler and cycled as follows: 37 C for 5 (80-bp overlaps) or 15 min (80-bp overlaps), 75 C for 20 min, 0.1 C/s to 60 C, 60 C for 30 min, and then 0.1 C/s to 4 C. Taq repair buffer (TRB) is prepared. This buffer consists of 5.83% PEG8000, 11.7 mM MgCl2, 15.1 mM DTT, 311 mM each of the four dNTPs, and 1.55 mM b-nicotinamide adenine dinucleotide (NAD) and can be aliquoted and stored at 20 C for up to 1 year.
In Vitro Recombination
355
6. Ten microliters of the CBA reaction, following step 4, is added to 25.75 ml of TRB in a tube on ice. In all, 4 ml of 40 U/ml Taq lig (NEB) and 0.25 ml of 5 U/ml Taq pol (NEB) are then added. 7. The reaction is incubated at 45 C for 15 min. 8. Samples may be used in downstream applications such as PCR or transformed into E. coli. For transformation, 3 ml of the undiluted TRB reaction can be directly electroporated into 30 ml Epi300 cells (Epicentre) in a 1-mm cuvette (BioRad) at 1200 V, 25 mF, and 200 O using a Gene Pulser Xcell Electroporation System (BioRad). Cells are allowed to recover in 1 ml Super Optimal broth with Catabolite repression (SOC) medium and then plated onto LB medium containing the appropriate antibiotic. 9. Assembly reactions are analyzed by agarose gel electrophoresis for product formation.
4. One-Step Thermocycled Assembly of Overlapping dsDNA A DNA assembly method that requires the absence of dNTPs to achieve exonuclease activity, such as the T4 pol-based system described earlier, cannot be completed in one step. This is because dNTPs are required at a later point to fill in the gapped DNA molecules. ExoIII, which removes nucleotides from the 30 -ends of dsDNA, is fully functional even in the presence of dNTPs, so it is a candidate for a one-step reaction. However, it will compete with polymerase for binding to the 30 -ends. To eliminate this competition, and allow for one-step DNA assembly, antibody-bound Taq pol (Ab-Taq pol) is used in combination with ExoIII (Fig. 15.3). In this assembly method, overlapping DNA fragments and all components necessary to covalently join the DNA molecules (i.e., ExoIII, Ab-Taq pol, dNTPs, Taq lig, etc.) are added in a single tube and placed in a thermocycler (Gibson et al., 2009). At 37 C, ExoIII is active (but Ab-Taq pol remains inactive) and recesses the 30 -ends of the dsDNA molecules. The reaction is then shifted to 75 C, which inactivates ExoIII. Annealing of the DNA molecules commences and the antibody dissociates from Taq pol, thus activating this enzyme. Further annealing, extension, and ligation are then carried out at 60 C. 1. A 4 chew-back, anneal, and repair (CBAR) reaction buffer is prepared. This buffer consists of 20% PEG-8000, 600 mM Tris–HCl, pH 7.5, 40 mM MgCl2, 40 mM DTT, 800 mM each of the four dNTPs, and 4 mM NAD and can be aliquoted and stored at 20 C for up to 1 year. 2. ExoIII (NEB) is diluted 1:25 from 100 to 4 U/ml in its storage buffer (50% glycerol, 5 mM KPO4, 200 mM KCl, 5 mM b-mercaptoethanol,
356
Daniel G. Gibson
Overlap 5⬘ 3⬘
3⬘ 5⬘ 3⬘ 5⬘
5⬘ 3⬘
One-step thermocycled
Chew-back at 37 ⬚C with Exoll l 3⬘
5⬘
5⬘
3⬘
Taq pol Taq lig 3⬘
5⬘
3⬘ Anneal at 75 ⬚C → 60 ⬚C
5⬘
Exol ll Taq pol Taq lig 5⬘
3⬘ 5⬘ 3⬘
3⬘ 5⬘ 3⬘
5⬘
Repair at 60 ⬚C with Taq pol and Taq lig 5⬘ 3⬘
3⬘ 5⬘
Figure 15.3 One-step thermocycled assembly of overlapping DNA segments. Two adjacent DNA fragments sharing terminal sequence overlaps are joined into one covalently sealed molecule by a one-step thermocycled reaction.
3.
4. 5. 6. 7. 8.
0.05 mM EDTA, and 200 mg/ml BSA, pH 6.5). This enzyme dilution can be stored at 20 C for up to 1 year. The DNA fragments to be assembled are mixed in a volume not exceeding 10 ml. Approximately 10–100 ng of each DNA segment is used in equimolar amounts. For 5–8 kb DNA fragments, 25 ng substrate DNA is ideal. For larger assemblies, the amount of DNA is increased accordingly. It is best to use 1 kb DNA fragments in 5- to 10-fold excess. In a 0.2-ml PCR tube on ice, a 40-ml reaction is prepared and consists of 10 ml 4 CBAR buffer, 0.35 ml of 4 U/ml ExoIII, 4 ml of 40 U/ml Taq lig, and 0.25 ml of 5 U/ml Ab-Taq pol (Applied Biosystems). The tube is added to a thermocycler and cycled as follows: 37 C for 5 (80-bp overlaps) or 15 min (80-bp overlaps), 75 C for 30 min, 0.1 C/s to 60 C, and then 60 C for 1 h. Samples are diluted 1:5 with sterile water and used in downstream applications such as PCR or E. coli transformation as described earlier. Assembly reactions are analyzed by agarose gel electrophoresis for product formation. ExoIII is less active on 30 protruding termini, which can result from digestion with certain restriction enzymes. This can be overcome by removing the ssDNA overhangs to form blunt ends, prior to assembly, with the addition of T4 pol and dNTPs, as described in the previous method.
357
In Vitro Recombination
5. One-Step ISO Assembly of Overlapping dsDNA Exonucleases that recess dsDNA from 50 -ends, and are not inhibited by the presence of dNTPs, are also candidates for a one-step DNA assembly reaction. Further, these exonucleases will not compete with polymerase activity. Thus, all activities required for DNA assembly can be simultaneously active in a single ISO reaction. A 50 C ISO assembly system has been optimized using the activities of the 50 -T5 exonuclease (T5 exo), PhusionÒ DNA polymerase (PhusionÒ pol), and Taq lig (Gibson et al., 2009; Fig. 15.4). Taq pol can be used in place of PhusionÒ pol; however, PhusionÒ pol is preferred, as it has inherent proofreading activity for removing noncomplementary sequences from assembled molecules. In addition, PhusionÒ pol incorporates the incorrect nucleotide at a significantly lower rate. 1. A 5 ISO reaction buffer is prepared. This buffer consists of 25% PEG8000, 500 mM Tris–HCl, pH 7.5, 50 mM MgCl2, 50 mM DTT, 1 mM each of the four dNTPs, and 5 mM NAD and can be aliquoted and stored at 20 C for up to 1 year. This mix can be prepared by combining 3 ml of 1 M Tris–HCl, pH 7.5, 150 ml of 2 M MgCl2, Overlap 3⬘ 5⬘
5⬘ 3⬘
One-step isothermal
5⬘ 3⬘ Chew-back at 50 ⬚C with T5 exo
3⬘
5⬘
Phusion pol Taq lig
3⬘
5⬘
3⬘ 5⬘
5⬘ 5⬘
3⬘ Anneal at 50 ⬚C T5 exo 3⬘
5⬘
Phusion pol Taq lig 3⬘
3⬘ 5⬘ 5⬘ 3⬘
3⬘
5⬘
Repair at 50 ⬚C with Phusion pol and Taq lig 3⬘
5⬘ 3⬘
5⬘
Figure 15.4 One-step isothermal assembly of overlapping DNA fragments. Two adjacent DNA fragments sharing terminal sequence overlaps are joined into one covalently sealed molecule by a one-step isothermal reaction.
358
2.
3.
4. 5. 6. 7.
Daniel G. Gibson
60 ml of 100 mM dGTP, 60 ml of 100 mM dATP, 60 ml of 100 mM dTTP, 60 ml of 100 mM dCTP, 300 ml of 1 M DTT, 1.5 g PEG-8000, and 300 ml of 100 mM NAD. This will produce 6 ml of 5 ISO buffer, which can be aliquoted and stored at 20 C for up to 1 year. An enzyme–reagent master mixture is prepared by combining 320 ml of 5 ISO reaction buffer, 0.64 ml of 10 U/ml T5 exo (Epicentre), 20 ml of 2 U/ml PhusionÒ pol, 160 ml of 40 U/ml Taq lig (NEB), and water up to a final volume of 1.2 ml. Fifteen microliters of this enzyme-reagent mix can be aliquoted and stored at 20 C for up to 2 years. This exonuclease amount is ideal for overlaps that are 80 bp. For overlaps that are 80 bp, 3.2 ml exonuclease is used in the mixture. The DNA fragments to be assembled are mixed in a volume not exceeding 5 ml. Approximately 10–100 ng of each DNA segment is used in equimolar amounts. For 5–8 kb DNA fragments, 25 ng substrate DNA is ideal. For larger assemblies, the amount of DNA is increased accordingly. It is best to use 1 kb DNA fragments in 5- to 10-fold excess. In a tube on ice, a 20-ml reaction consisting of 5 ml DNA and 15 ml enzyme–reagent master mixture is prepared, and the reaction is mixed by pipetting. The reaction is incubated at 50 C for 1 h. Samples are diluted 1:5 with sterile water and used in downstream applications such as PCR or E. coli transformation as described earlier. Assembly reactions are analyzed by agarose gel electrophoresis for product formation.
6. One-Step ISO DNA Assembly of Overlapping ssDNA The one-step ISO reaction described earlier can also be used to directly assemble oligos into the pUC19 cloning vector (Gibson et al., 2010; Fig. 15.5). The assembled products are then cloned in E. coli, and N
N
One-step isothermal assembly with T5 exo, Phusion pol, and Taq ligase N
Oligos
N
N
Vector + oligos
N
dsDNA vector
Figure 15.5 Isothermal assembly of overlapping oligonucleotides into pUC19. Eight 60-base oligos (red lines) are directly assembled into pUC19 (gray lines), in vitro, to produce a dsDNA fragment. N indicates the NotI restriction site (black line), which is added to release the fragment from the pUC19 vector.
In Vitro Recombination
359
the errors originating from the chemical synthesis of the oligos are weeded out by DNA sequencing. To ensure that error-free molecules are obtained at a reasonable efficiency, only eight to twelve 60-base oligos are assembled at one time. The resulting dsDNA molecules can then be assembled by any of the methods described earlier. 1. The pUC19 assembly vector is prepared. To reduce the background of undesired vector-only clones following assembly and transformation, pUC19 plasmid DNA can be linearized by restriction digestion with BamHI then extracted from an agarose gel following electrophoresis. This linearized vector can then be diluted to 2 ng/ml and used as template in a PCR with a forward primer having the sequence 50 -gatcctctagagtcgacctg-30 and a reverse primer having the sequence 50 -cgggtaccgagctcgaattc-30 . Following a standard PCI extraction and ethanol precipitation, the DNA pellets are suspended in TE buffer and then diluted to 200 ng/ml. 2. An enzyme–reagent mixture is prepared as in step 2 in the above protocol, but 20 ml of PCR-amplified pUC19 (200 ng/ml) is included in the 1.2-ml mix. 3. The oligos to be assembled are designed. Adjacent oligos overlap by 20 bp. The oligos at each end of the assembly contain a 20-bp overlap to the termini of PCR-amplified pUC19 and restriction sites not present in the assembled insert (e.g., NotI sites) to allow the release of the synthesized product from pUC19. An example of how this method can be used to synthesize a 284-bp fragment from eight overlapping 60-base oligos is shown in Fig. 15.6. If larger constructs are to be made from a series of
A 5⬘-caggtcgactctagaggatc gcggccgcGAAAAAAAGAACCTTTCGGCTATATAGGAATAGTATGAGCAATAATGT CTATTGGCTTTCTAGGCTTTATTGTATGAGCCCACCACATATTCACAGTAGGATTAGATGTAGACACACGA GCTTACTTTACATCAGCCACTATAATTATCGCAATTCCTACCGGTGTCAAAGTATTTAGCTGACTTGCAAC CCTACACGGAGGTAATATTAAATGATCTCCAGCTATACTATGAGCCTTAGGCTTTATTTTCTTATTTACAGT TGGTGGTCTAACCGGAATTGTT gcggccgc cgggtaccgagctcgaattc -3⬘
B Oligo 1 = 5⬘-caggtcgactctagaggatc gcggccgc GAAAAAAAGAACCTTTCGGCTATATAGGAATA-3⬘ Oligo 2 = 5⬘-CAATAAAGCCTAGAAAGCCAATAGACATTATTGCTCATACTATTCCTATATAGCCGAAAG-3⬘ Oligo 3 = 5⬘-TGGCTTTCTAGGCTTTATTGTATGAGCCCACCACATATTCACAGTAGGATTAGATGTAGA-3⬘ Oligo 4 = 5⬘-TGCGATAATTATAGTGGCTGATGTAAAGTAAGCTCGTGTGTCTACATCTAATCCTACTGT-3⬘ Oligo 5 = 5⬘-CAGCCACTATAATTATCGCAATTCCTACCGGTGTCAAAGTATTTAGCTGACTTGCAACCC-3⬘ Oligo 6 = 5⬘-CATAGTATAGCTGGAGATCATTTAATATTACCTCCGTGTAGGGTTGCAAGTCAGCTAAAT-3⬘ Oligo 7 = 5⬘-TGATCTCCAGCTATACTATGAGCCTTAGGCTTTATTTTCTTATTTACAGTTGGTGGTCTA-3⬘ Oligo 8 = 5⬘-gaattcgagctcggtacccg gcggccgc AACAATTCCGGTTAGACCACCAACTGTAAATA-3⬘
Figure 15.6 Overlapping oligonucleotide design for assembly into pUC19. (A) A 340-bp sequence, which includes 20 bp overlapping sequence to PCR-amplified pUC19 (nonbolded lowercase) and NotI restriction sites (bolded and underlined). Because 56 bp is used for assembly into and release from pUC19, only 284 bp of unique sequence (uppercase) is synthesized. (B) The sequence in (A) can be synthesized from the eight 60-mer oligos shown, which contain 20 bp overlaps.
360
4.
5. 6. 7.
Daniel G. Gibson
synthetic cassettes, 40-bp overlapping sequences can also be designed into these oligos. The oligos to be assembled are pooled. Oligos are prepared without modification or additional purification and suspended to 50 mM in TE buffer. Equal volumes of each oligo are then pooled in groups of 8 or 12 and diluted in TE buffer to a per-oligo concentration of 180 or 75 nM, respectively. In a tube on ice, a 20-ml reaction consisting of 5 ml DNA and 15 ml enzyme–reagent master mixture containing the pUC19 vector is prepared. The reaction is mixed by pipetting up and down. The reaction is incubated at 50 C for 1 h. Samples are diluted 1:5 with sterile water and used in downstream applications such as PCR or E. coli transformation as described earlier.
ACKNOWLEDGMENTS The author would like to thank Synthetic Genomics, Inc. (SGI) for funding this work and the synthetic biology group at JCVI for helpful discussions.
REFERENCES Agarwal, K. L., Buchi, H., Caruthers, M. H., Gupta, N., Khorana, H. G., Kleppe, K., Kumar, A., Ohtsuka, E., Rajbhandary, U. L., Van de Sande, J. H., Sgaramella, V., Weber, H., et al. (1970). Total synthesis of the gene for an alanine transfer ribonucleic acid from yeast. Nature 227, 27–34. Aslanidis, C., and de Jong, P. J. (1990). Ligation-independent cloning of PCR products (LIC-PCR). Nucleic Acids Res. 18, 6069–6074. Chien, A., Edgar, D. B., and Trela, J. M. (1976). Deoxyribonucleic acid polymerase from the extreme thermophile Thermus aquaticus. J. Bacteriol. 127, 1550–1557. Cohen, S. N., Chang, A. C., Boyer, H. W., and Helling, R. B. (1973). Construction of biologically functional bacterial plasmids in vitro. Proc. Natl. Acad. Sci. USA 70, 3240–3244. Crea, R., Kraszewski, A., Hirose, T., and Itakura, K. (1978). Chemical synthesis of genes for human insulin. Proc. Natl. Acad. Sci. USA 75, 5765–5769. Czar, M. J., Anderson, J. C., Bader, J. S., and Peccoud, J. (2008). Gene synthesis demystified. Trends Biotechnol. 27, 63–72. Gellert, M. (1967). Formation of covalent circles of lambda DNA by E. coli extracts. Proc. Natl. Acad. Sci. USA 57, 148–155. Gibson, D. G., Benders, G. A., Andrews-Pfannkoch, C., Denisova, E. A., Baden-Tillson, H., Zaveri, J., Stockwell, T. B., Brownley, A., Thomas, D. W., Algire, M. A., Merryman, C., Young, L., et al. (2008). Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 319, 1215–1220. Gibson, D. G., Young, L., Chuang, R. Y., Venter, J. C., Hutchison, C. A., 3rd, and Smith, H. O. (2009). Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6, 343–345.
In Vitro Recombination
361
Gibson, D. G., Smith, H. O., Hutchison, C. A., 3rd, Venter, J. C., and Merryman, C. (2010). Chemical synthesis of the mouse mitochondrial genome. Nat. Methods 7, 901–903. Goeddel, D. V., Kleid, D. G., Bolivar, F., Heyneker, H. L., Yansura, D. G., Crea, R., Hirose, T., Kraszewski, A., Itakura, K., and Riggs, A. D. (1979). Expression in Escherichia coli of chemically synthesized genes for human insulin. Proc. Natl. Acad. Sci. USA 76, 106–110. Horton, R. M., Cai, Z. L., Ho, S. N., and Pease, L. R. (1990). Gene splicing by overlap extension: tailor-made genes using the polymerase chain reaction. Biotechniques 8, 528–535. Itakura, K., Hirose, T., Crea, R., Riggs, A. D., Heyneker, H. L., Bolivar, F., and Boyer, H. W. (1977). Expression in Escherichia coli of a chemically synthesized gene for the hormone somatostatin. Science 198, 1056–1063. Jackson, D. A., Symons, R. H., and Berg, P. (1972). Biochemical method for inserting new genetic information into DNA of Simian Virus 40: circular SV40 DNA molecules containing lambda phage genes and the galactose operon of Escherichia coli. Proc. Natl. Acad. Sci. USA 69, 2904–2909. Kodumal, S. J., Patel, K. G., Reid, R., Menzella, H. G., Welch, M., and Santi, D. V. (2004). Total synthesis of long DNA sequences: synthesis of a contiguous 32-kb polyketide synthase gene cluster. Proc. Natl. Acad. Sci. USA 101, 15573–15578. Li, M. Z., and Elledge, S. J. (2007). Harnessing homologous recombination in vitro to generate recombinant DNA via SLIC. Nat. Methods 4, 251–256. Smith, H. O., and Wilcox, K. W. (1970). A restriction enzyme from Hemophilus influenzae. I. Purification and general properties. J. Mol. Biol. 51, 379–391. Smith, H. O., Hutchison, C. A., 3rd, Pfannkoch, C., and Venter, J. C. (2003). Generating a synthetic genome by whole genome assembly: phiX174 bacteriophage from synthetic oligonucleotides. Proc. Natl. Acad. Sci. USA 100, 15440–15445. Stemmer, W. P., Crameri, A., Ha, K. D., Brennan, T. M., and Heyneker, H. L. (1995). Gene 164, 49–53. Weiss, B., and Richardson, C. C. (1967). Single-step assembly of a gene and entire plasmid from large numbers of oligodeoxyribonucleotides. Proc. Natl. Acad. Sci. USA 57, 1021–1028. Xiong, A. S., Peng, R. H., Zhuang, J., Gao, F., Li, Y., Cheng, Z. M., and Yao, Q. H. (2008a). Chemical gene synthesis: strategies, softwares, error corrections, and applications. FEMS Microbiol. Rev. 32, 522–540. Xiong, A. S., Peng, R. H., Zhuang, J., Liu, J. G., Gao, F., Chen, J. M., Cheng, Z. M., and Yao, Q. H. (2008b). Non-polymerase-cycling-assembly-based chemical gene synthesis: strategies, methods, and progress. Biotechnol. Adv. 26, 121–134. Yount, B., Denison, M. R., Weiss, S. R., and Baric, R. S. (2002). Systematic Assembly of a Full-Length Infectious cDNA of Mouse Hepatitis Virus Strain A59. J. Virol. 76, 11065–11078.
C H A P T E R
S I X T E E N
Automated Assembly of Standard Biological Parts Mariana Leguia,*,†,‡ Jennifer Brophy,* Douglas Densmore,§ and J. Christopher Anderson*,†,‡,} Contents 1. Introduction 1.1. The BglBrick standard and 2ab assembly 1.2. Software tools 1.3. Robotics 2. Materials and Methods 2.1. Materials 2.2. Design and construction of basic parts in proper format 2.3. Generation of methylated plasmid DNA 2.4. High-throughput mini-preps 2.5. 2ab reaction 2.6. Transformation 2.7. Plating 2.8. Screening of transformants 2.9. Competent cells 3. Troubleshooting 3.1. No colonies 3.2. Very few colonies 3.3. Colonies of different size 3.4. Too many colonies to pick cleanly 3.5. Streaky colonies 3.6. Lawns or lawny areas 3.7. Uneven number of colonies on various different strips 4. Concluding Remarks Acknowledgments References
364 365 369 372 373 373 373 374 374 377 383 386 391 392 393 393 394 394 395 395 395 396 396 397 397
* Department of Bioengineering, University of California, Berkeley, California, USA QB3: California Institute for Quantitative Biological Research, University of California, Berkeley, California, USA { Synthetic Biology Engineering Research Center, Berkeley, California, USA } Department of Electrical and Computer Engineering, Boston University, Boston, Massachusetts, USA } Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA {
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00016-4
#
2011 Elsevier Inc. All rights reserved.
363
364
Mariana Leguia et al.
Abstract The primary bottleneck in synthetic biology research today is the construction of physical DNAs, a process that is often expensive, time-consuming, and riddled with cloning difficulties associated with the uniqueness of each DNA sequence. We have developed a series of biological and computational tools that lower existing barriers to automation and scaling to enable affordable, fast, and accurate construction of large DNA sets. Here we provide detailed protocols for high-throughput, automated assembly of BglBrick standard biological parts using iterative 2ab reactions. We have implemented these protocols on a minimal hardware platform consisting of a Biomek 3000 liquid handling robot, a benchtop centrifuge and a plate thermocycler, with additional support from a software tool called AssemblyManager. This methodology enables parallel assembly of several hundred large error-free DNAs with a 96þ% success rate.
Abbreviations 2ab A C K RT
2 antibiotic ampicillin chloramphenicol kanamycin room temperature
1. Introduction The field of synthetic biology is experiencing growing demand for robust DNA fabrication technologies that can be automated and scaled. In theory, it should be possible to carry out high-throughput, automated, iterative assembly of standard biological parts using a variety of robotics, software, assembly standards, and assembly protocols. In practice, however, many of the existing assembly standards and protocols require steps that pose significant barriers to automation and scaling. Gel purification of DNA fragments, to name one example, cannot be performed using simple liquid handling robotics. We have developed and continue to optimize a series of biological and computational tools, including the BglBrick standard (Anderson et al., 2010), 2ab (2 antibiotic) assembly (manuscript in preparation), and an automated assembly algorithm (www.clothocad.org), that significantly lower current barriers to automation, allowing us to carry out high-throughput assemblies of standardized biological parts using liquid handling operations run on standard commercially available robotics.
Automated Assembly of Standard Biological Parts
365
Here, we specifically outline protocols for the construction of composite BglBrick parts by 2ab assembly using a Biomek 3000 liquid handler and a software tool called AssemblyManager. Software tools and sample files required to carry out these protocols are available for download at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html. As described, these methods are designed to be implemented within standard academic research labs at reasonable cost, and to rely on a minimal equipment setup consisting exclusively of a Biomek 3000, a plate-spinning benchtop centrifuge and a plate thermocycler. As such, they are only a template that can be either followed verbatim or modified to accommodate specific needs. We anticipate that given the similarities among various assembly standards, protocols, commercially available robotics, and BioCAD packages, these methods could be easily adjusted, with minor modifications, to build composite parts using alternate standards and protocols run on similar equipment. Where there are steps that can be automated using additional equipment, but not with this minimal setup, we have described available options. Throughout, we have made an effort to point out critical steps and provide troubleshooting guidance.
1.1. The BglBrick standard and 2ab assembly The BglBrick standard (Anderson et al., 2010) is one of several assembly standards available for the iterative construction of composite genetic devices from standardized biological parts (Fig. 16.1). BglBrick parts are flanked by unique BglII and BamHI restriction sites on their 50 and 30 ends, respectively. To join together two parts, A and B, in that order, part A is cut 30 to the part with BamHI, while part B is cut 50 to the part with BglII. Given that the two restriction enzymes generate complementary sticky ends, parts can be ligated back together to generate new composite parts that are separated by a small six-nucleotide scar sequence (GGATCT). When translated in frame, the BglBrick scar encodes glycine-serine, a peptide linker innocuous for most protein fusion applications. At its core, the BglBrick standard is a minimal standard that precisely describes the products of restriction enzyme-based manipulations, while imposing no rules on composition protocols. As such, it deals with the theoretical assembly of DNAs, rather than with their physical assembly. Although BglBrick parts can be assembled together using a variety of techniques, like PCR-based strategies, for example, the methodology described here uses 2ab assembly (manuscript in preparation) to assemble parts together. 2ab assembly is a DNA fabrication strategy where iterative 2ab reactions are used to build composite parts starting from basic parts according to an assembly tree generated computationally (see Section 1.2) (Fig. 16.2). The 2ab reaction enables ligation by double antibiotic selection of 50 and 30 parts, designated as “lefty” and “righty,” respectively, located on two
366
Mariana Leguia et al.
Figure 16.1 Standard assembly of BglBrick parts (originally published in Anderson et al., 2010). Unique BglII (in red) and BamHI (in blue) restriction sites flank BglBrick basic parts on their 50 and 30 ends, respectively. EcoRI and XhoI restriction sites employed in various protocols for part assembly are also shown. Cleavage of each DNA with the appropriate enzyme (color-coded arrowheads) generates compatible cohesive ends. These can be connected head-to-tail by ligation (black arrow) to generate composite parts separated by a 6 nucleotide scar sequence (ggatct). When translated in frame, the scar sequence between parts encodes glycine-serine, a peptide linker innocuous for most protein fusion applications.
different assembly vectors. The core of 2ab assembly is a set of highly engineered vectors that allow assembly to proceed in iterative cycles of restriction digestion, ligation, transformation, and selection of desired products. Every assembly vector contains two different antibiotic resistance genes (from a total set of three: A, ampicillin; C, chloramphenicol; K, kanamycin) separated by an XhoI restriction site. Given these antibiotics, six different combinations of assembly vectors are possible: AC, CK, KA, AK, KC, and CA. The choice of assembly vector pair for any given 2ab reaction is predetermined such that, once two plasmids recombine to form new architectures, desired child products can be selected away from undesired products, as well as from parents, using differential antibiotic selection. Though assembly vector pairs can be selected manually, we routinely generate assembly trees using AssemblyManager (see Section 1.2). This is particularly useful for large assemblies, where the number of junctions can be too big to handle manually without introducing human-generated errors. AssemblyManager considers specified standard assembly rule(s), as well as user-defined constraints, to generate one or more assembly trees. Though many trees can be generated to assemble a particular set of target composite parts, the ideal tree requires assembly of the smallest number of intermediate junctions. Once a particular tree is selected, AssemblyManager outputs the commands required to execute those assemblies on our robotics platform.
367
Automated Assembly of Standard Biological Parts
A
B
C
A,B
D
C,D
Cam Kan A,B,C,D
Amp 2ab reaction
Figure 16.2 Sample 2ab assembly tree for DNA fabrication. Iterative 2ab reactions are used to build composite parts according to an assembly tree generated by AssemblyManager. Each 2ab assembly reaction proceeds through consecutive steps of restriction digestion, ligation, transformation, and selection of desired products. Selection is mediated by antibiotic resistance markers (C, chloramphenicol in yellow; K, kanamycin in green; A, ampicillin in red) expressed from highly engineered assembly vectors. Given that each assembly vector contains two different antibiotic resistance genes, six different types of vectors are possible: AC, CK, KA, AK, KC, and CA. The choice of assembly vector pair for any given 2ab reaction is predetermined such that, once two plasmids recombine to form new architectures, desired child products can be selected away from undesired products, as well as from parents, by growth selection.
2ab standard assembly proceeds in iterative cycles of 3 steps. In the first step, input plasmids are digested with restriction enzymes to generate lefty and righty fragments; in the second step, fragments are ligated back together in different combinations to generate new parts; and in the last step, new plasmid combinations are transformed into Escherichia coli in order to identify and isolate desired composite parts by growth selection. Prior to the first step of assembly, lefty and righty parts are specified by methylation using E. coli strains specifically engineered to target BglII or BamHI restriction sites. Lefties are transformed into a BglII-methylating strain, while righties are transformed into a BamHI-methylating strain. Following isolation from the corresponding methylation strain, lefties and righties are combined together in one pot and digested with an enzyme cocktail containing BglII, BamHI, and XhoI (Fig. 16.3). Since restriction enzymes do not cut methylated DNA, the BglII restriction site in the lefty part and the BamHI restriction site in the righty part are blocked from digestion. Upon digestion of unprotected sites, a ligation reaction recombines fragments with complementary sticky ends to generate new plasmid architectures. Given that each new plasmid architecture contains a different antibiotic combination, desired child products can be selected away from undesired products, as
368
Mariana Leguia et al.
BamHl
BgIll
BgIll
5′ part
3′ part
Lefty parent
KnR
Righty parent
CmR
CmR
Xhol
AmpR Xhol
BamHl
BamHl
BgIll
BamHl
BgIll
BgIll
5′ part
BamHl
3′ part
Digested lefty parent
KnR
Digested righty parent
CmR Xhol
CmR Xhol
Xhol
BgIll
AmpR
BamHl
Xhol
Scar
5′ part 3′ part
Desired child
KnR
AmpR Xhol
Undesired child
CmR
CmR Xhol
Figure 16.3 The 2ab reaction. 2ab reactions are carried out in-vitro and used to make junctions between parts located on different assembly vectors. The choice of assembly vector pair for each reaction is dictated by an assembly tree generated by AssemblyManager. Initially, lefty and righty elements are made by harvesting plasmids from stains of E. coli that specifically methylate BglII and BamHI restriction sites, respectively (methylated restriction sites are shown in red). Isolated plasmids are combined in one pot and digested with an enzyme cocktail containing BglII, BamHI, and XhoI. Given the methylation state of each plasmid, lefties are cut with BamHI and XhoI, while righties are cut with BglII and XhoI. Following digestion, a ligation reaction recombines fragments to generate a new composite part with a distinct permutation of antibiotic markers. The desired child product can be selected away from the undesired product, as well as from parents, by growing it in the appropriate combination of antibiotics. Each new composite part can now be used iteratively in additional 2ab reactions to create progressively more complex parts
Automated Assembly of Standard Biological Parts
369
well as from parents, simply by growing them in the appropriate antibiotic combination. Further, each new composite part can be used iteratively in additional 2ab reactions to create progressively more complex parts because all the elements necessary for 2ab assembly have been preserved during the reaction: new parts are still flanked by BglII and BamHI sites on the 50 and 30 ends, respectively, and two antibiotic resistance genes, albeit in a new combination, are still present and separated by an XhoI site. In order for the next cycle of 2ab assembly to begin, new composite parts need to be specified as lefties or righties by methylation, and compatible assembly vector partners selected. In practice, these choices have already been determined by the assembly tree generated by AssemblyManager at the outset of the assembly.
1.2. Software tools The development of software tools for the automated design and construction of composite biological parts is an ongoing effort in the Anderson lab. Here we describe two specific tools, OligoDesigner and AssemblyManager, developed to assist with the design and fabrication of BglBrick basic parts, and the automated 2ab assembly of those basic parts into composite parts on a Biomek 3000 robot, respectively. Both of these tools are available for download at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html. OligoDesigner is a software tool that automates design and fabrication strategies for BglBrick basic parts. It will take as input a list of target DNA sequences in raw format, and will output a list of oligos needed to construct them. OligoDesigner considers several parameters to select design and fabrication strategies, including whether a DNA template is available, whether the part contains a protein coding sequence, and whether additional restriction sites should be removed from the sequence. For each part, it chooses the most appropriate fabrication strategy from a list of methods that includes overlap extension, polymerase chain assembly, enzymatic inverse PCR, or PCR from a DNA template. If internal BamHI, BglII, EcoRI, or XhoI restriction sites are present in the sequence, it removes them with silent mutations. Once all necessary oligos have been designed, OligoDesigner outputs a series of files, including a list of parts and the oligos needed to make each one, human-readable construction files describing the steps needed to build each part, an order form for all needed oligos arrayed in a 96-well plate format, and finally, robot commands for implementing PCA and PCR-based construction using the arrayed oligos ordered. AssemblyManager is a software tool that computes the robot commands needed for iterative 2ab assembly of BglBrick parts. It will take as input a list of target composite parts described in terms of their basic part composition and will output an assembly tree. AssemblyManager uses a dynamic programming-based algorithm to perform an analysis on the collection of
370
Mariana Leguia et al.
specified target parts, and determines a minimal assembly tree that outlines all the reactions needed for construction (Densmore et al., 2010). Reactions are first sorted based on their association with a particular assembly vector and methylating strain, and then arrayed into 96-well plates for various rounds of assembly. Once the assembly tree has been specified, AssemblyManager outputs a list of files, including human-readable instructions for assembly and robot commands in the form of .csv files that can be sent directly to the Biomek 3000 (see Section 1.3). Although the current version of AssemblyManager is specific to BglBrick-based 2ab assembly, the software’s architecture is designed to be reusable for other types of assembly (Fig. 16.4). AssemblyManager describes a “project” as the ensemble of all the component parts, samples, plates, and reactions associated with a set of DNAs undergoing parallel fabrication. A project contains a library of items under fabrication and an ordered set of “stages” of fabrication. A stage is a temporal organizing object that corresponds to one round of BioBrick assembly. Stages are further subdivided into individual “reactions,” which describe mechanical operations applied to a set of samples, such as cherry picking to dispense DNA samples from one plate to another, or adding enzyme cocktails. Within each reaction, units of individual operations carried out in single reaction containers are called “steps.” Steps can have multiple inputs, but they will always have a single reaction container and a single product sample. For example, when two parts are combined into one reaction for ligation, that step has 2 input samples that correspond to separate input parts, but it has only one output reaction container in which the ligation takes place. Steps belong to “trajectories” that vertically integrate them throughout multiple stages of fabrication. Trajectories exist to permit redundant fabrication of parts and minimize the impact of failed steps. AssemblyManager allows composite parts to be built multiple times, in parallel, as independent trajectories. If a particular step fails, it can select samples from successful trajectories in order to continue. AssemblyManager contains two additional control structures known as “directors” and “commanders.” A director is the behind-thescenes implementation of the wizard interface that walks the user through generating stages, reactions, samples, steps, and output files. Every time AssemblyManager is run on a new project, a single newly generated director that operates on the current project is created. A commander is responsible for the implementation of a reaction on a specific hardware platform. It generates specific commands needed to perform individual reaction steps in a form that is understandable by a particular robot. Upon completion of the design of a particular assembly project, AssemblyManager calls on the commanders of each reaction to generate all the files needed to carry out that particular assembly. At the moment, OligoDesigner and AssemblyManager are provided as standalone tools. There is neither direct communication between them nor
Automated Assembly of Standard Biological Parts
371
Figure 16.4 Architecture of AssemblyManager. AssemblyManager is a software tool for designing standard assembly projects and generating the commands needed to automate these processes on liquid handling robots. Organizationally, AssemblyManager subdivides the fabrication of a set of composite parts into individual “stages” for each round of assembly. Within each stage, “reactions” control the specific operations needed to perform a specific task within an assembly stage. Each individual sample within a reaction belongs to a “step,” and steps are vertically connected through the entire fabrication process by a “trajectory.”
methods to link them to databases or public resources such as the Registry of Standard Biological Parts (http://partsregistry.org). Connectivity functions will be provided in second-generation versions of the tools as apps for the Clotho BioCAD environment. Clotho is a Java-based software environment developed to create, deploy, and share BioCAD tools. It is designed to represent data objects that exist in synthetic biology in a
372
Mariana Leguia et al.
“hard” form that is sufficient for automation, while remaining agnostic about the mode in which the user will interact with those objects or the specific tasks that the user will want to perform with them. In developing Clotho, we have attempted to include every type of data object currently collected during routine research operations in a synthetic biology lab and to represent those objects in a form that is both machine-understandable and linked to other types of data objects. For example, objects can be used to describe the theoretical composition of DNAs and strains, including their nucleic acid sequences, parts, plasmids, features, and annotations. These theoretical compositional objects can be mapped onto other objects that handle physical samples, such as actual plasmid DNA samples, containers, and plates. These can be further mapped to a third group of experimental data objects, including sequencing reads or GFP measurements of specific samples. A fourth group of family and reference objects enables additional mapping to the published literature or to functional descriptions of the object. Clotho provides persistence to objects through relational databases that at least partially correspond to the Clotho data model. Apps provide more specific tasks to Clotho, such as viewing and editing data, running simulations, and automating various tasks. Detailed information on the Clotho environment is available at www.clothocad.org.
1.3. Robotics The protocols described here have been developed for automation with minimal investment in hardware. This places some constraints on the types of operations that can be carried out, thus, not all aspects of the workflow outlined herein are automated. Using a Biomek 3000 Laboratory Automation Workstation (Beckman Coulter) equipped with one P20 Single-Tip Tool and one MP200 Eight-Tip Tool we automate only liquid handling operations. This “minimal” setup is completed by a table top centrifuge that has the ability to spin at >6000g (Beckman Coulter AllegraTM 25R, or similar equipment) and a thermocycler that accommodates 96-well plates (MJ Research PTC-200 Peltier Thermal Cycler, or similar instrument). Automated operations like heating, cooling, shaking, and filtration, as well as access to integrated equipment like plate readers, precision colony pickers, gridding robots, and dispensers, are possible provided the Biomek 3000 is fitted with additional software and hardware. Throughout this chapter, we provide suggestions for further equipment wherever steps can be readily automated using additional robotics. The Biomek Software that runs the robotics platform contains an intuitive icon-based interface that is easy to operate. We repeatedly use a handful of user-specified programs for various steps along the 2ab assembly cycle, like cherry picking input DNAs into destination wells, dispensing restriction enzyme and ligase cocktails, adding competent cells for transformation,
Automated Assembly of Standard Biological Parts
373
etc. Each program outlines a series of basic functions that do not change between assembly sets, like “load tool P20,” “aspirate,” “dispense,” or “mix.” These functions are basic commands required to carry out various steps of 2ab assembly. They are always performed in the same order, regardless of the number or complexity of parts being assembled. Each of these programs also contains at least one “transfer from file” function. This function allows the robot to retrieve information that does change between assembly runs, like the source position of input DNAs, or their destination in another plate, or the 2ab combination marking a specific junction, for example. This type of changing information is outlined in Excel spreadsheets saved as .csv files. At the outset of every robot run, the corresponding .csv file is imported into every “transfer from file” function. Currently .csv files are generated using AssemblyManager once a particular assembly tree has been specified. We should point out, however, that it is possible, albeit tedious and error-prone, to generate .csv files manually. The main advantages of software-generated .csv files are that human-generated errors are minimized, and that DNA fabrication is streamlined by reducing the time required to set-up the Biomek 3000 to the simple task of opening a protocol (which does not change) and importing a .csv file (which is generated automatically by AssemblyManager). Throughout this chapter, we provide sample screen shots of every robot program used to automate 2ab assembly. Actual robot files, along with their corresponding .csv files, can be downloaded at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html.
2. Materials and Methods 2.1. Materials Unless otherwise specified, all enzymes are purchased from NEB. Multiwell blocks and plates used for high-throughput bacterial growth, plating, and processing of samples are purchased from Analytical Sales & Services. 96-well PCR plates used as input and destination plates for plasmid DNA are from 4titude (cat# 4ti-0960/c). Disposable P20 (cat# 1061-2400) and P200 (cat# 1062-2400) tips for the Biomek 3000 are from USA Scientific.
2.2. Design and construction of basic parts in proper format Basic parts can be constructed using a variety of standard molecular biology tools, including traditional cut-and-paste reactions and PCR-based cloning techniques. To select “most appropriate” protocols for construction, a number of parameters are considered during the design stage, including the total number of parts in a set, the length of each target part and its
374
Mariana Leguia et al.
complexity. Part construction protocols can be implemented manually or using the Biomek 3000 (see Section 1.3). Regardless of the methodology utilized for construction, we ensure that parts are flanked by a BglII restriction site on their 50 end and a BamHI restriction site on their 30 end, as required by the BglBrick standard. Additional restriction sites used to directionally transfer parts between various vectors are usually also present flanking parts. Thus, common restriction sites, particularly BamHI, BglII, EcoRI, and XhoI, are removed from internal part sequences. Sequence adjustments are performed using OligoDesigner (see Section 1.2) and GeneDesign (http://baderlab.bme.jhu.edu/gd/). Depending on the nature of the part we may use additional sequence manipulation tools, such as the RBS calculator to calculate target translation rates (Salis et al., 2009), or GeneDesign to codon optimize sequences. Upon completion of construction parts are verified by sequencing. Prior to the start of 2ab assembly, basic parts need to be housed in the appropriate 2ab assembly vector, as specified by the assembly tree generated using AssemblyManager (see Section 1.2), and methylated as either lefties or righties (see Section 2.3). Parts housed in various vectors can be easily transferred into the correct 2ab assembly vector by directional cut-and-paste cloning using EcoRI and BamHI restriction sites flanking 50 and 30 ends of parts, respectively.
2.3. Generation of methylated plasmid DNA Appropriate methylation of lefty and righty input parts is required for the 2ab reaction to proceed correctly. Lefty and righty parts are generated by transforming and harvesting plasmids from BglII- and BamHI-methylating strains, respectively. The methylating strains used are pir þ MC1061 derivatives capable of replicating R6K origins of replication found in all assembly vectors (see Sections 2.6 and 2.9). They were generated by moving parts J72007 and J72013 (Anderson et al., 2010) from donor cells to pir þ MC1061 via P1vir phage transduction, followed by removal of the resistance marker associated with the methylation devices.
2.4. High-throughput mini-preps Although it may not seem obvious at first, generating high-quality miniprep DNA is the single most critical factor for the success of automated assembly as described herein. High-throughput mini-preps can be carried out using an assortment of equipment, methods and reagents available from a variety of vendors. Differences among extraction chemistries are usually minor, thus, only small changes are needed to adapt one into another or to switch to different hardware. We have thoroughly investigated a variety of protocols, including automated and manual ones, with varying degrees of success. Our observations suggest that magnetic bead-based chemistries that
Automated Assembly of Standard Biological Parts
375
can be implemented using the Biomek 3000, such as the Agencourt CosMCPrep System (Beckman Coulter Genomics), the Wizard MagneSil Plasmid Purification System (Promega), and MagAttract 96 Miniprep Core Kit (Qiagen), do not work well using our current setup. Disadvantages include high heterogeneity of samples, including preps that yield no DNA, and high star activity, which can significantly lower DNA quality. We have also tested vacuum manifold-based protocols with mixed results. The main disadvantages of these systems are that the setup has a tendency to clog and that samples are prone to cross-contamination. Given our existing minimal robotics platform and the importance of high-quality DNA, we currently mini-prep plasmid DNA by hand using P200 and P1000 multichannel pipettes. We use NucleoSpin plasmid columns in Multi-8 format (Macherey-Nagel) according to the manufacturer’s directions. However, we routinely substitute Qiagen buffers for Macherey-Nagel buffers, though we never mix the two (P1 resuspension buffer for A1, P2 lysis buffer for A2, N3 neutralization buffer for A3, PB binding buffer for AW, PE wash buffer for A4, and TE elution buffer for AE). All steps are always carried out using either all Qiagen, or all Macherey-Nagel buffers, with Macherey-Nagel Multi-8 format columns. Filtration steps are carried out by centrifugation using a table top centrifuge (Beckman Coulter AllegraTM 25R, or similar instrument) fitted with a plate rotor (type S57002). Multi-8 columns are kept in place atop collection blocks and plates during centrifugation steps using Macherey-Nagel metal column holder C attachments (ref# 740684). Others have reported alternate automated mini-prep setups (Grunberg et al., 2010) of sufficient quality for BioBrick-based assembly. Before starting: 1. Ensure that fresh, single colony transformants of all parts are available for picking from LB plates supplemented with appropriate 2ab combinations (A ¼ ampicillin at 100 mg/mL, C ¼ chloramphenicol at 25 mg/mL, and K ¼ kanamycin at 25 mg/mL). a. Plasmids need to be transformed into the appropriate methylation strains (see Sections 2.6 and 2.9). 2. Prepare saturated bacterial cultures. a. Aliquot 1 mL LB liquid medium supplemented with appropriate 2ab combinations into the appropriate well(s) of a sterilized (by autoclaving) 96-well block (Analytical Sales & Services, cat# 27P687). b. Use a sterile toothpick to pick a single colony growing on a freshly streaked plate and insert into appropriate well to inoculate. Continue until all necessary input parts have been picked. c. Remove toothpicks taking care not to contaminate adjacent wells. d. Cover block with AeraSeal film (Phenix Research Products, cat# B100). e. Grow at 37 C, overnight, with shaking.
376
Mariana Leguia et al.
Mini-prep protocol 1. Pellet cells growing in 96-well blocks by centrifugation at 2000g for 5 min at RT. 2. Invert block over a container large enough to catch supernatants for appropriate disposal. a. Use a single continuous downward motion in order to remove as much supernatants as possible without disturbing pellets, which should remain intact and attached to the bottom of the block. 3. Use paper towels to dry the surface of the block. 4. Add 250 mL resuspension buffer with RNase to each well and carefully vortex the block until pellets are completely resuspended. a. Proper resuspension of pellets is essential to obtain high-quality DNA preps. Incomplete resuspension can lead to incomplete bacterial lysis and lower DNA yields. Though pellets should be completely resuspended, avoid vigorous vortexing, as this can lead to sheared genomic DNA that can contaminate preps, lowering both DNA quality and yield. 5. Add 250 mL lysis buffer and incubate at RT until suspension clears (usually within 5 min). 6. Add 350 mL neutralization buffer (solution will form a white proteinaceous precipitate). 7. Clear lysates by centrifugation at 6000g for 10 min at RT. 8. In the mean time, setup Multi-8 columns over a collection block (like the one used to grow the bacterial cultures) using metal column holder C attachments to keep them in place. 9. Transfer supernatants to Multi-8 columns using a multichannel pipette. a. Take care not to transfer any of the white pellets, which could clog columns. b. Although we routinely transfer cleared lysates into Multi-8 columns by hand using a multichannel pipette, this step can be automated using the Biomek 3000 if the user finds it easier to complete the transfer without disturbing the pellet. Robot file for this operation can be downloaded at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html. 10. Filter DNA onto column by centrifugation at 1000g for 1 min at RT. 11. Discard flow-through. 12. Add 600 mL binding buffer and filter through by centrifugation at 1000g for 1 min at RT. a. Although optional, this step significantly increases the quality of the prep and should not be avoided. 13. Discard flow-through. 14. Add 1 mL wash buffer and filter through by centrifugation at 1000g for 1 min at RT.
Automated Assembly of Standard Biological Parts
377
15. Discard flow-through. 16. Thoroughly dry columns by centrifugation at 2000g for 3 min at RT. a. This step should be performed carefully since even residual amounts of ethanol can significantly inhibit downstream applications like restriction digestion and ligation. 17. Transfer columns along with metal column holder C attachments onto a clean 96-well PCR plate. a. This will be the “stock” plate. 18. Add 50 mL of water and elute by centrifugation at 2000g for 2 min at RT. 19. Repeat elution and centrifugation steps with an additional 50 mL of water. a. Final prep volumes should be 100 mL. 20. A note about DNA concentration: although possible, we do not quantify or normalize DNA preps upon completion of extraction. Instead, we ensure that our extraction protocols are always run in the same way, starting from equivalent volumes of saturated cultures of comparable bacterial strains replicating similar plasmids.
2.5. 2ab reaction The program that carries out the 2ab reaction is subdivided into three separate sections, also known as “groups” of commands. The first distributes a digestion cocktail containing BglII, BamHI, and XhoI into all destination wells used in a particular assembly; the second cherry picks methylated lefty and righty input parts from a source plate (or plates) into specific wells in a destination plate (or plates); the last distributes a ligation cocktail containing T4 DNA ligase. The second and third groups of commands are separated by a pause during which the reaction plate is manually transferred into a plate thermocycler (MJ Research PTC-200 Peltier Thermal Cycler, or similar instrument). The thermocycler executes a program consisting of an incubation step at 37 C during which restriction enzymes digest plasmid input parts, followed by a heat-kill step at 65 C during which restriction enzymes are inactivated in preparation for the subsequent addition of ligase. Manual transfers of plates between equipment pieces are currently necessary because our existing minimal Biomek 3000 setup allows us to automate liquid handling operations only. The robot can be fitted with additional tools and accessories to carry out further operations, including heating, cooling, shaking, etc. In addition, dispenser robots, which in some cases are fast and do not consume tips, are an alternative to our tip-based setup, which can get expensive with increased assembly throughputs. High-precision dispensers, like the BioRaptor (Beckman Coulter) and the Tempest (Formulatrix), have the advantage of handling small volumes and multiple inputs, but the
378
Mariana Leguia et al.
disadvantage of requiring a significant up-front financial investment. Numerous low-precision dispensers are available at significantly more affordable prices, however, most are not able to handle volumes smaller than 1 mL. Given our current setup, we minimize pipetting errors by always keeping aliquoted volumes over 3 mL. Digestions are carried out in 16 mL volumes containing 10 mL of premixed digestion cocktail (1 mL 10 NEB Buffer 2, 0.5 mL BamHI, 0.5 mL BglII, 0.5 mL XhoI, 7.5 mL water) and 3 mL each lefty and rightly plasmid DNA. Ligations are carried out by adding 4 mL of premixed ligation cocktail (0.4 mL NEB Buffer 2, 2 mL 10 mM ATP, 0.5 mL T4 DNA ligase, 1.1 mL water) into each heat-killed 16 mL digestion reaction. The information required to run all 3 groups of commands specifying pipetting operations along the entire 2ab reaction protocol is listed in a single .csv file that is imported into every “transfer from file” step. “Transfer from file” steps appear four times along the 2ab reaction protocol. Screen shot of sample robot program file for the 2ab reaction is shown in Fig. 16.5. Actual sample robot file and corresponding .csv file can be downloaded at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html. Before starting: 1. Dilute “stock” plate into input plate referred to as “dilution” plate. a. Mini-prepped input parts should be diluted 1:1 with 2 NEB Buffer 2 such that the “dilution” plate contains plasmids in 1 NEB2. b. Provided that the stock plate is already properly arrayed, it is usually faster to perform dilutions by hand using a multichannel pipette. c. If input parts need to be re-arrayed into new positions, it is usually faster to use the Biomek3000 to cherry pick plasmids into new well positions already containing an appropriate volume of 2 NEB Buffer 2. 2. Prepare digestion cocktail. a. You will need 10 mL of digestion cocktail per reaction. b. Prepare amount ¼ n þ 2 reactions. For n ¼ 1 you will need: i. 1 mL 10 NEB2 buffer ii. 0.5 mL BamHI iii. 0.5 mL BglII iv. 0.5 mL XhoI v. 7.5 mL water c. Keep on ice until ready to use. 3. Prepare ligation cocktail. a. You will need 4 mL of ligation cocktail per reaction. b. Prepare amount ¼ n þ 2 reactions. For n ¼ 1 you will need: i. 0.4 mL 10 NEB2 buffer ii. 2 mL 10 mM ATP iii. 0.5 mL T4 ligase iv. 1.1 mL water c. Keep on ice until ready to use.
Automated Assembly of Standard Biological Parts
379
Figure 16.5 Screenshot of robot program used for the 2ab reaction. The “transfer from file” command is used four times, once to distribute the restriction enzyme cocktail, once to cherry pick lefty input parts, once to cherry pick righty input parts, and once to distribute the ligation cocktail. Part of the .csv file associated with this program is shown within the red box, along with additional specification points for the first “transfer from file” command, which dispenses the restriction enzyme cocktail. The other three “transfer from file” commands are associated with the same .csv file and contain similar information as related to the distribution of the other materials. The robot program and its corresponding .csv file are available for download at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html. At the outset of the run, the robot deck should be laid out as shown in the green box. As specified in the .csv file for this particular run, there are two input “dilution” plates located at positions P1 and P2. The destination “reaction” plate is located at P4. The source plate containing enzyme cocktails is located at P3.
380
Mariana Leguia et al.
4. Generate the robot file for the 2ab reaction. a. This program will be used to dispense digestion cocktail, cherry pick input parts into destination wells and dispense ligase cocktail every time 2ab assembly is carried out. b. Information detailing source location of all input reagents and parts, as well as destination plate(s) and wells for each, is specified in corresponding .csv file (see step 5 below). i. The “transfer from file” command is used four times, once to distribute the restriction enzyme cocktail, once to cherry pick lefty input parts, once to cherry pick righty input parts, and once to distribute the ligation cocktail, as specified by the corresponding .csv file. c. Robot program commands should appear in the following order: i. Start ii. Instrument Setup iii. Distribute restriction enzyme cocktail 1. Load Tool: P20 2. Transfer From File a. File specifies source position in column EnzSource b. File contains source well information in column BamBgl c. File specifies destination position in column Dest_Plate d. File contains destination well information in column Dest_Well e. File contains volume information in column BamBgl_Volume f. Skip zero volume transfers 3. Unload Tips 4. Unload Tool Step 5. End Group iv. Distribute lefties 1. Load Tool: P20 2. Transfer From File a. File specifies source position in column Lefty_Source_Plate b. File contains source well information in column LeftyPos c. File specifies destination position in column Dest_Plate d. File contains destination well information in column Dest_Well e. File contains volume information in column Lefty f. Skip zero volume transfers 3. Unload Tips 4. Unload Tool Step 5. End Group v. Distribute righties 1. Load Tool: P20 2. Transfer From File a. File specifies source position in column Righty_Source_Plate
Automated Assembly of Standard Biological Parts
381
b. File contains source well information in column RightyPos c. File specifies destination position in column Dest_Plate d. File contains destination well information in column Dest_Well e. File contains volume information in column Righty f. Skip zero volume transfers 3. Unload Tips 4. Unload Tool Step 5. End Group vi. Distribute Ligase 1. Load Tool: P20 2. Transfer From File a. File specifies source position in column EnzSource b. File contains source well information in column LigPos c. File specifies destination position in column Dest_Plate d. File contains destination well information in column Dest_Well e. File contains volume information in column Lig_Volume f. Skip zero volume transfers 3. Unload Tips 4. Unload Tool Step 5. End Group vii. Finish 5. Generate the .csv file that the robot will need to dispense enzyme cocktails and cherry pick plasmid input parts from source locations into destination wells. The file should have at least 13 columns. Columns containing additional information, like specification of the 2ab combination marking a new junction, for example, can be present in the .csv file, but are not required. The following fields are essential to carry out the work flow outlined in this section: a. Lefty_Source_Plate: specifies source position for “dilution” plate holding lefty input parts. b. LeftyPos: specifies source position for well containing lefty input parts in “dilution” plate. c. Righty_Source_Plate: specifies source position for “dilution” plate holding righty input parts. d. RightyPos: specifies source position for well containing righty input parts in “dilution” plate. e. Dest_Plate: specifies destination position for “reaction” plate(s). f. Dest_Well: specifies destination position for wells in “reaction” plate(s). g. EnzSource: specifies source position for plate holding enzyme cocktails. h. BamBgl: specifies source position for well containing restriction enzyme cocktail in source plate.
382
Mariana Leguia et al.
i. BamBgl_Volume: specifies volume of restriction enzyme cocktail that will be aliquoted. j. LigPos: specifies source position for well containing ligation cocktail in source plate. k. LigPos_Volume: specifies volume of ligation cocktail that will be aliquoted. l. Lefty: specifies volume of lefty input part that will be aliquoted. m. Righty: specifies volume of righty input part that will be aliquoted. 2ab reaction protocol 1. Open robot program for 2ab reaction. 2. Import the .csv file that specifies distribution pattern for enzyme cocktails and cherry picking commands for lefty and righty input parts. File will need to be imported into every “transfer from file” step. 3. Setup robot deck by laying out prepared “dilution” plate(s), destination “reaction” plate(s), and enzyme cocktails on source plate according to picture shown on “instrument setup.” a. We usually keep the ligation cocktail on ice until ready for use after the pause step. 4. Ensure that there are enough tips available for cherry picking. a. Initially, you will need one 20 mL tip, which will not be changed between aliquots, to distribute out the restriction enzyme cocktail. Since you will change tips after every cherry pick thereafter, you will need two 20 mL tips per 2ab reaction (one for the righty part and one for the lefty). An additional 20 mL tip per 2ab reaction will be needed to distribute out the ligation cocktail. 5. Begin running program. Into each reaction well, robot will: a. Distribute 10 mL restriction enzyme cocktail. b. Cherry pick and distribute 3 mL of lefty input plasmid. c. Cherry pick and distribute 3 mL of righty input plasmid. 6. At the pause, transfer “reaction” plate into plate thermocycler. Thermocycler will: a. Incubate at 37 C for 1 h to digest DNA. b. Incubate at 65 C for 20 min to heat-inactivate restriction enzymes. 7. Return “reaction” plate to robot and lay out ligation cocktail into source position. 8. Resume program. Into each reaction well, robot will: a. Distribute 4 mL of ligation cocktail. 9. Following termination of liquid handling commands incubate 30 min at RT to ligate fragments. 10. Proceed to transformation.
Automated Assembly of Standard Biological Parts
383
2.6. Transformation Ligations are transformed into one or two different E. coli strains, depending on whether new composite parts will be used as lefties, righties, or both, in the following round of 2ab assembly (see Section 2.3). “Reaction” plates are usually prearrayed such that new composite parts destined as lefties end up on the left side of the plate, while new composite parts destined as righties end up on the right side of the plate. Transformation reactions are carried out by adding 30 mL of chemically competent cells into each 20 mL ligation. Following a brief incubation on ice, cells are heat shocked at 42 C using a plate thermocycler, and then rescued at 37 C prior to plating. With the described minimal setup plates have to be transferred manually from the robot to an ice bucket for cooling, and from the ice bucket to the plate thermocycler for heat shocking and rescuing. Thermocyclers, shakers, and incubators can be incorporated into the minimal setup provided additional financial investment. Screen shot of sample robot program file for transformation is shown in Fig. 16.6. Actual sample robot file and corresponding .csv file can be downloaded at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html. Before starting: 1. Prepare righty and lefty competent cells (see Section 2.9). a. You will need 30 mL of competent cells per transformation. b. Prepare amount ¼ n þ 2 reactions. i. Thaw enough vials of frozen competent cells on ice, combine into one tube, and add KCM solution to 1 (see Section 2.9). c. Keep on ice until ready to use. 2. Generate the robot file for transformation. a. This program will be used to dispense lefty and righty competent cells every time transformations are carried out. b. Information detailing source location of lefty and righty competent cell stocks, as well as destination plate(s) and wells, is specified in corresponding .csv file (see step 3 below). c. The “transfer from file” command is used two times in this program, once to distribute lefties, and once to distribute righties, according to corresponding .csv file. d. Robot program commands appear in the following order: i. Start ii. Instrument Setup iii. Distribute lefty/righty cells 1. Load Tool: P20 2. Transfer From File (aliquots lefties) a. File specifies source position in column Cell_Source_Plate b. File contains source well information in column Cell_Source_Well
384
Mariana Leguia et al.
Figure 16.6 Screenshot of robot program used for transformation. The “transfer from file” command is used twice, once to distribute BglII-methylating competent cells and once to distribute BamHI-methylating competent cells. Part of the .csv file associated with this program is shown within the red box, along with additional specification points for the first “transfer from file” command, which dispenses BglII-methylating competent cells. The other “transfer from file” command is associated with the same .csv file and contains similar information as related to the distribution of BamHI-methylating competent cells. The robot program and its corresponding .csv file are available for download at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html. At the outset of the run, the robot deck is laid out as shown in the green box. As specified in the .csv file for this particular run, the source plate containing competent cells is located at P3, while the destination “reaction” plate is located at P2.
Automated Assembly of Standard Biological Parts
385
c. File specifies destination position in column Reaction_Plate d. File contains destination well information in column Reaction_Well e. File contains volume information in column Is_Lefty f. Skip zero volume transfers 3. Transfer From File (aliquots righties) a. File specifies source position in column Cell_Source_Plate b. File contains source well information in column Cell_Source_Well c. File specifies destination position in column Reaction_Plate d. File contains destination well information in column Reaction_Well e. File contains volume information in column Is_Righty f. Skip zero volume transfers 4. Unload Tips 5. Unload Tool Step 6. End Group a. Finish 3. Generate the .csv file that the robot will need to dispense lefties and righties from source location into destination wells. The following six fields are essential to carry out the work flow outlined in this section: a. Cell_Source_Plate: specifies source position for plate holding lefty and righty competent cell stocks to be aliquoted. b. Cell_Source_Well: specifies source position for well containing lefty and righty competent cell stocks on source plate. c. Reaction_Plate: specifies destination position for plate(s) where cells will be aliquoted. d. Reaction_Well: specifies destination position for wells in destination plate(s). e. Is_Lefty: specifies volume of lefties that will be aliquoted. f. Is_Righty: specifies volume of righties that will be aliquoted. Transformation protocol 1. Open robot program for transformation. 2. Import the .csv file that specifies distribution pattern for righties and lefties. File will need to be imported twice, once for every “transfer from file” step. 3. Setup robot deck by laying out prepared competent cell stocks and “reaction” plate containing ligation products of 2ab reaction according to picture shown on “instrument setup.” Ideally, the “reaction” plate should be held within a cooling block (Bioscision, cat# Gray, BCS120) to prevent loss of cell competency while pipetting operations are carried out.
386
Mariana Leguia et al.
4. Ensure that there are enough tips available for cherry picking. a. Since you will change tips after every aliquot, you will need 1 20 mL tip per transformation to aliquot out competent cells. 5. Run program. 6. Transfer plate to ice bucket and incubate 10 min. 7. Transfer plate to plate thermocycler and heat shock at 42 C for 3 min. 8. Transfer plate back to ice bucket to cool briefly. 9. Add 100 mL of liquid LB or 2YT media by hand using P200 multichannel pipette. 10. Rescue at 37 C for 1 h. 11. Proceed to plating.
2.7. Plating High-throughput plating of transformations is carried out on 24-well strip plates (Analytical Sales & Services, cat# 47025) containing LB/agar strips supplemented with appropriate 2ab combinations. Each strip is 3.5 cm long by 0.8 cm wide and has a surface area of 2.8 cm2. This surface area is large enough to accommodate several dozen colonies. The plating protocol outlined here routinely yields 5–50 colonies per strip, which is more than enough to provide sufficient coverage for screening. Given that our Biomek 3000 setup holds an 8-tip multichannel tool, our protocols are designed to maintain multiples of 8 grouped together. Ideally for our application, 24-well strip plates would be arranged with strips running horizontally in 3 columns of 8 strips each. Unfortunately plates of this sort, which were once available from Seahorse Bioscience, have been discontinued. Instead, we use 24-well strip plates where strips are arranged vertically in 2 rows of 12. Even though it is possible to use all 24 strips for plating, we routinely use only 16. This is done to simplify organization and avoid introducing human-generated errors down the line, particularly during the colony picking steps required to screen transformants. When using 2 12 24well strip plates, we find that is easier to use only the first 8 wells in every row to plate a total of 16 transformations (2 columns of 8 wells each from a standard 96-well plate of transformations) than to split sets of 8. To plate 96 transformations prepared in a 96-well plate, six 24-strip plates are necessary. Plating begins with the preparation of LB/agar strips supplemented with different 2ab combinations. Briefly, the robot is used to spot small volumes of antibiotic stocks onto different wells, as specified by a .csv file. Screen shot of sample robot program file for plating is shown in Fig. 16.7. Actual sample robot file and corresponding .csv file can be downloaded at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html. Each well receives two different antibiotics, corresponding to the antibiotic combination marking the new part junction. Once all antibiotics are dispensed,
Automated Assembly of Standard Biological Parts
387
Figure 16.7 Screenshot of robot program used for plating. The “transfer from file” command is used three times, once to aliquot chloramphenicol, once to aliquot ampicillin, and once to aliquot kanamycin. Part of the .csv file associated with this program is shown within the red box, along with additional specification points for the first “transfer from file” command, which dispenses chloramphenicol. The other two “transfer from file” commands are associated with the same .csv file and contain similar information as related to the distribution of ampicillin and kanamycin. The robot program and its corresponding .csv file are available for download at http://andersonlab.qb3.berkeley.edu/andersonSoftware.html. At the outset of the run, the robot deck should be laid out as shown in the green box. As specified in the .csv file for this particular run, there are four destination plates located at P5, P6, P7, and P8. The source plate containing antibiotic stocks is located at P3. At the completion of the robot run, and following addition of LB/agar, destination plate P5 will appear as shown at the top and bottom, respectively, of the blue box.
388
Mariana Leguia et al.
melted LB/agar is added to each well using a P1000 multichannel pipette. The plate is rocked back and forth to mix antibiotic stocks and agar, and allowed to solidify under a flame to keep it sterile. Once the agar solidifies, transformations are transferred onto the agar surface using a P200 multichannel pipette, rocked back and forth to coat the agar surface, and allowed to dry under a flame. Once dry, the plate is sealed with AeraSeal film (Phenix Research Products, cat# B100), inverted, and incubated overnight at 37 C or until colonies appear. As an alternative to plating, others have successfully used logarithmic serial dilutions (Randy Rettberg, personal communication). Although it is possible to identify liquid cultures derived from a single transformant using this approach, the large number of samples processed during a typical automated assembly run can make this method difficult to implement at a reasonable scale. Before starting: 1. Prepare 100 antibiotic stocks: A at 10 mg/mL, C at 2.5 mg/mL, and K at 2.5 mg/mL. a. We usually prepare fresh 1:10 dilutions from frozen 1000 antibiotic stocks using water as a diluent. b. 1000 stocks of A and K are made in water, whereas C is made in ethanol. c. We usually color-code our 100 antibiotic stocks (A, red; C, blue; and K, green) using small amounts (1–2 mL) of stock solutions of rhodamine, bromophenol blue, and fluorescein. Food coloring can also be used for this purpose. Though color-coding of antibiotic stocks is not required, it provides a great visual aid to check that the robot has aliquoted antibiotic stocks correctly. 2. Prepare LB/agar. a. We maintain LB/agar in 500 mL glass bottles and microwave to melt prior to plating. 3. Ensure that you have clean, sterilized (by autoclaving) 24-strip plates. 4. Generate the robot file for plating. a. This program will be used every time plates need to be prepared. b. Information detailing source location of antibiotic stocks and destination plates and wells for each is specified in corresponding .csv file (see step 5 below). c. The “transfer from file” command is used three times in this program, once to aliquot A, once to aliquot C, and once to aliquot K. d. Robot program commands appear in the following order: i. Start ii. Instrument Setup iii. Antibiotic picking 1. Load Tool: P20 2. Transfer From File (aliquots C) a. File specifies source position in column Source_Plate
Automated Assembly of Standard Biological Parts
389
b. File contains source well information in column CamPos c. File specifies destination position in column Dest_Plate d. File contains destination well information in column Dest_Well e. File contains volume information in column Cam f. Skip zero volume transfers 3. Transfer From File (aliquots K) a. File specifies source position in column Source_Plate b. File contains source well information in column KanPos c. File specifies destination position in column Dest_Plate d. File contains destination well information in column Dest_Well e. File contains volume information in column Kan f. Skip zero volume transfers 4. Transfer From File (aliquots A) a. File specifies source position in column Source_Plate b. File contains source well information in column AmpPos c. File specifies destination position in column Dest_Plate d. File contains destination well information in column Dest_Well e. File contains volume information in column Amp f. Skip zero volume transfers 5. Unload Tool Step 6. End Group iv. Finish 5. Generate the .csv file that the robot will need to dispense antibiotics from source location into destination wells. The following nine fields are essential to carry out the work flow outlined in this section: a. Source_Plate: specifies source position for plate holding antibiotic stocks. b. CamPos: specifies source position for well containing C stock on source plate. c. KanPos: specifies source position for well containing K stock on source plate. d. AmpPos: specifies source position for well containing A stock on source plate. e. Dest_Plate: specifies destination position for plate(s) where antibiotics will be aliquoted. f. Dest_Well: specifies destination position for wells in destination plate(s). g. Cam: specifies volume of C that will be aliquoted. h. Kan: specifies volume of K that will be aliquoted. i. Amp: specifies volume of A that will be aliquoted.
390
Mariana Leguia et al.
Plating protocol: 1. Open robot program for plating. 2. Import the .csv file that specifies destination wells in 24-strip plates for antibiotics. File will need to be imported three times, once for every “transfer from file” step. 3. Setup robot deck by laying out antibiotic stocks and destination plates according to picture shown on “instrument setup.” 4. Ensure that there are enough tips available for cherry picking. a. You will need only 3, since the same tip will be reused for each antibiotic. 5. Run program. 6. Perform a quick visual inspection of the plate(s). If antibiotics have been dispensed correctly, each well will contain two spots of different colors in different locations. 7. After antibiotics have been dispensed, pour strips by adding 1.5 mL of LB/agar into each well. a. First transfer melted LB/agar onto a sterile, disposable reservoir (Corning Costar, cat# 4870) under a flame. b. Use P1000 multichannel to transfer 2750 mL of liquid into each well. c. Work quickly, as agar solidifies fast. d. Mix by gently rocking back and forth two or three times. e. Do not mix by pipetting, as this can create a lot of bubbles and a smooth flat surface is needed for successful plating. f. Remove any bubbles from the surface by flaming quickly. Avoid extensive flaming, which can melt well walls and degrade antibiotics, reducing their effective concentration. 8. Once agar solidifies, add LB containing transformations and rock back and forth to coat entire surface. a. Transformation volumes need to be determined empirically. In the work flow described here we usually plate 40–60 mL of each transformation (from a 150 mL total volume). b. Transformation volumes plated can be increased or decreased to adjust the number of transformants obtained. One should avoid plating volumes smaller than 20 mL because they are hard to spread over the entire agar surface. Similarly, volumes larger than 80 mL take too long to dry, causing colonies to form streaks or lawns (see Section 3). 9. Allow liquid to dry under a flame. 10. Once dry, cover with AeraSeal film. 11. Incubate at 37 C, inverted, overnight or until colonies appear. 12. Proceed to screening of transformants.
Automated Assembly of Standard Biological Parts
391
2.8. Screening of transformants Proper screening of transformants is essential in order to identify clones containing correctly assembled parts. Each transformant is biotyped to determine growth phenotype using two separate LB plates. The first contains two selection antibiotics (determined by the 2ab combination marking the new junction), and the second contains all three selection antibiotics (ACK) (A at 100 mg/mL, C at 25 mg/mL, and K at 25 mg/mL). Transformants with properly assembled parts will be able to grow on 2ab combinations, but not on triple antibiotic combinations. Transformants that are able to grow on triple combinations are usually cotransformed with multiple plasmids, including undesired child products and/or parent input plasmids. Segregating plasmids away from each other is a difficult and time-consuming task, so colonies showing triple antibiotic resistance should be eliminated. We routinely obtain cotransformation rates averaging about 5%. However, we have observed that rates can vary drastically, ranging from 2% to 26%, depending on which new antibiotic combination is being generated. In general, CK and KC pairs show higher rates of cotransformation than AC, CA, AK, and KA pairs when equivalent amounts of plasmid DNA are used. Despite the range, our protocols yield cotransformation rates that are low enough not to interfere with the screening process. We routinely pick only 2–3 transformants per junction, and at least one will contain properly assembled parts. If cotransformation rates were high enough to become cumbersome and problematic, the easiest way to reduce them is to dilute the amount of DNA used for transformation. In our experience, the following screening strategy will yield 96þ% successful junctions after cotransformants have been eliminated (see Section 3). The minimal setup does not allow us to automate the process of colony picking and gridding on selection plates, thus, these operations are presently carried out manually. However, there are a number of automated solutions that can be integrated with the Biomek 3000 to carry out these functions, including the EasyPick (Hamilton Robotics) and the Biomek FX (Beckman Coulter). Screening of transformants protocol: 1. Remove 24-strip plates from the incubator and visually inspect. Ideally all wells should have distinct, punctate, perfectly round colonies, ranging in number from 5 to 50, that can be easily and cleanly picked (see Section 3). 2. Using a sterile toothpick, pick a single colony, and spot onto LB plates supplemented with both double and triple antibiotic combinations. a. Use grid patterns to organize and easily identify clones growing on separate plates. b. We usually pick 2–3 colonies per junction, which is enough to yield at least one with the correct growth phenotype. 3. Grow at 37 C overnight.
392
Mariana Leguia et al.
4. Eliminate clones that grow on triple antibiotic combinations. 5. Identify potential positives and proceed to isolate plasmid DNA (see Section 2.4).
2.9. Competent cells Competent cells should be prepared from bacteria that have been freshly restreaked on LB plates, supplemented with appropriate antibiotics if appropriate, from a 80 C stock. It is best not to restreak from competent cell stocks. The two methylation strains described here do not have antibiotic resistance genes, so antibiotic supplements should not be used. Currently we are not aware of any automated robotics platforms that can be used to carry out competent cell protocols. Before starting: 1. Prepare a fresh plate of restreaked bacteria containing single colonies. 2. Prepare 1 L LB in a large flask (at least 2 L volume) to allow for proper aeration of culture during growth. 3. Prepare TSS solution: a. Autoclave individually: i. 85 mL LB ii. 10 g PEG-3350 in 5 mL water iii. 5 mL DMSO iv. 2 mL 1 M MgCl2 b. Allow all solutions to cool and combine. c. Filter sterilize through 0.2 mm filter and aliquot into 50 mL conical tubes. d. Store at 4 C until needed. e. TSS solution can be made in advance and stored at 4 C for at least a year. 4. Prepare 5 KCM solution: a. 5: 500 mM KCl, 150 mM CaCl2, 250 mM MgCl2 b. Filter sterilize through 0.2 mm filter. c. Store at 4 C until needed. d. 5 KCM solution can be made in advance and stored at 4 C indefinitely. 5. Autoclave centrifuge bottles. 6. Prechill centrifuge rotor, bottles, a large serological pipette used to resuspend the bacterial pellet and 500 mL Eppendorf tubes for 80 C storage of aliquoted competent cells. Competent cells protocol: 1. Inoculate 10 mL of LB (plus antibiotics, if appropriate) with a fresh single colony.
Automated Assembly of Standard Biological Parts
2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
393
Grow to saturation, usually at 37 C overnight, with shaking. Inoculate prewarmed 1 L LB with entire 10 mL overnight culture. Grow to OD600 ¼ 0.5 (will usually take 2–3 h, but can take longer). Once at OD600 ¼ 0.5 place on ice and swirl a few times to stop the growth. Transfer to chilled, sterilized centrifuge bottle(s). Centrifuge at 5000g for 10 min at 4 C to pellet. Discard supernatant. Completely resuspend pellet in 25–50 mL ice-cold TSS. Prepare 200 mL aliquots. Flash freeze in liquid nitrogen bath. Store at 80 C until needed. Prior to use: a. Thaw 200 mL aliquot. b. Add 50 mL 5 KCM solution. c. Mix gently by inversion. d. Store on ice until needed for transformation.
3. Troubleshooting Potential problems with 2ab assembly are most obvious post-plating of transformations. By far, the most common source of failure is mini-prep DNA of poor quality. The following describe various scenarios for the appearance of LB/agar strips on plates.
3.1. No colonies Of all the possible scenarios for the appearance of agar strips, a total lack of colonies is the most complicated to troubleshoot because it can be due to a variety of reasons affecting several different steps along the work flow. However, assuming that all reagents and equipment are working properly, the most common source of failure is human error. Common mistakes include: 1. Quality of input plasmid DNA is subpar because mini-preps are not performed carefully. Redoing mini preps from scratch is the easiest way to correct issues associated with poor plasmid DNA quality. To ensure that input plasmid DNA is of the highest quality possible, do not cut corners during mini-prep steps. Resuspend pellets well to ensure complete lysis. Wash columns well (with both binding and wash buffers) to eliminate nucleases and protect DNA from degradation after prepping. Ensure that trace amounts of ethanol are removed from columns prior to elution steps.
394
Mariana Leguia et al.
2. Stocks of digestion and ligation cocktails are prepared incorrectly. Ensure that the right amount and proportion of enzymes are added to the stock enzyme cocktails, and that sufficient cocktail is available for all reactions. Check that the robot is aliquoting the correct amount required and that the liquid sample being aliquoted is actually reaching its destination well and mixing with the rest of the well’s contents. At the beginning of every run, it is usually a good idea to watch the robot perform the first few liquid transfers to ensure that it is doing it correctly. At the end of the run, if possible, check that all volumes have been aliquoted out (compare all wells, which should have roughly the same amount of volume collected at the bottom) and that stock vials are not empty (which may indicates that you run out of material and that some wells may have not gotten an aliquot). 3. Input parent parts are not matched correctly, are not placed in the correct well, or are not methylated properly. Ensure that lefty and righty parent assembly vectors are matched and can generate a viable junction. Ensure that enough input parent DNA is placed in the correct well in the correct input plate. Adjust volumes for parts that will be used as input multiple times in an assembly. Ensure that lefties and righties are derived from the appropriate methylation strain. Other sources of failure can usually be traced back to reagents and/or equipment that are not working properly. Ensure that enzymes and antibiotic stocks are fresh and working as expected. Routinely perform visual checks on plates while robot is in operation to ensure that it is running as expected. Try to keep an orderly, clean flow. Try to keep liquid volumes collected at the bottom of plates to avoid cross-contamination.
3.2. Very few colonies Unless the colonies that appear on the agar surface do not contain the desired part, just a few colonies per strip is preferred because it increases the chances of cleanly picking single colonies. If desired, the number of colonies per strip can be increased either by increasing the volume plated, or by plating a more concentrated sample. To concentrate the sample, first pellet transformants by centrifugation, then remove a portion of the supernatant and resuspend the pellet in the remaining volume.
3.3. Colonies of different size Slight differences in colony size are normal and due to minor variations in growth rates of cultures replicating plasmids with different 2ab combinations. We routinely observe that cells replicating AK and KA plasmids grow the fastest, while cells replicating CK and KC plasmids grow the slowest. Major
Automated Assembly of Standard Biological Parts
395
differences in colony size or general colony appearance are not normal and usually indicate either that the construct contains a toxic part, or that there is a contamination problem. Ensure that you do not have a contamination with antibiotic-resistant bacteria and/or check that the appropriate antibiotics, at the appropriate concentration, have been used on strips (see Section 3.4).
3.4. Too many colonies to pick cleanly Too many colonies growing on the surface of an agar strip indicates that too many transformants were present in the volume of suspension plated. To correct, either plate less volume or plate a dilution of the original transformation. It is also possible that the strip does not contain the correct antibiotic combination in the correct amount. Ensure that the robot aliquots antibiotics to all strips. Perform a visual inspection of the antibiotic pattern on plate prior to adding liquid LB/agar. When adding liquid LB/agar ensure that the liquid is not too hot (should not burn to the touch), as this can degrade antibiotic molecules, effectively reducing their final concentration. Ensure that antibiotic stocks are fresh and used at the correct concentration. Take care to mix liquid LB/agar and antibiotics by gently rocking plate back and forth several times. Ensure that your materials (competent cell stocks, media for rescuing, ligations, etc.) are not contaminated.
3.5. Streaky colonies The primary cause of streaks is an uneven drying surface. If the plate is placed down to dry on an uneven surface, or if the surface of a solidified agar strip itself is tilted, even slightly, liquid will pool and collect in the shallowest part of the strip. As the liquid dries it will continue to move down toward the shallowest part of the strip, dragging down cells that have begun to divide elsewhere. This imbalance will result in streaks that cannot be picked as single colonies. To solve, pour agar strips with care, ensure that the surface is even and free of bubbles. Allow plates to dry on a flat benchtop and pick a dry environment, if possible, so that LB dries as fast as possible. The goal is to have bacterial cells pulled straight down onto the agar strip as the liquid media containing transformants dries out, such that they form single, distinguishable colonies. If strips are taking too long to dry after transformants are plated, reduce the volume of LB added, which will reduce the time needed for strips to dry.
3.6. Lawns or lawny areas Lawns that evenly cover the entire surface of the strip usually indicate that too many transformants are present in the volume of suspension plated. To correct, either plate less volume or plate a dilution of the original
396
Mariana Leguia et al.
transformation. Lawny areas in some, but not all, of the strip’s surface usually indicate that the plate has been allowed to dry on an uneven drying surface. Ensure that the plate is resting on a flat surface and that agar strips are even and free of bubbles (see Section 3.5). It is also possible that the strip does not contain the correct antibiotic combination in the correct amount (see Section 3.4). Finally, ensure that you do not have a contamination problem.
3.7. Uneven number of colonies on various different strips Sometimes you can get plates where some strips have no colonies, others have a few, and others have too many to pick cleanly. The most common source of this problem is that input DNA plasmids are present in dramatically different concentrations. Provided that all input DNAs are miniprepped together and correctly, they should be of similar concentration and quality. If not, check concentration and DNA quality and adjust accordingly. If the problem is just one input DNA, this can be apparent when all the assemblies containing that particular input part fail. Check the pattern(s) of failure to see if you can identify potentially problematic input plasmids. Problems associated with a single input part usually indicate degradation of that DNA. Re-mini-prepping that specific input part is usually the easiest solution to the problem. If there is no obvious pattern for failures, check that input parent parts are matched correctly, placed in the correct input well, and methylated properly (see Section 3.1).
4. Concluding Remarks The protocols outlined here describe an automation flow carried out using a bare-bones minimal robotics platform that can be affordably implemented in a small academic research lab. Several aspects of the automation process could be significantly improved to streamline automation without modifications to the underlying chemistry using additional tools. The first step is full integration of software and robotics tools. At the moment .csv files are used to translate AssemblyManager code into Biomek 3000 code. A better alternative would be to develop a single seamless piece of software. The second step is use of assembly lines equipped with multiple small robots dedicated to specific tasks. Bottlenecks along a particular construction trajectory can vary depending on the nature of the DNA fabrication task at hand. By adding modularity with dedicated robots that perform a single task optimally, bottlenecks can be easily streamlined. Where appropriate, we have pointed out alternatives to our setup using a variety of automated solutions, including mini-prepping robots, cherry picking and gridding
Automated Assembly of Standard Biological Parts
397
robots, dispensers of several kinds, etc. Our vision for the future is full automation of the entire DNA fabrication process within the context of a core facility staffed by a dedicated technician. Outsourcing construction to a core facility where DNA fabrication is consolidated into a single assembly line, at large scale, using small volume reagents and consumable-free robotics is a better alternative to “one-box” solutions executed in individual labs, particularly because the implementation of these technologies can be complicated despite the fact that the chemistries work well. In the interim, we continue to develop our assembly chemistries in order to further simplify and streamline construction. Currently, we are developing technologies that will enable cells to carry out various steps of the assembly process in-vivo, including restriction digestion and ligation reactions. We envision that these upgrades, along with fully integrated automated assembly lines within core facilities, will enable fast and economical solutions that will transform DNA fabrication from a technically intensive art into a purely design-based discipline.
ACKNOWLEDGMENTS We thank Martin Pollard of the Joint Genome Institute (www.jgi.doe.gov) for helpful discussions on assembly line automation design. We also thank Nina Revko and the 2009 Berkeley computational iGEM team for early work on the automated assembly software. This work was funded by SynBERC (www.synberc.org).
REFERENCES Anderson, J. C., Dueber, J. E., Leguia, M., Wu, G. C., Goler, J. A., Arkin, A. P., and Keasling, J. D. (2010). BglBricks: A flexible standard for biological part assembly. J. Biol. Eng. 4, 1. Densmore, D., Hsiau, T. H., Kittleson, J. T., DeLoache, W., Batten, C., and Anderson, J. C. (2010). Algorithms for automated DNA assembly. Nucleic Acids Res. 38, 2607–2616. Grunberg, R., Ferrar, T. S., van der Sloot, A. M., Constante, M., and Serrano, L. (2010). Building blocks for protein interaction devices. Nucleic Acids Res. 38, 2645–2662. Salis, H. M., Mirsky, E. A., and Voigt, C. A. (2009). Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950.
C H A P T E R
S E V E N T E E N
MEGAWHOP Cloning: A Method of Creating Random Mutagenesis Libraries via Megaprimer PCR of Whole Plasmids Kentaro Miyazaki Contents 400 401 401 402 403 404 404 405 405 405
1. Introduction 2. Methods 2.1. MEGAWHOP protocol 2.2. Technical considerations 3. Applications of MEGAWHOP 3.1. Whole gene random mutagenesis 3.2. Site-directed mutagenesis of multiple sites 3.3. Domain-targeted mutagenesis 3.4. Gene fusion References
Abstract MEGAWHOP allows for the cloning of DNA fragments into a vector and is used for conventional restriction digestion/ligation-based procedures. In MEGAWHOP, the DNA fragment to be cloned is used as a set of complementary primers that replace a homologous region in a template vector through whole-plasmid PCR. After synthesis of a nicked circular plasmid, the mixture is treated with DpnI, a dam-methylated DNA-specific restriction enzyme, to digest the template plasmid. The DpnI-treated mixture is then introduced into competent Escherichia coli cells to yield plasmids carrying replaced insert fragments. Plasmids produced by the MEGAWHOP method are virtually free of contamination by species without any inserts or with multiple inserts, and also the parent. Because the fragment is usually long enough to not interfere with hybridization to the template, various types of fragments can be used with
Bioproduction Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Central 6, 1-1-1 Higashi, Tsukuba, Ibaraki, Japan Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00017-6
#
2011 Elsevier Inc. All rights reserved.
399
400
Kentaro Miyazaki
mutations at any site (either known or unknown, random, or specific). By using fragments having homologous sequences at the ends (e.g., adaptor sequence), MEGAWHOP can also be used to recombine nonhomologous sequences mediated by the adaptors, allowing rapid creation of novel constructs and chimeric genes.
1. Introduction The classical method for cloning of a DNA fragment is to digest the fragment with restriction endonucleases and ligate it to a vector having compatible ends (Sambrook and Russell, 2001). Although this method is most commonly used, it is laborious and the resulting library is often contaminated with unwanted plasmids that have no inserts or multiple inserts. These undesirable plasmids are problematic especially when producing random mutagenesis libraries as they lead to false positives/negatives after subsequent functional screening. Here, we provide a ligation-independent cloning method, designated MEGAWHOP (megaprimer PCR of whole plasmid; Fig. 17.1). In MEGAWHOP, the DNA fragment to be cloned serves as a set of complementary primers, which are much longer than common oligonucleotide primers, hence the name, “megaprimer.” Using the fragment, the template vector is amplified by PCR to yield nicked circular DNA. The mixture is then treated with DpnI, a dam-methylated DNA-specific Random fragment
Specific fragment
Fragment having homologous ends Megaprimer
Template vector MEGAWHOP 1. Megaprimer hybridization to a vector 2. Whole plasmid amplification by PCR 3. DpnI treatment 4. Transformation Recombinant vector
Figure 17.1 MEGAWHOP is based on the replacement of a homologous region between a fragment and vector. MEGWAHOP allows for the use of various types of fragments as long as they can hybridize to a template vector.
MEGAWHOP: Megaprimer PCR of Whole Plasmids
401
restriction enzyme, to digest the template plasmid. The DpnI-treated mixture is subsequently introduced into competent Escherichia coli cells to yield a plasmid with a mutated insert. Depending on the types of DNA fragments, this method can be used for various purposes, including the creation of random mutagenesis libraries, multiple site-directed mutagenesis, recombining nonhomologous sequences, and de novo cloning into a vector. Because MEGAWHOP does not require restriction digestion for insertion, it is particularly useful when targeting specific gene regions such as those encoding a particular protein domain or a signal peptide (or mature sequence) of secretory proteins. Additionally, because the method is based on hybridization, it can be used to fuse two nonhomologous sequences mediated by homologous ends as adaptors, allowing rapid recombination of gene segments.
2. Methods The MEGAWHOP cloning method consists of four steps: preparation of the DNA fragment to be cloned (megaprimer); whole-plasmid PCR using the megaprimer, which serves as a set of overlapping primers; DpnI treatment of the whole-plasmid PCR product to eliminate the template plasmid; and transformation of competent cells. During the described protocol, the DNA fragment to be cloned replaces a homologous region in the template plasmid. Libraries produced by the MEGAWHOP method are virtually free of contamination by plasmids without any insert or with multiple inserts, and also the parent. The protocol guarantees a positive outcome with a 50–3000-bp megaprimer and a 2–8-kb template plasmid.
2.1. MEGAWHOP protocol 1. Prepare a mutated gene fragment using a PCR-based procedure (e.g., error-prone PCR; Cadwell and Joyce, 1992), DNA shuffling (Stemmer, 1994) or StEP recombination (Zhao et al., 1998), or naturally occurring homologous genes. Dissolve the DNA in water (or in a nuclease-free Tris-based buffer). 2. Mix megaprimer and plasmid template (final volume of 50 mL): 0.5 mg megaprimer, 50 ng template plasmid, 0.2 mM of each dNTP, and 2.5 U of KOD-plus-Neo DNA polymerase (Toyobo, Tokyo, Japan) in 1 buffer. 3. Run whole-plasmid PCR. Incubate the reaction mixture at 68 C for 5 min (this step is optional) and heat at 98 C for 2 min. Perform 24–40 cycles of incubation at 98 C for 10 s and 68 C for x min (depending on the length of the whole plasmid). You can monitor the reaction by agarose gel electrophoresis as shown in Fig. 17.2.
402
Kentaro Miyazaki
M
0
12
18
24
Cycles
Synthesized plasmid Template plasmid Megaprimer
0
48
134
233
Number of transformants
Figure 17.2 Agarose gel electrophoresis of the MEGAWHOP reaction. An aliquot (3 mL) of the whole-plasmid PCR products (50 mL) was loaded onto a 1% (w/v) agarose gel. As the number of thermal cycles increased, the intensity of the band corresponding to the synthetic plasmid became stronger. The number of transformants proportionally increased with the yield of the synthetic plasmid product.
4. Add 20 units of DpnI (1 mL) directly into the whole-plasmid PCR mixture. Incubate for 1–2 h at 37 C or at room temperature overnight. 5. Use 1–2 mL of the DpnI-treated mixture to transform 100 mL of competent E. coli cells using standard protocols (Sambrook and Russell, 2001). Grow cells on agar-containing growth media supplemented with appropriate antibiotics.
2.2. Technical considerations In MEGAWHOP, product yield is dependent on the amount of megaprimer and the number of cycles used during the whole-plasmid PCR, which is similar to other oligonucleotide-based PCR. Unlike a restriction digestion/ligation-based method, one need not accurately adjust the concentrations of the vector and insert or their ratio for MEGAWHOP. Typically, 0.2 mg of megaprimer (750 bp) and 50 ng of template plasmid ( 3.5 kb) with a PCR run of 24 cycles give satisfactory results. When error-prone PCR is performed in 50 mL, 2 mg of amplicon can be routinely produced; therefore, only 1/10 of the product is used for a subsequent MEGAWHOP reaction. Under these conditions, one can typically yield 2000 transformants using 1/10 of the MEGAWHOP product and 100 mL of competent E. coli cells ( 108/mg of pUC18). Virtually, all (>99.9%) of the screened clones will contain a single insert. In some instances, the number of transformants is lower than expected. Should this happen, increasing the number of cycles and/or adding more megaprimer to the reaction should be attempted.
MEGAWHOP: Megaprimer PCR of Whole Plasmids
403
MEGAWHOP uses long-range PCR, and to minimize the incorporation of unintended mutations in the extending region, one must use a highfidelity DNA polymerase, for example, KOD-plus-Neo DNA polymerase. Other types of high-fidelity DNA polymerases such as Pfu Turbo DNA polymerase (Stratagene) or Vent DNA polymerase (New England Biolabs) are also appropriate. When using KOD-plus-Neo DNA polymerase, the extension time can be calculated as 2 kb/min based on the length of the DNA template. For other DNA polymerases, the extension time will need to be optimized. PCR can be performed either in two or three steps. Because megaprimers are generally long enough to properly hybridize to the template at high temperatures, a two-step PCR should be used to shorten the reaction time. As the quality of DNA polymerase is critical, the enzyme as well as the dNTPs should always be kept at 20 C for storage and on ice when in use. Also, one should store the reagents in aliquots. Even though the polymerases can be used for “normal” PCR (e.g., for amplification of a specific target), it may not work for whole-plasmid PCR. Therefore, always using a fresh aliquot of DNA polymerase may be worthwhile. If one prepares a DNA fragment using Taq DNA polymerase, which is most commonly used in error-prone PCR, be sure to remove extra nucleotides at the 30 -end of products (Clark, 1988; Hu, 1993) to avoid introducing unintended mutations during whole-plasmid PCR. This can be done by adding a short incubation period (68 C for 5 min) prior to whole-plasmid PCR that will allow the 30 –50 exonuclease activity of high-fidelity DNA polymerases to remove extra nucleotides. Most PCR buffers do not interfere with DpnI digestion. DpnI digests only methylated and hemimethylated DNA (McClelland and Nelson, 1992). Therefore, newly synthesized DNA using “standard” dNTP will not be digested by DpnI. Alternatively, template plasmid propagated in common E. coli is dam-methylated and will be digested by DpnI. Although most common E. coli strains (e.g., DH5a, JM109) contain a dam-methylation system, some (e.g., JM110, SCS110) lack the system and are not suitable for the preparation of a template vector.
3. Applications of MEGAWHOP The MEGAWHOP method, first introduced in 2002 (Miyazaki and Takenouchi, 2002), has been successfully applied to the creation of mutant libraries of a wide range of genes, including the creation of random mutagenesis libraries, site-directed mutagenesis for multiple sites, domain mutagenesis, and gene fusion.
404
Kentaro Miyazaki
3.1. Whole gene random mutagenesis We applied MEGAWHOP to improve the solubility of fungal glycosyl hydrolase when overexpressed in E. coli (Yaoi et al., 2007). The whole gene (2.4 kb) was mutated by error-prone PCR or StEP recombination method, and the amplicon was used as a megaprimer to clone back into the previously made pET-based expression vector (7.6 kb). The MEGAWHOP cycle was 68 C for 5 min and 94 C for 2 min, followed by 40 cycles of thermal cycling at 94 C for 15 s, 55 C for 30 s, and 68 C for 5 min, and a final extension at 68 C for 10 min. The reaction mixture was treated with DpnI and purified in a Qiagen spin column. The DNA was eluted in 50 mL of water and concentrated to 10 mL using a vacuum concentrator. Competent XL-1 blue cells were transformed with 5 mL of the solution by electroporation to yield 10,000 transformants. Three rounds of directed evolution converted the aggregation-prone enzyme to a fully soluble one. In this experiment, 24 cycles were used for whole-plasmid amplification but this yielded a small number of transformants. By increasing the number of cycles, transformant yields were successfully increased to create large-scale libraries. Baeyer–Villiger monooxygenases belong to a class of oxidoreductases and convert aliphatic, arylaliphatic, and cyclic ketones to esters and lactones, respectively, using molecular oxygen. The wild-type enzyme exhibits only moderate E values (E 55) and was subjected to directed evolution to improve the enantioselectivity (Kirschner and Bornscheuer, 2008). Errorprone PCR was used to randomize the gene and the resultant amplicon served as a megaprimer to generate a library. The random mutagenesis library consisted of over 3500 clones with two variants identified as demonstrating improved selectivity.
3.2. Site-directed mutagenesis of multiple sites MEGAWHOP was used to create a green fluorescence protein variant GFPuv (or cycle-3 mutant; Crameri et al., 1996) from wild-type GFP. Because GFPuv contains three amino acid substitutions, F99S, M153T, and V163A, two gene segments that contained mutations, F99-M153 and M153-V163, were amplified using primers F99Sþ (þ denotes the sense primer) and M153T ( denotes the antisense primer), and M153Tþ and V163A. Fragments were gel-purified, combined, and fused using primers F99Sþ and V163A to produce a fragment carrying the F99, M153, and V163 mutations. This fragment was used to replace the GFP gene by MEGAWHOP. This method to introduce three mutations was completed in 2 days, which is considerably faster than performing three rounds of site-directed mutagenesis, which can take more than 4 days to complete.
MEGAWHOP: Megaprimer PCR of Whole Plasmids
405
3.3. Domain-targeted mutagenesis Nguyen and Daugherty (2005) improved the sensitivity and dynamic range of FRET (Fo¨rster resonance energy transfer) signal in a CFP–YFP pair. They first randomized the YFP sequence by error-prone PCR and the resultant fragment replaced the parent YFP by MEGAWHOP. Screening was performed by FACS-sorting variants that acquired improved ratiometric FRET signal change. Further optimization by random mutagenesis of CFP and YFP yielded a mutant, CyPET-YPet, which exhibited a 20-fold ratiometric FRET signal change, as compared to a threefold change for the parent pair. Dileepan et al. (2005) applied MEGAWHOP to produce a series of chimeric CD18 genes of bovine and human origins. A portion of the target sequence was amplified from one of the parents and substituted with corresponding regions of the second parent, thereby systematically producing domain-shuffled constructs that were subsequently functionally screened and identified as a leukotoxin-binding site. Similarly, Naylor et al. (2006) targeted the N-terminal region of the Mdv1 protein, which is part of the mitochondrial fission machinery in Saccharomyces cerevisiae, to investigate the interaction with Fis1. A segment of the full-length Mdv1 was amplified by error-prone PCR, and the mutant library created by the MEGAWHOP procedure was subjected to subsequent functional analysis to identify the region responsible for the interaction.
3.4. Gene fusion Tabor et al. (2009) applied MEGAWHOP to seamlessly combine gene fragments. A PCR amplicon containing overhangs homologous to a template plasmid was used as a megaprimer and was successfully incorporated into the plasmid sequence. In synthetic biology, recombining gene segments is often performed but is extremely labor intensive. Therefore, MEGAWHOP is a viable alternative.
REFERENCES Cadwell, R. C., and Joyce, G. F. (1992). Randomization of genes by PCR mutagenesis. PCR Methods Appl. 2, 28–33. Clark, J. M. (1988). Novel non-templated nucleotide addition reactions catalyzed by procaryotic and eucaryotic DNA polymerases. Nucleic Acids Res. 16, 9677–9686. Crameri, A., Whitehorn, E. A., Tate, E., and Stemmer, W. P. (1996). Improved green fluorescent protein by molecular evolution using DNA shuffling. Nat. Biotechnol. 14, 315–319.
406
Kentaro Miyazaki
Dileepan, T., Kannan, M. S., Walcheck, B., Thumbikat, P., and Maheswaran, S. K. (2005). Mapping of the binding site for Mannheimia haemolytica leukotoxin within bovine CD18. Infect. Immun. 73, 5233–5237. Hu, G. (1993). DNA polymerase-catalyzed addition of nontemplated extra nucleotides to the 30 end of a DNA fragment. DNA Cell Biol. 12, 763–770. Kirschner, A., and Bornscheuer, U. T. (2008). Directed evolution of a Baeyer-Villiger monooxygenase to enhance enantioselectivity. Appl. Microbiol. Biotechnol. 81, 465–472. McClelland, M., and Nelson, M. (1992). Effect of site-specific methylation on DNA modification methyltransferases and restriction endonucleases. Nucleic Acids Res. 20, 2145–2157. Miyazaki, K., and Takenouchi, M. (2002). Megawhop cloning: A method for creating random mutagenesis libraries by megaprimer PCR of whole plasmid. Biotechniques 33 (1033–1034), 1036–1038. Naylor, K., Ingerman, E., Okreglak, V., Marino, M., Hinshaw, J. E., and Nunnari, J. (2006). Mdv1 interacts with assembled dnm1 to promote mitochondrial division. J. Biol. Chem. 281, 2177–2183. Nguyen, A. W., and Daugherty, P. S. (2005). Evolutionary optimization of fluorescent proteins for intracellular FRET. Nat. Biotechnol. 23, 355–360. Sambrook, J., and Russell, D. W. (2001). Molecular Cloning: A Laboratory Manual. 3rd edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York. Stemmer, W. P. (1994). Rapid evolution of a protein in vitro by DNA shuffling. Nature 370, 389–391. Tabor, J. J., Salis, H. M., Simpson, Z. B., Chevalier, A. A., Levskaya, A., Marcotte, E. M., Voigt, C. A., and Ellington, A. D. (2009). A synthetic genetic edge detection program. Cell 137, 1272–1281. Yaoi, K., Kondo, H., Hiyoshi, A., Noro, N., Sugimoto, H., Tsuda, S., Mitsuishi, Y., and Miyazaki, K. (2007). The structural basis for the exo-mode of action in GH74 oligoxyloglucan reducing end-specific cellobiohydrolase. J. Mol. Biol. 370, 53–62. Zhao, H., Giver, L., Shao, Z., Affholter, J. A., and Arnold, F. H. (1998). Molecular evolution by staggered extension process (StEP) in vitro recombination. Nat. Biotechnol. 16, 258–261.
C H A P T E R
E I G H T E E N
Multiplexed Genome Engineering and Genotyping Methods: Applications for Synthetic Biology and Metabolic Engineering Harris H. Wang*,† and George M. Church*,† Contents 1. Introduction 1.1. Iterative engineering of a single chromosomal site 1.2. Multiplexed engineering of multiple chromosomal sites 2. Design Protocol 2.1. Oligonucleotides: Design and procurement 2.2. Designing appropriately scoped MAGE experiments 2.3. Primer design for multiplex allele-specific colony (MASC) PCR 3. Experimental Protocol 3.1. Strains and media 3.2. Supplies/reagents 3.3. MAGE cycling 3.4. Genotyping by multiplex allele specific colony PCR verification 4. Concluding Remarks Acknowledgments References
410 411 414 416 416 418 419 419 419 420 420 422 424 424 424
Abstract Engineering at the scale of whole genomes requires fundamentally new molecular biology tools. Recent advances in recombineering using synthetic oligonucleotides enable the rapid generation of mutants at high efficiency and specificity and can be implemented at the genome scale. With these techniques, libraries of mutants can be generated, from which individuals with functionally useful phenotypes can be isolated. Furthermore, populations of cells can be evolved in situ by directed evolution using complex pools of
* Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA Wyss Institute for Biologically Inspired Engineering, Harvard University, Massachusetts, USA
{
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00018-8
#
2011 Elsevier Inc. All rights reserved.
409
410
Harris H. Wang and George M. Church
oligonucleotides. Here, we discuss ways to utilize these multiplexed genome engineering methods, with special emphasis on experimental design and implementation.
1. Introduction Construction of genomes with highly engineered genetic components is a hallmark challenge and opportunity for synthetic biologists in the postgenomics era. Decreased cost and rising demand for DNA sequencing and oligonucleotide synthesis have created an entire service industry dedicated to reading and writing DNA material (Lipshutz et al., 1999; Shendure and Ji, 2008). DNA synthesized in vitro is now used efficiently to modify genomes (Yu et al., 2000; Zhang et al., 1998), plasmids (Swaminathan et al., 2001; Wang et al., 2009b; Warming et al., 2005), and phages (Marinelli et al., 2008; Thomason et al., 2009) of an expanding list of organisms (Shanks et al., 2009; Swingle et al., 2010; van Kessel et al., 2008) using homologous recombination-based genetic engineering, or recombineering, techniques (Sharan et al., 2009). Large libraries of DNA constructs can be combinatorial incorporated into the genome to test >109 genetic designs in a highly multiplexed fashion (Wang et al., 2009a). These techniques present opportunities to create organisms with optimally engineered metabolic pathways, regulatory, and protein modules, as well as new genetic codes. The l-Red (Datsenko and Wanner, 2000) and the similar rac-encoded RecET (Muyrers et al., 2004) homologous recombineering systems have been widely used to introduce genomic modifications into Escherichia coli. The l-Red system is based on three essential proteins, Exo, Beta, and Gam from the l-bacteriophage (Court et al., 2002). Exo is a 50 to 30 exonuclease that digests linear double-stranded DNA (dsDNA), leaving 30 overhangs that then act as substrates for subsequent recombination events. Beta is a single-stranded DNA (ssDNA) binding protein that facilitates recombination via hybridization of the linear fragment to its genomic complement. Gam acts to inhibit RecBCD activity in vivo to protect the degradation of foreign linear dsDNA fragments. Heterologous expressions of other l-Red protein homologs also lead to increased recombinagenicity in E. coli, suggesting the universality of this mode of genome integration (Datta et al., 2008). Numerous other modified l-Red constructs have been described and are reviewed elsewhere (Datta et al., 2006; Sawitzke et al., 2007). Both ssDNA and dsDNA can be used with the l-Red system to insert novel genetic sequences, introduce mismatches, or delete genes. In dsDNA-based recombineering, which requires Exo, Beta, and Gam, a linear dsDNA cassette with at least 50 bps of flanking homology to the target site is used. The efficiency of double-stranded homologous
Multiplexed Genome Engineering and Genotyping Methods
411
recombination can be as high as 0.01% among cells that survive transformation. Isolation of cells harboring a cassette with a selectable phenotype (i.e., antibiotic resistance) is done easily on agar plates to obtain modified mutants at >95% efficiency using a strong selection. In ssDNA-based recombineering where only Beta is required, the ssDNA integrates into the genome most efficiently by hybridizing to the exposed lagging strand at the replication fork (Wu et al., 2005; Yu et al., 2003). This manner of integration appears to mimic that of an Okazaki fragment of replicating DNA. Recent evidences suggest that linear dsDNA may be completely transformed into a ssDNA intermediate prior to integration into the genome (Maresca et al., 2010; Mosberg et al., 2010). The leading strand can also be targeted with ssDNA, but albeit at a 10- to 100fold lower efficiency than for the lagging strand (Ellis et al., 2001). The incorporation efficiency is highest for ssDNA in the 70–90 bps range, but can be as short as 30 bps, which is the minimum binding size for Beta (Erler et al., 2009). The efficiency of ssDNA-based recombineering can be as high as 25% among cells that survive transformation when the native mismatch repair system is evaded (Costantino and Court, 2003). Based on these advances, a cyclical and shotgun approach called Multiplex Automated Genome Engineering (MAGE) was developed to simultaneously introduce many chromosomal changes in a combinatorial fashion across a population of cells to generate up to 4 billion genetic variants per day (Wang et al., 2009a). This rapid chromosomal engineering method offers the opportunity to construct both highly modified genomes and explore large sequence landscapes by directed evolution in a semirational fashion. The general MAGE process (Fig. 18.1) will be detailed extensively in the sections below to provide a useful guide for designing and performing MAGE experiments. While the potential of MAGE is fully realized through automated instrumentations, they are not necessarily required to perform the MAGE protocols described here.
1.1. Iterative engineering of a single chromosomal site The first aspect of MAGE is the iterative application of the ssDNA (or oligo) recombineering protocol on a cell population without the intermediate step of colony isolation for genotyping or phenotyping. While the efficiency of replacing the chromosomal alleles with synthetic oligonucleotides may be high in certain instances (e.g., 1-bp mismatches), the efficiency decreases markedly with increase in size of the replacement. To overcome low efficiency, the oligo-recombineering protocol is iterated on the same cell population over multiple cycles using the same oligo species. In this fashion, the population is enriched for mutants containing the desired sequence conversions. Typically, each full cycle takes 2–3 h depending on the growth rate of the cells. The relative abundance of mutants in the
412
Harris H. Wang and George M. Church
population M can be approximated by M ¼ 1 (1 RE)N, where N is the number of cycles and RE is the allelic replacement efficiency per cycle. RE is highly dependent on the type of target conversion (mismatch, insertion, deletion) and the size of the conversion. General exponential decay functions of empirically determined RE are shown in Table 18.1. A
C
Mismatch Oligo
Input population
Deletion
Genome Oligo
Insertion
ss-oligos
Genome
Oligo Genome
Target sequence
ow Gr
Target site
ACTGGGACATAGCCTTCAGGTTCGTCAACAGACCACCGTTAC
Degenerate oligos Mixed oligos
Population enrichment
nt
Target site
MAGE cycling
me
Genome
ace
th
pl Re
B
Recovery
ACTGGGACATAGCCTCTAGGTGGATCTACAGACCACCGTTAC ACTGGGACATAGCCTGAAGGTCGATCCACAGACCACCGTTAC ACTGGGACATAGCCTNNAGGTNNNTCNACAGACCACCGTTAC ACTGGGACATNGCNNNCAGNNNCGTCNNNNGACCACCGTTAC ACTGGGACATNGCNTTNAGNTTNGTNAANAGNCCACCGTTAC
Output for screening, selection, or genotyping
Figure 18.1 (A) Recombineering can be used to generate mismatches, insertions, and deletions up to 30 or more bps using a 90 bps oligonucleotide. Larger deletions (kbs) can be achieved at lower efficiency (< 10 3). (B) Many targets can be multiplexed in the same recombineering reaction using degenerate or mixed oligo pools. (C) General schematic of MAGE process with input population being continually cycled with MAGE. Subpopulations can be removed for assay by genotyping or phenotyping and used as enrichment inoculum for subsequent MAGE cycles. Table 18.1 Allelic replacement efficiency prediction function based on fitting empirically determined efficiencies from Wang et al. (2009a) where b is the base-pair size of the modification
a
Replacement type
Replacement size in base-pairs (b)
Multipliera (RE0)
Predicted replacement efficiency (RE)
Mismatch
b ¼ 1 to 30 bp
RE0 ¼ 0.26
Insertion
b ¼ 1 to 30 bp
RE0 ¼ 0.15
Deletion
b ¼ 1 to 30 bp
RE0 ¼ 0.23
RE ¼ RE0 e ^( 0.135(b 1)) RE ¼ RE0 e ^( 0.075(b 1)) RE ¼ RE0 e ^( 0.058(b 1))
The multiplier RE0 may vary depending on the local contextual features of the target chromosomal site and the formation of secondary structures by the 90-bp oligonucleotide.
Multiplexed Genome Engineering and Genotyping Methods
413
Thus, the relative abundance of desired mutants in the population can be easily estimated by defining the number of iterative cycles and the size and type of the desired mutation. Often time, the required number of cycles is dictated by the throughput of the genetic screen. Genetic screens can be in the form of direct genotypic methods such as PCR or DNA sequencing, or phenotypic screening or selection methods such as colorimetry, growth rate, or antibiotic resistance. The number of cycles N needed to produce mutation size of b base-pairs at a frequency of at least F in the population can be estimated by N ¼ logð1 F Þ= logð1 REÞ:
ð18:1Þ
For example, the number of cycles needed to generate mutants with a 6 bp chromosomal mismatch to a frequency of 0.25 (i.e., 25%) in the population with an oligo folding energy of 5.4 kcal/mol (predicted through MFold; Markham and Zuker, 2005) is N ¼ log(1 0.25)/log (1 0.26 e 0.135 5) ¼ 2.0 cycles, and to a frequency of 0.50 (i.e., 50%) is N ¼ 4.9 cycles. Thus, one would expect from a PCR screen that at least one in four cells would show conversion after two cycles and one in two would show conversion after five cycles of oligo-recombineering. Another useful application is the generation of a large number of variants at one particular genomic site, such as to make promoter or ribosomal binding site (RBS) variants or to mutagenize the active site of an enzyme. Using oligos with the same flanking homology arms but different mutation sequences, the same chromosome site can be targeted across all cells in the population. At every MAGE cycle, the conversion frequency of the population to a new mutant genotype is determined by RE. For example, to introduce a 7-bp consecutive or nonconsecutive mismatch to a promoter region (RE ¼ 0.1), we could potentially generate 108 promoter variants in a population of 109 cells (a typical MAGE population size) every cycle. In this example, the actual oligo pool complexity is 47 ¼ 16, 384, so on average each variant is found in 6100 cells in the population after each cycle. After one cycle, however, 90% of the cells in the population still contain the wild-type promoter sequence. Iterative cycling of the same population with the degenerate oligo pool will reduce the abundance of the wild-type sequence, which is (1 RE)N. For high oligo pool complexities (> 109), the population should be cycled multiple times to generate all possible variants. It is important to note that because the population is constantly changing after each MAGE cycle, the total sequence space that can be explored is much greater than the carrying capacity (109 ) of the cycled population at any cycle. Therefore, the number of variants generated is dependent on the number of MAGE cycles. This feature of MAGE can be especially useful when simultaneously targeting different chromosomal sites, discussed in Section 1.2.
414
Harris H. Wang and George M. Church
1.2. Multiplexed engineering of multiple chromosomal sites In Section 1.1, we described how to assess MAGE cycling to target one chromosomal site. More frequently, one would want to simultaneously target multiple chromosomal sites. Several advantages arise by multiplexing. First, many different variants can be combinatorially generated and screened/selected all at once from a single population. Second, the mechanism of oligo-mediated allelic replacement allows multiple sites to be simultaneously converted during each MAGE cycle. For this shotgun approach, a mixed pool of oligo species that target different chromosomal sites is used. Multiplex engineering of up to 40 chromosomal sites can be easily done, while at higher pool diversity (100s–1000s of different species) oligo–oligo interactions may potentially begin to inhibit the reaction. Simultaneous allelic manipulation of k 1 different genomic locations, each with an average efficiency of replacement of REav, can be modeled as a binomial process, assuming that replacement operates independently across all loci (no linkage association). Here, the probability of replacement at any one location is pN ¼ 1 (1 REav)N, and the probability of finding k m exactly m variants is P ðm variantsÞ ¼ p ð1 pN Þkm . Under typical m N conditions, this will be well approximated by the Gaussian distribution 2 1 2 ð18:2Þ P ðxÞ ¼ pffiffiffiffiffi eðxmÞ =2s s 2p where the average number of mutations is m ¼ k(1 (1 REav)N) and the variance of the distribution is s2 ¼ k(1 REav)N(1 (1 REav)N). To estimate the frequency with which one can find cells with at least m mutations after N cycles, the Standard Normal Table or the Gaussian error function can be used to estimate the size of the tail to the right of m using the mean and variance above. To determine the number of cycles N needed to produce m mutants at a particular abundance in the population, we need to analyze m þ Zs, which is qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N m ¼ k 1 ð1 REav Þ þ Z kð1 REav ÞN 1 ð1 REav ÞN ð18:3Þ where Z is the Z-score based on the Standard Normal Table. If one finds the m for which the tail size is 1/20 (5%) of the entire distribution, one will on average find one cell among 20 in which there are at least m mutations. For a standard Gaussian distribution, the point at which the right tail of the distribution is 5% of the whole occurs at Z ¼ 1.645. Therefore, using the Gaussian approximation, the value of m that meets this condition is estimated by m þ 1.645s.
415
Multiplexed Genome Engineering and Genotyping Methods
We further illustrate these calculations in Table 18.2 for N ¼ 5, 10, and 20 for a situation in which oligos are multiplexed to introduce Stop codon nonsense mutations to 10 target genes to knockout function (k ¼ 10). Here, the overall RE is 0.26 and we assume RE per locus is REav ¼ RE/k ¼ 0.26/10 ¼ 0.026 because of the shared 10-plex oligo pool. This illustration shows how m increases with N. We find that five MAGE cycles (N ¼ 5) would be sufficient to produce mutants with at least 2.9 knockouts (m 2.9) at an abundance of 5% in the population (corresponding to Z ¼ 1.645). Twenty cycles would be sufficient to enrich for mutants with at least 6.7 knockouts at the same abundance of 5% (also illustrated in Fig. 18.2). Note that a tail size of 1/20 or 5% means that one can have 95% confidence of finding a cell with at least m mutations among 59 cells as determined by P (not finding an m mutant among s cells) or (1 0.05)s < 0.05, which implies s > log(0.05)/log(0.95) or s > 58.4. Methods to screen for these mutants are discussed later. Each locus in a multilocus-targeting reaction can also be multiplexed. For example, cells with multiple promoter variants for each gene of a multicomponent pathway can be combinatorially generated in the population. A mixture of knockouts, RBS changes, promoter modulation, and protein coding sequence modifications can be multiplexed through a single oligo pool. Economically, the cost of generating oligonucleotides with degenerate sequences by column-based DNA synthesis is same as the cost of generating oligo of a specific sequence. Coupled with automation Table 18.2 A list of variables to consider for a 10-target MAGE reaction to introduce single base-pair mutations (REav ¼ RE/k ¼ 0.026) as a function of the number of MAGE cycles, with pN ¼ abundance level of each of the 10 target locus, m ¼ average number of accumulated mutations in each cell, s ¼ variance of the mutations, and m ¼ the number of mutations in the top 5%, 2%, and 1% of cells in the population Number of MAGE cycles (N) N
pN ¼ 1 (1 REav) ¼ 1 (1 0.026) m ¼ kpN ¼ 10pN pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s ¼ kpN ð1 pN Þ ¼ 10pN ð1 pN Þ Top 5% clones m ¼ m þ 1.645s 95% screening confidence Top 2% clones
m ¼ m þ 2.054s 95% screening confidence
Top 1% clones
m ¼ m þ 2.326s 95% screening confidence
N
5
10
20
0.12 1.23 1.04 2.9 59 cells to screen 3.4 149 cells to screen 3.7 298 cells to screen
0.23 2.31 1.33 4.5
0.41 4.10 1.56 6.7
5.1
7.3
5.4
7.7
The number of cells need to isolate a clone with at least m mutations is provided at 95% confidence.
416
Harris H. Wang and George M. Church
Relative abundance in population (F )
0.40 N=5 N = 10 N = 20
0.35 0.30
m
m + Zs
0.25 0.20 Screening specificity at 1 in 20
0.15 0.10 0.05 0.00 0
1
2
3
4
5 6 j-mutants
7
8 Best clones
9
10
Figure 18.2 Relative abundance of cells containing j mutations in the population, where 0 j 10, k ¼ 10, and REav ¼ RE/k ¼ 0.026. For a cumulative distribution value of 0.95 (ability to screen and identify mutation in the top 5% of clones), the Z-score is 1.645, given a likely isolation of clones containing 6–7 mutations after 20 MAGE cycles.
systems to continuously cycle population of cells, MAGE holds the potential to turn genome engineering from a laboratory-based method to a scalable platform comparable in scale and throughput as large modern day DNA synthesis and sequencing services.
2. Design Protocol 2.1. Oligonucleotides: Design and procurement Lagging strand targeting: Oligonucleotides should be designed to target the lagging strand of replicating DNA (Fig. 18.3a). Since replication in E. coli is bidirectional, care should be taken to ensure that the oligo sequence designed targets the lagging strand. The origin of replication (oriC) in E. coli is located at positions 3923767–3923998 (Blattner et al., 1997) and the dif terminus is at 1588774–1588801. If the target chromosomal position is on replichore 1 (>3923998 or <1588774), then the oligo sequence should be the complementary sequence to the (þ) strand sequence. If target chromosomal locus is on replichore 2 (>1588774 and <3923998), then the oligo sequence should be the same sequence as the (þ) strand sequence (i.e., the complementary sequence to the () strand).
417
Multiplexed Genome Engineering and Genotyping Methods
B
A Replichore 1 oligo design Oligo sequence
4 phosphorothioated bases
(+) strand 3¢ 5¢ (–) strand
3¢ >15 bp homology
(+) strand 0 oriC B M 3.9
90 bp oligos
5¢ >15 bp homology
C
(–) strand
Predicted oligo secondary structure
Replichore 1 Replichore 2
5¢
3¢
B 1.6 Mf di
42 bp
6 bp
42 bp
If strong 2 º structure, then shift sequence towards 3¢ terminus (–) strand Oligo sequence
3¢ 5¢ (+) strand
Replichore 2 oligo design
3¢
5¢ 30 bp
6 bp
54 bp
Figure 18.3 Optimal oligonucleotide design. (A) Design of oligos to target the lagging strand, based on the location of the target site on the chromosome. (B) Optimally efficient oligos should be 90 bps with at least 15 bps of homologous sequences to the target region on both the 50 and 30 ends. The target mutation sequence should be placed at the center of the oligo whenever possible. Four phosphorothioated bases should be used at the 50 terminus of the oligo to reduce its degradation rate in vivo. (C) The secondary structure of the oligo should be assessed using MFold (Markham and Zuker, 2005) or other folding prediction algorithm. If folding energy DGss < 12.5 kcal/ mol, redesign oligo by shifting sequence toward the 30 or 50 terminus (30 is preferred due to higher error rates of 50 sequences during oligo synthesis). A minimum of 15 bps of homology should be left at the ends. Mutations nearing the termini are less frequently incorporated.
Secondary structure optimization: Oligonucleotides can often form hairpin structures that inhibit the allelic replacement because the homology arms are not available for hybridization. In general, we recommend ensuring that the oligo design has a folding energy that is no less than 12.5 kcal/mol as predicted by MFold (Markham and Zuker, 2005) on default values. If the folding energy reaches this prohibitive value, the oligo can be redesigned by shifting the mutation site toward the 30 terminus of the oligo thereby potentially disrupting the local hairpin structures (Fig. 18.3c). In general, a shift to the 30 end is more desired than a shift to the 50 end. Because oligo synthesis is 30 to 50 , the likelihood of retaining the mutation sequence is higher toward the 30 end, where truncations and errors are less prevalent. At least 15 bps of homology should be left on each end of the oligo as
418
Harris H. Wang and George M. Church
mutations at the distant arms are less likely to be incorporated into the chromosome due to chew back of the oligo ends during integration (H.H. Wang, unpublished results). Mismatch repair evasion: When active, the mismatch repair (MMR) machinery converts mutations generated by the oligos back to the wild-type sequence. To avoid reversion, the EcNR2 or EcHW24 strain can be used where the MMR system is inactivated through a mutS knockout. The drawback of this approach is the higher background mutation rate (10 8) of a DmutS strain versus the wild type (10 10). Alternatively, in the presence of MMR, incorporation of silent mutations near the mutation site that are poorly recognized by mutS (e.g., C–C pairs) can increase efficiency (Costantino and Court, 2003). Furthermore, utilization of modified bases not recognized by mutS can also increase efficiency in the presence of an intact MMR system (Wang et al., 2011c). Since, the MMR can only repair short segments of mutations (<6 bps), large mutations are also naturally avoided. Synthetic oligonucleotides: In general, 90 bps oligos appear to produce the highest allelic replacement efficiency. Longer oligos tend to form more inhibitory hairpin structures and are more costly to synthesize. Shorter oligos are less efficient due to lower hybridization energy to the chromosomal target. Up to four phosphorothioated bases should be used at the 50 terminus of the oligo to prevent exonuclease degradation of the oligos inside the cell (Fig. 18.3b). Absence of phosphorothioation protection can lead to a two- to threefold decrease in efficiency. For most applications, oligos with standard purification should suffice although in certain applications PAGE/HPLC purified oligos may be needed. Typically, oligos can be obtained through a commercial oligonucleotide synthesis vendor in 2–3 days (e.g., Integrated DNA Technologies, USA).
2.2. Designing appropriately scoped MAGE experiments One needs to weigh several factors when determining the scope of a MAGE experiment. First, the size of the mutations determines the allelic replacement efficiency. Second the number of targets determines the complexity of the oligo pool. These two factors affect the overall number of MAGE cycles that will be required. Third, the throughput of the genotyping/phenotyping method affects the degree to which mutants in the population need to be enriched before they can be successfully isolated. In general, 100–200 colonies can be easily screened by multiplex allele specific PCR to query 10–20 target alleles simultaneously. Increased screening capacity decreases the number of MAGE cycles required for mutant enrichment. Detailed example: Let us design an experiment in which we will attempt to explore 64 RBS variants xxxNNNxxxATG upstream of 5 genes of a biosynthesis pathway to tune gene expression. Our genotyping/phenotyping method is by plating on agar and observing a colorimetric indicator change. We can therefore distinguish one mutant from plate of 1000 cells. Using
Multiplexed Genome Engineering and Genotyping Methods
419
Table 18.1, we first estimate the allelic replacement efficiency RE of this 3 bp mismatch to be at RE ¼ 0.26 e 0.135(3 1) ¼ 0.198 or 19.8%. To produce mutants in >50% of our population, we use Eq. (18.1) to find that we require N ¼ log(1 0.5)/log(1 0.198) ¼ 3 cycles. The full complexity of the oligo pool is 645 ¼ 1.07 109, therefore we will have to explore about half of these sequences after three MAGE cycles. So, we want to isolate mutants that contain at least m out of the five possible RBS locations on our indicator plate. Our cumulative distribution value is (1 1/1000) ¼ 0.999, so our Z-score is 3.08. Using Eq. (18.3), we find that with 10 cycles, m ¼ 5 1 ð1 0:198=5Þ10 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi þ3:08 5ð1 0:198=5Þ10 1 ð1 0:198=5Þ10 ¼ 3:4, meaning that 1 in 1000 cells that are found on the plate will contain mutations in three or four RBS positions out of a possible five. Under these considerations, 10–15 MAGE cycles may be required in total to fully explore the sequence space to produce successful mutants.
2.3. Primer design for multiplex allele-specific colony (MASC) PCR Allele-specific PCR can be used to directly query genotypes. Two forward primers are designed for each query target, a forward primer that is specific only to the wild-type sequence primer_f(wt) and another only to the mutant sequence primer_f(mut). Both forward primers share the same reverse primer (primer_r). The specificity is designed into the 30 terminus of the forward primer. For single nucleotide polymorphism (SNP) detection, the 30 base is either the wild-type base or the mutant base. Thus, for a colony containing the wild-type SNP, the f(wt)/r and not the f(mut)/r primer pair should amplify a PCR product. Conversely, for a colony containing the mutant SNP, the f(mut)/r and not the f(wt)/r primer pair should amplify a PCR product. This reaction can be multiplexed across more than 10 sites in a single PCR by designing primer pairs with amplicons of different lengths to distinguish each SNP. In general, we advise designing primer pairs that amplify at sizes of 100, 150, 200, 250, 300, 400, 500, 600, 700, 850 bps, which can produce clearly distinguishable bands on a 1.5% agarose gel. The primer pairs should be designed to a Tm of 62 C (Zhang), but the actual Tm of the MASC-PCR is determined through a gradient PCR.
3. Experimental Protocol 3.1. Strains and media The protocol described here is optimized for E. coli MG1655 derivatives EcNR1, EcNR2, and EcHW24 (Wang et al., 2009a). EcNR1 contains a chromosomally integrated l-prophage construct (based on DY330
420
Harris H. Wang and George M. Church
(Yu et al., 2000)) fused to the bla gene for ampicillin resistance. The l-Red construct (containing exo, beta, and gam) is integrated at the bioA/bioB gene locus and is temperature inducible by brief heat shock at 42 C. EcNR2 is an EcNR1 derivative with DmutS::cat for chloramphenicol resistance. EcHW24 is an EcNR1 derivative with inactivated mutS by Stop codon mutations at amino acids 189 and 191. All strains must be grown at 30–32 C. For rich media, LB-Lennox (10 g/L tryptone, 5 g/L yeast extract, 5 g/L NaCl) is used with the appropriate antibiotics, chloramphenicol (cat), kanamycin (kan), or carbenicillin (carb) at concentrations of 20, 30, or 50 mg/mL, respectively. Standard M9 minimal media supplemented with D-biotin (0.25 mg/mL) can also be used.
3.2. Supplies/reagents – – – – – – – – – –
Rotator drum in 30–32 C incubator Shaking water bath at 42 C Ice bucket with ice/water mixture Distilled sterile water (chilled) Synthetic oligonucleotides (in 50 mL dH2O at 0.05–50 mM, chilled) Microcentrifuge tubes (chilled) 1-mL and 200-mL pipettes and pipette tips (chilled) Tabletop centrifuge (at 4 C) Electroporation system and electroporation cuvettes or plates Glass culture tubes with prewarmed LB-Lennox.
3.3. MAGE cycling In the receding day, streak out the appropriate strain (e.g., EcNR2) on agar plate and allow colonies to grow overnight at 32 C. If initiating a new MAGE experiment: Step 1. Pick a colony into a glass tube with 3 mL of LB-Lennox media and place in a rotator drum spun at 300 rpm in a 32 C incubator. If continuing from a previously paused MAGE cycle: Step 1. Take 100 mL of the overnight MAGE cell culture from 32 C incubator or the 4 C storage and dilute into a glass tube with 3 mL of LB-Lennox media and place in a rotator drum spun at 300 rpm in a 32 C incubator. This step ensures that the cells, which are in stationary phase, can recover back into exponential phase growth. The remaining cell culture can be stored or discarded. Step 2. Once cells have reached mid-exponential growth phase as determined by OD600 nm of 0.6–0.7, place the culture tube in a 42 C
Multiplexed Genome Engineering and Genotyping Methods
Step 3. Step 4.
Step 5.
Step 6. Step 7.
421
shaking water bath for 15 min. This step ensures that the l-Red system is properly induced. Lengthening the temperature induction is undesirable as the Gam protein is highly toxic to the cells when expressed for >20 min. After the 15-min 42 C induction, immediately place cells in icewater bucket and cool by gentle swirling for 30–60 s. Induced cells can stay on ice for up to 3 h prior to the next step. Making electrocompetent cells (this step should be done at 4 C if possible): a. Place 1 mL of culture in prechilled 1.5 mL microcentrifuge tubes. Spin tube in tabletop centrifuge at 13,000g for 30 s. The remaining 2-mL culture can be discarded or frozen in 15% glycerol at 80 C for future assays. b. Remove LB supernatant from tube and resuspend pellet in 1 mL of prechilled sterilized distilled H2O by pipetting up and down, do not vortex the cells. c. Spin tube in tabletop centrifuge at 13,000g for 30 s. d. Remove H2O supernatant from tube and resuspend pellet in another 1 mL of prechilled dH2O by pipetting up and down, do not vortex the cells. e. Spin tube in tabletop centrifuge at 13,000g for 30 s. f. Remove H2O supernatant from tube and add 50 mL of oligos to the pellet. A maximum amount of oligos that can be added to the cell pellet without arcing the electroporation reaction is 140 mg (50 mM). Typically, 2 mM in 50 mL is used. The lower range is 0.05 mM while still producing detectable replacement efficiency. For highly complex oligo pools, a 20 mM total oligo concentration is advised to ensure higher numbers of oligos are reaching each cell. g. Place cell–oligo mixture in a prechilled 1 mm gap electroporation cuvette. Remove cuvette from ice and dry sides with a paper towel prior to electroporation. Transform oligos into the cells by electroporation using a standard electroporation pulse generator (i.e., Bio-Rad MicroPulser, BTX ECM-830). For a 1 mm gap cuvette, use settings: 1.8 kV, 200 O, 25 mF. For a 2 mm gap cuvette use settings: 2.5 kV, 200 O, 25 mF. Time constant for the electroporation should be >4.0 ms. After electroporation, immediately add 1 mL of LB-Lennox to the cuvette and transfer to a glass tube containing 2 mL of LB-Lennox, resulting in the standard 3 mL growth volume. Allow cells to recover and grow back into midexponential growth phase. The bulk of the MAGE cycle time is dominated by this posttransformation recovery phase, which is where the oligos are
422
Harris H. Wang and George M. Church
being incorporated into the chromosome. An adequate number of cell divisions (>4) are required for segregation of the mutant allele, which may take 2–3 h depending on the growth rate. Furthermore, only 1–5% of the cells in the population survive electroporation. Therefore, outgrowth after transformation is required to repopulate the culture to the appropriate density. The end of this recovery phase marks the end of one MAGE cycle. a. If continuing MAGE cycles, go to Step 2 and wait for OD600 nm to reach 0.6–0.7. b. If pausing MAGE cycles, continue to grow the culture into stationary phase. Keep in 32 C for < 1 day storage and keep in 4 C for 1 day storage. Paused cultures can be restarted by 1:30 dilution into fresh LB-Lennox (see Step 1) c. If cultures are being plated for colony isolation, recover for at least 3 h to allow all recombinant genomes to segregate prior to plating. In general, 3–4 MAGE cycles can be carried through per day. Multiple independent cultures can be run simultaneously. Up to 48 cultures can potentially be cycled at a time using 2.2 mL 96-well plates, 8-channel multichannel pipettes, and 96-well electroporation plates and pulse generators (BTX ECM 830, Model 45-0421 or Lonza Nucleofector 96-well Shuttle System).
3.4. Genotyping by multiplex allele specific colony PCR verification A gradient MASC-PCR should first be run to determine the optimal Tm for the rest of the PCR. The multiplex primer mix should contain primers (up to 20) at individual primer concentration of 0.2 mM. Two separate PCRs should be run, one containing f(wt)/r and the other containing f(mut)/r. An optimized multiplex PCR kit is recommended (Qiagen Cat #206143). The PCR is highly sensitive to template concentration. In general, using 1 mL of a 1:100 dH2O dilution of a saturated culture or a single colony is recommended. The gradient PCR optimized melting temperature, Tm, is best used for the specific dilution and may need to be repeated for other dilutions or template preparations. Generally a gradient Tm ranging from 61 to 69 C is used although finer ranges are also acceptable. An example of gel containing a gradient MASC-PCR is shown in Fig. 18.4. To choose the optimal Tm, we want to ensure that all bands can be adequately amplified, and that there is binary specificity of the f(wt)/r and f(mut)/r. The MASC-PCR cycles are as follows (using a Taq polymerase): Step 1: 95 C for 15 min Step 2: 94 C for 30 s
423
– + – + – + – + – + – + – + – + – + – + – + – +
opt
100 bp ladder
69. 0
68. 9
68. 4
67. 8
67. 0
65. 8
64. 5
63. 3
62. 3
61. 6
61. 2
Gradient Tm 61. 0
100 bp ladder
Multiplexed Genome Engineering and Genotyping Methods
Tm
High-throughput mascPCR screening of isolates 1
2
3 4 5 6 7 8 9 10 11 12 – + – + – + – + – + – + – + – + – + – + – + – +
Figure 18.4 Example of a gradient MASC-PCR (top gel). Symbol () denotes amplification of the template with the f(wt)/r primer set and (þ) with f(mut)/r primer set. The optimal melting temperature Tm is chosen based on high specificity (i.e., either f(wt)/r or f(mut)/r primer set amplify) and strong signal (i.e., visible bands). Here, we determined that 65.8 C < optimal Tm < 67.0 C (denoted by arrows). A large number of colonies can be screened directly using MASC-PCR to isolate variants generated combinatorially (bottom gel). Here, all clones except #2 and #4 have unique combinations of 10 targeted mutations.
Step 3: 61–69 C (gradient) or Tm (optimal) for 30 s Step 4: 72 C for 80 s Step 5: go to Step 2 for 26 times Step 6: 72 C for 5 min Step 7: 4 C for forever Generally, 20 mL PCRs are suggested. Xylene cyanol loading dye is added to each reaction (high molecule weight dye to not interfere with <1000 bp bands) and 10 mL is loaded to each lane of an ethidium
424
Harris H. Wang and George M. Church
bromide-strained 1.5% agarose gel. The gel is run by electrophoresis at 180 V for 60–70 min and analyzed subsequently on a Gel Documentation System. To increase the throughput of PCR screening, utilization of 96-well PCR blocks, 200-lane gel electrophoresis setups (e.g., BioRad Subcell Model 192), and multichannel pipetting and gel loading is highly recommended.
4. Concluding Remarks Recombineering-based genome engineering provides a powerful approach for constructing and modifying chromosomes synthetically. As the cost of oligonucleotide synthesis continues to drop and automation capacities continue to expand, efficient “on-the-fly” manipulation of a living organism’s genome will continue to improve. With the MAGE platform, existing genomic templates are used as scaffolds to produce newly engineered variants. An important aspect of template-based genome engineering is the benefit from the natural selection process as new genomes evolve by directed steps from existing functional genomes. Genome engineering approaches coupled with de novo synthesis methods (Chan et al., 2005; Gibson et al., 2010; Menzella et al., 2005; Tian et al., 2004) will continue to offer an expanding capability to engineer living organisms at the resolution of single nucleotides, but scaled across the entire genome and beyond.
ACKNOWLEDGMENTS The authors wish to thank John Aach for helpful discussions and careful reading of the manuscript. This work was funded by the Wyss Institute for Biologically Inspired Engineering, the National Science Foundation, the U.S. Department of Energy, and the Defense Advanced Research Project Agency.
REFERENCES Blattner, F. R., Plunkett, G., Bloch, C. A., Perna, N. T., Burland, V., Riley, M., ColladoVides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F., et al. (1997). The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1462. Chan, L. Y., Kosuri, S., and Endy, D. (2005). Refactoring bacteriophage T7. Mol. Syst. Biol. 1(2005), 0018. Costantino, N., and Court, D. L. (2003). Enhanced levels of lambda Red-mediated recombinants in mismatch repair mutants. Proc. Natl. Acad. Sci. USA 100, 15748–15753. Court, D. L., Sawitzke, J. A., and Thomason, L. C. (2002). Genetic engineering using homologous recombination. Annu. Rev. Genet. 36, 361–388. Datsenko, K. A., and Wanner, B. L. (2000). One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl. Acad. Sci. USA 97, 6640–6645.
Multiplexed Genome Engineering and Genotyping Methods
425
Datta, S., Costantino, N., and Court, D. L. (2006). A set of recombineering plasmids for gram-negative bacteria. Gene 379, 109–115. Datta, S., Costantino, N., Zhou, X., and Court, D. L. (2008). Identification and analysis of recombineering functions from Gram-negative and Gram-positive bacteria and their phages. Proc. Natl. Acad. Sci. USA 105, 1626–1631. Ellis, H. M., Yu, D., DiTizio, T., and Court, D. L. (2001). High efficiency mutagenesis, repair, and engineering of chromosomal DNA using single-stranded oligonucleotides. Proc. Natl. Acad. Sci. USA 98, 6742–6746. Erler, A., Wegmann, S., Elie-Caille, C., Bradshaw, C. R., Maresca, M., Seidel, R., Habermann, B., Muller, D. J., and Stewart, A. F. (2009). Conformational adaptability of Redbeta during DNA annealing and implications for its structural relationship with Rad52. J. Mol. Biol. 391, 586–598. Gibson, D. G., Glass, J. I., Lartigue, C., Noskov, V. N., Chuang, R. Y., Algire, M. A., Benders, G. A., Montague, M. G., Ma, L., Moodie, M. M., et al. (2010). Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52–56. Lipshutz, R. J., Fodor, S. P., Gingeras, T. R., and Lockhart, D. J. (1999). High density synthetic oligonucleotide arrays. Nat. Genet. 21, 20–24. Maresca, M., Erler, A., Fu, J., Friedrich, A., Zhang, Y., and Stewart, A. F. (2010). Singlestranded heteroduplex intermediates in lambda Red homologous recombination. BMC Mol. Biol. 11, 54. Marinelli, L. J., Piuri, M., Swigonova, Z., Balachandran, A., Oldfield, L. M., van Kessel, J. C., and Hatfull, G. F. (2008). BRED: A simple and powerful tool for constructing mutant and recombinant bacteriophage genomes. PLoS ONE 3, e3957. Markham, N. R., and Zuker, M. (2005). DINAMelt web server for nucleic acid melting prediction. Nucleic Acids Res. 33, W577–W581. Menzella, H. G., Reid, R., Carney, J. R., Chandran, S. S., Reisinger, S. J., Patel, K. G., Hopwood, D. A., and Santi, D. V. (2005). Combinatorial polyketide biosynthesis by de novo design and rearrangement of modular polyketide synthase genes. Nat. Biotechnol. 23, 1171–1176. Mosberg, J. A., Lajoie, M. J., and Church, G. M. (2010). Lambda Red recombination in Escherichia coli occurs through a fully single-stranded intermediate. Genetics. Available at: http://www.ncbi.nlm.nih.gov/pubmed/20813883. Muyrers, J. P., Zhang, Y., Benes, V., Testa, G., Rientjes, J. M., and Stewart, A. F. (2004). ET recombination: DNA engineering using homologous recombination in E. coli. Methods Mol. Biol. 256, 107–121. Sawitzke, J. A., Thomason, L. C., Costantino, N., Bubunenko, M., Datta, S., and Court, D. L. (2007). Recombineering: In vivo genetic engineering in E. coli, S. enterica, and beyond. Meth. Enzymol. 421, 171–199. Shanks, R. M., Kadouri, D. E., MacEachran, D. P., and O’Toole, G. A. (2009). New yeast recombineering tools for bacteria. Plasmid 62, 88–97. Sharan, S. K., Thomason, L. C., Kuznetsov, S. G., and Court, D. L. (2009). Recombineering: A homologous recombination-based method of genetic engineering. Nat. Protoc. 4, 206–223. Shendure, J., and Ji, H. (2008). Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145. Swaminathan, S., Ellis, H. M., Waters, L. S., Yu, D., Lee, E. C., Court, D. L., and Sharan, S. K. (2001). Rapid engineering of bacterial artificial chromosomes using oligonucleotides. Genesis 29, 14–21. Swingle, B., Markel, E., Costantino, N., Bubunenko, M. G., Cartinhour, S., and Court, D. L. (2010). Oligonucleotide recombination in Gram-negative bacteria. Mol. Microbiol. 75, 138–148.
426
Harris H. Wang and George M. Church
Thomason, L. C., Oppenheim, A. B., and Court, D. L. (2009). Modifying bacteriophage lambda with recombineering. Methods Mol. Biol. 501, 239–251. Tian, J., Gong, H., Sheng, N., Zhou, X., Gulari, E., Gao, X., and Church, G. (2004). Accurate multiplex gene synthesis from programmable DNA microchips. Nature 432, 1050–1054. van Kessel, J. C., Marinelli, L. J., and Hatfull, G. F. (2008). Recombineering mycobacteria and their phages. Nat. Rev. Microbiol. 6, 851–857. Wang, H. H., Isaacs, F. J., Carr, P. A., Sun, Z. Z., Xu, G., Forest, C. R., and Church, G. M. (2009a). Programming cells by multiplex genome engineering and accelerated evolution. Nature 460, 894–898. Wang, S., Zhao, Y., Leiby, M., and Zhu, J. (2009b). A new positive/negative selection scheme for precise BAC recombineering. Mol. Biotechnol. 42, 110–116. Wang, H. H., Xu, G., Vonner, A. J., and Church, G. M. (2011c). Modified bases enable high-efficiency oligonucleotide-mediated allelic replacement via mismatch repair evasion. Nucleic Acids Res. (in press). Warming, S., Costantino, N., Court, D. L., Jenkins, N. A., and Copeland, N. G. (2005). Simple and highly efficient BAC recombineering using galK selection. Nucleic Acids Res. 33, e36. Wu, X. S., Xin, L., Yin, W. X., Shang, X. Y., Lu, L., Watt, R. M., Cheah, K. S., Huang, J. D., Liu, D. P., and Liang, C. C. (2005). Increased efficiency of oligonucleotide-mediated gene repair through slowing replication fork progression. Proc. Natl. Acad. Sci. USA 102, 2508–2513. Yu, D., Ellis, H. M., Lee, E. C., Jenkins, N. A., Copeland, N. G., and Court, D. L. (2000). An efficient recombination system for chromosome engineering in Escherichia coli. Proc. Natl. Acad. Sci. USA 97, 5978–5983. Yu, D., Sawitzke, J. A., Ellis, H., and Court, D. L. (2003). Recombineering with overlapping single-stranded DNA oligonucleotides: Testing a recombination intermediate. Proc. Natl. Acad. Sci. USA 100, 7207–7212. Zhang, K. “Oligonucleotide Tm Calculator.” Web Resource. Sept 2010. http://arep.med. harvard.edu/kzhang/cgi-bin/myOligoTm.cgi. Zhang, Y., Buchholz, F., Muyrers, J. P., and Stewart, A. F. (1998). A new logic for DNA engineering using recombination in Escherichia coli. Nat. Genet. 20, 123–128.
C H A P T E R
N I N E T E E N
Construction and Manipulation of Giant DNA by a Genome Vector Mitsuhiro Itaya and Kenji Tsuge Contents 1. Introduction 2. Basics for B. subtilis as a Novel Host 2.1. BGM vectors 2.2. Recipes for Bsu168 competent cell preparation 2.3. Standard transformation of Bsu168 2.4. Isolation of Bsu168 genome DNA in liquid 2.5. Isolation of Bsu168 genome in agarose block 2.6. Enzyme reaction on DNA in gel blocks 3. Large DNA Reconstruction via Small DNA Fragments Assembly in the BGM Vector 3.1. Domino method 3.2. Applications of the domino method to chloroplast genome 3.3. Other domino related methods 3.4. Retrieval of the large DNA out of BGM 3.5. Trouble shooting 4. Assembly of Multiple DNA Fragments with Designed Order Connecting by a Few Bases 4.1. Fragment design 4.2. Cloning of PCR product in E. coli plasmid 4.3. Careful preparation of fragments for OGAB method 4.4. Molar concentration adjustment 4.5. OGAB assembly 4.6. Plasmid DNA extraction from Bsu168 4.7. OGAB specific trouble shooting 5. Future Perspectives Acknowledgments References
428 430 430 430 431 431 432 432 433 433 434 434 435 437 438 439 440 440 441 442 442 443 443 445 445
Institute for Advanced Biosciences of Keio University, Nipponkoku, Tsuruoka-shi, Yamagata, Japan Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00019-X
#
2011 Elsevier Inc. All rights reserved.
427
428
Mitsuhiro Itaya and Kenji Tsuge
Abstract Since the entire sequence of a number of genome came into determination, current studies are gradually focusing on unveiling global networks of gene products, RNA, protein, and metabolites that support real-life activities. Our understanding of whole gene networks will be brought about by use of not only a few recombinant genes but also more number of genes at a time, or the genome. Genomes should be likely handled freely; however, there exist certain barriers in handling between genes and genomes. They are intrinsic fragility of giant DNA in test tube and the size limit of conventional cloning vector systems relying on prevailing cloning host Escherichia coli. A eubacterium, Bacillus subtilis has been offered as a replacement for particular large DNA or genomes, relying on inherent ability to take up DNA given outside and integrate it into its own genome via homologous recombination. The Bacillus GenoMe (BGM) vector derived from the 4,200-kbp genome of B. subtilis 168 has been demonstrated to accommodate fairly large DNAs and is highlighted by the successful stable cloning of a whole 3,500-kbp genome of the nonpathogenic, unicellular photosynthetic bacterium Synechocystis and any sequence-known DNAs. In the chapter, highlighted are clear differences in cloning concept and actual manipulation from other conventional ones, focusing methodological aspects as plainly as possible. We may also indicate that B. subtilis provides other opportunities for assembly of a large number of DNA fragments, in unbelievably high efficiency. The new workhorse described here exhibits technical breakthroughs leading to the new concept for designing the desired genomes even from scratch. The novel system not only offers unprecedented opportunities for addressing important contemporary issues in biotechnology, but also gives rise to new ideas of thinking among versatile field of biology.
1. Introduction DNA cloning is the first step for investigation of genes and gene function of the cell. Cloning and manipulation of small DNA has made a big success, due to substantial efforts relying on Escherichia coli host-based cloning system and recent advanced technologies on DNA synthesis in test tube (Bang and Church, 2008; Gibson et al., 2008, 2010a,b). DNA, a polymer composed of four deoxyribonucleic acids, is fragile in liquid. Due to this physicochemical nature, naked DNA handled in liquid is exposed to physical shearing and results in breakage into smaller pieces. However, this shearing is less prominent for DNAs below 50 kbp, referred to small DNA in this chapter. Small DNA is suitable for manipulations with restriction-, ligation-, and modification-enzymes in test tube and serves for
429
Genome Synthesis by a Genome Vector
PCR templates. However, it becomes difficult to handle large DNA, referring to those above 50 kbp in this chapter, undamaged in test tube. Thus, the idea of connection of small DNAs together to reconstruct the large DNA of interest was examined in host cell. E. coli lacks effective manipulation tools for large DNA, with one exceptional BAC/E. coli cloning system. The BAC plasmid, standing for bacterial artificial chromosome, derived from F plasmid increased the clonable DNA size in E. coli (Shizuya et al., 1992). However, it also exhibited a maximum range of DNA around 300 kbp for regular handling (Gibson et al., 2009). Appropriate host cells are needed for larger DNA. There are two hosts proved suitable for cloning and amplification of large DNA above 1,000 kbp and up to 3,500 kbp to date, the one Saccharomyces cerevisea, a bakers’ yeast and the other Bacillus subtilis Marburg 168, a Gram-positive bacterium (Fig. 19.1). As the yeast system is detailed in the chapter by Don Giboson of this edition, focused below are the principles and methods for the use of B. subtilis. It should be addressed that preparation of large DNA is intended not only for remodeling the genome of the present cells, but also for versatile tools to deliver large mutations or structural perturbations to beneficial prokaryotes, animals, plants, and other eukaryotes.
Domino clone in pBR322
Bsu168
Retrieval, Fig. 19.3
BGM vector
BGM vector
pBR322 (4.3,kbp) 2-1
Competent Bsu168 2-2, 2-3
BGM clone
Homologous recombination 3-1
Figure 19.1 BGM vectors and integration of single domino. Bsu168 has a single-copy genome per cell. A pBR322 plasmid, open circular and closed arrows with antibiotic marker as a small open oval on top, is integrated in Bsu168 to be converted as BGM vector. The pBR322 in the BGM genome serves to catalyze integration of domino clone by two homologous recombinations (X). Closed triangles indicate another marker for BGM selection. See retrieval details in Fig. 19.3.
430
Mitsuhiro Itaya and Kenji Tsuge
2. Basics for B. subtilis as a Novel Host In mid 1980, we started investigation of the strain B. subtilis Marburg 168, designated Bsu168 hereafter, able to develop natural competence by which naked DNA molecules can be easily delivered to the bacterium (Spizizen, 1958; Dubnau, 1999). It was fortuitous that the Bsu168 possessed no original/cognate plasmid. Despite certain plasmids being found replicable in Bsu168, our research had gradually focused on integration of DNA directly in the genome. This idea turned out a new concept that the Bsu168 genome itself could be a platform for DNA deposit and carrier (Itaya, 1993, 1995a,b).
2.1. BGM vectors Bsu168 genome was developed to serve as a unique cloning vehicle. For the Bsu168 genome to serve as genome vector, it should possess cloning site for guest DNA, because guest DNA generally should have no homology with those of the host Bsu168 genome. Thus, integration of the guest DNA needs sorting of sequences to catalyze homologous recombination reactions for integration. E. coli plasmid vectors, pBR322 and BAC unable to replicate in Bsu168 as plasmid, were chosen for this aim. Strain that carries either of these as illustrated in Fig. 19.1 is called BGM vector, standing for Bacillus GenoMe (BGM) vector (Itaya et al., 2000). As BGM vectors are all derived from Bsu168, genetic methods for Bsu168 can be immediately applied to BGM. Large DNA accommodated stably replicates as part of the Bsu168 genome irrespective of their size and origins.
2.2. Recipes for Bsu168 competent cell preparation Luria–Bertani media used for E. coli cultivation is also the standard culture media for Bsu168. Actually, a Bsu168 derivative strain RM125 that lacks restriction-modification system is used as Bsu168 throughout our works (Uozumi, 1977). Bsu168 competent cell was developed as follows: Stational stage of Bsu168 (50 mL) in LB was inoculated in 950 mL medium TFI. TFI medium was prepared by mixing of 100 mL of 10 Spizizen solution (Anagnostopoulos and Spizizen, 1961) [140 g of K2HPO4, 60 g of KH2PO4, 20 g of (NH4)2SO4, 10 g of Trisodium citrate2H2O per liter], 10 mL of 50% (w/v) glucose, 10 mL of 2% (w/v) MgSO47H2O, 10 mL of 2% (w/v) casamino acids, 10 mL of 5 mg/mL tryptophan, 10 mL of 5 mg/ mL arginine, 10 mL of 5 mg/mL leucine, 10 mL of 5 mg/mL threonine, 830 mL of water, and then filtrated. After shaking with vigorous aeration for 4–5 h at 37 C to give an OD600 reading about 0.6, 400 mL culture was transferred in 3.6 mL medium TFII. TFII medium was prepared by mixing of 10 Spizizen solution [140 g of K2HPO4, 60 g of KH2PO4, 20 g of
Genome Synthesis by a Genome Vector
431
(NH4)2SO4, 10g of Trisodium citrate2H2O per liter], 10 mL of 50% (w/v) glucose, 10 mL of 2% (w/v) MgSO47H2O, 5 mL of 2% (w/v) casamino acids, 1 mL of 5 mg/mL tryptophan, 1 mL of 5 mg/mL arginine, 1 mL of 5 mg/mL leucine, 1 mL of mg/mL threonine, 871 mL of water, and then filtrated. After 90 min incubation at 37 C, the cell was collected by centrifugation (8000g, 5 min) at 4 C. Cells gently suspended to the 250 mL TFD medium containing 15% (v/v) glycerol were for immediate use for transformation or saved at 70 C. TFD medium was prepared by mixing of 6.25 mL [2 g of (NH4)2SO4, 1 g of trisodium citrate2H2O per 50 mL], 0.625 mL [14 g of K2HPO4, 6 g of KH2PO4, per 50 mL], 1.25 mL 50% (w/v) glucose, added up to 100 mL.
2.3. Standard transformation of Bsu168 In a 1.5-mL microtube, to the 50 mL TFD medium, 1.25 mL of 1 M MgCl2, 1.25 mL of 2% (w/v) MgSO4, 5 mL of DNA solution, and 6.25 mL of competent cells were added in this order and incubated at 37 C for 30 min. Successive incubation for 1 h at 37 C by addition of 200 mL LB including chloramphenicol at 200 ng/mL, or erythromycin at 10 ng/mL to induce respective antibiotic resistance gene. Cells were spread on plate containing appropriate antibiotics. The domino DNA, usually 200 ng in standard transformation, yielded more than dozens of transformants.
2.4. Isolation of Bsu168 genome DNA in liquid Genomic DNA for PCR amplification and southern hybridization are performed as follows: BGM clone is inoculated in 3 mL of LB medium in 13-mL plastic test tube, and incubated with agitation for 15–17 h. Cells harvested by centrifugation (3000g for 5 min), are suspended to 0.7 mL of Sucrose–Lysozyme solution (10 mM Tris–HCl (pH 8.0), 20% sucrose, 23 mM EDTA, and 0.9 mg/mL lysozyme), and incubated at 37 ˚C for 30 min. Seventy microliters of 10 mg/mL pronase E solution (preincubation at 37 ˚C for 30 min immediately before use for activation of enzyme) is added. After incubation at 37 ˚C for 30 min, 0.77 mL of SDS solution (TE buffer with 0.9% SDS) is added, and mixed gently until the solution becomes transparent. One milliliter of TE-saturated phenol is added, mixed gently, centrifuged at 3000g for 5 min, and the supernatant is transferred to new test tube. The solution mixed gently with 3 mL of ethanol produces white strings, the NDA precipitate. They are picked up by tip of micropipette, transferred to 1.5 mL of micro tube, rinsed with 0.9 mL of 70% ethanol, and dried at 42 ˚C for 20 min. Finally, the DNA is added by 600 mL of TE, and the tube is gently rotated until the DNA completely dissolves. It may take a few days for completion. The DNA solution is used for southern analyses and as template for PCR.
432
Mitsuhiro Itaya and Kenji Tsuge
2.5. Isolation of Bsu168 genome in agarose block Intact genomic DNA prepared in agarose gel block is performed as previously described (Itaya and Tanaka, 1991). Five milliliter of BGM clone culture similarly prepared as above is harvested by centrifugation (3000g for 10 min). The pellet should be suspended by vigorous vortexing. Five microliter of RNaseA solution (10 mg/mL in TE), 200 mL of Lysozyme solution (10 mg/mL in TE), 200 mL of STET [50 mM Tris–HCl (pH 8.0), 50 mM EDTA, 8% sucrose, and 5% (v/v) Triton X-100], and 0.9 mL of 1.5% low gelling temperature agarose (Sigma, Type VII, preincubated at 42 ˚C) are added to the BGM suspension in this order, mixed well, and poured to a well of 12-well plate. After solidified agarose at 4 ˚C for 30 min, the plate is incubated at 37 ˚C for 2 h for lysozyme reaction. The agarose plug is divided into six pieces of gel blocks (4 5 7.5 mm) by sterilized spatula, and transferred to a 13 mL of plastic test tube. Two milliliter of proteinase K buffer (150 mM EDTA and 100 mM Tris-HCl (pH 9.5)) and 100 mL of 10 mg/mL proteinase K are added to the tube and incubated at 50 ˚C for 17 h. The solution is replaced by 5 mL of TE, and rotated at room temperature for 30 min for dialysis gel blocks: This wash is repeated at least five times. After discarding TE, 0.9 mL of TE supplied 9 mL of PMSF solution (0.1 M phenylmethylsulfonyl fluoride in ethanol) is added, and incubated with gentle rotation at room temperature for 1 h. The PMSF solution is replaced by 5 mL of TE, and rotated at room temperature for 30 min for dialysis gel blocks. This wash is repeated at least three times. Gel blocks in 5 mL of TE are stored at 4 ˚C until use.
2.6. Enzyme reaction on DNA in gel blocks For restriction endonuclease digestion and following electrophoresis separation, gel block is downsized to as small as the sample well of separation gel by sterilized razor blade. For equilibration, two of the gel blocks are soaked into 45 mL of restriction enzyme buffer for 15 min. The buffer is replaced by 50 mL new buffer supplemented with 20 units of relevant restriction enzyme and incubated at reaction temperature for 17 h. To ensure complete digestion, another 10 units of enzyme is added, and incubation lasts for additional 5 h. The gel blocks are dialyzed for 15 min with TE buffer. This wash is repeated at least three times. Finally, one of the two gel block is plugged into the well of separation gel, immobilized by 1.5% low gelling temperature agarose gel solution, and subjected to pulsed-field gel electrophoresis. We use contour-clamped homogeneous electric field (CHEF) apparatus (Chu et al., 1986; Itaya and Tanaka, 1991). The gel can be melted at 65 ˚C and loaded in the well like liquid sample. This method gives high resolution for fragments smaller than 200 kbp.
433
Genome Synthesis by a Genome Vector
3. Large DNA Reconstruction via Small DNA Fragments Assembly in the BGM Vector 3.1. Domino method BGM vector carry the preinstalled sequence of pBR322 (Itaya, 1993, 1995a,b; Itaya and Kaneko, 2010; Itaya et al., 2003, 2008) or BAC (Itaya et al., 2000, Itaya and Kaneko, 2010; Kaneko et al., 2003, 2005, 2009) in the genome (Fig. 19.1). In Section 3.2, methods worked out on pBR322-based are detailed. Prerequisitely needed is a set of small DNA fragments that completely encompass the guest genome. The small fragments made on special pBR322 plasmid in E. coli are called domino clones. The special E. coli pBR322 plasmid should contain an antibiotic marker selectable for Bsu168. The first domino integration is catalyzed simply by pBR322 sequences only, as shown in Fig. 19.2, progressive integration of
Guest genome Domino design 3-1, 3-2
Preparation of dominos 3-1, 4-2
I
I II III
First domino integration Fig. 1 and 2-2, 2-3, 3-2
Second domino extension This Fig. and 3-1, 3-2
Confirmation of guest genome structure 2-4, 2-5, 2-6
IV I + II
I + I I + III
I + I I + I I I + IV
Figure 19.2 Domino procedures. The first domino possessing two antibiotic markers, open oval and closed triangle, integrates as shown in Fig. 19.1. Guest genome is covered by dominos I–IV in the first region. Integration of the second domino (II) forced to use the internal overlap region and the pBR322 shown by closed thick arrow in the BGM genome. Selection by an alternative marker closed oval and triangle elongates the internal DNA to I þ II and eliminate the previous marker. As elongation continues, pBR322 parts always remain flunking the insert.
434
Mitsuhiro Itaya and Kenji Tsuge
neighboring dominos are guided by the two homologous recombination process, the one for overlapped region with the previous domino and the other for a half of pBR322 sequence always remains in the direction to elongate. Selection by the new domino-associated marker eradicated the previous marker, which allows the previous marker for the next step as a selection marker. This cycle of domino integration can be repeated unlimited times in principle, as long as dominos possess alternative selection markers at the end to be elongated.
3.2. Applications of the domino method to chloroplast genome Chloroplast genome (cpGenome) from rice, a crop producing plant, is the best example demonstrated to date (Itaya et al., 2008), referring to a circular form of 134.5 kbp in size. A total of 31 domino clones that cover the entire sequence of the rice cpGenome were designed and obtained. The pBR322 possess either of the two antibiotic markers. Feature of the domino clones, 6 kbp insert on average sharing 1 kbp overlaps with the inserts of both adjacent dominos, were employed empirically. Inserts are amplified by PCR reactions, in similar manner to that described in Section 4.2, using partially purified rice chloroplast DNA solution as templates. Effectiveness of progressive connection of all the dominos in the BGM was fully demonstrated for complete cloning of the rice cpGenome. It should be addressed that the last domino should have no overlap with the first one. Structure variations of cpGenome in BGM vector are possible by choosing the first and the last dominos. The pBR322-based dominos supply relatively small inserts, up to 10 kbp. The use of large DNA inserts particularly >100 kb supplied by BAC is expected to dramatically accelerate the elongation rate. We constructed BGM vectors special for the use of BAC dominos. However, currently available BAC libraries are made on common BAC vectors with no appropriate markers on selection for Bsu168. The BAC-specific BGM has been proved to accept single BAC clone, the insert of which comes from mouse (Itaya et al., 2000; Kaneko et al., 2003, 2009) or plant mitochondria (Itaya and Kaneko, 2010; Kaneko et al., 2005). Suffice it to say that our unpublished ongoing result indicates that the BAC-based domino method works just like the pBR322-based one, although efficiency is in reverse proportional to increased insert size and depends on overlapping length.
3.3. Other domino related methods 3.3.1. Inchworm elongation methods Similar method was invented. The “inchworm elongation method”, by which the gap left between the two separate fragments can be filled in directly by natural guest DNA (Itaya et al., 2003, 2005). As the gap ranges
435
Genome Synthesis by a Genome Vector
about 40–50 kbp, elongation rate is faster by several times than that of pBR322-based dominos. Cycle of the gap production and its fill-in resembles as if an inchworm walk, leaving elongated guest genome part in the BGM cloning site. Application of this cycle by sliding forward step-by-step in the BGM vector demonstrated an accommodation of DNA up to 3,500 kb from Synechocystis PCC6803 (Itaya et al., 2005). However, as the inchworm elongation method requires liquid guest genome solution with high quality and high concentration, its application has been limited to cases where such genome DNA solution is available.
3.4. Retrieval of the large DNA out of BGM 3.4.1. Direct isolation in test tube Assembled DNA molecules in BGM have to be isolated for general applicational use, in particular, biomaterial production. Out of three methods described in Fig. 19.3, digestion by endonuclease and subsequent isolation/purification of the cloned segment appears to be the most simple and
3-4-1
3-4-2
3-4-3
BGM clone
Figure 19.3 Retrieval of the insert from BGM. The three methods are illustrated as detailed in the text sections. Final DNA structures are in dotted rectangles. The genome dissection in the right has been applied not on insert of BGM clones but on the Bsu168 own genome.
436
Mitsuhiro Itaya and Kenji Tsuge
straightforward. As a rare-cutting endonuclease I-PpoI, 23-base [ATGACTCTCTTAA/GGTAGCCAAA] have been preinstalled at both ends of the pBR322 sequence of all the BGM vectors (Fig. 19.3), linearized DNAs produced on I-PpoI digestion are readily isolated from agarose gel resolved by pulsed-field gel electrophoresis and can then be concentrated in liquid form. We demonstrated this manipulation for various insert DNA and currently 355-kbp mouse genomic DNA isolated by this method (Itaya and Tanaka, 1997a; Kaneko et al., 2005,2009). Because of simplicity for preparation in agarose gel block, the method has been of great value. 3.4.2. Retrieval by copying the insert of the BGM Compared with the direct DNA isolation by all physical manners, a yet more complicated genetic process has been developed. As illustrated in the middle of Fig.19.3, this method, called Bacillus Recombinational Transfer (BReT) (Tsuge and Itaya, 2001), directs to copy out the cloned insert in the BGM and allow transfer it to the newly delivered recipient pBR322-based BReT plasmid. The copy process starts by homologous recombination with the two pBR322 halves in the BGM vector part shown by x. Complete copy makes the linearized BReT plasmid circular and start to replicate in BGM. The specially constructed pBR plasmid is composed with three sequences, pBR322 for homologous regions, antibiotic markers for selection, and replication origin for Bsu168 (Tsuge and Itaya 2001). The rare-cutting enzyme such as NotI, whose recognition site positioned in the middle of pBR322, makes BReT plasmid linear and directly used for transformation of BGM clones. Recipes for transformation and plasmid extraction protocol are the same as described in Sections 2.3 and 4.6, respectively. The BReT system has been used to retrieve certain BGM due to technical simplicity, for example, lambda genome 48.5-kbp (Tsuge and Itaya, 2001) and organelle genomes from mitochondria and chloroplast up to 135-kbp (Itaya et al., 2008). Even Bsu168 genome parts were copied out (Tomita, et al., 2004). Maximum insert size, we think, depends on the ability of the replicon. Larger DNA remains to be examined. 3.4.3. Direct isolation in vivo It is more straightforward to separate the genome of BGM clone into the insert and the Bsu168 and isolate the former one as shown in Fig. 19.3. It is possible to dissect the circular Bsu168 genome (4,200 kbp) into two circular parts, 300- and 3,900-kbp, in somewhat complicated genetic manipulations (Itaya and Tanaka, 1997). Whether the dissection protocol can be applied to newly inserted genomes or not remained for our on-going and future works.
Genome Synthesis by a Genome Vector
437
3.5. Trouble shooting 3.5.1. Bsu168 cells and DNA Though competent cells can be stored with glycerol 15% (v/v) at 70 freezer, freshly prepared ones are recommendable. Domino clones prepared from E. coli in most cases can be directly applied on BGM transformation. Particularly, BAC clones carrying large DNA, sensitive to contaminated nuclease during biochemical purification process from E. coli, need careful purification (Kaneko et al., 2005). As regards preservation of BGM clones, simple and low-cost recipe was proposed, taking advantage of their ability to form end spores. Plates on which BGM colonies are streaked should be left at room temperature for weeks or even months. This period assures complete dry-ups of agar-based plate. Sterile water, 25 mL spotted on trace of colonies on the dried agar plate, recovers BGM spores that form colonies next day on fresh wet plate. Colonies through spores are identical to the initial BGM clones (Kaneko et al., 2005). 3.5.2. Restriction modification Bsu168 possesses an inherent restriction and modification system. The BsuR–BsuM recognizes the same sequence that of XhoI (Ohtani et al., 2008). BGM strains derived from a Bsu168 deficient mutant have completely avoided problems (Itaya, 2009; Itaya and Tanaka, 1997; Itaya et al., 2005, 2008). 3.5.3. Gaps If some dominos are unavailable, gaps are formed in the final guest genomes. This problem, in all cases in the past, stems from instability or toxicity of the particular insert for E. coli. Either slightly sliding insert region back and forth, or dividing the gap region into smaller pieces worked out. Aside from the E. coli associated problem, we have experienced several gene (s) conferring Bsu168 toxic or aberrant phenotypes found when integrated in the BGM. Gene knockout of the culprit (Itaya et al., 2005), or promoter redesign (Nishizaki et al., 2007) appeared to solve, allowing the rest of DNA reconstructions. 3.5.4. Repetitive sequences Two mouse genome regions >100 kbp full of short repeats were handled with no obvious problem in BGM (Itaya et al., 2000; Kaneko et al., 2003, 2009). An extremely long 21 kbp inverted repeat sequence of the rice cpGenome was constructed and stably maintained (Itaya et al., 2008). DNA with high content of guanine plus cytosine has been accommodated causing no problem to date. Taken together with all these current available examples except for low GþC content DNA, DNA from any sources seems likely to be reconstructed in the BGM vector.
438
Mitsuhiro Itaya and Kenji Tsuge
3.5.5. Insert size limit in BGM The cycle of domino integration can be in principle repeated before domino gaps. As a 3,500-kbp guest genome was challenged, apparent size limit in one pBR322 integration site was observed; that is tightly related to the structural constraints on Bsu168 genome replication. Bsu168 genome is believed to sustain equal size for both left and right half around the axis of initiation of DNA replication (oriC) and termination (terC). In consequence, insertion of DNA by domino or related methods causes imbalance. We found that insert size at the right half exceeding approximately 1,000kbp starts conferring detrimental effects on Bsu168 life. However, additional insertion of another large DNA of approximately 1,000-kbp at the left half apparently appeared to restore the normal cellular status (Itaya et al., 2005). This yet empirical proposal made insertion of total 3,500 kbp in BGM possible by dividing it equally at both halves of Bsu168 genome. Lengths of both halves can be regulated by, for example, induction of inversion mutation on Bsu168 (Kuroki et al., 2008), which may be worth investigation leading to a second generation of BGM vectors. 3.5.6. Stability of the cloned guest genome PCR check, colony PCR seems sufficient, for inner DNA sequence gives primary information about the presence, relying on the single-copy-DNA/ cell nature of the BGM. If the guest DNA increases its size far above 100 kbp, the genome structure need to be certified on pulsed-field gel electrophoresis apparatus (Itaya et al., 2005; Gibson et al., 2010a,b) or to resequence the genome of BGM carrying the guest is recommendable from now.
4. Assembly of Multiple DNA Fragments with Designed Order Connecting by a Few Bases Domino inserts require certain length for homologous recombination to be effective. The length of about 5–10% of the insert size has been empirically used. In contrast, we developed another efficient method to produce large DNA starting with small pieces of DNA with overlaps by only several nucleotides. Example of order and orientation of DNA fragments uniquely determined by combinations of only three bases is shown in Fig. 19.4. These end sequences can be designed and produced through E. coli molecular cloning system. It is surprising that connection of all the designed fragments performed by one ligation reaction in test tube and the final DNA is produced at higher frequency via unique transformation features of Bsu168. This novel DNA fragment assembly method, named Ordered Gene Assembly in Bsu168 (OGAB) produces a DNA in plasmid
439
Genome Synthesis by a Genome Vector
Ge
GTT
ne
TGA
fra
gm
ATG
ent
s
CAA CTA TAC
TGA
Protrusions designed to appear once in one plasmid unit
GAT TCT ACT ACT AGA
Ligation to form tandem repeat unit. 4-5 B. subtilis-E. coli shuttle plasmid vector
Transformation of Bsu168 2-3
Bsu168 Assembled plasmid
Figure 19.4 An example for OGAB assembly. All the DNA fragments to be assembled by the OGAB method possess specific protrusion at both ends. Design of 3-base sequence of 30 protrusion determines the unique end structure with the designed fragment order and orientation. These DNA fragments are ligated in tandem repeat form. Precise equal molar ratio of all the DNA fragments is critical for long ligation product.
form via Bsu168 transformation is giving outstanding breaks applicable to make DNA cassette with many relevant genes (Nishizaki et al., 2007; Tsuge et al., 2003, 2007).
4.1. Fragment design These protrusions made by certain type II restriction endonucleases whose cutting site includes N (an arbitral sequence) are good for this purpose. The three bases of 30 protrusion should contain at least 1 C or G, because of inefficiency of ligation between protrusions with A or T only. Endonucleases like AlwNI (50 -CAGNNN/CTG-30 ), BglI (50 -GCCNNNN/ NGGC-30 ), DraIII (50 -CACNNN/GTG-30 ), SfiI (50 -GGCCNNNN/ NGGCC-30 ), PflMI (50 -CCANNNN/NTGG-30 ) generate 3-base N at 30 portion and preferentially used. SfiI appears useful because of the 8-base
440
Mitsuhiro Itaya and Kenji Tsuge
recognition sequence that rarely exists in the genome. BsaXI (50 -/9(N)AC (N)5CTCC(N)10/-30 , 50 -/7(N)GGAG(N)5GT(N)12/-30 ), a typeIIB restriction endonuclease, is curious because it cuts two sites at once and excises the recognition sequence as a 30 bp fragment. This site disappears at final construct leaving no recognition sequence by this enzyme. One or two restriction enzymes are selected due to design of assembly; usually restriction enzymes that cut inside of fragments to be assembled should be excluded. Primer set with a selected restriction enzyme site at 50 -end are used to amplify target fragment by PCR.
4.2. Cloning of PCR product in E. coli plasmid PCR product is ligated into conventional E. coli plasmid vector using TAcloning. After transformation of E. coli, colonies formed on selection plate are subjected to colony PCR to screen the insert. We use in general M13F (50 -GTTTTCCCAGTCACGACGTTGT-30 ) and M13R (50 -CAGGAAACAGCTATGACCATGATTAC-30 ) primers for this purpose. PCR fragments, purified by general cartridge-type PCR product purification kit, are directly used as sequencing templates to confirm no mutation in the insert associated with PCR amplification and also primer parts that are originally chemically synthesized. Obtained colony having correct sequence plasmid is cultivated in LB medium supplemented with appropriate antibiotic. Plasmid preparation performed by commercially available anionexchange column-type plasmid preparation kit such as QIAprep mini (for up to 5 mL) or midi (for greater than 50 mL) depending on the cultivation scale. Buffer of the plasmid, after elution from column, should be changed buffer to lower pH-TE (10 mM Tris–HCl, 1 mM EDTA, pH 7.5). This is because the buffer with higher pH (e.g., pH 8.5) included in kit causes serious star activity of restriction enzyme that digests DNA at undesired sequence at higher pH, especially DraIII.
4.3. Careful preparation of fragments for OGAB method Amount and purity are the most important factors to affect the results. Fragments are excised from E. coli plasmid on digestion with one or two relevant restriction enzymes. Digests are separated by electrophoresis using low-gelling-temperature agarose gel (Agarose Type VII: Low Gelling Temperature, SIGMA). After stained by ethidium bromide, DNA-containing agarose gel block is excised by illuminating black light. Use 365 nm; other shorter wavelength such as 256 nm causes serious damages on DNA. Excised gel block is filled with 1 TAE buffer (40 mM Tris–acetate (pH 8.3) and 1 mM EDTA) up to 650 mL, and melted at 65 ˚C. Do not use other less-concentrated buffer than 1 TAE, because double-stranded DNA denatures at low ion strength at this temperature. The DNA was extracted
441
Genome Synthesis by a Genome Vector
twice by equal volume of TE-saturated phenol, followed by repeated extraction with equal volume of 1-butanol until aqueous phase below 450 mL. Normally, several times extraction with 1-butanol eliminate phenol that might inhibit ligation reaction. Use TE-saturated phenol. Do not use chloroform-contained phenol, because chloroform solidifies agarose that traps DNA and never redissolves at any temperature. DNA is precipitated by addition of 50 mL of solution III (3 M potassium acetate, pH 4.8 adjusted with acetic acid) and 900 mL of ethanol. After rinsing with 70% ethanol, resulting precipitate is dissolved into 20 mL of TE buffer. Aliquot of this sample, lower than 1 mL, is used to check quality and quantity on gel electrophoresis. DNA fragment should be repeatedly purified until no other bands are seen.
4.4. Molar concentration adjustment Molar ratio is critical for formation of linear multimers, the substrate for Bsu168 competent cell (Fig. 19.5). All the fragments should be adjusted equal molar concentration. Adjustment of fragments smaller than four in number can be done by direct comparison of fluorescence intensity resolved on electrophoresis gel stained with ethidium bromide. However, adjustment of molar concentration of more than five fragments becomes complicated. Use of fluorescence microplate reader is as follows: Commercially available double-stranded DNA, whose concentration is determined, is serially diluted by twofold in TE buffer. Aliquot of 25 mL is dissolved in 125 mL of TE containing 8,333-fold diluted SYBR-Green I (Molecular Probe, Inc.). These solution are transferred into a 96-microplate, then
Circular form
E. coli
Linear multimeric form
B. subtilis
Figure 19.5 Different plasmid formation on transformation of E. coli and B. subtilis. E. coli transformation absolutely requires circular form that has to be prepared prior to use. In contrast, linear DNA with multiplied repeat form transforms Bsu168 by circularizing inside the competent cell. Quite fortunate, repeated cluster unit in tandem form as shown in Fig. 19.4 serves in test tube is a good substrate for plasmid production in Bsu168 (Canosi et al., 1981).
442
Mitsuhiro Itaya and Kenji Tsuge
fluorescence intensity is measured by Fluorescent microplate reader (excitation 485 nm, emission 535 nm, SPECTRmax GEMNI XS, Molecular Devise, Inc.) to make standard curve. Similarly, the fragment stock solution is diluted at appropriate fold with TE for working solution. Aliquot is taken, filled up to 25 mL with TE, mixed with 125 mL of TE containing 8,333-fold diluted SYBR-Green I, and measured by fluorescent microplate reader. To adjust DNA concentration ratio very precisely from different sources, aliquot of working solution calculated from the previous measurement is repeatedly measured until all working solutions reach within 20% of mean value of fluorescent intensity to easily check all solutions being of same weight concentration (w/v). Molar concentration is calculated from values of weight concentration and length of DNA fragment (bp).
4.5. OGAB assembly Ligation is performed as follows: Take 1 mL each of DNA fragments [molar concentration of all fragments are identical (1 fmol)] and add 2 ligation buffer (132 mM Tris–HCl (pH 7.6), 13.2 mM MgCl2, 20 mM dithiothreitol, 0.2 mM ATP, 300 mM NaCl, and 20% polyethylene glycol (PEG) 6000) at the volume of total amount of DNA plus 1 mL and mix well. Addition of 1 mL T4 DNA ligase (200 weiss units, TOYOBO), followed by 30 min incubation at 37 ˚C complete the reaction. Mixed with 100 mL of Bsu168 competent cell in TFII, incubation at 37 ˚C. Gentle agitation for 30 min is enough for incorporation of DNA inside the cell. To express antibiotic resistance gene for selection, 300 mL of LB medium is added to this culture and incubated at 37 ˚C with gentle shake for 1 h. The entire solution is spread on the plate containing the antibiotic for selection and plates are placed at 37 ˚C. Normally, colonies form the next day.
4.6. Plasmid DNA extraction from Bsu168 Alkaline–SDS method previously described by Bron (1990) is as follows: Colonies on plate are picked by heat-sterilized toothpick and inoculated into 2 mL of antibiotic containing LB medium. After 17 h grown at 37 ˚C, bacteria are harvested by centrifugation at 15,000g for 30 s. Cell pellet is suspended to 100 mL of solution I (50 mM glucose, 25 mM Tris-Cl (pH 8.0), 10 mM EDTA (pH 8.0)) containing 10 mg/mL of lysozyme, and incubated at 37 ˚C for 10 min. This solution, being added to 200 mL of solution II (0.2 N NaOH, 1 % (w/v) sodium dodecyl sulfate), agitated gently until turns transparent. Addition of 150 mL of solution III followed by gentle agitation produces white precipitation. After centrifuged at 15,000g for 5 min, obtained supernatant is transferred to new tube, and then extracted by 450 mL of mixture with phenol: chloroform: isoamylalcohol (¼25:24:1), then centrifuged at 15,000g for 5 min. Three hundred
Genome Synthesis by a Genome Vector
443
and twenty microliter of supernatant is moved to a new tube, added by 900 mL of 100% ethanol. Vigorous mixing followed by centrifugation at 15,000g for 10 min gave DNA pellet at the bottom. This pellet is rinsed with 900 mL of 70% ethanol. After complete removal of liquid by micropipetting, dissolve into 25 mL of TE (10 mM Tris–HCl, 1 mM EDTA, pH 7.5). Usually, the TE contains 10 mg/mL of RNaseA to digest remaining RNAs. Eight microliter of this sample are used for appropriate restriction endonuclease analyses. For Bsu168 plasmids whose copy number can be regulated by an inducer (Tsuge et al., 2003), plasmid copy number can be amplified by addition of IPTG (Isopropyl-b-D()-thiogalactopyranoside) to the culture at final concentration of 1 mM, when the culture reaches late-log to stationary phase, and then followed by cultivation for another 3 h.
4.7. OGAB specific trouble shooting 4.7.1. Fragment isolation Occasionally, the fragment DNA is hardly separated from vector DNA fragment in electrophoresis due to size similarity. In such cases, the vector part should be digested into far smaller pieces by combined use of different restriction enzymes specific for the vector sequence. In practice, the order of restriction reaction might be critical in multiple digestion. SfiI requires longer incubation time (3 days) for complete digestion, in certain cases especially in close location of two SfiI. In contrast, 15 min is adequate for DraIII and longer incubation causes serious star activity. Double digestion by SfiI and DraIII should be done addition of enzymes this order.
5. Future Perspectives The chapter describes focusing the cloning technology using BGM vector. The principle to produce large genome is how to assemble smaller, partially overlapping DNA fragments that are available through regular E. coli molecular cloning systems. The domino method largely suited for reconstruction of existing genomes as long as a set of dominos with no gap is prepared. There seemed little limit for sequence context. The size may be influenced by intrinsic structural constraint of the Bsu168 genome, thus about 1,000 kbp in one integration site (Itaya et al., 2005). But by distributing large guest genomes in more than two sites, finally 3,500-kbp assembly was achieved (Itaya N&V, 2010). By contrast, OGAB method seems more suited for assembly of many DNA fragments from different genome locus or species. We are now aiming at harnessing relevant genes of the same metabolic pathway in a polycistronic operon form as briefly illustrated in
444
Mitsuhiro Itaya and Kenji Tsuge
Molar concentration Adjustment of fragments 4-4
OGAB assembly 4-5
B
C
D
E
A
B
C
D
E
SfiI
SfiI SfiI
A
A
SfiI
SfiI
DraIII
DraIII DraIII
DraIII
DraIII
E
D
C
DraIII
DraIII
DraIII
DraIII
DraIII
DraIII
DraIII B
DraIII
DraIII
SfiI
DraIII
A
DraIII
SfiI
SfiI
Fragments design 4-1
Fragments preparation 4-2, 4-3
B C D E
A
Design of construct
A B C D E
B C D E A B C D E
Confirmation of structure 4-6
A B C D E
Figure 19.6 Flow chart of gene assembly by OGAB method. Given a blue print of a sequence of guest DNA, all the DNA components to be assembled/connected should be designed accordingly. Steps for the OGAB method are described referring to the correspondence in the text.
Fig. 19.6. Great advantages are (i) Number of promoter can be reduced to one in an ideal case, (ii) Gene expressions can be controlled by placing it at the location with appropriate distance from the promoter, (iii) Simple delivery or move among different hosts as a single functional DNA cassette (Nishizaki et al., 2007). Although the domino method can do this assembly by designing overlapped region, OGAB method likely gives rapid and many constructs. The functional cassettes made by OGAB method can further be connected together by repeating OGAB method for cassettes. One intriguing goal motivated by this research direction might be construction of DNA with minimal set of genes for life (Itaya, 1995a,b, Glass et al., 2006 Gerdes, et al., 2006). We do not know if there is upper limit for size produced by OGAB. Domino methods can be helpful in this case. Improved protocol may include dominos or OGAB components starting from chemically synthesized DNA. Production of nonexisting large DNA is desired for wider applications. Synthetic biology is a new and rapidly emerging discipline where redesign and actual construction of new biological systems can be aimed at. Genome synthesis seems to be the big challenge, because genome is the utmost molecule, manipulating of
Genome Synthesis by a Genome Vector
445
which regulates most of the cellular process. Methods to synthesize or clone of genomes for relatively simple life bacteria have just emerged at present.
ACKNOWLEDGMENTS I thank Dr Kaneko, S. for his contribution on BAC part and discussion and comments on this chapter.
REFERENCES Anagnostopoulos, C., and Spizizen, J. (1961). Requirements for transformation in Bacillus subtilis. J. Bacteriol. 81, 741–746. Bang, D., and Church, G. M. (2008). Gene synthesis by circular assembly amplification. Nat. Methods 5, 37–39. Bron, S. (1990). Plasmids. In “Molecular Biological Methods for Bacillus,” (C. R. Harwood and S. M. Cutting, eds.), pp. 75–174. John Wiley and Sons, Chichester, UK. Canosi, U., Iglesias, A., and Trautner, T. A. (1981). Plasmid transformation in Bacillus subtilis: Effects of insertion of Bacillus subtilis DNA into plasmid pC194. Mol. Gen. Gene. 181, 434–440. Chu, G., Vollrath, D., and Davis, R. W. (1986). Separation of large DNA molecules by contour-clamped homogeneous electric fields. Science 234, 1582–1585. Dubnau, D. (1999). DNA uptake in bacteria. Annu. Rev. Microbiol. 53, 214–244. Gerdes, S., Edwards, R., Kubal, M., Fonstein, M., Stevens, R., and Osterman, A. (2006). Essential genes on metabolic maps. Curr. Opin. Biotechnol. 17, 448–456. Gibson, D. G., Benders, G. A., Andrews-Pfannkoch, C., Denisova, E. A., BadenTillson, H., Zaveri, J., Stockwell, T. B., Brownley, A., Thomas, D. W., Algire, M. A., Merryman, C., Young, L., et al. (2008). Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 319, 1215–1220. Gibson, D. G., Young, L., Chuang, R. Y., Venter, J. C., Hutchison, C. A., 3rd, and Smith, H. O. (2009). Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6, 343–345. Gibson, D. G., Glass, J. I., Lartigue, C., Noskov, V. N., Chuang, R. Y., Algire, M. A., Benders, G. A., Montague, M. G., Ma, L., Moodie, M. M., Merryman, C., Vashee, S., et al. (2010a). Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52–56. Gibson, D. G., Smith, H. O., Hutchison, C. A., 3rd, Venter, J. C., and Merryman, C. (2010b). Chemical synthesis of the mouse mitochondrial genome. Nat. Methods 7, 901–903. Glass, J. I., Assad-Garcia, N., Alperovich, N., Yooseph, S., Lewis, M. R., Maruf, M., Hutchison, C. A., 3rd, Smith, H. O., and Venter, J. C. (2006). Essential genes of a minimal bacterium. Proc. Natl. Acad. Sci. USA 103, 425–430. Itaya, M. (1993). Integration of repeated sequences (pBR322) in the Bacillus subtilis 168 chromosome without affecting the genome structure. Mol. Gen. Genet. 241, 287–297. Itaya, M. (1995a). Toward a bacterial genome technology: Integration of the Escherichia coli prophage lambda genome into the Bacillus subtilis 168 chromosome. Mol. Gen. Genet. 248, 9–16. Itaya, M. (1995b). An estimation of minimal genome size required for life. FEBS Lett. 362, 257–260.
446
Mitsuhiro Itaya and Kenji Tsuge
Itaya, M. (2009). Recombinant Genomes. In “Systems Biology and Synthetic Biology,” (P. Fu, M. Latterich, and S. Panke, eds.), pp. 155–194. John Wiley & Brothers, Inc., Hoboken, NJ. Itaya, M., and Tanaka, T. (1991). Complete physical map of the Bacillus subtilis 168 chromosome constructed by a gene-directed mutagenesis method. J. Mol. Biol. 220, 631–648. Itaya, M., and Tanaka, T. (1997). Experimental surgery to create subgenomes of Bacillus subtilis 168. Proc. Natl. Acad. Sci. USA 94, 5378–5382. Itaya, M., and Kaneko, S. (2010). Integration of stable extra-cellular DNA released from Escherichia coli into the Bacillus subtilis genome vector by culture mix method. Nucleic Acids Res. 38, 2551–2557. Itaya, M., Shiroishi, T., Nagata, T., Fujita, K., and Tsuge, K. (2000). Efficient cloning and engineering of giant DNAs in a novel Bacillus subtilis genome vector. J. Biochem. 128, 869–875. Itaya, M., Fujita, K., Koizumi, M., Ikeuchi, M., and Tsuge, K. (2003). Stable positional cloning of long continuous DNA in the Bacillus subtilis genome vector. J. Biochem. 134, 513–519. Itaya, M., Tsuge, K., Koizumi, M., and Fujita, K. (2005). Combining two genomes in one cell: Stable cloning of the Synechocystis PCC6803 genome in the Bacillus subtilis 168 genome. Proc. Natl. Acad. Sci. USA 102, 15971–15976. Itaya, M., Fujita, K., Kuroki, A., and Tsuge, K. (2008). Bottom-up genome assembly using the Bacillus subtilis genome vector. Nat. Methods 5, 41–43. Kaneko, S., Tsuge, K., Takeuchi, T., and Itaya, M. (2003). Conversion of submegasized DNA to desired structures using a novel Bacillus subtilis genome vector. Nucleic Acids Res. 31, e112. Kaneko, S., Akioka, M., Tsuge, K., and Itaya, M. (2005). DNA shuttling between plasmid vectors and a genome vector: Systematic conversion and preservation of DNA libraries using the Bacillus subtilis genome (BGM) vector. J. Mol. Biol. 349, 1036–1044. Kaneko, S., Takeuchi, T., and Itaya, M. (2009). Genetic connection of two contiguous bacterial artificial chromosomes using homologous recombination in Bacillus subtilis genome vector. J. Biotech. 139, 211–213. Kuroki, A., Toda, T., Matsui, K., Uotsu-Tomita, R., Tomita, M., and Itaya, M. (2008). Reshuffling of the Bacillus subtilis genome by multifold inversion. J. Biochem. 143, 97–105. Nishizaki, T., Tsuge, K., Itaya, M., Doi, N., and Yanagawa, H. (2007). Metabolic engineering of carotenoid biosynthesis in Escherichia coli by ordered gene assembly in Bacillus subtilis (OGAB). Appl. Environ. Microbiol. 73, 1355–1361. Ohtani, N., Sato, M., Tomita, M., and Itaya, M. (2008). Restriction and modification on conjugational transfer between Bacillus subtilis 168. Biosci. Biotechnol. Biochem. 72, 2472–2475. Shizuya, H., Birren, B., Kim, U. J., Mancino, V., Slepak, T., Tachiiri, Y., and Simon, M. (1992). Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl. Acad. Sci. USA 89, 8794–8797. Spizizen, J. (1958). Transformation of biochemically deficient strains of Bacillus subtilis by deoxyribonucleate. Proc. Natl. Acad. Sci. USA 44, 1072–1078. Tomita, S., Tsuge, K., Kikuchi, Y., and Itaya, M. (2004). Targeted isolation of a designated region of the Bacillus subtilis 168 genome by recombinational transfer. Appl. Environ. Microbiol. 70, 2508–2513. Tsuge, K., and Itaya, M. (2001). Recombinational transfer of 100-kb genomic DNA to plasmid in Bacillus subtilis 168. J. Bacteriol. 183, 5453–5456.
Genome Synthesis by a Genome Vector
447
Tsuge, K., Matsui, K., and Itaya, M. (2003). One step assembly of multiple DNA fragments with a designed order and orientation in Bacillus subtilis plasmid. Nucleic Acids Res. 31, e133. Tsuge, K., Matsui, K., and Itaya, M. (2007). Production of the non-ribosomal peptide plipastatin in Bacillus subtilis regulated by three relevant gene blocks assembled in a single movable DNA segment. J. Biotech. 129, 592–603. Uozumi, T., Hoshino, T., Miwa, K., Horinouchi, S., Beppu, T., and Arima, K. (1977). Restriction and modification in Bacillus species: Genetic transformation of bacteria with DNA from different species, part 1. Mol. Gen. Genet. 152, 65–69.
C H A P T E R
T W E N T Y
Mapping E. coli RNA Polymerase and Associated Transcription Factors and Identifying Promoters Genome-Wide Sarah E. Davis,*,† Rachel A. Mooney,* Elenita I. Kanin,*,† Jeff Grass,* Robert Landick,*,‡ and Aseem Z. Ansari*,† Contents 1. Introduction 1.1. Formaldehyde effects on growth and cellular response 1.2. Formaldehyde effects on indirect cellular interactions 2. Protocol: ChIP–chip 2.1. Harvesting cells 2.2. Isolation of cross-linked DNA 2.3. Immunoprecipitation 2.4. Prepping and handling sepharose bead 50:50 slurry 2.5. Solutions and Reagents 2.6. Analysis of ChIP by qPCR 2.7. ChIP–chip DNA prep for microarray 2.8. Genome-wide location analysis (ChIP–chip) 2.9. Data analysis protocol 2.10. Defining the background signal for a widely distributed protein complex 3. Chemical Genomics 4. Future Directions Acknowledgments References
450 450 454 454 454 456 457 457 458 459 459 463 464 466 467 469 470 470
Abstract The ability to examine gene regulation in living cells has been greatly enabled by the development of chromatin immunoprecipitation (ChIP) methodology. ChIP captures a snapshot of protein–DNA interactions in vivo and has been used to study interactions in bacteria, yeast, and mammalian cell culture. * Department of Biochemistry, University of Wisconsin, Madison, Wisconsin, USA Genome Center, University of Wisconsin, Madison, Wisconsin, USA Department of Bacteriology, University of Wisconsin, Madison, Wisconsin, USA
{ {
Methods in Enzymology, Volume 498 ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00020-6
#
2011 Elsevier Inc. All rights reserved.
449
450
Sarah E. Davis et al.
ChIP conditions vary depending upon the organism and the nature of the DNAbinding proteins under study. Here, we describe a customized ChIP protocol to examine the genome-wide distribution of a mobile DNA-binding enzyme, Escherichia coli RNA Polymerase (RNAP) as well as the factors that dynamically associate with RNAP during different stages of transcription. We describe new data analysis methods for determining the association of a broadly distributed DNA-binding complex. Further, we describe our approach of combining small molecules and antibiotics that perturb specific cellular events with ChIP and genomic platforms to dissect mechanisms of gene regulation in vivo. The chemical genomic methods can be leveraged to map natural and cryptic promoters and transcription units, annotate genomes, and reveal coupling between different processes in regulation of genes. This approach provides the framework for engineering gene networks and controlling biological output in a desired manner.
1. Introduction Chromatin immunoprecipitation (ChIP) is a powerful technique for studying protein–DNA interactions in time, across the whole genome, and under various growth conditions. With ChIP, formaldehyde is used to cross-link proteins to proteins and proteins to DNA in vivo (Solomon and Varshavsky, 1985), followed by disruption of genomic DNA to shorter fragments and immunoprecipitation of protein–DNA complexes with an antibody specific to the protein of interest. By reversing the cross-link and examining specific loci or the entire set of immunoprecipitated genomic DNA, one can determine the location of a protein of interest (Iyer et al., 2001; Ren et al., 2000). ChIP has been used previously in a variety of cultured cells and organisms, each with an optimized protocol (reviewed in Lin and Grossman, 1998; Aparicio et al., 2005). Similarly, to perform ChIP in Escherichia coli, optimization was required to maximize specificity and sensitivity. Moreover, to define regions that were bound by RNA polymerase (RNAP), a broadly distributed mobile protein complex, it was necessary to develop new computational models to distinguish background noise from specific ChIP signals.
1.1. Formaldehyde effects on growth and cellular response The first step in optimization focused on determination of the rapidity of action of formaldehyde, a cell permeable, four-atom cross-linker. In order for a cross-linking reagent to effectively capture protein–DNA complexes in their functional state, the molecule must gain entry into live cells and rapidly cross-link proteins and DNA to take a “snapshot” of interactions across the genome. If a cross-linker is unable to enter the cell effectively or if
451
RNA Polymerase and Associated Transcription Factors
100 1 % HCHO
Culture density (Klett U)
90
No HCHO
80 70 60 50 Addition of HCHO
40 30 20 10 0 0
2
4
6 8 Time (h)
10
12
14
Figure 20.1 Growth curve comparing E. coli cultures grown in MOPS minimal media plus glucose to mid-log phase (30 Klett units) and treated (open circles) or not treated (solid squares) with addition of 1% formaldehyde (HCHO). Ten millimolar sodium phosphate mix was added to both cultures.
the time required to effectively cross-link proteins and DNA is substantial, proteins could redistribute along the DNA during the course of the treatment or in response to the cross-linker. In this case, the resulting “snapshot” would not be a true representation of what normally occurs in the cell and might include formaldehyde-responsive alterations in gene expression. As one test of the ability of formaldehyde to rapidly enter E. coli, we first determined that the concentration of formaldehyde we were using was sufficient to immediately halt cell growth (Fig. 20.1). The immediate cessation of growth indicates that formaldehyde entered the cells rapidly. As a further test of formaldehyde action, we examined the redistribution of RNAP on the frm operon, whose expression was previously shown to be stimulated by formaldehyde (about 100-fold at 250 mM; Herring and Blattner, 2004). To test whether formaldehyde cross-linking is slow enough that E. coli is able to respond and induce this operon, we compared the amount of RNAP present at the frm operon when induced with 250 mM formaldehyde for 30 min versus the amount present at the higher concentration used for cross-linking (1%, or 33 mM formaldehyde) for 10 min (Fig. 20.2A). If transcription were allowed to continue after the addition of the high concentration of formaldehyde, one would expect to see a redistribution of RNAP and induction of transcription of the frm operon. Instead, RNAP was not observed at the frm operon when cells were treated with high levels (1% or 33 mM) of formaldehyde whereas pretreatment with low levels (250 mM) led to robust occupancy of RNAP across this operon (Fig. 20.2B).
452
Sarah E. Davis et al.
A frmR
frmA
frmB
B Buffer
250 mM HCHO C
30 min
1% HCHO
10 min
Quench /PCR
30 min
1% HCHO
10 min
Quench /PCR
DNA concentration (ng/mL)
14 Buffer
12
250 mM HCHO
10 8 6 4 2 0
frm promoter
frmR
frmA
Figure 20.2 (A) Comparison of ChIP conditions to conditions to induce frm expression in E. coli. Typical ChIP conditions include exposing E. coli to 1% formaldehyde for 10 min to cross-link RNAP. To induce frm activation in E. coli, cells were treated with 250 mM formaldehyde for 30 min before addition of formaldehyde/HCHO to 1% to cross-link RNAP. (B) Comparing RNAP distribution across the frm operon under normal ChIP conditions to frm activation conditions using ChIP. Under normal ChIP conditions, RNAP has a low presence across the frm operon. However, under frm induction conditions, RNAP has high activation levels across the frm operon.
These results validate formaldehyde as an excellent probe to cross-link protein–DNA interactions in E. coli because, at 1%, it immediately inhibits growth and does not allow redistribution of RNAP after treatment. Having validated the use of 1% formaldehyde as an appropriate crosslinking agent, we optimized the treatment conditions for maximal signal-tonoise in ChIP assays. The length of cross-linking is likely to affect the extent of protein–protein and protein–DNA cross-linking. Excessive exposure to formaldehyde (Fig. 20.3) may yield fortuitous interactions due to networks of cross-links between cellular components. To find the optimal formaldehyde treatment regimen to detect specific interactions in E. coli, we examined various durations of formaldehyde treatment. By comparing RNAP (ß0 -subunit) occupancy at the promoters of the actively transcribed
453
RNA Polymerase and Associated Transcription Factors
A
RNAP
Relative DNA concentration (ng/mL)
6000
90
5000
84 4000 59 3000 2000
27
1000 0
B
0.5
2.5
5.0 Time (min)
10
20
NusG 1000
Relative DNA concentration (ng/mL)
103
rrn promoter lac promoter
rrn promoter lac promoter
21
800 17
600 9
30 400 3 200
0
0.5
2.5
5.0 Time (min)
10
20
Figure 20.3 Testing duration of formaldehyde treatment in E. coli at 30 s (half minute), 2 min and 30 s, 5 min, 10 min, and 20 min. ChIP was done against RNAP (b0 -subunit) and quantitating the signal at the rrn promoter (highly active in normal growth conditions) to the lac promoter (extremely low activity in these growth conditions). The average values from three independent experiments are plotted; the error bars represent the standard deviations. The numbers above the pairs of data are the folddifference between the rrn and lac PCR signals at each time point. (A) Using E. coli MG1655, the highest signal-to-noise ratio for ChIP against RNAP (the b0 subunit) was seen at 5 min of cross-linking with formaldehyde. (B) Using E. coli RL1664(HA-NusG), the optimal formaldehyde treatment for ChIP against the HA epitope attached to NusG was seen at 10 min cross-linking, with both the highest concentration of NusG present at the rrn promoter and the highest ratio of rrn to lac signal.
454
Sarah E. Davis et al.
rrn operon to the lac operon which is not transcribed when cells are grown in minimal media supplemented with glucose, we were able to establish that the maximal signal as well as the maximal signal-to-noise ratio for proteins that directly interact with DNA was approximately 5 min (Fig. 20.4A).
1.2. Formaldehyde effects on indirect cellular interactions ChIP can also be used to examine transcription regulatory proteins that do not directly interact with DNA but that are localized to specific sites across the genome through interactions with RNAP. Successful ChIP analysis of these proteins requires two cross-links, one between RNAP and DNA and the other between RNAP and the associated regulator. An example of such an interaction is the dynamic association of the elongation factor NusG with the transcribing RNAP. To characterize the optimal cross-linking regimen for NusG, we performed a time course of formaldehyde treatment. The best signal-to-noise ratio that maintained a high recovery of DNA was determined to be approximately 10 min (using an epitope-tagged NusG strain, RL1664 containing HA-tagged NusG; Herring et al., 2003; Mooney et al., 2009). The difference in time of formaldehyde treatment between ß0 -subunit and NusG is consistent with the lowered efficiency of achieving two cross-links between interacting molecules. ChIP provides the ability to study proteins under various growth conditions and their interaction with specific genes by analyzing the DNA by quantitative real-time PCR (qPCR). ChIP also enables one to map a protein or protein complex’s interaction across an entire genome by analyzing the DNA with a microarray (Grainger et al., 2004, 2005; Herring et al., 2005; Iyer et al., 2001; Ren et al., 2000; Wade and Struhl, 2004). By using ChIP–chip, it is possible to understand the involvement of transcription components in vivo and correlate the position of the transcriptional machinery along the genome.
2. Protocol: ChIP–chip For all experiments described, MG1655 was used with the exception of strain RL1664, in which NusG is HA-tagged at the N-terminal domain as described previously (Mooney et al., 2009).
2.1. Harvesting cells Day 1 and 2: 1. Inoculate 5 mL MOPS þ 0.2% glucose with a single colony and grow overnight at 37 C (to an OD600 > 1.5).
455
RNA Polymerase and Associated Transcription Factors
A Chromatin immunoprecipitation (ChIP) Crosslink and fragment
Immunoprecipitate
Reverse crosslink
B
ChIP–chip Amplify DNA by PCR Linker + ligase
5 5
3
vs 3
3 3 Hybridize DNA to array
3 3
3
3 3
3
3
3
3
3 3
Label DNA by PCR + Cy3 or Cy5 3 3 5
3 3
3 3
Figure 20.4 (A) In ChIP, proteins are cross-linked to DNA through the use of formaldehyde or other cross-linking reagent, the DNA is fragmented by sonication, and immunoprecipitation is performed against the protein of interest. After reverse cross-linking is performed, specific protein–DNA interactions can be examined by PCR. (B) Using the ChIP DNA, a linker is ligated on to all DNA fragments which allows for uniform amplification of the DNA by PCR. The ChIP DNA is labeled with either Cy3 or Cy5 (control DNA is labeled with remaining Cy dye) and hybridized to a high-density DNA array.
456
Sarah E. Davis et al.
2. Inoculate a larger culture to OD 0.02 and grow to mid-log phase (OD 0.3–0.4) at 37 C at 250 rpm on an orbital shaking incubator. 3. To a 50 mL culture (already at 37 C), add 0.5 mL 1 M sodium phosphate mix (see below for mix) and 1.3 mL 37% formaldehyde (1% final). Cross-link by rotating/gently shaking for 30 min at 37 C or room temperature. Note: each IP uses a 50 mL culture. 4. Add 2 mL cold 2.5 M glycine (100 mM final) and immediately transfer to ice/water slurry to cool rapidly; continue to rotate/gently shake for 30 min at 4 C to stop cross-linking. 5. Centrifuge culture at 3500g for 10 min at 4 C. Resuspend cells in 100 mL PBS and spin again; repeat the wash step once. Resuspend pellet in 1 mL PBS and transfer entire volume to an Eppendorf tube. Pellet cells and wash one more time with 1 mL PBS. Remove all supernatant, flash-freeze, and store at 80 C.
2.2. Isolation of cross-linked DNA Day 3: 1. Thaw frozen cells on ice. Resuspend cell pellet with 250 mL 1 IP buffer þ 1mM PMSF. If resuspending cell pellet from 100 mL culture, use 500 mL 1 IP buffer þ 1 mM PMSF. 2. Sonicate cells in a flat bottom tube. For a Branson sonicator, sonicate at 10% output for 20 s, cool on ice for 2–5 min, and repeat four to five times (five to six times all together). Note that the sonicator tip should not touch the tube and should be just below the fluid without causing excess foam. 3. To 250 mL volume add 5 mL micrococcal nuclease, 5 mL CaCl2 mix (see solutions for instructions on prep below), and 0.5 mL 1 mg/mL RNaseA. Incubate 1 h at 4 C, rotating. Note: Since the development of this protocol, it was reported that smaller-sized DNA fragments of the yeast genome could be generated by using a cup horn with a Misonix sonicator (Auerbach et al., 2009). We have found that use of a cup horn on a Misonix sonicator 4000 to sonicate crosslinked E. coli cells similarly yields smaller fragments and removes the need for subsequent digestion with micrococcal nuclease. 4. To stop micrococcal nuclease, add 5 mL 0.5 M EDTA (10 mM final). 5. Centrifuge for 10 min at maximum speed (13,000–14,000 rpm) at 4 C. Transfer supernatant to new Eppendorf tube and use this lysate for subsequent IP. 6. To check efficacy of sonication and nuclease treatment, take small sample and reverse the X-link for 5–6 h at 65 C, then run on 1.5% agarose gel. Majority of smear should be between 300 and 500 bp. Adjust the number of sonications if necessary.
RNA Polymerase and Associated Transcription Factors
457
2.3. Immunoprecipitation 1. Remove 1/10 volume for Input/no antibody control into new tube and add 20 mL Sepharose beads (see instructions below on how to prepare and handle beads), and rotate at 4 C until day 4, step 5b. 2. Preclear remaining lysate by adding 20–30 mL Sepharose beads, rotate 3 h at 4 C. Spin down beads at 3000 rpm. Transfer supernatant to a new Eppendorf tube. 3. Add 2 mL antibody to 50 mL culture (now about 250 mL lysate). Rotate overnight at 4 C. Day 4: 4. Add 30 mL Sepharose beads to the tube with antibody and rotate for 1 h at 4 C. It is good to bring the volume up to 750 mL with 1 IP buffer to ensure beads stay in solution; they have a tendency to stick to the side of tubes. 5. Spin down beads at 3000 rpm for 1–2 min. Remove lysate; antibody/ protein/DNA is now bound to beads. 5b. Spin down Input/no antibody control tube, remove, and save supernatant in new tube for Input control. Continue to step 6 with no antibody beads. Go to step 7 with Input control tube. 6. Wash beads with 1 mL 1 LiCl wash solution. Spin down beads at 3000 rpm for 1 min, remove liquid but leave a little liquid (less than 100 mL) to ensure no beads are removed. Continue washing steps with 600 mM NaCl wash buffer twice, 300 mM NaCl wash buffer twice, and 1 TE twice. After final TE wash, spin down again at 3000 rpm for 3 min, remove last bit of liquid with a pipette making sure not to remove any beads. 7. Add 100 mL ChIP elution buffer to beads and incubate for 30 min at 65 C to elute DNA/protein from beads. 8. Spin down beads at 3000 rpm for 2–3 min. Transfer supernatant to new Eppendorf tube, incubate 6 h to overnight at 65 C to reverse cross-link. Day 5: 9. Clean up DNA with Qiagen’s QIAquick PCR Purification kit. Elute DNA with 58 mL Qiagen elution buffer provided with the kit. This yields a final volume of about 50 mL.
2.4. Prepping and handling sepharose bead 50:50 slurry For monoclonal antibodies, we find a 50:50 mix of protein A and protein G works best. For polyclonal antibodies, use just protein A beads. Use snipped tip/wide-bore tips whenever pipetting beads to avoid disrupting them. Wash beads two to three times with IP buffer to remove ethanol from beads. Resuspend beads in equal volume of IP buffer to get a 50:50 slurry.
458
Sarah E. Davis et al.
2.5. Solutions and Reagents 1 M Sodium phosphate mix For 100 mL: 0.845 M Na2HPO4 84.5 mL 1 M Na2HPO4 0.155 M NaH2PO4 15.5 mL 1 M NaH2PO4 2 IP buffer For 50 mL: 200 mM Tris pH8 10 mL 1 M Tris 600 mM NaCl 6 mL 5 M NaCl 4% TritonX-100 20 mL 10% TritonX Micrococcal nuclease (USB cat. no. 70196Y): Dissolve in 10 mM Tris–HCl, pH 8, 50% glycerol to 10 units/mL. For 15,000 units, add 1.5 mL of Tris/ glycerol. Ribonuclease I “A” bovine pancreas (USB cat. no. 27032301): Resuspend to 10 mg/mL and heat for 10 min at 90 C to remove potential DNase. Prepare 1 mg/mL stock by dilution in 10 mM Tris–HCl pH 7.5 and store at 20 C. PMSF: Dissolve to 100 mM in isopropanol. Store at 20 C in 100 mL aliquots. Add fresh each time to IP buffer. CaCl2 mix 500 mM Tris pH8 50 mM CaCl2 1 LiCl wash buffer For 50 mL of 2 LiCl: 250 mM LiCl 5 mL 5 M LiCl 100 mM Tris–HCl, pH8 10 mL 1 M Tris 2% TritonX-100 20 mL 10% TritonX 600 mM NaCl wash buffer For 50 mL: 100 mM Tris–HCl, pH8 5 mL 1 M Tris 600 mM NaCl 6 mL 5 M NaCl 2% TritionX-100 10 mL 10% TritonX 300 mM NaCl wash Buffer For 50 mL: 100 mM Tris–HCl, pH8 5 mL 1 M Tris 300 mM NaCl 3 mL 5 M NaCl 2% TritionX-100 10 mL 10% TritionX 1 TE For 50 mL: 10 mM Tris–HCl, pH8 0.5 mL 1 M Tris 1 mM EDTA 0.1 mL 500 mM EDTA ChIP elution Buffer For 50 mL: 50 mM Tris–HCl, pH8 2.5 mL 1 M Tris 10 mM EDTA 1 mL 500 mM EDTA 1% SDS 5 mL 10% SDS
459
RNA Polymerase and Associated Transcription Factors
2.6. Analysis of ChIP by qPCR DNA from ChIP can be quantitated by real-time PCR. Primers are designed for the desired gene targets with an amplification range of 100–200 bp. All real-time PCRs described here were performed on an Applied Biosystems 7500 Fast Cycler, using JumpStart SYBR Green from Sigma. Primer sequences for each location are listed here: rrnP (forward 50 -ttgcatgcagatgatgaggt; reverse 50 -tatgccgcgtgtcgtataaa) lacP (forward 50 -agctggcacgacaggttt; reverse 50 -tccgctcacaattccaca) frmP (forward 50 -ttgcatgcagatgatgaggt; reverse 50 -accgttccagagcatcaatc) frmR (forward 50 -ctaatgggctgatggcagaa; reverse 50 -gtcaacggattggctgactt) frmA (forward 50 -gcaaaccatgaacacgtctg; reverse 50 -acagaatcacctggctggac)
2.7. ChIP–chip DNA prep for microarray ChIP DNA was prepared according to NimbleGen’s protocol. Briefly, ChIPed DNA is blunt ended using DNA polymerase, and annealed linkers (see below for specific instructions on preparation of annealed linkers) are then ligated overnight to the DNA using T4 DNA ligase. The ligated DNA is amplified, labeled, and examined using in-house microarrays or sent to NimbleGen for processing and examination on high-density arrays synthesized by NimbleGen Systems. 2.7.1. Blunting the DNA 1. In 200 mL PCR tubes, add 25 mL ChIP DNA, 25 mL no antibody control DNA, or 5 mL Input control DNA and bring the total volume to 100 mL with dH2O. 2. Add 12.7 mL of blunting mix. Blunting Mix Components
1
10 T4 DNA polymerase buffer NEB #007-203 same as NEB #2 (blue) 10 mg/mL BSA NEB #007-BSA 10 mM dNTP T4 DNA polymerase 3U/mL NEB #203L Total
11 mL 0.5 mL 1 mL 0.2 mL 12.7 mL
3. Mix by pipetting and incubate at 12 C for 20 min in a PCR machine. 4. Transfer blunted DNA to new 1.5 mL tubes and place on ice.
460
Sarah E. Davis et al.
5. Add 12 mL of NaOAc/glycogen mix and vortex. Sodium acetate (NaOAc)/glycogen mix Components
1
3 M NaOAc, pH 5.2 (Sigma S-7899) 20 mg/mL glycogen (Roche #10901393001) Total
11 mL 1.0 mL 12 mL
6. Add 120 mL of phenol/chloroform/isoamyl alcohol (25:24:1, Sigma P-3803). 7. Vortex and spin 5 min at maximum speed at 4 C. 8. Transfer 110 mL to a new 1.5 mL Eppendorf tube and add 230 mL cold ethanol (100%) and then vortex sample. Store at 80 C for 15– 30 min. Spin for 14,000 rpm for 15 min at 4 C. 9. Remove supernatant and wash the pellet with 500 mL cold 70% ethanol. 10. Spin for 5 min at 4 C. 11. Aspirate the supernatant, spin briefly, and remove any remaining liquid with pipette. Allow the pellet to dry briefly (5–10 min). 12. Resuspend pellet in 25 mL dH2O and place on ice. 2.7.2. Ligating the DNA 1. On ice, add 25 mL of cold ligase mix to each tube: Ligase mix Components
1
10 ligase buffer 15 mM annealed linkers *See below T4 DNA ligase NEB #202L dH2O Total
5 mL 6.7 mL 0.5 mL 13 mL 25.2 mL
2. Mix by pipetting and incubate overnight at 16 C. 3. The next day, add 6 mL of 3 M NaOAc to each tube and 130 mL of 100% ethanol. Vortex sample. 4. Freeze at 80 C for 15–30 min, spin 14,000 rpm for 15 min. 5. Wash with 500 mL 70% ethanol, then spin and air dry for 5–10 min.
461
RNA Polymerase and Associated Transcription Factors
6. Resuspend the pellet in 25 mL of dH2O and transfer to 200 mL PCR tubes and place on ice. 2.7.3. First ligation mediated-PCR (LM-PCR) 1. On ice, add 25 mL of the following PCR mix to DNA. PCR mix Components
1
10 ThermoPol reaction Buffer NEB 2.5 mM dNTP 40 mM Primer 1 dH2O Taq Polymerase 5U/mL Qiagen PFU Turbo 0.025U/mL* Stratagene Total
5 mL 5 mL 1.25 mL 11.75 mL 1 mL 1 mL 25 mL
*100 PFU ¼ 2.5 U/mL, dilute to 1, use 1 mL of 1 per Rxn.
2. Transfer contents to PCR tubes on ice, place in PCR machine, and run the following program: Program: LM-PCR step 1. 55 C for 20 ; step 2. 72 C for 50 ; step 3. 95 C for 20 ; step 4. 95 C for 10 ; step 5. 60 C for 10 ; step 6. 72 C for 20 ; step 7. go to step 4 for 22 times; step 8. 72 C for 50 ; step 9. 4 C forever; 3. Continue to second LM-PCR. 2.7.4. Second LM-PCR 1. On ice, take 15 mL of first LM-PCR and add 10 mL dH2O to final volume of 25 mL. 2. On ice, add 15 mL of PCR label mix to tube.
462
Sarah E. Davis et al.
PCR label mix Components
1
10 ThermoPol reaction buffer NEB 2.5 mM dNTP 40 mM Primer 1 dH2O Taq Polymerase 5U/mL Qiagen PFU Turbo 0.025U/mL* Stratagene Total
5 mL 5 mL 1.25 mL 11.75 mL 1 mL 1 mL 25 mL
*100 PFU ¼ 2.5 U/mL, dilute to 1, use 1 mL of 1 per Rxn.
3. Transfer to PCR tubes on ice, place in PCR machine, and run the same program as before: Program: LM-PCR step 1. 55 C for 20 ; step 2. 72 C for 50 ; step 3. 95 C for 20 ; step 4. 95 C for 10 ; step 5. 60 C for 10 ; step 6. 72 C for 20 ; step 7. go to step 4 for 22 times; step 8. 72 C for 50 ; step 9. 4 C forever; 4. Purify DNA with Qiaquick PCR purification kit. Elute in 50 mL elution buffer (if using NimbleGen to run the array, then elute in nf-H2O). Take the OD of all final samples. 5. If using NimbleGen services, then the preferred concentration of DNA is 300–500 ng/mL. 2.7.5. *Linker preparation: Primer 1:50 -GCGGTGACCCGGGAGATCTGAATTC-30 ; HPLC purified Primer 2:50 -GAATTCAGATC-30 ; HPLC purified 1. Dissolve both linkers in dH2O to a concentration of 40 mM. 2. Mix 250 mL 1 M Tris pH 7.9, 375 mL of each 40 mM primer; aliquot in 100 mL volumes. 3. Heat samples for 5 min at 95 C. 4. Transfer samples to 70 C heat block and place heat block on bench top and allow it to gradually cool to room temperature (25 C). 5. Store samples at 20 C.
463
RNA Polymerase and Associated Transcription Factors
2.8. Genome-wide location analysis (ChIP–chip) To perform a ChIP experiment, one requires an antibody that only binds to the protein of interest. However, when an antibody for the specific protein is not available, an epitope-tagged protein can be used for ChIP and ChIP–chip (or genome-wide ChIP). To test the fidelity and effectiveness of ChIP–chip using an antibody directed to an epitope-tagged protein, we compared the genome-wide binding profiles for NusG obtained by using two different antibodies: a polyclonal antibody specific to NusG (in a strain bearing the untagged endogenous protein) and a monoclonal antibody to the Hemaglutenin (HA) peptide epitope in a strain where chromosomal NusG is tagged at the N-terminus with an HA epitope. Analysis of the data using ChIPOTle (Buck et al., 2005) identified an average of 601 sites from three separate ChIP–chip experiments for the HA-tagged NusG, while 710 sites were detected for ChIP–chip done with a polyclonal antibody against NusG. Although there are differences in the number of peaks detected, the scatter plot shows a strong correlation, 0.835, between the two approaches (Fig. 20.5). The difference in Scatter plot of NusG HA-tagged versus NusG antibody
NusG_Ab
4
2
0
–2 TAG –2
0 HA_NusG
2
4
Figure 20.5 Comparison of ChIP–chip data from a polyclonal antibody specific for NusG (in strain MG1655) to that from using a monoclonal antibody to a HA tag engineered at the N-terminus of NusG (RL1664; MG1655::HA-NusG). The data shows strong correlation to each other indicating overlap in detection across the genome.
464
Sarah E. Davis et al.
the number of peaks detected might reflect differences in antibody accessibility or differences in how well the individual antibodies work for immunoprecipitation. Ideally, ChIP could be performed with a specific antibody as well as with an antibody specific to an epitope tag, but either approach can be used to map protein distributions. An advantage of the epitope-tagging approach is that the same antibody could be used against multiple tagged proteins in parallel, thus removing the potential complication of antibodies that are less effective in this procedure. By using a tiled microarray, ChIP–chip data can reveal not only the genomic location, but also report on the extent to which a protein interacts with DNA across the entire genome in vivo. With a well-designed array (reviewed in Buck and Lieb, 2004) and our optimized ChIP–chip protocol, highly detailed and specific binding sites across the entire genome were discovered. When combining various ChIP–chip traces of RNAP along with various transcription factors, one is able to visualize the distribution across the genome as shown in Fig. 20.6B (Herring et al., 2005, Mooney et al., 2009; Reppas et al., 2006; Roberts, et al., 2003). In addition to mapping protein-binding sites, ChIP–chip is able to differentiate relatively how much protein is binding to specific regions of the genome. The arrays used for ChIP–chip were designed to cover the entire genome, and the design of each sequence probe was optimized to achieve isothermal values to maintain uniform hybridization properties across the array. These design criteria produced probes of 45–51 nucleotides and tiled across the genome, with an average of 24.5 bp separation. This design provided a twofold (duplicate) coverage of the genome using 374,408 probes in a 1 cm2 area on a glass slide.
2.9. Data analysis protocol 1. Obtain Cy3 and Cy5 paired data files from scanned microarrays (files from arrays scanned either using a Molecular Devices Axon4000B scanner or from a fee-for-service facility such as that provided by Roche NimbleGen (Madison, WI)). Paired data files are typically labeled 532.pair (Cy3) and 635.pair (Cy5) and contain one intensity value for each probe on the microarray. 2. Convert the Cy3 and Cy5 intensities for the probe set to log2(IP/Input) values that are corrected for dye interaction by Lowess normalization (Yang et al., 2002), using the NormalizeWithinArray function (Smyth and Speed, 2003) within the limma package (Smyth, 2005) for the statistical program R (R Development Core Team, 2005). The Lowess normalization corrects for dye interactions by generating a local regression model of log2(IP/Input) versus log2 (IP Input) signals with the global median of the data set to zero.
465
RNA Polymerase and Associated Transcription Factors
–0.363 ± 0.137, background mean
0
5.5
50
5000
5.0
10000
4.0
100
4.5
15000
3.5
4.63, mean of top ten 3-probe clusters
20000
4.0
Number of probes
25000
3.0
A
5.0
4.5
2.5
2.0
1.5
1.0
0.5
0.0
–0.5
–1.0
0 log2 (RNAP IP/input) bins
B 70
s
b¢ bkgd rRNA tRNA rProtein
ori 0 mB
1
yljA clpA cspD
lrp ftsK
serW aat cydCD infA
lolA ycaJ serS
dmsABC
3
ycaD ycaM ycaK ycaC
trxB
0.93 s
2
0.94
70
4.63
4
ycaP
serC
ycaL aroA cmk rpsA himD
ycaN pflA pflB focA ycaO
0.95
0.96 ycal
b¢ NusA NusG r background
Figure 20.6 (A) The RNAP (b0 subunit IP/Input) histogram (blue) is overlaid with the histogram of the background regions (black). The highest signal region (mean of top 10 3-probe clusters) were selected for an apparent occupancy of 1. (B) The log2 ratios from ChIP–chip profiles (IP/Input) of RNAP and regulators across the E. coli genome are shown for sigma70 (orange) and RNAP b0 subunit (blue). Regions across the genome identified as background interactions are shown as black bars, genes encoding rRNA and tRNA are shown in blue and green, respectively. An expanded region around 0.95 Mb is shown for RNAP, sigma70, NusA, NusG, and Rho with the locations of known promoters (vertical lines with black horizontal arrows) or predicted promoters (vertical lines with gray horizontal arrows).
3. If the analysis involves averaging multiple data sets (either biological or technical replicates), then perform quantile normalization using the normalize.quantiles function in the R package affy (Gautier et al., 2004) and then average the values for each probe. 4. To associate each probe with a genome position, assign the probe value to the midpoint coordinate of the probe.
466
Sarah E. Davis et al.
5. To determine background based on the absence of transcription for each dataset, identify genome regions in which the RNAP ChIP signals meet three criteria: (i) the region is greater than 1 kb; (ii) the log2(IP/Input) signals for all 300 bp windows within the region are indistinguishable (Student’s t-test; p < 0.05) from that of a known nontranscribed gene (e.g., for E. coli K-12, use bglB); and (iii) no portion of the region overlaps a transcribed gene based on expression profiling of cells grown in identical conditions (e.g., for E. coli K-12 grown in minimal medium, no estimated transcript abundance exceeds 1/cell (Allen et al., 2003). 6. For RNAP or RNAP-associated transcription factors (e.g., NusG and NusA), calculate the background signal distribution (mean, s.d.) for each dataset by averaging the probe values within the identified genome regions. Verify that the background signal distribution is normal (Cramer–von Mises normality test; p < 0.02). 7. Subtract the background value from each probe value in the dataset.
2.10. Defining the background signal for a widely distributed protein complex The assignment of background signal in RNAP and regulator ChIP–chip experiments is a complex issue. Most available methods were adapted from expression array analysis and offer imperfect solutions. Nimblegen, for instance, subtracts a background value derived from the Tukey’s bi-weight robust mean estimator (Hoaglin et al., 2000). This is an appropriate background estimate when relatively few DNA locations are occupied by a protein. However, this is not the case for RNAP and many regulators of transcription that interact with RNAP. Hence, the background levels estimated by Tukey’s bi-weight mean are inappropriate. Therefore, we considered using the mode of the ChIP-signal distribution (0.23 in Fig. 20.6A) as the true background, under the assumption that it represents the mean of a normal background distribution whose right portion is merged with signal above background. However, even the mode proved to be above the background for RNAP as estimated from bglB-like regions (Fig. 20.6). We settled on the average signal from bglB-like regions as the best estimator of background signal in our ChIP–chip experiments, but note that even this background may include signal from nonspecifically bound RNAP and that the extent of nonspecific RNAP association with DNA could be affected by other proteins that interact with DNA, including chromatin-like proteins or actively transcribing RNAP. It is likely that this method of defining background regions of nonspecific RNAP or regulator interactions with DNA underestimates the true extent of these regions. Low levels of specific RNAP or regulator
RNA Polymerase and Associated Transcription Factors
467
association with DNA may not detected above background by this method but might be detected if the efficiencies of cross-linking or IP were higher. For this reason, the regions used for definition of background levels of nonspecific association should not be generalized to other experiments, but need to be reassessed for each experiment.
3. Chemical Genomics Expanding the application of ChIP–chip to include the use of cellpermeable small molecule inhibitors, chemical genomics can help annotate RNAP-binding sites genome-wide, uncouple mechanistic events of cellular processes and reveal new regulatory networks that were not evident from biochemical or genetic analyses (Herring et al., 2005, Kanin et al., 2007, Mooney et al., 2009, Peters et al., 2009, Raffaelle et al., 2005). For instance, by treating E. coli with rifampicin, a small molecule inhibitor known to inhibit RNA chain extension beyond 2–3 nt (Campbell et al., 2001; McClure and Cech, 1978; Raffaelle et al., 2005; Sippel and Hartmann, 1968), one can trap RNAP at promoters (Fig. 20.7A). We were able to see distinct peaks of b and sigma70 at all known promoters and when these peaks were averaged together, they centered at the promoter start site with great resolution as compared to b and sigma70 in the absence of rifampicin treatment (Fig. 20.7B). As rifampicin binds the b subunit of RNAP and halts RNAP in the initiation complex, we are also able to map E. coli’s holoenzyme across the genome. Figure 20.7C demonstrates the great level of overlap between b and b0 signals when treated with rifampicin. By comparing rifampicin treated b/b0 (core polymerase) to rifampicin treated sigma70, ChIP–chip provides the opportunity to identify promoters across the genome where core polymerase is present associated with a different sigma factor (Fig. 20.7C, sigma54 complexes at glnH marked in blue). Interestingly, comparison of core occupancy between rifampicin treated (150 mg/mL rifampicin for 15 min) and untreated cells revealed RNAP association at numerous distinct binding sites that are within transcribed coding regions and often not occupied by holoenzyme in a rapidly growing cells (E.K., unpublished and R. H. Ebright, personal communication). The presence of numerous potential “promoter-competent” regions within transcribed regions suggests intriguing new possible roles for such elements under conditions of reduced expression or it may simply reveal the lack of selective pressure in erasing potential promoter-like sequences in regions of the genome that are not typically available for RNAP assembly. In another example of using small molecules in combination with ChIP, bicyclomycin (BCM) was used (20 mg/mL) to inhibit the terminator protein Rho and observe the consequences on transcription termination across
468
Sarah E. Davis et al.
A
Rif DNA b¢ Bind
3.5 3.0
Open complex
Elongation complex
C
s s Rif b b Rif
D
6 6 4
2.5
b¢ rif
Average log 2 ratio
4.0
Closed complex
2.0 1.5
4 rif
4.5
NTP
s RNA
RNAP
B
Initiate NTP Abort
Isomerize
2
70
b
s
s
0
0
1.0 0.5
–2
–2 276 414 552 690 828
– 690 –552 – 414 –276 –138 0 138
0
Base pairs
2
s
s
–2
0
2 b rif
4
6
–2
0
2
4
6
Core rif
Figure 20.7 (A) RNAP binds to promoter regions of DNA to form a closed complex. This closed complex then isomerizes to an open complex in which the DNA strands are melted. RNAP can either abort initiation, releasing short RNA transcripts or can isomerize to a stable complex that can processively elongate the RNA. The small molecule rifampicin (red) can bind the b subunit of RNAP in all but the elongation complex and is able to trap RNAP by blocking extension of the RNA from 2–3 nt. (B). Averaged occupancy profiles of sigma70 and b subunits in the presence or absence of rifampicin. A subset of genes (nearly a third of the 4000 E. coli genes) of robust to moderate expression were grouped and the ChIP–chip signal for sigma70 (orange) and b (blue) displayed with respect to the center of the sigma70 peak which coincides with the transcription start site. The positive offset for the b subunit may be a function of more optimal cross-linking with downstream DNA. In the presence of rifampicin, a Gaussian distribution for both subunits (dashed lines) is observed, indicative of an immobile complex. In the absence of rifampicin, sigma70 is released in a stochastic manner as RNAP engages in processive transcript elongation. Part of the sigma70 signal within the transcribed region may arise due to dynamic reassociation with the elongating polymerase. (C) Comparison of ChIP–chip data of RNAP subunits b and b0 from cells treated with rifampicin shows significant correlation. The data indicate that ChIP– chip methods accurately define the location of two subunits of the core RNAP across the genome. (D) Comparison of ChIP–chip data of core RNAP (averaged b and b0 signal) to sigma70 when treated with rifampicin shows several genes where core RNAP is present without sigma70. Circled in blue are probes for the glnH operon that is regulated by sigma54 bearing holoenzyme.
the genome by looking at the changes in RNAP distribution (Peters et al., 2009). Sites that exhibited a shift in the distribution of RNAP in BCMtreated cells were classified as regions where RNAP is regulated by Rho. Although individual Rho-dependent terminators had been previously identified, this application of chemical genomics coupled with ChIP allowed
RNA Polymerase and Associated Transcription Factors
469
determination of the sites of RNAP regulation by Rho on a genome-wide scale. In this work, approximately 200 Rho-terminated loci were identified that were either located at the ends of genes or within genes. Of the loci identified at the ends of genes, these included not only just mRNA genes, but also noncoding RNAs such as small RNAs and transfer RNAs (tRNAs), revealing a previously unappreciated role of Rho in the termination of stable RNA synthesis. Additionally, several sites of altered RNAP distribution were identified that were located within genes, including a previously uncharacterized set of short antisense transcripts, identified as such because the shift in RNAP distribution was opposite to the direction of the known gene at that location. Numerous other small molecules and antibiotics that disrupt defined processes in E. coli are known and could be used together with ChIP. Systematic application of such molecules, for example, antibiotics that block translation or specific enzymes such as topoisomerases, would reveal the coupling of various processes in the functioning of complex biological machines that act on the genome. In time, such understanding will provide the framework to develop synthetic tools to control gene networks in a desired manner.
4. Future Directions ChIP–chip has provided the ability to examine protein–DNA interactions across the genome. New methodologies are focused on using massive parallel sequencing approaches to evaluate protein–DNA complexes (ChIP-seq). ChIP-seq, while expensive, provides significant improvement in base pair resolution. The basic concerns of defining meaningful associations versus “genome-sampling/scanning” by RNAP that is observed by ChIP–chip remain a challenge. While methods described here provide the best current estimate of background determination, further insight into the role of RNAP collisions with DNA will be an interesting area of investigation. The role of DNA structure and compaction within live cells also provides interesting layers of organizational and structural control. The genome-wide distribution patterns of RNAP and its factors obtained through ChIP–chip or ChIP-seq methods, when combined with high-resolution live cell imaging, will reveal role of chromosomal structures and substructures in coordinate regulation of operons. Such analyses will also shed light on other genomic processes such as replication, recombination, and repair. In the future, the ability to define the patterns of genomic interactions of RNAP and other cellular machines will lead to the ability to synthetically alter these natural complexes and use rational design to construct synthetic
470
Sarah E. Davis et al.
genomes and control genes and networks in a programmable manner. The ability to do so in a precise manner will be of enormous value to several fields, especially synthetic biology.
ACKNOWLEDGMENTS We gratefully acknowledge the efforts of Marni Raffaelle, Jen Rowland, Jason Peters, and Christopher Herring in developing these protocols. We thank Dick Burgess, Bernhard Palsson, Sunduz Keles, and Tricia Kiley for advice and collaborative efforts in developing the methodology. The work would not have been possible without funds from the Vilas associates award, the I&EDR grant, and USDA Hatch grants to A. Z. A. and NIH funds to R. L.. E. K. was supported by the NHGRI (GSTP) training grant.
REFERENCES Allen, T. E., Herrgard, M. J., Liu, M., Qiu, Y., Glasner, J. D., Blattner, F. R., and Palsson, B. O. (2003). Genome-scale analysis of the uses of the Escherichia coli genome: Model-driven analysis of heterogeneous data sets. J. Bacteriol. 185, 6392–6399. Aparicio, O., Geisberg, J. V., Sekinger, E., Yang, A., Moqtaderi, Z., and Struhl, K. (2005). Chromatin immunoprecipitation for determining the association of proteins with specific genomic sequences in vivo. In “Current Protocols in Molecular Biology,” (F. M. Ausubel, R. Brent, R. E. Kingston, D. D. Moore, J. G. Seidman, J. A. Smith, and K. Struhl, eds.), pp. 21.3.1–21.3.33. John Wiley & Sons, Inc., Hoboken, NJ. Auerbach, R. K., Euskirchen, G., Rozowsky, J., Lamarre-Vincent, N., Moqtaderi, Z., Lefrancois, P., Struhl, K., Gerstein, M., and Snyder, M. (2009). Mapping accessible chromatin regions using Sono-Seq. Proc. Natl. Acad. Sci. USA 106, 14926–14931. Buck, M. J., and Lieb, J. D. (2004). ChIP-chip: Considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 83, 349–360. Buck, M. J., Nobel, A. B., and Lieb, J. D. (2005). ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biol. 6(11), R79. Campbell, E. A., Korzheva, N., Mustaev, A., Murakami, K., Nair, S., Goldfarb, A., and Darst, S. A. (2001). Structural mechanism for rifampicin inhibition of bacterial RNA polymerase. Cell 104, 901–912. Gautier, L., Cope, L., Bolstad, B. M., and Irizarry, R. A. (2004). Bioinformatics Oxford, England. Vol. 20, 307–315. Grainger, D. C., Overton, T. W., Reppas, N., Wade, J. T., Tamai, E., Hobman, J. L., Constantinidou, C., Struhl, K., Church, G., and Busby, S. J. W. (2004). Genomic studies with Escherichia coli MelR protein: Application of chromatin immunoprecipitation and microarrays. J. Bacteriol. 186, 6938–6943. Grainger, D. C., Hurd, D., Harrison, M., Holdstock, J., and Busby, S. J. W. (2005). Studies of the distribution of Escherichia coli camp-receptor protein and RNA polymerase along the E. coli chromosome. Proc. Natl. Acad. Sci. USA 102(49), 17693–17698. Herring, C. D., and Blattner, F. R. (2004). Global transcriptional effects of a suppressor tRNA and the inactivation of the regulator frmR. J. Bacteriol. 186, 6714–6720. Herring, C. D., Glasner, J. D., and Blattner, F. R. (2003). Gene replacement without selection: Regulated suppression of amber mutations in Escherichia coli. Gene 311, 153–163.
RNA Polymerase and Associated Transcription Factors
471
Herring, C. D., Raffaelle, M., Allen, T. E., Kanin, E. I., Landick, R., Ansari, A. Z., and Palsson, B. O. (2005). Immobilization of Escherichia coli RNA polymerase and location of binding sites by use of chromatin immunoprecipitation and microarrays. J. Bacteriol. 187, 6166–6174. Hoaglin, D., Mosteller, F., and Tukey, J. (2000). Understanding Robust and Exploratory Data Analysis. Wiley, New York. Iyer, V. R., Horak, C. E., Scafe, C. S., Botstein, D., Synder, M., and Brown, P. O. (2001). Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409, 533–538. Kanin, E. I., Kipp, R. T., Kung, C., Slattery, M., Viale, A., Hahn, S., Shokat, K. M., and Ansari, A. Z. (2007). Chemical inhibition of the TFIIH-associated kinase Cdk7/Kin28 does not impair global mRNA synthesis. Proc. Natl. Acad. Sci. USA 104, 5812–5817. Lin, D. C. H., and Grossman, A. D. (1998). Identification and characterization of a bacterial chromosome partitioning site. Cell 92, 675–685. McClure, W. R., and Cech, C. L. (1978). On the mechanism of rifampicin inhibition of RNA systhesis. J. Biol. Chem. 253, 8949–8956. Mooney, R. A., Davis, S. E., Peters, J. M., Rowland, J. L., Ansari, A. Z., and Landick, R. (2009). Regulator trafficking on bacterial transcription units in vivo. Mol. Cell 33, 97–108. Peters, J. M., Mooney, R. A., Kuan, P. F., Rowland, J. L., Keles, S., and Landick, R. (2009). Rho directs widspread termination of intragenic and stable RNA transcription. Proc. Natl. Acad. Sci. USA 106(36), 15406–15411. R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 3-900051-07-0. http://www. R-project.org. Raffaelle, M., Kanin, E. I., Vogt, J., Burgess, R. R., and Ansari, A. Z. (2005). Holoenzyme switching and stochastic release of sigma factors from RNA polymerase in vivo. Mol. Cell 20, 357–366. Ren, B., Robert, F., Wyrick, J. J., Aparicio, O., Jennings, E. G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E. I., Volkert, T. L., Wilson, C. J., et al. (2000). Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309. Reppas, N. B., Wade, J. T., Church, G. M., and Struhl, K. (2006). The transition between transcriptional initiation and elongation in E. coli is highly variable and often rate limiting. Mol. Cell 24, 747–757. Roberts, D. N., Stewart, A. J., Huff, J. T., and Cairns, B. R. (2003). The RNA polymerase III transcriptome revealed by genome-wide localization and activity-occupancy relationships. Proc. Natl. Acad. Sci. USA 100, 14695–14700. Sippel, A., and Hartmann, G. (1968). Mode of action of rifampicin on the RNA polymerase reaction. Biochem. Biophys. Acta. 157, 218–219. Smyth, G. (2005). In “Bioinformatics and Computational Biology Solutions using R and Bioconductor,” (R. Gentleman, V. Carey, S. S. Dudoit, R. Irizarry, and W. Huber, eds.), pp. 397–420. Springer, New York. Smyth, G., and Speed, T. (2003). Normalization of cDNA microarray data. Methods 31, 265–273 (San Diego, Calif,). Solomon, M. J., and Varshavsky, A. (1985). Formaldehyde-mediated DNA-protein crosslinking: A probe for in vivo chromatin structures. Proc. Natl. Acad. Sci. USA 82(19), 6470–6474. Wade, J. T., and Struhl, K. (2004). Association of RNA polymerase with transcribed regions in Escherichia coli. Proc. Natl. Acad. Sci. USA 101, 17777–17782. Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P. (2002). Nucleic Acids Res. 30, e15.
Author Index
A Adam, L., 154, 173 Adams, S. E., 283 Adar, R., 207 Affholter, J. A., 401 Afonso, B., 312 Agapakis, C. M., 312 Agarwal, K. L., 249, 278, 350 Agol, V. I., 47 Agrawal, S., 294, 295 Aguilar, P. V., 260 Aho, A. V., 208, 209 Ajo-Franklin, C. M., 257, 299, 312 Akashi, H., 50 Akioka, M., 433, 434, 436, 437 Alain, T., 47 Aleff, R. A., 285 Algire, M. A., 250, 253, 258, 265, 278, 300, 303, 329, 351, 353, 424, 428, 438 Alibe´s, A., 3–6, 9 Aliprandi, P., 27 Alkema, W., 13 Allen, T. E., 454, 466, 467, 564 Allert, M., 44, 45, 47, 293, 298, 299 Alon, U., 138, 139 Alper, H., 78, 299 Alperovich, N., 300, 329, 444 Alsuwaiyel, M. H., 209, 221 Altman, A., 256 Altman, R. B., 73, 76 Altschul, S. F., 179 Alves-Rodrigues, I., 69 Al Zaid Siddiquee, K., 69 Amberg, J. R., 270 Anagnostopoulos, C., 430 Anderson, C., 192 Anderson, J. C., 97, 103, 138, 154, 254, 299, 312, 351, 363, 364–366, 370, 374 Andre, P., 294 Andrews-Pfannkoch, C., 253, 258, 265, 278, 300, 329, 351, 353, 428 Andrianantoandro, E., 138 Angarica, V. E., 4 Ansari, A. Z., 449, 454, 464, 467, 564 Aparicio, O., 254, 450 Appel, A., 58 Applebee, M. K., 77 Arauzo-Bravo, M. J., 69
Arima, K., 430 Aristidou, A. A., 71, 73 Arkin, A. P., 69, 138, 299, 312, 364–366, 374 Arlotto, M. P., 59 Arndt, K., 312 Arnheim, N., 249, 342 Arnold, F. H., 138, 401 Arnold, S., 253 Arnould, S., 4–6 Asadollahi, M. A., 78, 79 Ashworth, J., 5 Aslanidis, C., 270, 328, 350 Assad-Garcia, N., 300, 329, 444 Atkinson, T. C., 249 Attias, R., 59 Auckenthaler, A., 253 Auerbach, R. K., 456 Au, L. C., 208, 282, 283 Avignone-Rossa, C. A., 79, 80 Axelrod, K. C., 300, 329 Aymerich, S., 69 B Baba, T., 69 Babson, E., 84 Baden-Tillson, H., 253, 258, 265, 278, 300, 329, 351, 353, 428 Bader, J. S., 254, 292, 351 Bader, M. L., 58 Badhwar, J., 28 Baeten, L., 16 Bagh, S., 139 Bahassi, E. M., 316 Bailes, E., 50 Baker, D., 4, 5 Balachandran, A., 410 Ball, D. A., 187 Baneyx, F., 57 Bang, D., 214, 428 Banville, D. L., 283 Bao, J. S., 283 Barabasi, A. L., 84 Barak, Y., 256 Barany, F., 249, 297 Baric, R. S., 350–351 Bar-Joseph, Z., 84 Barnes, H. J., 59 Barone, A. D., 281
473
474 Barrett, C. L., 84 Barrett, J. C., 329 Barrow, A., 293 Barr, P. J., 57 Bartley, B. A., 314 Basler, C. F., 260 Basu, S., 138 Batten, C., 363, 370 Bauer, A. P., 251, 264–265 Bautsch, W., 328 Bayer, E. A., 256 Baym, M., 79–82 Beard, D. A., 84 Beaucage, S. L., 278, 280 Beck, E., 70 Becker, C., 278, 280 Becker, J., 73, 76 Becker, S. A., 70, 80 Beecher, J. E., 281 Beg, Q. K., 84 Belghazi, M., 59 Belosludtsev, Y., 280 Belshaw, P. J., 281, 288, 294, 296 Bembom, O., 16 Benders, G. A., 250, 253, 258, 265, 278, 300, 303, 329, 351, 353, 424, 428, 438 Benes, V., 410 Benkovic, S. J., 295 Benos, P. V., 4 Beppu, T., 430 Berger, B., 79–82 Berglund, H., 48 Bergmann, F. T., 139, 174, 187 Berg, P., 350 Berkman, O., 78 Berlin Yu, A., 328 Berman, H. M., 9 Bernard, P., 316 Berry, K. E., 47 Bershtein, S., 256 Berthold, P., 297 Betley, J. R., 280 Bevilacqua, P., 28 Bhatia, S., 97 Bhat, T. N., 9 Bianchi, B. R., 57 Bielecka, A., 80 Bieler, K., 264 Bilitchenko, L., 153 Bing, X., 154 Binkowski, B. F., 288, 294, 296 Birikh, K. R., 328 Birren, B., 429 Bjerknes, M., 27, 316 Blackburn, E. H., 329 Blake, W. J., 138, 139 Blanco, F. J., 4, 5 Blank, L. M., 84
Author Index
Blattner, F. R., 79, 281, 416, 451, 454, 466 Blauvelt, M. F., 44, 175 Bloch, C. A., 416 Blocker, H., 250 Bloom, F. R., 324 Blose, J., 28 Bode, M., 254, 292 Boeke, J. D., 292, 330, 336 Boesch, B. W., 76, 77 Bognar, A., 69 Bois, J., 28, 31, 32 Boldt, J., 98 Bolivar, F., 250, 350 Bolstad, B. M., 465 Bonde, B. K., 80 Boni, I. V., 27, 46, 47 Borg, J., 5, 6 Bork, P., 85 Bornscheuer, U. T., 404 Borschevskaya, L. N., 254, 289, 290, 297 Bosch, D., 284 Botstein, D., 329, 450, 454 Bourguignon, P. Y., 70 Bourne, P. E., 9 Bowman, J., 73, 76 Boyer, H. W., 250, 350 Boyle, P. M., 312 Brachmann, R. K., 330 Braddock, M., 283 Bradshaw, C. R., 411 Brake, A. J., 57 Braman, J. C., 294 Bramlett, B. W., 174 Brasch, M. A., 328 Bray, R. C., 283 Breitling, F., 295 Brennan, T. M., 208, 280, 284–286, 291, 351 Breton, A., 73, 76 Brickner, R. G., 284 Brignac, S., 280 Brindle, K. M., 80 Broadbelt, L. J., 80, 84 Broderick, M. L., 312 Bron, S., 442 Brophy, J., 363 Brousseau, R., 282 Brown, A. D., 312 Brown, C. J., 50 Brown, P. O., 450, 454 Browne, R. A., 80 Browning, K. S., 47 Brownley, A., 253, 258, 265, 278, 300, 329, 351, 353, 428 Brownstein, M. J., 318 Bubeck, P., 328 Bubenheim, B., 97 Bubunenko, M. G., 410
475
Author Index
Bucheimer, R. E., 46 Buchholz, F., 328, 410 Buchi, H., 249, 278, 282, 350 Buck, M. J., 463, 464 Bui, O. T., 77 Bujara, M., 22 Bujard, H., 138 Buller, R. M., 338 Bulmer, M., 50 Bulter, T., 138 Bulyk, M. L., 4, 5, 9, 13 Bumeister, R., 280 Bundy, J. G., 80 Burbelo, P. D., 299 Burchard, J., 281 Burgard, A. P., 77, 79, 80, 82 Burgess, R. R., 467 Burgin, A., 293 Burke, J. F., 283 Burland, V., 416 Burton, N., 80 Bury, P. A., 281 Busby, S. J. W., 454 Bushell, M. E., 79, 80 Butler, M. J., 79 Butner, T. L., 312 Butt, T. R., 57 Byrd, D. R., 328 Byrne-Steele, M. L., 288 C Cabaniols, J. P., 4–6 Caflisch, A., 256 Cai, G., 338 Cairns, B. R., 464 Cai, S., 281, 293 Cai, Y., 44, 174, 175, 183, 187, 258, 291 Cai, Z. L., 350 Campbell, E. A., 467 Camsund, D., 312 Canosi, U., 441 Canton, B., 141, 257, 312 Cantor, C. R., 138, 139 Cao, W., 297 Carbon, J., 329 Carlson, R., 196 Carney, H. C., 270 Carney, J. R., 424 Carpten, J. D., 318 Carrera, J., 138 Carrera, W., 300 Carr, P. A., 208, 214, 295, 296, 410–412, 419 Cartinhour, S., 410 Caruthers, M. H., 208, 249, 278, 280, 350 Caspi, J., 256 Castronovo, S., 59 Cech, C. L., 467
Cello, J., 250, 260, 278, 299 Cerrina, F., 281 Chaffron, S., 85 Chakraborty, R., 69 Chambon, P., 282 Chames, P., 4, 5 Chandra, N., 68, 73, 76, 77 Chandran, D., 139, 174 Chandran, S. S., 424 Chang, A. C., 350 Chang, H. N., 76 Chang, M. C., 59 Chan, L. Y., 424 Chater, K. F., 79 Chavali, A. K., 76, 77 Cheah, K. S., 411 Chen, C., 281 Chen, G., 47 Cheng, J. Y., 280 Chen, G.-Q., 285 Chen, G. T., 47 Chen, H. B., 27, 283, 316 Chen, H. H., 280 Chen, J. M., 253, 282, 351 Chen, K. H., 77 Chen, M. T., 138 Chen, S., 33 Chen, T. J., 84, 86 Chen, X. N., 329 Cheng, Z.-M., 209, 214, 253, 282, 285–290, 294, 297, 351 Cheo, D. L., 328 Cheong, W. C., 254 Chesbrough, H. W., 190 Chevalier, A. A., 312, 405 Chevray, P. M., 330 Chien, A., 353 Ching, K. H., 299 Chiu, M. L., 57 Choe, S., 59 Cho, G., 295 Cho, K., 138 Choi, B. K., 76 Choi, H. S., 78, 80 Choi, I., 285 Choi, S. J., 76 Choi, S. S., 58 Chomsky, N., 208 Choulet, J., 58 Christen, P., 69 Christiansen, M., 28 Chrysostomou, C., 289, 293, 301, 302 Chuang, R. Y., 250, 300, 303, 329, 338, 351, 353, 355, 357, 424, 428, 429, 438 Chu, G., 432 Chu, J., 69 Chu, L. L., 281 Chung, B. K., 82
476
Author Index
Church, G. M., 69, 78–82, 208, 209, 212, 214, 215, 281, 294, 295, 409–412, 418–419, 424, 428, 454, 464 Ciccarelli, R. B., 282 Clackson, T., 321 Clancy, K., 174 Clark, A., 78, 79 Clarke, L., 329 Clark, J. M., 403 Clayson, E. M., 80 Cline, J., 294 Cohen, S. N., 350 Coit, D. G., 57 Collado-Vides, J., 4, 416 Collins, J. J., 23, 138, 139, 299 Constante, M., 312, 375 Constantinidou, C., 454 Contou-Carrere, M.-N., 141 Contreras-Moreira, B., 4 Cookson, S., 23 Cooling, M. T., 258 Cooper, Iver, P., 198 Cooper, K. L., 44 Cope, L., 465 Copeland, N. G., 410, 420 Coppi, M. V., 69 Cortez, D., 328 Cortopassi, G., 342 Costa, F. F., 139 Costantino, N., 410–411, 418 Court, D. L., 410–411, 418, 420 Couturier, M., 316 Covert, M. W., 84, 86 Cowe, E., 50 Cox, J. C., 44, 45, 47, 289, 293, 295, 298, 299 Cox, N. J., 260 Cox, R. S., 138 Craft, D. L., 59 Crameri, A., 208, 284–286, 291, 351, 404 Crasta, O., 44 Crea, R., 250, 350 Creasy, C. L., 332 Creevey, C., 85 Cross, T. A., 48 Cruz-Vera, L. R., 47 Cumbers, J., 257, 299, 312 Cunningham, P. R., 46 Curtis, E., 84 Cyanoski, D., 197 Czar, M. J., 174, 183, 254, 291, 299, 312, 351 D Daboussi, F., 4–6 Dacey, S., 283 Dadgar, M., 97 Dalgarno, L., 27 Danchin, A., 257
Danino, T., 23 Dansette, P. M., 59 Daoutidis, P., 141 Darfeuille, F., 26 Darst, S. A., 467 Datsenko, K. A., 328, 410 Datta, S., 410 Daubert, D., 51, 264 Daugherty, P. S., 405 Davidsen, T., 70 Davidson, E. A., 312 Davis, C., 281 Davis, J. H., 257, 299, 312 Davis, R. W., 280, 329, 432 Davis, S. E., 449, 454, 464, 467 Davison, J., 73, 76 DeBlasio, A., 24 de Boer, H. A., 47 de Gier, J. W., 58 Dehal, P. S., 69 deHaseth, P., 278, 280 de Jong, H., 138 de Jong, P. J., 270, 328, 350 DeLalla, E. C., 44 Delgado, J., 80 Dellinger, D. J., 280 DeLoache, W., 363, 370 De Lorenzo, V., 257 DeMarini, D. J., 332 De Masi, F., 4, 5, 9 de Menezes, M. A., 84 de Mora, K., 257, 299, 312 Denison, M. R., 350–351 Denisova, E. A., 253, 258, 265, 278, 300, 329, 351, 353, 428 Densmore, D., 97, 99, 153, 363, 370 Dent, C., 197, 198 Deremble, C., 4 de Smit, M. H., 26, 47 Deutschbauer, A., 69 de Vos, W. M., 80 Dickson, J. O., 312 Dietmaier, W., 270 Diez, J., 69 DiGate, R. J., 283 Dileepan, T., 405 Dirks, R., 28, 31, 32 DiTizio, T., 411 Dittmar, K. A., 52 Doan, T., 69 Doede, T., 281 Doerks, T., 85 Doi, N., 437, 439, 444 Dong, H., 50 Dong, Q., 249, 286, 289, 294, 297 Donis, R., 299 Doudna, J. A., 47 Doyle, F. J. 3rd, 82
477
Author Index
Doyle, J., 82 Draheim, R., 58 Drew, D., 58 Dreyfus, M., 46 Drubin, D. A., 138, 312 Duan, H., 209, 215, 285, 287, 289, 290, 297 Duarte, C. M., 4, 5 Duarte, N. C., 70, 80 Dubel, S., 295 Dubnau, D., 430 Duboule, D., 59 Ducat, D. C., 312 Duchateau, P., 4–6 Duclert, A., 4, 5 Dudoit, S., 464 Dueber, J. E., 299, 312, 364–366, 374 Du, L., 312 Durot, M., 70 E Eachus, R. A., 59 Ebersole, T., 329 Eddy, J. A., 86 Edgar, D. B., 353 Edgell, M. H., 209, 341 Edge, M. D., 249 Edwards, J. S., 76, 77 Edwards, M., 283 Edwards, R. M., 283, 444 Efcavitch, J. W., 278, 280 Ehrenberg, M., 52 Eisenberg, Y., 84 Eldarov, M., 329 Elf, J., 52 Elie-Caille, C., 411 Elledge, S. J., 328, 350 Ellington, A. D., 277, 280, 293, 312, 405 Ellis, H. M., 410–411, 420 Ellis, T., 23 Ellison, M., 312 El Massaoudi, M., 80 Elowitz, M. B., 138, 139 Elshourbagy, N., 57 Emilsson, V., 262 Emma,W., 154 Endres, R. G., 4 Endy, D., 140, 257, 299, 312, 323, 424 Engler, C., 103 Engstrom, P., 13 Epinat, J. C., 4, 5 Erdogan, E., 297 Eren, M., 282 Erler, A., 411 Erlich, H. A., 249 Ernst, J., 84 Eshoo, M., 59 Eskin, J. A., 312
Euskirchen, G., 456 Evans, C., 44 Evans, D. H., 338 Evans, G. A., 280 Eyre-Walker, A., 50 F Fabry, S., 270 Faloona, F., 249 Fan, H.-Q., 209, 214, 285–290, 294, 297 Faria, J. P., 80 Fast, W., 295 Fath, S., 251, 265 Fattaey, A., 330 Favis, R., 297 Fedor, J., 312 Feist, A. M., 70, 80, 84 Feng, Z., 9 Ferrar, T. S., 312, 375 Fidanza, J. A., 281 Fingar, S. A., 284 Fink, G. R., 329, 336 Finley, S. D., 80 Firca, J. R., 280 Fischer, E., 69, 84 Fischer, M., 253 Fisher, E. F., 278, 280 Flickinger, S. F., 281 Focha, M., 80 Fodor, S. P., 281, 410 Foley, P., 264 Folkerts, O., 44 Fong, S. S., 69, 77, 79, 83 Fonstein, M., 444 Forest, C. R., 410–412, 419 Forney, L. J., 50 Forster, A. C., 208, 209, 214, 312 Forster, J., 79 Francke, C., 80 Franklin, C. M., 312 Frank, R., 250 Freeland, S. J., 291 Freeman, G. J., 338 Freigassner, M., 58 Friedman, A. M., 297 Friedrich, A., 411 Fritsch, E. F., 269 Fryer, K. E., 77 Fuan, H., 288 Fuhrer, T., 69 Fuhrmann, M., 253, 297 Fu, J., 411 Fujita, K., 233, 234, 237, 430, 433–435, 437 Funahashi, A., 138 Fung, E., 138 Furumichi, M., 70 Fussenegger, M., 138 Fyles, J., 284
478
Author Index G
Gabant, P., 316 Galagan, J. E., 79–82 Galas, D. J., 342 Galinsky, K., 70 Gallant, J., 51 Galluppi, G., 278, 280 Gammon, D. B., 338 Ganapathy, A., 70 Gane, P. J., 193 Gang, G. A., 58 Gao, F., 48, 253, 282, 351 Gao, P., 299 Gao, R., 312 Gao, X., 208, 209, 212, 214, 215, 281, 287–289, 292, 294, 295, 328, 424 Garcı´a-Sastre, A., 260 Garcia-Vallve, S., 298 Gardner, T. S., 138 Garner, H. R., 280 Garside, E., 312 Gatchel, J. R., 47 Gauges, R., 138 Gautier, L., 465 Gazo, B. M., 47 Geddie, M. L., 270 Gee, E. P., 312 Geisberg, J. V., 450 Gelfand, D. H., 249 Gellert, M., 350 Georgiou, G., 47 Gerdes, S., 444 Gerrits, M., 51, 264 Gerstein, M., 456 Gerth, M. L., 295 Geussenhainer, S., 250 Ge, X., 289, 293, 301, 302 Gianchandani, E. P., 86 Gibson, D. G., 138, 193, 250, 253, 258, 265, 278, 300, 303, 329, 338, 349, 351, 353, 355, 357, 358, 424, 428, 429, 438 Gillam, S., 209, 341 Gilliland, G., 9 Gingeras, T. R., 410 Gish, W., 179 Giver, L., 401 Glasner, J. D., 416, 454, 466 Glass, J. I., 138, 250, 300, 303, 424, 428, 438, 444 Glieberman, A. L., 257, 299, 312 Glieder, A., 58 Godinho, M., 80 Goeddel, D. V., 350 Goetz, P., 70 Gojobori, T., 50 Goldfarb, A., 467 Goldman, R., 278, 280 Goler, J. A., 174, 299, 312, 364–366, 374
Gong, H., 208, 209, 212, 214, 215, 281, 294, 295, 424 Gonzalez de Valdivia, E. I., 47, 48 Gonzalez-Lergier, J., 80 Gordeeva, T. L., 254, 289, 290, 297 Goto, H., 299 Goto, S., 70 Gouaux, J. E., 285 Govindarajan, S., 45, 48, 51, 52, 264, 291, 298, 299 Grafahrend-Belau, E., 80 Graf, M., 51, 251, 255–256, 264–265 Grainger, D. C., 454 Grant, O., 280 Grass, J., 449 Graves, J., 329 Green, A. R., 249 Green, R. D., 281 Greenwald, J., 59 Grice, R., 293 Griswold, K. E., 47 Grizot, S., 4, 5 Grocock, R. J., 50 Gronau, I., 207 Gross, E. A., 270 Grossman, A. D., 450 Grote, A., 298 Gruber, A., 28, 31, 32 Gru¨nberg, R., 22, 256, 312, 375 Grundstrom, T., 282 Grunwald, T., 264 Guarneros, G., 47 Guerois, R., 5 Guido, N. J., 79–82 Guiles, R. D., 283 Guillier, S., 4, 5 Gulari, E., 208, 209, 212, 214, 215, 281, 292, 294, 295, 424 Gunasinghe, M., 291 Gunn, L., 335 Gunyuzlu, P., 282 Guo, J. T., 4 Guo, M. J., 286, 288, 289 Gupta, N. K., 249, 278, 282, 350 Gurney, A., 48, 51, 52, 298, 299 Gu¨ssow, D., 321 Gustafsson, C., 43, 45, 48, 51, 52, 54, 264, 291, 298, 299 Guzman, E., 298 H Habermann, B., 411 Hahn, P., 251, 265 Ha, K. D., 208, 284–286, 291, 351 Hall, B., 280, 293 Hall, E. O., 338 Hallinan, J., 258
Author Index
Hamilton, M. D., 338 Hammarstrom, M., 48 Hammer, K., 79 Hanahan, D., 324 Han, B. L., 299 Hanekamp, T., 73, 76 Hannett, N., 254, 450 Harden, W. L., 312 Hard, T., 48 Harlow, E., 330 Harmston, R., 80 Harrison, M., 454 Harris, T. K., 287–289, 294, 328 Hartley, J. L., 328 Hartmann, G., 467 Hartnett, B., 174, 183, 187 Hatfull, G. F., 410 Hatzimanikatis, V., 80, 84 Havranek, J. J., 4, 5 Hay, B. N., 270 Hayden, M. A., 249 Haynes, K. A., 312 Hazen, T. C., 69 Heard, L. H., 312 Heathcliffe, G. R., 249 Heberlein, U. A., 57 Hegemann, P., 297 Heidorn, T., 312 Heinemann, M., 84, 208, 216, 252 Heinzle, E., 69 Hellen, C. U., 47 Hellgren, N., 48 Hellinga, H. W., 44, 45, 47, 289, 293, 295, 298, 299 Helling, R. B., 350 Hempel, D., 298 Hemsley, A., 342 Henriques, P., 187 Henry, C. S., 80, 84 Hentze, M. W., 47 Herrgard, M. J., 69, 83, 84, 466 Herring, C. D., 77, 79, 84, 451, 454, 467, 564 Hershberg, R., 297 Hertzberg, R., 173 Heyman, A., 256 Heyneker, H. L., 208, 250, 284–286, 291, 350, 351 Hickerson, R. P., 56 Hicks, J. B., 329 Higgins, D. G., 50 Higgins, D. R., 335 Hill, A. D., 139, 174 Hiller, K., 298 Hillesland, K. L., 85 Hinnen, A., 329 Hinshaw, J. E., 405 Hirai, K., 69 Hirakawa, M., 70
479 Hirasawa, T., 69 Hirose, T., 250, 350 Hiyoshi, A., 404 Hoaglin, D., 466 Hobman, J. L., 454 Hobom, G., 299 Hoekema, A., 47 Hoeller, O., 312 Hoffmann, E., 299 Hogbom, M., 58 Hogrefe, H. H., 294 Hoi, K., 289, 293, 301, 302 Holdstock, J., 454 Holland-Staley, C. A., 46 Holmgren, E., 47 Holtz, W., 22 Hong, A., 281, 293 Hong, S. H., 77, 80 Honig, B., 4 Hood, L., 280 Hoops, S., 138 Hoover, D., 288, 292, 294 Hopcroft, J. E., 208, 209 Hopwood, D. A., 424 Ho, P. Y., 69 Hoque, A., 69 Horak, C. E., 450, 454 Horinouchi, S., 430 Horn, G. T., 249, 253 Horton, R. M., 285, 350 Horvath, S. J., 280 Horwitz, A., 312 Hoshino, T., 430 Ho, S. N., 285, 350 Ho, T., 327 Houle, J., 139, 140, 174 Hsiau, T. H., 363, 370 Hsiung, H. M., 282 Huang, H. H., 312 Huang, J. D., 282, 297, 411 Huang, M. C., 254 Hua, Q., 79 Hua, Y., 48 Huff, J. T., 464 Hu, G., 403 Hughes, M., 299 Hughes, R. A., 277, 289, 293, 301, 302 Hughes, R. C., 288 Hughes, T. R., 281 Hunicke-Smith, S. P., 280, 289, 293, 301, 302 Hunkapiller, M. W., 280 Hunkapiller, T., 280 Hunt, H. D., 285 Hunt, T., 56 Hurd, D., 454 Hutchison, C. A. III., 209, 212, 221, 278, 284, 288, 299, 300, 329, 338, 341, 351, 353, 355, 357, 358, 428, 429, 438, 444
480
Author Index
Hu¨ttenhofer, A., 25 Hu, W., 73, 76 Huynen, M., 85 I Iadarola, M. J., 299 Ibarra, R. U., 76 Ibrahim, A. F., 291 Iglesias, A., 441 Ikemura, T., 50, 262 Ikeuchi, M., 433, 434 Ingerman, E., 405 Inouye, M., 47 Irizarry, R. A., 465 Isaacs, F. J., 410–412, 419 Isaksson, L. A., 47, 48 Ishii, N., 69 Itakura, K., 250, 350 Itaya, M., 233, 234, 237, 427, 430, 432–439, 443, 444 Ito, H., 299 Iverson, B. L., 47, 289, 293, 301, 302 Iyer, V. R., 450, 454 J Jackson, D. A., 350 Jacobson, J. M., 208, 214, 295, 296 Jaeggi, D., 85 Jahn, D., 298 Jahnke, P., 209, 341 Jaklevic, J. M., 280 Jamal Rahi, S., 4 Jamet, E., 69 Jamshidi, N., 70, 73, 76, 80 Jaramillo, A., 138 Jayaraman, K., 284 Jay, E., 316 Jenkins, N. A., 410, 420 Jennings, E. G., 254, 450 Jensen, L. J., 85 Jensen, P. R., 79 Jeong, H., 82 Jessee, J., 324 Jessen, E. L., 312 Jiang, B., 73, 76 Jiang, K., 283 Ji, H., 410 Jin, X., 4 Jin, Y. S., 78 Johnson, I. D., 283 Johnson, M., 200 Johnson, P., 292 Jones, A. R., 281 Jones, L., 28 Jones, R. J., 297 Joseph, S., 26, 47 Joshi, R., 4
Joyce, A. R., 68, 70, 77, 79, 80, 84 Julien, P., 85 Jungert, K., 265 Jung, G. Y., 79 Jung, Y. K., 77–80 Junker, B. H., 80 Juvonen, R. O., 59 K Kadouri, D. E., 410 Kaiser, W., 250 Kakazu, Y., 69 Kanai, A., 69 Kanaya, S., 50 Kandzia, R., 103 Kanehisa, M., 70 Kaneko, S., 433, 434, 436, 437 Kang, S., 289, 293, 301, 302 Kanin, E. I., 254, 449, 450, 454, 467, 564 Kannan, M. S., 405 Kao, C. F., 208, 282, 283 Kao, W. C., 280 Kao, Y. S., 280 Kaplan, S., 207 Kardar, M., 4 Karig, D. K., 138 Kærn, M., 138, 139 Karp, P. D., 73, 76, 80, 84 Karri, S., 28 Karr, J. R., 84, 86 Kastelein, R. A., 47 Katzen, F., 327 Katz, J. M., 260 Kauffman, K. J., 76, 77 Kauffman, S., 73, 76 Kawaoka, Y., 299 Kaysen, J., 281, 288, 294, 296 Kayton, I., 199 Kaznessis, Y. N., 137–142, 145–147, 149 Kealey, J. T., 250 Keasling, J. D., 59, 69, 299, 312, 364–366, 374 Keefe, A. D., 295 Kehlenbeck, S., 265 Keith, A., 287–289, 294, 328 Keles, S., 467, 468 Keller, M., 284 Kelly, J. R., 257, 299, 312 Kelner, J. A., 79–82 Kennedy, J., 250 Kerridge, I., 260 Keutzer, K., 99 Khalil, A. S., 23, 299 Khanam, N., 69 Khannapho, C., 79, 80 Khorana, H. G., 249, 278, 282, 350 Khor, S., 292 Khrapko, K., 294
481
Author Index
Kiefer, P., 69 Kiel, C., 256 Kierzek, A. M., 79, 80 Kikuchi, Y., 436 Kim, A., 294 Kim, B. H., 77 Kim, C., 281 Kim, H. U., 68–70, 73, 77, 79, 80, 83 Kim, J. H., 79, 81, 329 Kim, P. J., 82 Kim, T. Y., 22, 67–70, 73, 76–80, 82–85 Kim, U. J., 429 Kim, V. N., 59 Kim, Y., 58 Kingsman, A. J., 59, 283 Kingsman, S. M., 59, 283 Kinoshita, A., 86 Kinouchi, M., 50 Kirk, B., 297 Kirschner, A., 404 Kitano, H., 82, 138 Kittleson, J. T., 363, 370 Kizer, L., 69 Kleid, D. G., 350 Kleppe, K., 249, 278, 350 Klepsch, M. M., 58 Klewinghaus, I., 295 Klimavicz, C. M., 299 Kline, B. C., 285, 329 Knight, E. M., 77, 79, 84 Knight, R., 50 Knight, T. F., 140, 208, 299, 311, 312, 314, 323 Kobayashi, S., 281 Kobe, B., 48 Kodumal, S. J., 250, 278, 285, 294, 351 Koffas, M. A., 79 Koizumi, M., 433–435, 437 Kolupaeva, V. G., 47 Komar, A. A., 47 Komarova, A. V., 46, 47 Koncz, C., 284 Koncz-Kalman, Z., 284 Kondo, H., 404 Korenberg, J. R., 329 Korepanova, A., 48, 57 Korzheva, N., 467 Koschutzki, D., 80 Kosovac, D., 264 Koster, H., 250 Kosuri, S., 424 Kotsopoulou, E., 59 Kouprina, N., 329 Kozak, M., 24, 47 Kraszewski, A., 350 Kromer, J. O., 69 Kruger, B., 85 Krummenacker, M., 80, 84 Kuan, P. F., 467, 468
Kuan, Y. K., 254 Kubal, M., 444 Kubert, M., 294, 295 Kubicek, J., 51, 264 Kubitz, M. M., 270 Kudla, G., 44, 45, 47, 51, 298 Kudlicki, W., 327 Kudo, Y., 50 Kuepfer, L., 76, 84 Kuhn, M., 85 Kumar, A., 249, 278, 282, 350 Kumar, R., 316 Kummel, A., 84 Kummer, U., 138 Kunes, S., 329 Kunkel, T. A., 342 Kupiec, M., 47 Kurland, C. G., 50, 51, 262 Kuroki, A., 433, 434, 437, 438 Kuznetsov, S. G., 410 Kwok, R., 174, 194 Kwon, Y. K., 69 L Labno, A., 257, 312 Lacroix, E., 4, 5 LaCroute, F., 336 Lajoie, M. J., 411 Lake, M. R., 57 Lamarre-Vincent, N., 456 Landgraf, D., 312 Landick, R., 449, 454, 464, 467, 468, 564 Langmann, T., 253 Lao, K., 297 Lapedes, A. S., 4 Lape, J., 289, 293, 295 Larionov, V., 329 Larsen, A. P., 256 Larsson, O., 47 Lartigue, C., 250, 300, 303, 329, 424, 428, 438 Lashkari, D. A., 280 Laursen, B., 24 Lavery, L. A., 312 Lavery, R., 4 Lawson, J., 258 Lebedenko, E. N., 328 Le Coq, D., 69 Lee, B. S., 84 Lee, C. C., 138, 280 Lee, D. Y., 47, 68, 77, 82 Lee, E. C., 410, 420 Lee, J. M., 77, 86 Lee, K. H., 46, 68, 77–80, 82 Lee, S. G., 47, 138 Lee, S. J., 77 Lee, S. Y., 67–71, 73, 76–80, 82–85 Lee, W., 292
482 Lee, Y. J., 208, 214, 295, 296 Leem, S. H., 329 Lefkowitz, S. M., 281 Lefrancois, P., 456 Leguia, M., 299, 312, 363–366, 374 Leibham, D., 328 Leibler, S., 138 Leiby, M., 410 Leigh, J. A., 85 Lemieux, S., 73, 76 Lenaerts, T., 16 Lenhard, B., 13 LeProust, E. M., 280, 281 Lercher, M. J., 80 Lesia, B., 154 Levskaya, A., 312, 405 Levy, M., 312 Lewis, M. R., 444 Liang, C. C., 411 Liang, X., 327 Liao, J. C., 80, 138 Lieb, J. D., 463, 464 Lie, T. J., 85 Lieviant, J. A., 314 Li, H., 138 Li, K., 327 Li, M. H., 69, 254, 281, 292 Li, M. Z., 328, 350 Lim, L. S., 254 Lim, W., 312 Lindblad, P., 312 Lin, D. C. H., 450 Lin, D. M., 464 Linshiz, G., 207 Linteau, A., 73, 76 Lipshutz, R. J., 410 Lisser, S., 44 Liss, M., 51, 247, 251, 264–265 Listwan, P., 48 Li, T., 4 Li, W. H., 51, 55, 298 Li, X., 209, 214, 285–290, 294, 297 Li, Y., 209, 214, 285–290, 294, 297, 351 Little, M., 295 Liu, A., 153 Liu, D. P., 411 Liu, J. G., 253, 282, 351 Liu, M., 466 Liu, Q., 328 Liu, R., 295 Liu, Z., 4 Livi, G. P., 332 Lizarazo, M., 311 Llora, X., 69 Lloyd, D., 312 Lockhart, D. J., 410 Lomakin, I. B., 47 Lorenz, R., 28, 31, 32
Author Index
Lorimer, D., 293 Lo, S. H., 208, 282, 283 Lou, X. M., 287 Lovley, D. R., 69 Lowe, A. M., 281 Lu, A. T., 281 Lubkowski, J., 288, 292, 294 Ludwig, C., 251, 264–265 Lu, L., 411 Lun, D. S., 79–82 Luo, J., 58 Lu, Q., 332 Lu, T., 23 Lu, W., 69 Lutz, R., 138 Lutz, S., 295 Luu, P., 464 Lux, M. W., 174, 187 Lyons, B. M., 44 M MacEachran, D. P., 410 Madduri, K. M., 59 Madupu, R., 70 Maertens, B., 51, 251, 264–265 Magos-Castro, M. A., 47 Ma, H., 329 Ma, K., 254, 281, 293 Ma, L., 250, 300, 303, 329, 424, 428, 438 Mahadevan, R., 79 Maheswaran, S. K., 405 Mahmood, N. A., 47 Malcolm, B. A., 297 Malloy, K. J., 312 Mamedov, T. G., 254 Mancino, V., 429 Mandecki, W., 249 Mandelbrot, B. B., 208 Mane, S. P., 44 Maniatis, T., 269 Manni, M., 28 Mann, M. J., 338 Mann, R. S., 4 Mao, J., 332 Mao, M., 281 Maranas, C. D., 77, 79–82 Marcaida, M. J., 4, 5 Marchisio, M. A., 138, 174, 258 Marcil, R., 335 Marcotte, E. M., 312, 405 Maresca, M., 411 Margalit, H., 44 Mariana L., 154 Marillonnet, S., 103 Marinelli, L. J., 410 Marino, M., 405 Markel, E., 410
483
Author Index
Markham, A. F., 249 Markham, N. R., 28, 31, 413, 417 Marquez-Lago, T., 138 Marquez, R., 50 Marsic, D., 288 Martienssen, R. A., 139 Martineau, Y., 47 Martin, H. G., 69 Martins dos Santos, V. A., 77, 80 Marton, M. J., 281 Maruf, M., 444 Marykwas, D. L., 329 Masiarz, F. R., 57 Massou, S., 69 Masumoto, H., 329 Matayoshi, E. D., 57 Mathews, D., 28 Mathonnet, G., 47 Mathur, J., 284 Matsui, K., 438, 439, 443 Matsumura, I., 270 Matteucci, M., 278, 280 Matthes, H. W., 282 Maury, J., 78, 79 Mayhew, G. F., 416 Maynard, J. A., 138, 142, 145–147 McBride, L., 278, 280 McClelland, M., 403 McClure, W. R., 467 McCuen, H. B., 280 McCulloch, A., 76 McDonald, H. A., 57 McGall, G. H., 281 McGrath, W. J., 48 Meacock, P. A., 249 Meadows, A., 69 Mehreja, R., 138 Mehta, D. V., 283 Meissner, S., 264 Melamud, E., 69 Mendes, P., 138 Menzella, H. G., 250, 278, 285, 294, 351, 424 Merkle, R. C., 208 Merryman, C., 250, 253, 258, 265, 278, 300, 303, 329, 351, 353, 358, 428, 438 Merryweather, J. P., 57 Metropolis, N., 37 Meyerhans, A., 69 Meyer, M. R., 281 Micheletti, J. M., 280, 293 Michniewicz, J., 282 Miercke, L. J., 59 Miklos, A. E., 277, 289, 293, 301, 302 Miller, S., 28 Minshull, J., 43, 45, 48, 51, 52, 54, 56, 264, 291, 298, 299 Mirny, L. A., 4 Mirsky, E. A., 20, 21, 29, 35, 37, 44, 47, 374
Misirli, G., 258 Mitrophanous, K. A., 59 Mitsuishi, Y., 404 Miwa, K., 430 Mixon, M., 293 Miyazaki, K., 399, 403–404 Modrich, P., 295 Molenaar, D., 80 Mo, M. L., 70, 80 Mondrago´n-Palomino, O., 23 Monie, D. D., 257, 299, 312 Monnat, R. J. Jr., 5 Montague, M. G., 250, 300, 303, 329, 424, 428, 438 Montgomery, R., 70 Montoya, G., 4, 5 Moodie, M. M., 250, 300, 303, 329, 424, 428, 438 Mooney, R. A., 449, 454, 464, 467, 468 Moon, S. Y., 77, 80 Moore, B., 280 Moore, J. D., 48 Moore, R., 59 Moqtaderi, Z., 450, 456 Moreira, R. F., 318 Moreland, R. B., 57 Mori, H., 69 Morohashi, M., 138 Morozov, A., 4 Morris, S. K., 341–342 Morrow, C., 28 Mosberg, J. A., 411 Mosteller, F., 466 Moxley, J. F., 78 Mrozkiewicz, M. K., 48 Mullenbach, G. T., 57 Muller, D. J., 411 Muller, J., 85 Mu¨ller, K., 312 Mu¨ller, O., 98 Mulligan, J. T., 284 Mullinax, R. L., 270 Mullis, K. B., 249 Munch, R., 298 Munoz, I. G., 4–6 Murakami, K., 467 Muranjan, S., 281, 293 Murphy, P., 47 Murray, A. W., 44, 45, 47, 51, 298 Murray, J. A., 329 Mustaev, A., 467 Muyrers, J. P., 328, 410 N Naba, M., 69 Na, D., 22, 47 Nadra, A. D., 3–5, 9
484
Author Index
Nagata, T., 233, 234, 237, 430 Nair, S., 467 Nakahigashi, K., 69 Nakayama, I., 198 Nakayama, Y., 86 Nam, D. H., 208 Namsaraev, E., 280 Nanchen, A., 69 Narang, S. A., 282 Naslund, A. K., 262 Nathans, D., 330 Naylor, K., 405 Neelands, T. R., 57 Negishi, M., 59 Nelson, C., 281 Nelson, M., 403 Ness, J. E., 48, 51, 52, 174, 291, 298, 299 Neumann, G., 299 Newburger, D. E., 13 Newton, C. R., 249 Ngai, J., 464 Ng, J. D., 288 Nguyen, A. W., 405 Nguyen, H. B., 48 Nickoloff, J. A., 335 Nielsen, J. E., 5, 69, 71, 73, 78, 79 Nielsen, L. K., 76 Nielsen, R., 50 Nikolaev, E. V., 77, 79, 80 Nilsson, D., 52 Nilsson, L., 50 Nishiguchi, C., 281 Nishizaki, T., 437, 439, 444 Noller, H. F., 25, 56 Noren, C. J., 318 Norgren, R. M., 280 Noro, N., 404 Nortemann, B., 298 Noskov, V. N., 250, 300, 303, 329, 424, 428, 438 Notka, F., 247, 256, 264 Nuara, A. A., 338 Nunley, P. W., 292 Nunnari, J., 405 Nys, R., 5, 6 O Oakes, F. T., 282 Oberhardt, M. A., 76, 77, 80 O’Brien, K., 280 O’Connell, D., 253 O’Connell, T., 253 Oertel, W., 297 Ogden, B. J., 312 Ogle, K., 280, 293 Ohtani, N., 437 Ohtsuka, E., 249, 278, 282, 350
Okamoto, Y., 329 Okreglak, V., 405 Oldfield, L. M., 410 Oliveira, N., 187 Oliver, S. G., 80 Olson, M. V., 329, 332 Oltvai, Z. N., 84 Oppenheim, A. B., 410 Orloff, A., 69 Orr-Weaver, T. L., 329 Orth, J. D., 68, 76, 77 Osterman, A., 444 O’Toole, G. A., 410 Overton, T. W., 454 P Padgett, K. A., 269 Pahle, J., 138 Paillard, G., 4 Pal, C., 80 Palese, P., 260 Palsson, B. O., 68–70, 73, 76, 77, 79, 80, 83, 84, 454, 466, 467, 564 Panke, S., 22, 84, 208, 216, 252 Pan, T., 52 Papin, J. A., 76, 77, 80, 83, 86 Papoutsakis, E. T., 71, 73 Papp, B., 80 Paques, F., 4–6 Park, J. H., 68, 73, 77–80 Park, J. M., 67, 69, 70, 73, 77, 83, 85 Park, J. S., 208, 214, 295, 296 Park, L. J., 59 Park, S. J., 77–80, 82 Park, T. J., 58 Parmar, P. P., 300 Parmeggiani, F., 256 Parsyan, A., 47, 57 Passmore, S. E., 329 Patel, K. G., 250, 278, 285, 294, 351, 424 Patel, M., 281 Patel, T. R., 77 Patil, K. R., 78, 79 Patin, A., 4, 5 Patrick, W. M., 295 Paty, P. B., 297 Paul, A. V., 250, 260, 278, 299 Pease, L. R., 285, 350 Peccoud, J., 44, 173, 175, 254, 258, 291, 351 Peck, B. J., 280 Peck, K., 280 Peden, J. F., 50 Pedersen, M., 174 Pedersen, P. A., 58 Peisajovich, S. G., 312 Pellarin, R., 256 Peng, L., 69, 327
485
Author Index
Peng, R.-H., 209, 214, 253, 282, 285–290, 294, 297, 351 Peng, V., 464 Pennisi, E., 138 Pereda-Lopez, A., 57 Perez, A. G., 4 Perez, C., 4–6 Perez, D. R., 299 Perkins, E., 329 Perna, N. T., 416 Peroutka, R. J., 57 Persson, J. O., 58 Pestova, T. V., 47 Peters, J. M., 454, 464, 467, 468 Peterson, T., 327 Petit, A. S., 4, 5 Petroulakis, E., 47 Petrov, D., 297 Peyraud, R., 69 Pfannkoch, C., 212, 221, 278, 284, 288, 299, 351 Pfleger, B. F., 69 Phan, Q., 300 Pharkya, P., 77, 79–82 Philipps, A., 174 Phillips, I., 312 Phillips, S., 209, 341 Pichler, H., 58 Piech, T., 57 Pienaar, E., 254 Pieper, R., 300 Pilipenko, E. V., 47 Pincas, H., 297 Pinel, N., 85 Pingle, M. R., 297 Pirrung, M. C., 281 Pisarev, A. V., 47 Pisareva, V. P., 47 Pitera, D. J., 69 Piuri, M., 410 Plotkin, J. B., 44, 45, 47, 51, 298 Plu¨ckthun, A., 256 Plunkett, G., 416 Plutalov, O. V., 328 Pollard, J., 280, 293 Portais, J. C., 69 Porter, G., 329 Portnoy, V., 84 Potter, J., 327 Pownder, T. A., 329, 330 Prakash, P., 76, 77 Preiss, T., 47 Price, N. D., 77, 79, 80, 83 Prieto, J., 4–6 Prudovsky, E., 284 Puchalka, J., 77, 80 Puigbo, P., 298 Pullen, J. K., 285 Purnick, P. E. M., 23, 98, 174
Q Qian, H., 84 Qiu, Y., 466 Quake, S. R., 280 R Raab, D., 256, 264 Rabinowitz, J. D., 69 Raffaelle, M., 454, 467, 564 Ragan, T. J., 287–289, 294, 328 Rajbhandary, U. L., 249, 278, 350 Ramachandran, B., 285 Ramakrishnan, V., 25 Ramakrishna, R., 76 Ramalingam, K. I., 138, 142, 145–147 Raman, K., 68, 73, 76, 77 Ranganathan, S., 79–81 Ravid, S., 207 Raymond, A., 293 Raymond, C. K., 329, 330, 332 Rayner, S., 280 Read, J. L., 281 Reddy, S., 289, 293, 301, 302 Redondo, P., 4–6 Reece, R. J., 80 Reed, J. L., 77, 79–81, 84 Regenhardt, D., 80 Reid, R., 250, 278, 285, 294, 351, 424 Reisinger, S. J., 250, 285, 424 Remm, M., 46, 47 Ren, B., 254, 450, 454 Reppas, N. B., 454, 464 Resnick, M. A., 329 Rettberg, R., 311, 312 Reumers, J., 16 Revelles, O., 69 Rhau, B., 312 Richardson, C. C., 350 Richardson, S. M., 292 Richmond, K. E., 281, 288, 294, 296 Ridgway, D., 312 Riek, R., 59 Rientjes, J. M., 410 Riggs, A. D., 139, 250, 350 Riley, M., 416 Risso, C., 69 Robert, F., 254, 450 Robert, M., 69 Roberts, D. N., 464 Robinson, K., 312 Rocha, E. P., 50 Rocha, I., 79 Rockwell, G., 79–82 Rode, C. K., 416 Rodesch, M. J., 281 Ro, D. K., 59, 138 Rodrigo, G., 138
486
Author Index
Roemer, T., 73, 76 Rogers, H., 208 Rohs, R., 4 Rolland, S., 4, 5 Romeu, A., 298 Roosild, T. P., 59 Rosemond, S., 312 Rosenbluth, A., 37 Rosenfeld, N., 139 Roth, A., 85 Rothstein, R. J., 329 Roth, U., 51, 264 Rouillard, J. M., 281, 292, 293 Rouilly, V., 258 Rousseau, F., 5, 6, 16 Rowland, J. L., 454, 464, 467, 468 Rozowsky, J., 456 Rubin, A. J., 257, 299, 312 Rudd, K. E., 46 Ruppin, E., 47, 78, 84 Russell, D. W., 294, 316, 400, 402 Russo, V. E. A., 139 Rydzanicz, R., 292 Ryu, D. D., 208 S Saaem, I., 254, 281, 293 Sabina, J., 28 Sahle, S., 138 Saida, F., 57 Saiki, R. K., 249 Salerno, J. C., 297 Salis, H. M., 19–21, 23, 29, 35, 37, 44, 47, 141, 149, 312, 374, 405 Sambrook, J., 269, 294, 316, 400, 402 Samuel, G. N., 260 Samuelson, J. C., 58 Sandelin, A., 13 Sanders, R., 194 Sandhu, G. S., 285 SantaLucia, J. Jr., 28 Santama, N., 283 Santi, D. V., 250, 278, 285, 294, 351, 424 Sathe, G. M., 332 Sato, M., 437 Satya, P., 280, 293 Sauer, U., 68, 69, 76, 82, 84 Sauro, H. M., 139, 187, 314 Sawitzke, J. A., 410–411 Sayed, M., 289, 293, 295 Scafe, C. S., 450, 454 Scanlon, D. B., 249 Schachter, V., 70 Scha¨fer, F., 51, 251, 264, 265 Schafmeister, C. E., 59 Schalk, M., 78, 79 Scharf, S., 249
Schatz, O., 253 Schatz, P. J., 329 Scheer, M., 298 Schell, J., 284 Schelter, J. M., 281 Scherer, S., 329 Schicker, A., 69 Schilling, C. H., 77, 79, 80 Schipper, D., 80 Schlegel, S., 58 Schmidt, K., 69 Schmidt, S., 85 Schmitt, R., 270 Schneider, K., 69 Schneider, T. D., 46 Schoch, G. A., 59 Schoedl, T., 255–256, 264 Schreiber, F., 80 Schreiber, J., 254, 450 Schuch, W., 249 Schuetz, R., 76 Schwer, H., 253 Schymkowitz, J., 5, 6, 16 Scott, C., 282 Scouras, A., 312 Seehaus, T., 295 Segre, D., 69, 78 Seidel, R., 411 Sekinger, E., 450 Selgelid, M. J., 260 Septak, M., 280 Sequeira, S. I., 79 Serrano, L., 3–6, 9, 16, 22, 256, 312, 375 Sexson, S. L., 329, 330 Sgaramella, V., 249, 278, 282, 350 Shahbazian, D., 47 Shah, J., 284 Shallcross, M. A., 249 Shang, X. Y., 411 Shanks, R. M., 410 Shannon, K. W., 281 Shao, Z., 258, 401 Shapiro, E., 207 Sharan, R., 84 Sharan, S. K., 410 Sharp, P. M., 50, 51, 55, 298 Shatsky, I. N., 47 Sheardown, S. A., 332 Shelton, R., 44 Shendure, J., 410 Sheng, N., 208, 209, 212, 214, 215, 281, 293–295, 424 Shen, J., 57 Sherine, C., 154 Shetty, R. P., 140, 299, 311, 312, 323 Shields, D. C., 50 Shimizu, K., 69 Shindyalov, I. N., 9
Author Index
Shine, J., 27 Shirley, J., 327 Shiroishi, T., 233, 234, 237, 430 Shizuya, H., 429 Shlomi, T., 78, 84 Shoseyov, O., 256 Shultzaberger, R. K., 46 Sierzchala, A. B., 280 Siezen, R. J., 80 Siggers, T. W., 4 Siggia, E., 4 Sillaots, S., 73, 76 Silver, P. A., 138, 312 Simeonidis, E., 76 Simon, I., 254, 450 Simon, M., 429 Simonovic, M., 85 Simpson, Z. B., 312, 405 Sims, E. H., 329, 332 Simus, N., 138 Sindelar, L. E., 280 Sineoky, S. P., 254, 289, 290, 297 Singhal, M., 138 Singh-Gasson, S., 281 Sippel, A., 467 Sizun, C., 27 Skalka, A., 249 Sleight, S. C., 314 Slepak, T., 429 Slotboom, D. J., 58 Smallbone, K., 76 Smid, E. J., 80 Smith, A. T., 283 Smith, H. O., 212, 221, 278, 284, 288, 299, 300, 329, 338, 350, 351, 353, 355, 357, 358, 428, 429, 438, 444 Smith, J. R., 295, 318 Smith, M., 209, 341 Smith, S. M., 297 Smit, S., 50 Smolke, C. D., 99, 194 Smyth, D. R., 48 Smyth, G., 464 Sneh, B., 284 Snel, B., 85 Snyder, M., 456 Snyder, T. M., 280 Sockett, R. E., 50 Soga, T., 69 Sohn, S. B., 73 Solas, D., 281 Solomon, L. R., 57 Solomon, M. J., 450 Solo´rzano, A., 260 Someno, K., 198 Sonenberg, N., 47 Song, H., 76 Sorensen, H., 24
487 Sorensen, M. A., 52 Sorge, J. A., 269–270 Sotiropoulos, V., 139, 141 Soussi, T., 297 Speed, T. P., 464 Spirin, K., 280 Spizizen, J., 430 Spriestersbach, A., 51, 251, 264–265 Srivannavit, O., 281, 293 Srivas, R., 70, 80 Stahl, D. A., 85 Stark, M., 85 Staub, A., 282 Steffensen, L., 58 Stelling, J., 82, 138, 174, 258 Stemmer, W. P., 208, 284–286, 291, 341–342, 351, 401, 404 Stenstro¨m, C. M., 47 Stephanopoulos, G. N., 71, 73, 78, 79 Stevens, R., 444 Stewart, A. F., 328, 410–411 Stewart, A. J., 464 Stewart, L., 293 Stinchcomb, D. T., 329 Stockwell, T. B., 253, 258, 265, 278, 300, 329, 351, 353, 428 Stoddard, B. L., 5 Stolyar, S., 85 Stormo, G. D., 4 Stotland, E., 249 Stricher, F., 4–6, 9, 16, 256 Stricker, J., 23 Strizhov, N., 284 Stroud, R. M., 59 Struhl, K., 329, 450, 454, 456, 464 Stryer, L., 281 Studer, S. M., 26, 47 Stumpp, M. T., 256 Subramanian, A., 254 Suen, J. K., 138 Sueyoshi, T., 59 Sugimoto, H., 404 Summers, D. K., 335 Sun, X., 82 Sun, Z. Z., 410–412, 419 Supina, E. V., 47 Sussman, D., 5 Sussman, M. R., 281 Suthers, P. F., 79–81 Sutton, G., 70 Suzuki, H., 50 Svitkin, Y. V., 47 Swain, P. S., 139 Swaminathan, S., 410 Swayne, D. E., 260 Sweede, M. A., 44 Swenson, R. P., 282 Swigonova, Z., 410
488
Author Index
Swingle, B., 410 Symons, R. H., 350 Synder, M., 450, 454 Szallasi, Z., 82 Szostak, J. W., 295, 329 Szybalski, W., 249 T Tabone, J. C., 284 Tabor, J. J., 23, 312, 405 Tachiiri, Y., 429 Takenouchi, M., 403 Takeuchi, T., 433, 434, 436, 437 Takors, R., 80 Takyar, S., 56 Tamai, E., 454 Tanabe, M., 70 Tanaka, T., 432, 436, 437 Tang, Y. J., 69 Tanimura, N., 138 Tan, L., 270 Tannler, S., 69 Tarry, M., 58 Tate, E., 404 Tats, A., 46, 47 Taubenberger, J. K., 260 Tawfik, D. S., 256 Tchufistova, L. S., 46, 47 Temple, G. F., 328 Temsamani, J., 294, 295 Tenson, T., 24, 46, 47, 52 TerMaat, J. R., 254 Testa, G., 410 Tettweiler, G., 47 Teusink, B., 80 Thiberge, S., 138 Thiele, I., 68, 70, 76, 77, 80 Thilly, W. G., 294 Thomas, D. W., 253, 258, 265, 278, 300, 329, 351, 353, 428 Thomason, L. C., 410 Thorneley, R. N., 283 Thumbikat, P., 405 Tian, J., 208, 209, 212, 214, 215, 254, 281, 293–295, 424 Tigges, M., 138 Timmis, K. N., 80 Titus, S. A., 328 Toda, T., 438 Tollervey, D., 44, 45, 47, 51, 298 Tomita, M., 86, 437, 438 Tomita, S., 436 Tomshine, J. R., 138, 139, 142, 145–147, 149, 174 Toney, M. D., 342 Top, E. M., 50 Toyoda, T., 189
Tran, C. Q., 250 Trautner, T. A., 441 Trela, J. M., 353 Trieu, W., 59 Truan, G., 292 Trzaska, D., 270 Tsoka, S., 73, 76 Tsuda, S., 404 Tsuge, K., 233, 234, 237, 427, 430, 433–437, 439, 443, 444 Tsvetanova, B., 327 Tucker, P., 289, 293, 301, 302 Tukey, J., 466 Tuller, T., 47 Tumpey, T. M., 260 Turner, D., 28 Tur, V., 16 Tuttle, L., 149 U Uberla, K., 264 Ullman, J. D., 208, 209 Unoson, C., 26 Uotsu-Tomita, R., 438 Uozumi, T., 430 Urdea, M. S., 57 V Vadasz, S., 291 Valenzuela, P., 57 van Den Berg, S., 48 Van den Brulle, J., 253 Van der Sloot, A. M., 256, 312, 375 Van de Sande, J. H., 249, 278, 350 Van Dien, S. J., 69, 85 van Duin, J., 26, 47 van Kessel, J. C., 410 van Wijk, K. J., 58 Varadamsetty, G., 256 Varma, A., 76, 77 Varshavsky, A., 450 Vasconcelos, A. T., 4 Vashee, S., 250, 300, 303, 329, 428, 438 Vasser, M., 47 Vazquez, A., 84 Vecenie, C., 28 Vega, M., 59 Venkataramaian, N., 281 Venter, J. C., 212, 221, 278, 284, 288, 299, 300, 329, 338, 351, 353, 355, 357, 358, 428, 429, 438, 444 Vidal, M., 330 Villadsen, J., 69 Villalobos, A., 24, 43, 45, 48, 51, 52, 54, 174, 291, 298, 299 Vimberg, V., 46, 47
489
Author Index
Virnau, P., 4 Vitkup, D., 69, 78 Vogt, J., 467 Voigt, C. A., 44, 47, 138, 174, 312, 374, 405 Volkert, T. L., 254, 450 Vollrath, D., 432 von Groll, U., 51, 264 von Mering, C., 85 Vonner, A. J., 418 Vorholt, J. A., 69 Vo, T. D., 70, 80 W Wade, J. T., 454, 464 Waghray, S., 47 Wagner, R., 51, 247, 251, 255–256, 264–265 Wagner, S., 58 Wahl, A., 80 Walcheck, B., 405 Walchli, J., 293 Waldmann, T., 253 Waldman, Y. Y., 47 Wallace, E., 293 Walter, K. A., 57 Wang, H. H., 409–412, 418–419 Wang, H. K., 287 Wang, S., 410 Wang, W., 297 Wang, X., 23 Wanner, B. L., 328, 410 Ward, T., 280 Warming, S., 410 Wasserman, W. W., 13 Watanabe, S., 299 Watanabe, T., 299 Waterman, M. R., 59 Waters, L. S., 410 Watt, R. M., 411 Way, J. C., 138, 312 Weber, H., 249, 278, 282, 350 Weeding, E. M., 139, 140, 174 Wegmann, S., 411 Weinzierl, R. O., 270 Weiss, B., 350 Weissig, H., 9 Weiss, R., 23, 98, 138, 174 Weiss, S. R., 350–351 Welch, M., 24, 43, 45, 48, 51, 52, 54, 250, 278, 285, 294, 298, 299, 351 Weng, J. M., 283 Werck-Reichhart, D., 59 Westbrook, J., 9 West, S. M., 4 Wheelan, S., 292 Whitehorn, E. A., 404 White, O., 70
Whitney, S. E., 254 Wiechert, W., 69, 80 Wiersma, A., 80 Wiesler, S., 270 Wilcox, K. W., 350 Wild, J., 264 Willer, D. O., 338 Wilson, C. J., 254, 450 Wilson, C. R., 59 Wilson, D. B., 256 Wilson, D. S., 295 Wilson, M. L., 173, 175, 183, 258 Wimmer, E., 250, 260, 278, 299 Wingreen, N. S., 4 Winkler, M., 328 Wintermute, E. H., 312 Wintzerith, M., 282 Wipat, A., 258 Withers, H. L., 335 Wittmann, C., 69 Wolfe, K. H., 50 Wolf, H., 264 Wolf, J. B., 291 Woltering, J. M., 59 Wong, W. W., 138 Woo, H. M., 78, 80 Wright, F., 50 Wu, G. C., 291, 299, 312, 364–366, 374 Wu, X. S., 411 Wyrick, J. J., 254, 450 Wyrzykiewicz, T. K., 280 X Xia, B. 97 Xia, T., 28 Xia, Y., 281, 293 Xiang, Q., 281, 293 Xiao, N., 84, 86 Xin, C., 73, 76 Xin, L., 411 Xiong, A.-S., 209, 214, 253, 282, 285–290, 294, 297, 351 Xu, G., 410–412, 418–419 Xu, L., 138, 327 Xu, Y., 4 Xu, Z., 82 Y Yamada, C. M., 280 Yamada, Y., 50 Yanagawa, H., 437, 439, 444 Yang, A., 450 Yang, F. Y., 84, 208, 282, 283 Yang, J.-P., 327 Yang, Q., 70 Yang, W. J., 208, 282, 283
490 Yang, Y. H., 464 Yang, Z., 50 Yansura, D. G., 350 Yaoi, K., 404 Yao, Q.-H., 209, 214, 253, 282, 285–290, 294, 297, 351 Yao, S., 69 Yao, X. D., 338 Yarrington, R. M., 292 Ye, H., 254, 292 Yehezkel, T., 207 Yeh, I., 73, 76 Ying, J. Y., 254, 292 Yin, W. X., 411 Yooseph, S., 444 Yo, P., 287–289, 294, 328 You, L., 138 You, Q., 281, 293 Young, E., 299 Young, J. W., 139 Young, L., 249, 253, 258, 265, 278, 286, 289, 294, 297, 300, 329, 338, 351, 353, 355, 357, 428, 429 Yount, B., 350–351 Yu, D., 410–411, 420 Yu, P., 281, 293 Yu, S., 82 Yu, T., 208, 215, 258, 295, 296 Yue, Y., 281 Yugi, K., 86
Author Index Z Zafar, N., 70 Zamboni, N., 68, 69 Zamora-Romo, E., 47 Zaveri, J., 253, 258, 265, 278, 300, 329, 351, 353, 428 Zeitlinger, J., 254, 450 Zeng, H., 260 Zenke, W. M., 282 Zerbe, O., 256 Zhang, H., 281 Zhang, K., 138 Zhang, S. L., 208, 215, 286, 288, 289, 295, 296 Zhang, X., 281, 293 Zhang, Y., 328, 410–411 Zhang, Z., 287 Zhao, H., 79, 80, 258, 401 Zhao, J., 69 Zhao, X., 292 Zhao, Y., 410 Zhou, X., 208, 209, 212, 214, 215, 281, 292–295, 410, 424 Zhuang, J., 253, 282, 351 Zhu, B., 338 Zhu, J., 410 Zilberstein, A., 284 Ziman, M., 281 Znosko, B., 28 Zuker, M., 28, 31, 413, 417
Subject Index
A Amino acid element properties, 61 AssemblyManager tool architecture, 370–371 assembly trees generation, 366 BglBrick-based 2ab assembly, 370 defined, 365 dynamic programming-based algorithm, 369–370 robot commands, 369 Autonomously replicating sequence (ARS), 329 B Bacillus GenoMe (BGM) domino clones, 437 large DNA retrieval by copying, 436 direct isolation, 435–436 in vivo direct isolation, 436 strains, 437 Bacillus subtilis BGM vectors, 430 Bsu168 agarose block, 432 cell preparation, 430–431 isolation liquid and agarose block, 431–432 standard transformation, 431 enzyme reaction in gel blocks, 432 Bacterial translation and RBSs genetic part, 25 molecular interactions, 25–27 rate-limiting step, 24–25 BASE. See Bioinformatics and systems engineering BGM. See Bacillus GenoMe BioBrick Assembly Kit, 318 BioBrick standard biological parts, 3A assembly chemically competent cells efficiency, 323 protocols, 324–325 column cleanup and agarose gel purification, 316 construction forward and reverse amplification primers, 318 PCR amplification, 316–317 prefix and suffix sequences, 316
primer sequences, 316–317 restriction enzymes, 318, 319 definition, 312 destination plasmid, 314 vectors, 314, 323 EcoRI and XbaI restriction enzyme, 313 enzymes and reagents, 318 iterative and pairwise hierarchical assembly, 312–313 ligation products, 315–316 reaction, 319 PCR, linearized destination vector duration and automation, 322–323 failure modes, 322 positive and negative selection, 321–322 possible ligation products, 315 verification tests, 322 restriction enzymes, 318–319 transformation reaction, 320 verification clones, described, 321 growth disadvantage, 320 BioCAD tools, 371 Bioinformatics and systems engineering (BASE), 200–201 Biomek Software, 372–373 Biophysical model and optimize method, RBSs accuracy and limitations, 37–38 ribosome assembly, free energy model, 28–33 thermodynamics, 27–28 translation initiation, statistical thermodynamic model Gibbs free energy, 34 individual assembly reactions, 35 protein expression level, 35–36 ribosome and mRNA transcripts, 33 C Capillary electrophoresis fragment analysis described, 224, 226 elongation reaction, 209, 225 FAM and HEX fluorescence, 244 GFP DNA construction, 227–228 gene construction, 226 reconstruction, 229–230
491
492 Capillary electrophoresis fragment analysis (cont.) lambda exonuclease activity, 209, 225 natural and synthetic fragments, 231 P53 library reconstruction, 239–241 variant regions, 242 variants, 236–238 RT-PCR amplification and melt curve, 243 Tachylectinll molecule, 232–235 Chemically competent cells efficiency, 323 protocols, 324–325 ChIP. See Chromatin immunoprecipitation ChIP–chip DNA preparation, microarray blunting, 459–460 ligating, 460–461 *linker, 462 LM-PCR, 461–462 Chromatin immunoprecipitation (ChIP) chemical genomics ChIP–chip data comparison, 467, 468 Rho-dependent terminators, 468–469 RNAP binding, DNA promotor regions, 467, 468 sigma 70 and b subunits averaged occupancy profiles, 467, 468 CHIP–CHIP protocol analysis by qPCR, 459 background signal, protein complex, 466–467 cross-linked DNA isolation, 456 data analysis, 464–466 DNA preparation (see ChIP–chip DNA preparation, microarray) genome-wide location analysis, 463–464 harvesting cells, 454–456 immunoprecipitation, 457 preparing and handling sepharose bead 50:50 slurry, 457 solutions and reagents, 458 formaldehyde effects cross-linked proteins, DNA, 454, 455 cross-linker, 450 E. coli frm expression, 451, 452 growth curve comparison, 451 indirect cellular interactions, 454 testing duration, E. coli, 452, 453 Clotho Software, 371–372 Clotho v2.0, synthetic biological systems App types, 106 article organization, 107 biological objects, 100 current status, 99 data model authoring objects, 104–105 composition objects, 101–104 experimental data objects, 105–106 literature objects, 105
Subject Index
physical instantiation objects, 104 description, 99 developers App writing, 109–111 Windows version, 107–109 resources, 107 space constraints, 100 testing and debugging, 101 users adding notes and “factoids”, 125–128 Apps managing, 112–114 biosafety check, 133–135 common errors, 132–133 database connection, 131–132 DNA sequences, 124–125 new feature creation, 116–118 new institution and lab, 114–116 new part creation, 119–122 new plasmid creation, 122–124 new vector creation, 122 PCR products, 131 remarks, 112 right-click menu, 128–131 Codon optimization Monte Carlo approach, 298 protein heterologous expression, 297–298 RNA structure, 298–299 synthetic genes, 299 Constraints-based flux analysis constraints, 72 genome-scale reconstruction, 71 in silico algorithms, 70–71 metobolic reactions, 72–73 objective function, 72 optimal solution, 71–72 stoichiometric coefficients, 73 D Data model, Clotho Software authoring objects, 104–105 composition objects, 101–104 experimental data objects, 105–106 literature objects, 105 physical instantiation objects, 104 Data structure, eugene syntax classes, 166 global and custom classes, 166, 167 Polish and Postfix notations, 167 property, part and device relationships, 166, 168 rule relationship, 169 Divide and conquer (D&C) algorithm DNA combinatorial library synthesis computation, clone requirements, 223 goal and description, 222 minimal cut, 222–223 quasi-equilibrium process, 223–224, 244
493
Subject Index
DNA molecule synthesis dynamic programming and branch-andbound approaches, 221 error correction, 221–222 goal and description, 219 input and output, 219 pseudo-code, 220 recursive cost function, 220–221 DNA-binding specificity prediction, FoldX Arg243, 11 binding capabilities, interaction energy, 12 crystal structures, 10–11 de novo design, 5 designing alanine mutation, 14 in silico, 13 original wild-type structure, 14 sequence variability, 14 wild-type structure, 14 energy terms and base mutation base pairs detection, 6–8 degrees of freedom, 8–9 distance constraints, base pairs, 8 force field, 6 parameterization, 6 stacking energy, 8 in silico methods, 9 intraclashes energy, 12 known caveats clashes, 15 flexibility and base independence, 15 resolution, 15 water molecules, 15 partition function, 9 PDB structures, 9–10 procedures, 9–10 protein–DNA interfaces, 4 synthetic biology definition, 4–5 Waals intramolecular clashes, 9 DNA purifications, 218–219 Domino method antibiotic markers, 433 chloroplast genome application, 434 domino procedures, 433 inchworm elongation, 434–435 pBR322 sequence, 434 E Energy terms and base mutation, DNA adenine atom, 6–7 base pairs detection design tool, 7 distance constraint method, 7–8 hydrogen bonds, 7 nucleotides, 6–7 degrees of freedom, 8–9 distance constraints, base pairs, 8
FoldX, 5 force field, 6 parameterization, 6 stacking energy, 8 Eugene language description, 154 elements asserting and noting rules, 161–162 comments, 156 conditional statements, 163–164 devices, 159–160 header files, 164 image bindings, 165 parts, 158–159 permute function, 162–163 primitives, 156–157 properties, 157–158 relationship, 155–156 rules, 160–161 ‘if statement’, 170 implementation ANTLR, 165–166 data structure, 166–169 header file creation, 165 main file, 165 installation and use command line, 155 JDK, 154 syntax highlighting, 155 “print” function, 169–170 valid architecture, 154 “XOR” function, 171 Evolutionary robustness, RBSs, 23 F FBA. See Flux balance analysis Fluorescent protein expression levels considerations, 39–40 description, 38 protocol, 38–39 Flux balance analysis (FBA) cellular growth rate, 76 constraints-based flux analyses, 75 gene knockout approaches, 78 identify gene targets, 78 in silico metabolic model, 75–76 metabolic network, 76 objective functions, 76–77 signal transduction mechanisms, 86 G Gene design and protein expression biological components, 44 design software, 45 encoding protein, 45 fundamental genetic elements, 61 protein-specific factors
494 Gene design and protein expression (cont.) cis-regulatory regions, 59–61 proline, 57 properties structure, 56 secreted and membrane proteins, 57 structure, 56 toxicity, 57–58 transmembrane proteins, 58–59 sequence governing translation, 44 sequence parameters codon bias, 49–56 initiation of translation, 45–49 mRNA and translational elongation, 56 translation control, 44–45 Gene Designer backtranslation module, 45 steps, 53 codon table, 54 genetic algorithm, 55 homologous DNA sequence, 59 phage gIII signal sequence, 57–58 sliding window technique, 48 Gene Design Software Assembly PCR Oligo Maker, 292 description, 45 drag-and-drop genetic elements, 292 GeneComposer, 293 GeneFab software, 301 Gene fusion, 405 GeneOptimizerÒ construction, 266 project design, 264–265 sequence design, 265 Gene synthesis applications artificial genes, operons, and genomes, 252–253 availability and safety, 250 codon optimization (see Codon optimization) cost, capacity and speed, 252 expression efficiency, 251 origin and reliability, 250–251 protein performance, 251–252 synthetic biology (see Synthetic biology) 77-bp gene encoding, 249 de novo, 249–250 design software GeneDesigner, 291 overlapping oligonucleotides, 291–292 PCR assembly reactions, 292 Stemmer method, 292–293 synthesis design module, 292 TECAN Evo liquid handling robotic workstations, 293 TmPrime, 292 GeneOptimizerÒ, 264–266
Subject Index
GMOs, 249 industrial based biosafety/biosecurity, 260–262 high-throughput sequencing, 258–259 optimization rational, 262 optimizer software, 262–264 process features, 259–260 invention, PCR, 249 lab-scale vs. industry scale, 272 ligation-mediated assembly advantages, PCRmediated methods, 283 commercial gene synthesis, 284 DNA sequences, 284 oligonucleotide synthesis techniques, 282 stepwise synthesis approach, 282–283 thermostable ligases, 282 LIMS assembly, 269–270 expansion, process, 267 oligonucleotide, production, 268 order entry, 267–268 order, processing, 268 process control, 266–267 steering, process, 266 subfragment, production, 269 workflow, 270–271 manipulation, living organism, 248–249 Mycoplasma mycoides JCVI-syn1.0, 250 oligonucleotide synthesis microchip-based, 280–282 solid phase phosphoramidite synthesis, 278–280 PCR-mediated assembly beta-lactamase gene fragment, 284–285 DNA duplex, 291 IPS method, 289–290 liquid handling robot, 289 oligonucleotide primers, 290 one-pot method, 291 outermost primers, 286 overlap extension process, 285, 286 overlapping sense and antisense primers, 287 PCA reaction, 284–285 primer pairs, 288 single enzymatic reaction, 284 Stemmer method, 285 successive extension, 288 TBIO assembly, 287 PFA GeneFab software, 301 Mermade 192 DNA synthesizer, 302–303 oligonucleotide, 302 PCRs, 302 primary fragments, 300–301 single chain antibodies, 301 TECAN robot, 300, 302 state-of-the-art assembly steps, 254
495
Subject Index
balanced ratio, oligo, 254 phosphoramidite four-step process, 253, 254 synthesis fidelity/error correction, 293–297 and synthetic biology information, 255 modularity, 255–256 standardization, 256–258 technological developments, 258 Genetically modified organisms (GMOs), 249 Genetic circuits, 23 GenoCAD Application for Account page, 176, 177 BLAST, 179 default navigational tab, 176 design sequences, 183–184 strategy, 175 evolutions, 186–187 installation connection settings, 186 instance’s URL, 186 mysql database, 186 php.ini file, 185 restarting, 186 server’s webroot directory, 185 source code, 185 virtual host, zend, 185–186 library/category name, 177–178 My Cart new library, 180 temporary repository, 179 My Libraries Management Console, 181–183 New Library, 181 My Parts tab, 183 parts catalog grammars, 177 library/category name, 177–178 part’s Part ID, 179 public libraries, 176 promoter, 179 quick search, 178–179 registration process, 176 shopping cart, 179 synthetic biological systems, 174 tinkercell/gene designer, 174 Genome-wide location analysis data comparison, 463 polyclonal and monoclonal antibody, 463 RNAP histogram and log2 ratios, 464, 465 GFP. See Green fluorescent protein GMOs. See Genetically modified organisms Green fluorescent protein (GFP) capillary electrophoresis fragment analysis construction, 226–228 reconstruction, 229–230 expression, 142
H Homologous recombination, yeast-based description, 329 DNA assembly using fragments with identical ends, 330–332 without identical ends, 332–334 DNA fragments and size, 328 plasmid conversion cassette, 336–337 positive clone screening, 334–335 yeast–E.coli transfer, 329–330, 335 I In vitro recombineering DNA assembly advantages, 339 cloning efficiency, 337–338 linearized vector, 340 materials, 339 reaction, 339 stitching oligonucleotides, 338 thermostable polymerase, 338 topology effect, 341 vaccinia polymerase, 338 site directed mutagenesis genetic tools, 341 kits, 341 materials, 342–344 PCR, 341–342 primer specifications, 342 protocol, 344–345 pUC19-derivative plasmid templates, 343, 345 single DNA molecule, 342 J Java development kit (JDK), 154 JDK. See Java development kit K Kinetic trap, 21, 37 L Large DNA construction assembly multiple fragments design, 439–440 molar concentration adjustment, 441–442 OGAB assembly, 438–439, 442 PCR product cloning, E.coli plasmid, 440 plasmid extraction, Bsu168, 442–443 preparation, OGAB method, 440–441 trouble shooting, 443 bacterial artificial chromosome, 429 BGM vectors and integration, single domino, 429
496
Subject Index
Large DNA construction (cont.) B. subtilis (see Bacillus subtilis) cloning, 428 vs. small fragments assembly, BGM Domino method, 433–435 retrieval, 435–436 trouble shooting, 437–438 Ligation independent cloning (LIC), 350 Ligation mediated-PCR (LM-PCR) first, 461 second, 461–462 M Megaprimer PCR of whole plasmid (MEGAWHOP) cloning advantages, 401 applications creation, mutant libraries, 403 domain-targeted mutagenesis, 405 gene fusion, 405 site-directed mutagenesis, multiple sites, 404 whole gene random mutagenesis, 404 classical method, 400 description, 400–401 DNA fragment preparation, 401 fragment and vector, 400 protocol agarose gel electrophoresis, 401–402 PCR-based procedure, 401–402 technical considerations DpnI digestion, 403 KOD-plus-Neo DNA polymerase, 403 product yield, 402 Metabolic flux analysis (MFA) 13 C-based flux, 68–69 constraints-based flux structure genome-scale reconstruction, 71 in silico algorithms, 70–71 metobolic reactions, 72–73 objective function, 72 optimal solution, 71–72 stoichiometric coefficients, 73 gene targets indentifying flux balance analysis, 75–77 foreign genes insertion, 82 gene knockout, 77–79 in silico algorithms, 73, 75 in silico model, 83–85 metabolite essentiality, 82–83 status investigation, 73 up/down regulation, 79–82 metabolic network, 73, 74 mutualisms, 85 systems engineering, 68 MFA. See Metabolic flux analysis Monte Carlo approach, 298
O OGAB. See Ordered gene assembly in Bsu168 OligoDesigner tool design and fabrication, 369 parameters, 369 sequence adjustments, 374 Oligonucleotide synthesis microchip-based devices, 281 gene synthesis, 281–282 light-mediated reactions, 281 microarray synthesis methods, 280–281 solid phase phosphoramidite synthesis DMT protecting group, 280 enzymatic assembly, 278–280278 microfluidic reaction device, 280 novel detritylation process, 280 phosphoramidites., 278–279 steps, 279 One-step ISO assembly dsDNA dNTPs, 357 PhusionÒ pol, 357–358 ssDNA error-free molecules, 358–359 molecules, 359 pUC19 cloning vector, 358–359 Open translational research (OTR) model, 190–192 Optimal metabolic network identification (OMNI), 83 Ordered gene assembly in Bsu168 (OGAB) fragment preparation, 440–441 gene assembly, 444 polycistronic operon form, 443–444 Orphaned Parts library, 183 OTR model. See Open translational research model Overlapping DNA fragments enzymes, 351 genetic elements, 351 in vitro recombination dsDNA PCR, 352–353 phenol–chloroform–isoamyl alcohol (PCI) extraction, 353 vector primer design, 352 LIC and SLIC, 350 one-step ISO assembly dsDNA, 357–358 ssDNA, 358–360 PCR primers, 350 SV40 hybrid DNA molecule, 350 thermocycled assembly one-step, 355–356 two-step, 353–355 type II restriction enzymes, 350 Overlapping start codons (OLS), 21
497
Subject Index P PCR. See Polymerase chain reaction PFA. See Protein fabrication automation Polymerase chain assembly (PCA) reaction, 284–285 Polymerase chain reaction (PCR). See also Megaprimer PCR of whole plasmid (MEGAWHOP) cloning amplicons, 351 amplification, 217, 316, 342, 352 capillary electrophoresis fragment analysis construction, P53 library, 236–238 GFP construction and reconstruction, 227–230 natural and synthetic fragments, 231 P53 library reconstruction, 239–241 Tachylectinll construction, 233–234 cycling assembly, 351 destination vector, 322 diagnostic primers, 334, 341 DNA targets, 332–333 downstream applications, 355 end homology, 330 fragments, amplified, 336 full-length destination vector, 323 genomic DNA, 350 linearized destination vector duration and automation, 322–323 failure modes, 322 positive and negative selection, 321–322 possible ligation products, 315 verification tests, 322 mediated assembly antisense primers and overlapping sense, 287 beta-lactamase gene fragment, 284–285 DNA duplex, 291 IPS method, 289–290 liquid handling robot, 289 oligonucleotide primers, 290 one-pot method, 291 outermost primers, 286 overlap extension process, 285, 286 PCA reaction, 284–285 primer pairs, 288 single enzymatic reaction, 284 Stemmer method, 285 successive extension, 288 TBIO assembly, 287 primers phophorylation, 216 recursive DNA construction, 210, 212 pUC19, 359 RT curves, 243 Protein expression and sequence parameters codon bias adaptation, 56 analysis host sequences, 55
backtranslation, 54–55 data-driven tables, 55 DNA2.0’s Web service, 52–53 gene code, 55–56 gene designer performance, 52 genetic algorithm, 55 host codon approximation, 50–52 library explorer, 53 managing unwanted sequences, 54 N-terminal fusions impact, 50 optimal E. coli, 52 synonymous codons, 49–50 table, 53 usages, 50 mRNA and translational elongation, 56 translation initiation avoid mRNA structure, 48 gene optimization, 47 mRNA structures, 47 NGG codons, 47–48 N-terminal tags, 48–49 RBS sequences changes, 46–47 Shine–Dalgarno (SD) sequence, 45 Protein fabrication automation (PFA) GeneFab software, 301 Mermade 192 DNA synthesizer, 302–303 oligonucleotide, 302 PCRs, 302 primary fragments, 300–301 single chain antibodies, 301 TECAN robot, 300, 302 Protein-specific factors cis-regulatory regions, 59–61 proline, 57 properties structure, 56 secreted and membrane proteins, 57 structure, 56 toxicity, 57–58 transmembrane proteins, 58–59 Q Quantitative polymerized chain reaction (qPCR), 459 R Rational genome design BIO bricks, DEM, 194 CAD bricks, CEM model cross-species, 195–196 design process, 196–197 genetic background, 194 genome design, 195 program modules, 196 Semantic Web, 196 de novo synthesis, 193 Real-time polymerase chain reaction (RT-PCR), 243, 249
498 Recursive construction and error correction, DNA molecules algorithms PCR primers, 210, 212 protocol and cost accounts, 211–212 robot control program, 211 biochemistry divide and conquer procedure, 209–211 ssDNA vs. dsDNA, 211 capillary electrophoresis fragment analysis, 224–244 error rate statistics, 212, 213 libraries applications, 216 economization, 216 minimal cut analysis, error-prone clones, 213–214 naı¨ve approaches, 214 predictions, error-free components, 214 protocols challenges, 208–209 DNA purifications, 218–219 lambda exonuclease digestion, 217–218 overlap extension elongation, ssDNA fragments, 217 PCR, 216–217 short oligonucleotides, 209 special mismatch-binding proteins, 214 two-step assembly process, 209 construction strategy, 214, 215 use, short oligos, 212–213 Retrieval, large DNA by copying, 436 direct isolation in vivo, 436 in test tube, 435–436 Ribosome assembly, free energy model considerations, 32–33 finial state calculation, 31–32 Gibbs free energy, 28–29 initial state calculation, 31 16S rRNA-binding site, 29 statistical thermodynamic model, 33–36 synthetic ribosome binding sites, 30 translation initiation rates, 36 Ribosome binding site (RBS) calculator applications evolutionary robustness, 23 genetic circuits, 23 manipulation, 22 optimize metabolic pathways, 22–23 predicting translation rates, 24 bacterial translation genetic part, 25 molecular interactions, 25–27 rate-limiting step, 24–25 biophysical model and optimize method accuracy and limitations, 37–38
Subject Index
free energy model, 28–33 statistical model, 33–36 synthetic RBSs, 36–37 thermodynamics, 27–28 considerations, 21–22 control translation and protein expression, 20 evolutionary robustness, 23 fluorescent protein expression levels considerations, 39–40 protocol, 38–39 genetic circuits, 23 inputs, outputs and usage kinetic trap (K), 21 model assumption, 20 mRNA, 20 overlapping start codons (OLS), 21 short protein CDS, 21 sequence, 20 thermodynamic model, 20 translation initiation rates, 24 RNA interactions, 27–28 Robotics 2ab assembly, 373 Biomek Software, 372 functions, 372–373 protocols described, 372 RT-PCR. See Real-time polymerase chain reaction S Sequence and ligation-independent cloning (SLIC), 324, 328, 350 Standard biological parts automated assembly 2ab reaction destination plate, 377 disadvantages, 377–378 protocol, 378, 382 sample robot file, 378–382 AssemblyManager tool architecture, 370–371 BglBrick-based 2ab assembly, 370 dynamic programming-based algorithm, 369–370 robot commands, 369 BglBrick standard and 2ab assembly assembly vector pairs, 366 BglII and BamHI, 365–366 DNA fabrication strategy, 365, 367 elements necessary, 369 Escherichia coli, 367 in vitro, 367–368 BioCAD tools, 371 Clotho Software, 371–372 competent cells methylation strains, 392 protocols, 392–393 design and construction, 373–374
Subject Index
high-throughput mini-preps automated assembly, 374 BioBrick-based assembly, 375 magnetic bead-based chemistries, 374–375 protocols, 376–377 high-throughput plating LB/agar strips, 386 liquid cultures identification, 388 protocol, 390 robot program, 386–387 screen transformants, 386 materials, 373 methylated plasmid DNA, 374 OligoDesigner tool design and fabrication, 369 parameters, 369 sequence adjustments, 374 robotics, 372–373 transformants screening clones identification, 391 protocols, 392 transformation ligations, 383 minimal setup plates, 383 protocol, 385–386 robot program, 383–384 troubleshooting different size colonies, 394–395 lawns/lawny areas, 395–396 no colonies, 393–394 pick cleanly, colonies, 395 streaky colonies, 395 strips, 396 Stemmer method, 285 SynBioSS. See Synthetic biology software suite SynBioSS Designer BioBrick, 140–141 biological “parts”, 140 consecutive screenshots, 143, 144 gene network models, 139–140 SynBioSS Desktop simulator installation steps, 145 Open Licenses, 141 screenshot, 144–145 SynBioSS Wiki biochemical reaction data, 141 components, 141 Synthetic biology applications, 299 assumptions, system, 138–139 bio-based green innovations, 202 biological components, 299 biosynthetic pathways and novel operons, 300 contributors incentives, 201 definition, 328 description, 137–138
499 DNA sequence, 138 synthesis, 299 Freemium model, platform experiments cost, 192 premium services, 192–193 genetically modified organisms CAD bricks, 199 DNA fragment, 198–199 GenoCon infrastructure BASE, 200–201 RIKEN SciNetS system, 201 Semantic Web database, 200 in vitro recombineering DNA assembly, 337–341 site directed mutagenesis, 341–345 novel techniques, 300 oligonucleotides, 299 optimization and OTR model description, 190 DNA sequence design, 191 patents research Japanese Patent Law, 198 statutory exemptions, CEM, 197 rational genome design BIO bricks, 194 CAD bricks, 194–197 de novo synthesis, 193 safety guidelines, phases design, 199–200 DNA synthesis, 200 transgenic, 200 software packages, 138 yeast-based homologous recombination description, 329 DNA assembly using fragments, 330–334 DNA fragments and size, 328 yeast–E.coli transfer, 326–330, 335 yeast plasmid conversion cassette, 336 Synthetic biology software suite (SynBioSS) advantages, 148–150 components Designer, 139–141 Desktop Simulator, 141 Wiki, 141 disadvantages, 147–148 logic AND gate simulations consecutive screenshots, SynBioSS Designer, 143, 144 GFP expression, 142, 143 leakiness reaction, 147 model vs. experimental results, TLT, 146 operator sites, 142 promoters, 145 screenshot, 144, 145 steps, 142–143
500
Subject Index T
Thermocycled assembly one-step ADNA assemblymethod, 355 antibody-bound Taq pol, 355 DNA molecules, 355–356 overlapping DNA segments, 356 two-step adjacent DNA fragments, 353–354 in vitro recombination method, 353 noncomplementary sequences, 353 PCR, 355 Taq pol, 353 Taq repair buffer (TRB), 354 Thermodynamically balanced, inside-out (TBIO) assembly, 287 Trouble shooting Bsu168 cells and DNA, 437 gaps, 437 insert size limit in BGM, 438 OGAB, fragment isolation, 443 repetitive sequences, 437 restriction modification, 437 stability, cloned guest genome, 438 U Users, Clotho v2.0 adding notes and “factoids” biological primitives, 125 MediaWiki syntax, 127 objects, 125 PMID references, 127 pointer information, 126 procedure, 125 WikiText objects, 128 Apps managing capabilities, 112–113 collection viewer, 113–114
biosafety check components, 134–135 NIH guidelines, 133 protocol, 133–134 RG2 plasmid, 134 common errors, 132–133 database connection “Power User” version, 131 procedure, 131–132 DNA sequences, 124–125 new feature creation collection view app, 117 defined, 116 institution, lab and person editors, 116–117 procedure, 117–118 new institution and lab, 114–116 new part creation basic and composite biological parts, 119 objects, 119–121 spreadit parts, 121–122 new plasmid creation, 122–124 new vector creation, 122 PCR products, 131 remarks, 112 right-click menu choose viewer, 130–131 copy to clipboard, 129 delete, 128 export XML and search tags, 130 paste from clipboard, 129 revert, undo and redo, 128 save to database, 128 update, 128 W Waals intramolecular clashes, 9 X XOR function, 171