About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
i
COLLOQUIUM ON COMPUTATIONAL BIOMOLECULAR SCIENCE
NATIONAL ACADEMY OF SCIENCES WASHINGTON, D.C.
1998
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
NATIONAL ACADEMY OF SCIENCES
ii
NATIONAL ACADEMY OF SCIENCES
Colloquium Series In 1991, the National Academy of Sciences inaugurated a series of scientific colloquia, five or six of which are scheduled each year under the guidance of the NAS Council’s Committee on Scientific Programs. Each colloquium addresses a scientific topic of broad and topical interest, cutting across two or more of the traditional disciplines. Typically two days long, colloquia are international in scope and bring together leading scientists in the field. Papers from colloquia are published in
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COMPLETED NAS COLLOQUIA
iii
COMPLETED NAS COLLOQUIA
(1991 TO PRESENT)
Industrial Ecology May 20–21, 1991; Washington, D.C. Organizer: C.Kumar N.Patel Proceedings: February 4, 1992 Images of Science: Science of Images January 13–14, 1992; Washington, D.C. Organizer: Albert Crewe Proceedings: November 3, 1993 Physical Cosmology March 27–29, 1992; Irvine, California Organizer: David Schramm Proceedings: June 3, 1993 Molecular Recognition September 10–11, 1992; Washington, D.C. Organizer: Ronald Breslow Proceedings: February 16, 1993 Human-Machine Communication by Voice February 8–9, 1993: Irvine, California Organizer: Lawrence Rabiner Proceedings: October 24, 1995 Changing Human Ecology and Behavior: Effects on Infectious Diseases September 27–28, 1993; Washington, D.C. Organizer: Bernard Roizman Proceedings: March 29, 1994 The Tempo and Mode of Evolution January 27–29, 1994; Irvine, California Organizers: Francisco Ayala, Walter Fitch Proceedings: July 19, 1994 Chemical Ecology: The Chemistry of Biotic Interaction March 25–26, 1994; Washington, D.C. Organizers: Thomas Eisner, Jerrold Meinwald Proceedings: January 3, 1995 Physics: The Opening to Complexity June 25–27, 1994; Irvine, California Organizer: Philip Anderson Proceedings: July 18, 1995
Self Defense by Plants: Induction and Signaling Pathways September 15–17, 1994; Irvine, California Organizers: André Jagendorf, Clarence Ryan Proceedings: May 9, 1995 Earthquake Prediction February 10–11, 1995; Irvine, California Organizer: Leon Knopoff Proceedings: April 30, 1996 Quasars and Active Galaxies: High Resolution Radio Imaging March 24–25, 1995; Irvine, California Organizers: Marshall Cohen, Kenneth Kellerman Proceedings: December 5, 1995 Vision: From Photon to Perception May 21–22, 1995; Irvine, California Organizers: John Dowling, Lubert Stryer, and Torsten Wiesel Proceedings: January 23, 1996 Science, Technology, and the Economy October 20–22, 1995; Irvine, California Organizers: James Heckman, Ariel Pakes, and Kenneth Sokoloff Proceedings: November 12, 1996 Developmental Biology of Transcription Control October 25–28, 1995; Irvine, California Organizers: Roy Britten, Eric Davidson, and Gary Felsenfeld Proceedings: September 3, 1996 Carbon Dioxide and Climate Change November 13–15, 1995; Irvine, California Organizer: Charles Keeling Proceedings: August 5, 1997 Memory: Recording Experience in Cells and Circuits February 17–20, 1996; Irvine, California Organizer: Patricia Goldman-Rakic Proceedings: November 26, 1996
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COMPLETED NAS COLLOQUIA
iv
COMPLETED NAS COLLOQUIA
Elliptic Curves and Modular Forms March 15–17, 1996; Washington, D.C. Organizers: Barry Mazur, Karl Rubin Proceedings: October 14, 1997 Symmetries Throughout the Sciences May 10–12, 1996; Irvine, California Organizer: Ernest Henley Proceedings: December 15, 1996 Genetic Engineering of Viruses and Viral Vectors June 9–11, 1996; Irvine, California Organizers: Peter Palese, Bernard Roizman Proceedings: October 15, 1996 Genetics and the Origin of Species January 30-February 1, 1997; Irvine, California Organizers: Francisco Ayala, Walter Fitch Proceedings: July 22, 1997 The Age of the Universe: Dark Matter and Structure Formation March 21–23, 1997; Irvine, California Organizers: David Schramm, P.J.E.Peebles Proceedings: January 6, 1998 Neuroimaging and Human Brain Function May 29–31, 1997; Irvine, California Organizers: Michael Posner, Marcus Raichle Proceedings: February 3, 1998 Protecting Our Food Supply: The Value of Plant Genome Initiatives June 2–4, 1997; Irvine, California Organizers: Michael Freeling, Ronald Phillips, John Axtell Proceedings: March 5, 1998 Computational Biomolecular Science September 11–14, 1997; Irvine, California Organizers: Peter G.Wolynes, Russell Doolittle, J.A.McCammon Proceedings: May 26, 1998 A Library Approach to Chemistry October 19–21, 1997; Irvine, California Organizer: Peter Schultz, Jonathan Ellman
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PROGRAM
v
PROGRAM
Computational Biomolecular Science
Thursday, September 11, 1997 Registration and Welcome Reception Friday, September 12, 1997 Session I 8:45 AM-12:30 PM Introduction, Peter Wolynes. Measuring genome evolution. Peer Bork (EMBL, Heidelberg). Determining biological function from sequence: Building highly specific sequence motifs for genome analysis. Douglas Brutlag (Stanford). Experimental studies of protein folding dynamics. William Eaton (NIH). Coupling the folding of homologous proteins. Ron Elber (Hebrew University). Session II 2:00 PM-5:30 PM Photoactive yellow protein: Prototype for the PAS domains of sensors and clocks. Elizabeth Getzoff (Scripps Research Institute). Inhomogeneities in genomic sequence composition. Philip Green (Univ. Washington). New refinement methods for NOE-distance based NMR structure. Angela Gronenborn (NIH). Estimation of evolutionary distances between DNA sequences. Wen-Hsiung Li (Univ. Texas, Houston). Comments by Roy Britten After-dinner Lecture. From slide rule to super computer. Hans Frauenfelder (Los Alamos). Saturday, September 13, 1997 Session III 9:00 AM-12:30 PM Comparing sequence comparison with structure comparison. Michael Levitt (Stanford). Structural classification of proteins and its evolutionary implications. Alexey Murzin (MRC, Cambridge). Exploring the protein folding funnel landscape-connection to fast folding experiments. Jose Onuchic (UCSD). Bridged bimetallic enzymes: A challenge for computational chemistry. Gregory Petsko (Brandeis). 2:00 PM-5:30 PM Session IV Sequence determinants of protein folding and stability. Robert Sauer (MIT). The evolution of efficient light harvesting in photosynthesis-one goal, many solutions. Klaus Schulten (Illinois). Electrostatic steering and ionic tethering in simulations of protein-ligand interactions. Rebecca Wade (EMBL, Heidelberg). Computer simulation of enzymatic reactions and other biological process; finding out what was optimized by evolution. Arieh Warshel (USC). After-dinner Lecture. Applications of computers in structural biology. Harold Scheraga (Cornell).
Chair, Russell Doolittle
Chair, Andrew McCammon
Chair, Andrew McCammon
Chair, Peter Wolynes
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
LIST OF ATTENDEES
vi
LIST OF ATTENDEES
Computational Biomolecular Science
Robert K.Adair, Yale University Paul A.Bash, Argone National Laboratory R.L.Bernstein, San Francisco State University Paul Beroza, CombiChem Inc. Peer Bork, European Molecular Biology Laboratory David A.Brant, University of California Roy J.Britten, California Institute of Technology Thomas C.Bruice, University of California, Santa Barbara Douglas Brutlag, Stanford University Medical School Aloke Chatterjee. Lawrence Berkeley National Laboratory Jiangang Chen, University of California, Los Angeles Margaret S.Cheung, University of California, San Diego Julian D.Cole, Rensselaer Polytechnic Institute Kumari Devulapalle, University of Southern California, School of Dentistry Russel F.Doolittle, University of California, San Diego William Eaton, National Institutes of Health Ron Elber, Hebrew University Adrien Elcock, University of California, San Diego Hans Frauenfelder, Los Alamos National Laboratory Anthony Gamst, University of California, San Diego Robert Gerber, University of California, Irvine Elizabeth D.Getzoff, Scripps Research Institute Raveh Gill-More, Compugen Ltd. Adam Godzik, The Scripps Research Institute Jill E.Gready, Australian National University Phillip Green, University of Washington Angela M.Gronenborn, National Institutes of Health William Grundy, University of California. San Diego Volkhard Helms, University of California San Diego Dennis Kibler, University of California, Irvine Robert Konecny, The Scripps Research Institute Kristin Korethe, Smith Kline Beecham Leslie A.Kuhn, Michigan State University Donald Kyle, Scios Inc. Peter W.Langhoff, San Diego Supercomputer Center Michael Levitt, Stanford University, School of Medicine Jian Li, The Scripps Research Institute Wen-Hsiung Li, University of Texas E.N.Lightfoot, University of Wisconsin Jennifer H.Y.Liu, University of California Hartmut Luecke, University of California, Irvine Jia Luo, University of California, Santa Barbara
Zaida Luthey-Schultem, University of Ilinois Jeffry D.Madura, University of South Alabama J.Andrew McCammon, University of California, San Diego Gregory Mooser, University of Southern California, School of Dentistry Victor Munoz, National Institutes of Health Alexey G.Murzin, Centre for Protein Engineering Craig Nevill-Manning, Stanford University Louis Noodleman, The Scripps Research Institute Hugh Nymeyer, University of California, San Diego Jose N.Onuchic, University of California, San Diego Jean-Luc Pellequer, The Scripps Research Institute Gregory A.Petsko, Brandeis University Mike Potter, University of California, San Diego Vijay S.Reddy, The Scripps Research Institute Carolina M.Reyes, University of California, San Francisco Roy Riblet, Medical Biology Institute Andrey Rzhetsky, Columbia University Suzanne B.Sandmeyer, University of California, Irvine Robert Sauer, Massachusetts Institute of Technology Harold Scheraga, Cornell University Rebecca K.Schmidt, Australian National University Klaus Schulten, University of Illinois Soheil Shams, BioDiscovery Sylvia Spengler, Lawrence Berkeley National Laboratory Tim Springer, Center for Blood Research T.P.Straatsma, Pacific Northwest National Laboratory Ivan Suthsland, Sun Microsystems Laboratories Mounir Tarek, National Institute of Standards and Technology Douglas Tobias, University of California, Irvine Chandra S.Verma, University of York Rebecca Wade, European Molecular Biology Laboratory Frederic Y.M.Wan, University of California, Irvine Arieh Warshel, University of Southern California Stephen H.White, University of California, Irvine Peter Wolynes, National Institutes of Health Willy Wriggers, University of Illinois at Urbana-Champaign William V.Wright University of North Carolina Thomas Wu, Stanford University Qiang Zhenq, Scios Inc.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
TABLE OF CONTENTS
vii
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Table of Contents Papers from a National Academy of Sciences Colloquium on Computational Biomolecular Science
Computational biomolecular science Peter G.Wolynes Measuring genome evolution Martijn A.Huynen and Peer Bork SMART, a simple modular architecture research tool: Identification of signaling domains Jörg Schultz, Frank Milpetz, Peer Bork, and Chris P.Ponting Highly specific protein sequence motifs for genome analysis Craig G.Nevill-Manning, Thomas D.Wu, and Douglas L.Brutlag A statistical mechanical model for β-hairpin kinetics Victor Munoz, Eric R.Henry, James Hofrichter, and William A.Eaton Coupling the folding of homologous proteins Chen Keasar, Dror Tobi, Ron Elber, and Jeff Skolnick Photoactive yellow protein: A structural prototype for the three-dimensional fold of the PAS domain superfamily Jean-Luc Pellequer. Karen A.Wager-Smith, Steve A.Kay, and Elizabeth D.Getzoff New methods of structure refinement for macromolecular structure determination by NMR G.Marius Clore and Angela M.Gronenborn Estimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution Xun Gu and Wen-Hsiung Li Precise sequence complementarity between yeast chromosome ends and two classes of justsubtelomeric sequences Roy J.Britten A unified statistical framework for sequence comparison and structure comparison Michael Levitt and Mark Gerstein Folding funnels and frustration in off-lattice minimalist protein landscapes Hugh Nymeyer, Angel E.García, and José Nelson Onuchic Optimizing the stability of single-chain proteins by linker length and composition mutagenesis Clifford R.Robinson and Robert T.Sauer Architecture and mechanism of the light-harvesting apparatus of purple bacteria Xiche Hu, Ana Damjanovi , Thorsten Ritz, and Klaus Schulten Electrostatic steering and ionic tethering in enzyme-ligand binding: Insights from simulations Rebecca C.Wade, Razif R.Gabdoulline, Susanna K.Lüdemann, and Valère Lounnas Computer simulations of enzyme catalysis: Finding out what has been optimized by evolution Arieh Warshel and Jan Florián
5848 5849–5856 5857–5864 5865–5871 5872–5879 5880–5883 5884–5890
5891–5898 5899–5905
5906–5912
5913–5920 5921–5928 5929–5934 5935–5941 5942–5949 5950–5955
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COMPUTATIONAL BIOMOLECULAR SCIENCE
5848
Proc. Natl. Acad. Sci. USA Vol. 95. p. 5848, May 1998 Colloquium Paper This paper is the introduction to the following papers, which were presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Computational biomolecular science
PETER G.WOLYNES School of Chemical Sciences, University of Illinois, Urbana-Champaign, Urbana, IL 61801 In this century, the study of the molecules of life has transformed the practice of biology as a whole. Molecular thinking now influences the research agenda for scientists studying both the behavior of individual cells and organisms, and the relationships between organisms as in natural history. Even ecology and anthropology are being influenced by this molecular revolution. It is impressive that this transformation has, to a large extent, been made possible by simply identifying (with very clever strategies!) active biological molecules and cataloging their information content through their sequences. One result of all this activity is that raw data about life at the molecular level have become abundant, but understanding its biological meaning remains, in many if not most respects, perplexing. Fortunately, just at this stage, new approaches to understanding the connection between biomolecular sequence and physiological behavior are coming forward. Computation, theory, and novel experimental approaches that utilize the combinatorial power of the genetic code allow us to begin to understand biomolecular function from both the bottom-up atomistic point-of-view of the physical sciences and the top-down view usually associated with the evolutionary perspective. The goal of this colloquium was to bring together some of the workers from different scientific disciplines who are approaching these problems by using quantitative methods. Because computation plays such a large part in exploiting the information content of sequence data, the conference was entitled “Computational Biomolecular Science,” although some of the essential input of new experiments to this emerging discipline was covered too. From the bottom-up perspective, the first event to consider on the road from sequence to the biological behavior of an organism is the folding of a linear polymer into a three-dimensional structure. Once a molecule is properly folded, a variety of motions still go on in the folded state. It is through these motions that the biological molecule can function. These dynamical aspects represent complex problems in chemistry and physics. But it is the aptness with which these functions are carried out that at last determines whether the organism containing that molecule can survive in the struggle with other organisms. Quantitatively understanding molecular behavior sufficiently well for understanding this final biological goal requires much work from both the theoreticians and the experimentalists. The top-down interpretation of molecular data appears to proceed quite differently. Avoiding the complexity of molecular theory, the evolutionary perspective takes inheritance, perhaps the most self-evident aspect of “living” things, as its central concept. Comparing sequences between different organisms then provides clues to their molecular function. In this study, dominant use is made of features of molecules that do not change an organism’s fitness, thus allowing markers of inheritance to be reliably assigned. In a sense then the nonfunctional parts of a molecule’s structure and dynamics are the most useful to the phylogenetically inclined scientist. Convergent evolution is hard to establish by such studies but is critically important to those who wonder whether, from the atomistic perspective, there are indeed general themes to the scheme of life. Despite its sometimes “life as a blackbox” character, the top-down viewpoint has achieved a myriad of successes in the practical applications of biomolecular science. A gap exists between the two different vantage points of looking at biomolecular information, but there are a surprising number of common concepts. In understanding the folding, motions, and function of biological molecules, for example, a powerful new viewpoint that describes the entire energy landscape of a biomoiecule in a statistical fashion is proving essential. Understanding and differentiating between those parts of the energetics and dynamics that are biologically significant and those that can be thought of as random noise is the hallmark of this approach. Similarly, in the comparative top-down approach to understanding sequence data, a tremendous amount of statistical thinking must be done to understand whether a perceptible similarity between two sequences really means the molecules have comparable function or structure or whether the similarity is just an accident. Just as in energy landscape theory, extracting signal from noise is the crucial point to understanding molecular evolution. Such frankly statistical viewpoints must also be brought together when planning modern molecular biology experiments that now begin to allow the study of a huge number of variants of a biomoiecule in the laboratory simultaneously at one time. It became apparent in the meeting that, apart from the general common interest in biomolecules and the common but general theoretical concepts based on statistics, there were many specific problems where the top-down and bottom-up viewpoints can profitably be merged For example, surveys of genomes reveal widespread structural themes that may be clues to folding thermodynamics and kinetic folding routes. For the atomists, several studies show how the structures of specific sequences can be predicted if knowledge of the sequences of many widely different but evolutionary related molecules is available. On the other hand, for the evolutionist, an a priori knowledge of structural and energetic patterns in molecules leads to refined algorithms for comparing sequences to obtain reliable phylogenies. Also, convergent evolution can be recognized if both comparative and physical studies are available for proteins in the same family. This breaks evolutionary explanation out of the mold of sophisticated Kipling “just-so” stories into the quantitative mode, most prized by natural scientists. The papers in this colloquium give a partial snapshot of computational biomolecular science today. The organizers of the meeting, J.A.McCammon, R.F.Doolittle, and I, hope these papers give the readers of the Proceedings an idea of what is going on in a branch of science that is destined to grow much larger in the coming years.
© 1998 by The National Academy of Sciences 0027–8424/98/955848–1$2.00/0 PNAS is available online at http://www.pnas.org.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
MEASURING GENOME EVOLUTION
5849
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5849–5856, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine CA.
Measuring genome evolution
(ortholog/synteny/comptuer analysis/horizontal gene transfer) MARTIJN A.HUYNEN* AND PEER BORK European Molecular Biology Laboratory, Meyerhofstrasse 1, 69012 Heidelberg. Germany, and Max-Delbrück-Centrum for Molecular Medicine, 13122 Berlin-Buch. Germany ABSTRACT The determination of complete genome sequences provides us with an opportunity to describe and analyze evolution at the comprehensive level of genomes. Here we compare nine genomes with respect to their protein coding genes at two levels: (i) we compare genomes as “bags of genes” and measure the fraction of orthologs shared between genomes and (ii) we quantify correlations between genes with respect to their relative positions in genomes. Distances between the genomes are related to their divergence times, measured as the number of amino acid substitutions per site in a set of 34 orthologous genes that are shared among all the genomes compared. We establish a hierarchy of rates at which genomes have changed during evolution. Protein sequence identity is the most conserved, followed by the complement of genes within the genome. Next is the degree of conservation of the order of genes, whereas gene regulation appears to evolve at the highest rate. Finally, we show that some genomes are more highly organized than others: they show a higher degree of the clustering of genes that have orthologs in other genomes. Molecular evolution usually is studied at the level of single genes. With the determination of genome sequences we have an opportunity to study it at a higher, comprehensive level, that of complete genomes. This leads to the pertinent question: how can genomic information be used to obtain useful information concerning genome evolution? The goal of this paper is to create baseline expectations for measures of genome distances that are based on gene content. By describing some general patterns one also can identify the exceptions. Measuring evolution at the level of complete genomes is pertinent as it is, after all, the principal level for natural selection. Furthermore, it is intermediate to levels at which evolution has long been studied: namely, the molecular level in genes and genotypes, and the organismal level in the fossil record. The genome in principle contains all of the information necessary to bridge the gap between genotype and phenotype. For example, by-knowing the functions of the genes in a genome of a species we can postulate a model for its complete metabolism. However, we have to be careful not to overstate our expectations. The situation might turn out to be analogous to that of proteins, for which, in principle, all information necessary to determine three-dimensional structures in the form of amino acid sequences is known, yet we remain unable to predict their tertiary structures. Genomes can be analyzed and compared for various features: e.g., nucleotide content, compositional biases of leading and lagging strands in replication (e.g., in Escherichia coli) (1), dinucleotide frequencies (2), the occurrence of repeats (e.g., in virulence genes of Haemophilus influenzae: ref. 3), RNA structures, coding densities, protein coding genes, operons, the size distribution of gene families (4), etc. They also can be compared at a variety of levels: a first-order level where we regard the genome as a “bag of genes” without taking account of interactions between the various components, and a second-order level that considers whether properties of genomes are cross-correlated (e.g., the absence of certain polynucleotides together with the presence of restriction enzymes that specifically cut these polynucleotides; ref. 5). In this paper we focus on first- and second-order patterns in protein coding regions in genomes. Specifically we measure: (i) the fraction of orthologous sequences between genomes, (ii) the conservation of gene order between genomes, and (iii) the spatial clustering of genes in one genome that have an ortholog in another genome. We correlate these measures with the divergence time between the genomes compared. It is not our goal to define new distance measures to construct phylogenetic trees. Rather it is to analyze the conservation and differentiation of patterns between genomes, to show how we can extract useful information from these, and to analyze at what relative time scales they change. The analyses are done on the first nine sequenced Archaea and Bacteria that were publicly available: H.influenzae (6), Mycoplasma genitalium (7), Synechocystis sp. PCC 6803 (8), Methanococcus jannaschii (9), Mycoplasma pneumoniae (10), E.coli (1), Methanobacterium thermoautotrophicum (11), Helicobacter pylori (12), and Bacillus subtilis (13). Although the total number of publicly available genome sequences is growing rapidly, the trends that we observe should remain largely unchanged with the comparison of new species, given the diverse range of evolutionary distances of the species compared in this paper.
Methodological Issues in Comparisons of Genomes Identification of Orthologous Genes. Defining orthology. In comparing the genes of different genomes it is important that we avoid comparisons of “apples and pears”: i.e., that we are able to identify which genes correspond to each other in the various genomes. Fitch (14) introduced the term “orthologs” for genes whose independent evolution reflects a speciation event rather than a gene duplication event. “Where the homology is the result of gene duplication so that both copies have descended side by side during the history of an organism, (for example, alpha and beta hemoglobin) the genes should be called paralogous (para=in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example, alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho=exact)” (14). Note that orthology and paralogy are
*To whom reprint requests should be addressed at: European Molecular Biology Laboratory, Meyerhofstrasse 1,69012 Heidelberg. Germany, e-mail:
[email protected]. © 1998 by The National Academy of Sciences 0027–8424/98/955849–8$2.00/0 PNAS is available online at http://www.pnas.org.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
MEASURING GENOME EVOLUTION
5850
defined only with respect to the phylogeny of the genes and not with respect to function. Identifying orthology by using relative levels of sequence identity. Ideally one would expect that the orthologous genes of two genomes are those that have the highest pairwise identity, having bifurcated relatively recently compared with genes that duplicated before the speciation. The most straightforward approach to identifying orthologous genes is to compare all genes in genomes with each other, and then to select pairs of genes with significant pairwise similarities. A pair of sequences with the highest level of identity then is considered orthologous. Auxiliary information for detection of orthology. Auxiliary information that is useful to assess orthology is “synteny”: the presence in both genomes of neighboring sequences that are also orthologs of each other. As shown below, there is little conservation of the order of genes in genomes in evolution at a time when divergence of their orthologous genes reaches a level of 50% amino acid identity (see Fig. 3). Hence the potential for using synteny for identifying orthologs is limited mainly to genomes that have speciated only relatively recently. A second type of auxiliary information that can be used is the comparison of genes with those of a third genome. If two genes from different genomes have the highest level of identity both to each other and to a single gene from a third genome, then this is a strong indication that they are orthologs (see ref. 15 for a large-scale implementation of this idea). However for a large fraction of genes identifying orthologs by relative sequence identity is hampered by a variety of evolutionary processes. We describe these in the following sections. Sequence divergence. At large evolutionary distances, e.g., between Archaea and Bacteria, sequence similarities may be eroded to such an extent that the distance between orthologous sequences is similar to that between sequences that are merely part of the same gene family. More dramatically, homolog sequences can diverge “beyond recognition,” such that the similarity between two orthologs is not higher than the similarity between sequences that are not part of the same gene family and automatic procedures for the recognition of homology fail. A recent survey of genes in Drosophila shows that one-third of the cDNAs code for very fast evolving genes, for which the frequency of amino acid substituting mutations is only a 2-fold lower than that of silent mutations, leading to a situation where homologous proteins are barely recognizable after 8,000 years of evolution (16). Nonorthologous gene displacement. A second event problematic to ortholog identification is nonorthologous gene displacement. This occurs when two nonorthologous genes that are unrelated or only remotely related perform the same function in two organisms (17). This occurs relatively frequently: a comparison of M.genitalium to H.influenzae revealed 12 clear-cut cases (17). As a consequence orthologs may not be detectable (or are classified as paralogs) in another organism even when the corresponding function is retained. Gene duplication, gene loss, and horizontal gene transfer. A third process that restricts the identification of orthologous genes is that of gene loss in combination with gene duplication. If two genomes lose different paralogs of an ancestral gene that was duplicated before the speciation event, the remaining genes have highest sequence identity even though they are not orthologs (18). One may test for such an event by checking whether the protein similarity falls into an expected range. This is done implicitly by including (presumably orthologous) sequences from other species in the phylogeny and checking whether the gene tree is in accordance with the species tree (18, 19). Inconsistencies between the species tree and the gene tree can indicate nonorthologous relationships between genes. However, they also can be caused by horizontal gene transfer, in which case the genes still could be orthologs. In general, the identification of orthologous sequences, horizontal gene transfer, and ancient gene duplications cannot be distinguished. Besides the construction of phylogenetic trees an additional strategy for finding horizontal gene transfer is the comparison of nucleotide frequencies within a genome. Recently transferred genes often display nucleotide frequencies that deviate significantly from the rest of the genome (20, 21). A conservative estimate of the amount of genes that recently have been transferred to E.coli, based on nucleotide frequencies and dinucleotide frequencies in genomes is 10%—15% of the E. coli genome (Phil Green, personal communication: ref. 21). A third strategy for finding horizontal gene transfer is synteny. Because gene order is rarely conserved in evolution, the presence in two distant evolutionary branches of the same order of genes, combined with the absence of this gene order in other more closely related branches, can point to horizontal gene transfer. This strategy has been used successfully to find the example of horizontal gene transfer described in Fig. 1. Orthology in multidomain proteins. In multidomain proteins two levels of orthology can be distinguished: one is at the level of single domains, a second at the level of the whole protein. This may lead to situations where nonorthologous proteins possess orthologous domains. Modularity of genes in the sense that modules can have different positions, but the same function, in various proteins, is not well documented in Bacteria and Archaea. A first step toward modularity, the presence of “gene fusion” or “gene splitting,” however, does occur regularly. Comparative analysis of the genomes H.influenzae and E.coli showed 10 (24) clear-cut cases of genes that were separate in E.coli (H.influenzae), but that were part of a single gene in H.influenzae (E.coli) (unpublished data). A much more complicated scenario, for which many of the factors described above (multidomain proteins, synteny, and horizontal gene transfer) are involved, is shown in Fig. 1. In general, a combination of the various evolutionary processes described above leads to a situation where, although orthology was defined originally as a one-to-one relationship between proteins, it must be considered a many-to-many relationship. From homologs to orthologs. The advent of powerful, easyto-use tools, such as PSIBLAST (22), to find homologous sequences is likely to shift the emphasis in sequence analysis from predicting homology to predicting orthology. It is clear that, at present, there is not a single, simple, and perfect solution to the question of orthology. Orthology is methodologically defined, that is, dependent on what is asked of the genomes that are compared, different methods to find orthologous genes are being used. We use a minimal definition when we are interested only in the number of orthologs shared between genomes at various phylogenetic distances. Orthologs then are defined in the following manner: (i) They have the highest level of pairwise identity when compared with the identities of either gene to all other genes in the other’s genome; (ii) the pairwise identity is significant (E, the expected fraction of false positives, is smaller than 0.01), and (iii) the similarity extends to at least 60% of one of the genes. The region of similarity is not required to cover the majority of both genes to include the possibility of gene fusion and gene splitting. In more detailed comparisons between a small number of genomes, auxiliary information was used to determine orthology, such as the order of genes and the comparison to genes from a third genome (see legend to Fig. 1). Given all of these complications in the finding of orthologs and the oversimplified view of evolution that the term suggests, one could conclude that it is better not to use it at all, or only in those cases where one does not have conflicting information from various sources about the phylogeny of the genes. One also can argue that it is exactly these cases where there are conflicts in the information about orthology from different sources that evolution shows some of its most interesting aspects. Orthology is an important refinement over homology in describing the phylogenetic relations between genes, as long
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
MEASURING GENOME EVOLUTION
5851
as one always keeps in mind the caveats described above and as long as the methods for determining orthology are well defined.
FIG. 1. An example of complexities in assigning orthology to multidomain proteins. The M.thermoautotrophicum genes MTH444 (a sensory transduction histidine kinase) and MTH445 (a sensory transduction regulatory protein) are orthologs of the Synechocystis sequences slr0473 (phytochrome; ref. 41) and slr0474, respectively (the gene nomenclature is from the GenBank files of complete genomes, the first letters of gene names generally represent the initials of the genomes). The arguments for orthology are: (i) The genes have a 34.8% and a 40.2% identity to each other, which is significantly higher than either of them has to other sequences in the other’s genome. (ii) They are neighboring genes in both genomes, (iii) Both MTH444 and slr0473 have the highest level of identity to a single sequence from a third species Archeoglobus fulgidus (42), AF1483, the same is true for MTH445 and slr0474 with respect to AF1472. Interestingly, the level of identity of the Synechocystis sequences slr0473 and slr0474 is significantly higher to the M.thermoautotrophicum and A.fulgidus sequences than it is to any of the sequences in the Bacteria, including sequences in Synechocystis itself. The reverse is even more dramatic: MTH445, AF1472, and MTH444, AF1483 are more identical, not only to their Synechocystis orthologs, but also to 27 respectively 28 other sequences in Synechocystis than they are to sequences in their own genomes. These 27 (28) sequences are paralogs of slr0473 (slr0474). The similarity between MTH444 and AF1483 is slightly lower than that between AF1483 and slr0473, whereas the similarity between AF1472 and MTH444 is significantly higher than that of either of them to slr0473. Neighbor-joining clusterings of the histidine kinase orthologs together with their most similar sequences from the three genomes (A) illustrates the most likely evolutionary scenario: a horizontal transfer of the genes in the branch that has led to Synechocystis, to the branch leading to M.thermoautotrophicum and A.fulgidus. Given the relative similarities of the proteins, this event occurred after a major amplification of the histidine kinase family in Synechocystis and not long before the split of the branches that led to M.thermoautotrophicum and A.fulgidus. The fact that none of the proteins have a detectable homolog in M.jannaschii, which branched off in the Archaea not long before the branching of A.fulgidus and M.thermoautotrophicum, supports this hypothesis. The only inconsistency is the fact that in the clustering of the kinases, AF1483 and slr0473 are slightly more similar to each other than either is to MTH444. (B) Domain architecture of slr0473, AF1483, and MTH444. The genes slr0473 and AF1483 are multidomain proteins, carrying GAF (43) domains and PAS (44, 45) motifs at their N terminus. The PAC motif (44, 45) could be detected only in AF1483. The GAF domain and PAS and PAC motifs are absent in MTH444, and have been replaced by three transmembrane regions (see ref. 11). All three genes possess a histidine kinase domain (HisKc) at their C terminus; 3 to the slr0473 and MTH444 genes are the regulatory response genes slr0474 and MTH445. The distances between the reading frames are short: 15 nucleotides in Synechocystis and the reading frames overlap in M.thermoautotrophicum. In A.fulgidus the spatial association between these genes is absent. The absence of the GAF and PAS domains in MTH444 might have caused different selective constraints in MTH444 than in slr0473 and AF1483, and thus increased its rate of evolution, thereby reducing its similarity to its A. fulgidus and Synechocystis orthologs at a relatively high rate. The GAF, PAC, and PAS domains were predicted by using the SMART system (ref. 46; http://www.bork.embl-heidelberg.de/Modules/sinput.shtml). Timing Genome Divergence. To compare the rates at which various properties of genomes change, a central reference for the divergence between genomes is required. Measurement of the divergence times between the three “domains” (Archaea, Bacteria, and Eukarya) on the basis of protein dissimilarities recently has gained considerable attention and has been the subject of some controversy (see ref. 23 and references therein). The estimates of the date of the last common ancestor vary from 2 billion (24) to 3–4 billion years ago (23). The major assumptions in estimating divergence times from distances between protein sequences are: (i) The proteins are of vertical descent; i.e., they have not been horizontally transferred into the genome following the speciation of the species compared; and (ii) the proteins act as a molecular clock, having rates of amino acid substitutions that do not vary over time and between the lineages. Here we use proteins to scale divergence between and within the Archaea and the Bacteria. It is not our intention to estimate absolute divergence times, rather it is to compare the different relative rates at which genomes evolve. Thus we translate the protein dissimilarities between the species into amino acid substitutions per position per gene, using an equation derived by Grishin (25), which corrects for variations in substitution rates for both amino acids and sites: q=ln(1 +2d)/2d, where q is the fraction of identical amino acids between the proteins and d is the number of amino acid substitutions per site. Grishin’s equation recently was used by Doolittle el al. (23) and gives reasonable estimates for the divergence between Bacteria and Archaea. Stringent criteria were used to select a set of genes that had orthologs in all of the nine genomes compared: (i) Each gene had the highest level of identity to at least five of the other genes (relative to other genes in those five genomes, see our minimal definition of orthology above); and (ii) there were no conflicting hits, from each genome only one protein was selected. The resulting set of 34 proteins is surprisingly small. It contains 17 ribosomal proteins, five tRNA synthetases, two signal recognition particles, two proteins with unknown function, and eight metabolic enzymes. Interestingly, the set consists almost exclusively of proteins that interact with RNA or synthesize RNA. In estimating divergence times of the genomes of Archaea and Bacteria it could be useful to check whether the protein similarities follow the phylogenetic tree (23) given the previously recognized ancient horizontal transfer of metabolic enzymes from Bacteria to Archaea (26), and more recent occasions of horizontal gene transfer (Fig. 1). However, because Archaeal genomes are chimeric, they were treated as
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
MEASURING GENOME EVOLUTION
5852
such by obtaining a central reference for the distance between genomes by averaging over the proteins’ distances, irrespective of their phylogenetic trees. As Grishin’s equation tends to overestimate the number of amino acid substitutions per position for low levels of identities between genes (27), the median of the estimates of the number of amino acid substitutions was used in preference to the mean. The results are used in the following sections.
Comparing Genomes as “Bags of Genes” Shared Orthologous Genes. The decrease of the number of shared orthologs in time. A straightforward comparison between genomes simply considers genes, and not the correlation between genes: i.e., a genome is regarded as a “bag of genes.” Taking this a step further, we measure how the number of shared orthologs between two genomes decreases with their divergence time (Fig. 2). The results show that the fraction of shared orthologous sequences decreases rapidly in evolution, faster than the level of pairwise identity between the shared orthologs. Although the fraction of shared orthologs between Archaea and Bacteria is less than among the Bacteria, the most dramatic reduction in the fraction of shared orthologs takes place on shorter time scales within the Bacteria and Archaea, when protein identity levels between genomes are still above 50%. Non-tree-like aspects of the evolution of gene content. Even over large evolutionary distances such as those between Archaea and Bacteria different pairs of genomes share different orthologs. For example. M.genitalium shares different orthologs with M.jannaschii than with M.thermoautotrophicum (see legend to Fig. 2). This demonstrates a nontree-like aspect of the evolution of the gene content of genomes: phylogenetically closely related species do not share orthologous genes that either of them shares with a phylogenetically distant species. Differential Genome Analysis. Pairwise genome comparison. Instead of focusing on genomes’ similarities one can focus on their dissimilarities; i.e., “differential genome analysis” (28). Such analysis can be particularly revealing if the genomes are closely related but have different phenotypes, in which case one can identify the genetic basis for their differences. For example, of the genes in the pathogen H.influenzae that do not have a homolog in the relatively benign E.coli, a large fraction, 60% are (potentially) involved in H.influenzae’s pathogenesis (28). These genes encode proteins that are located on the surface of the cell or are involved in the production of toxins, or are virulence factors, or are homologous to proteins present only in pathogenic species. By contrast, of the proteins in H. influenzae that do have an ortholog in E.coli only an estimated 12% can be considered host interaction factors. Multiple genome comparison. Differential genome analysis can be extended to multiple genomes. One then can analyze the correlation between shared gene content and shared phenotypic features of the species compared. This is demonstrated in a comparison of the two pathogens H.influenzae and H.pylori with E.coli. H.influenzae and H.pylori share 17 orthologs that do not have a homolog in E.coli. Of these, a large fraction (12) are related to pathogenicity (unpublished data). Differential genome analysis also can be used to select genes responsible for other differences in phenotypes, e.g., metabolism. The main requirement is that the genomes are sufficiently close in evolution that the identification of orthologs is reliable and that the differences in genome content reflect mainly the phenotypic feature that one is interested in.
Measuring Correlations Between Genes Conservation of the Spatial Association of Genes. Quantification of the differentiation of gene order. Synteny, the conservation of the order of genes, has been extensively studied already. Although some conservation of the order of genes in genomes has been reported (29, 30), the emphasis has been on the the drastic rearrangement of gene order in evolution (31–33). The evolution of the spatial organization of the genome is being studied for three reasons: (i) To calibrate the rate at which it evolves. (ii) To study the genome organization of the last common ancestor (34). Shared gene order between the Archaea and the Bacteria is assumed to date back to their last common ancestor, with the exception of horizontal gene transfer (Fig. 1). (iii) To estimate the time scale at which gene regulation changes during evolution. The spatial association of genes is related to their regulation, e.g., in the case of operons.
FIG. 2. The relationship between genome similarity, measured as the fraction of shared orthologs, and time, measured as the number of amino acid substitutions per protein per position in a set of 34 orthologs.+shows the fraction of sequences in a genome A that has an ortholog in another genome B, and vice versa. This measure is asymmetric, a relatively small genome like H.influenzae is more similar to a large one like E.coli than E.coli is similar to H.influenzae. ` shows the average of the two asymmetric similarities. Here we use a minimal definition of orthology: sequences that between two genomes have the highest, significant (E<0.01) level of pairwise identity, that covers at least 60% of one of the proteins are regarded as orthologs. Sequences were compared with the Smith-Waterman algorithm (47), using a parallel Bioccellerator computer. The relationship between sequence identity and the number of amino acid substitutions per position as calculated with Grishin’s equation (25) is given for comparison. If one assumes that the divergence time between the Archaea and Bacteria is 3.5 billion years (23), the unit of one amino acid substitution corresponds to about 875 million years. In this estimate of divergence time the Mycoplasmas and H.pylori are not included, because they have a relatively high rate of evolution. The highest six divergence times correspond to the comparisons of the Mycoplasmas and H.pylori with the Archaea. As is clear from the figure, the fraction of shared orthologs between genomes decreases more rapidly in evolution than does the protein identity. Note that the base level of shared orthologs at which the figure saturates consists only partly of a set of sequences that are shared by all the genomes compared. For example, there are 15 orthologous pairs shared between M.genitalium and M.thermoautotrophicum of which none of the genes has a homolog at the E<0.01 level in M.jannaschii. Of this set. the ones with the highest level of protein identity are: DnaK and DnaJ (MG305 and MG019), heat shock proteins with 51% and 50% identity, respectively to their M.thermoautotrophicum ortholog, deoxyribose-phosphate aldolase (MG050) with 40% identity, a pyrophosphatase (MG351) with 40.5% identity, and a transcriptional regulator (MG448) with 45% identity. Genes that are shared by M.genitalium and M.jannaschii but that are absent in M. thermoautotrophicum, include proteins from the glycolysis like pyruvate kinase (MG216) with 29.1% identity and glucose-6-phosphate isomerase (MG111) with 27% protein identity. The conservation of gene order was related to genome divergence time (Fig. 3). The results show a drastic rearrange
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
MEASURING GENOME EVOLUTION
5853
ment of genomes within the first time unit, during which protein identity levels remain above 50%, after which a saturation level is reached. Notice that the order of orthologous genes is less preserved than their presence (compare with Fig. 2). At the divergence time at which the saturation level is reached, the genes that are still paired are in general subunits of proteins, ribosomal proteins or proteins involved in ABC transport. A detailed examination (T.Dandekar, M.A.H., and P.B., unpublished data) of all conserved pairs of proteins in three Gram-negative bacteria (E.coli. H.influenzae, and H. pylori) and in three Archaea (M.thermoautotrophicum, M. jannaschii, and A.fulgidus) has shown that, for nearly all cases, there is experimental evidence for direct physical interaction between these proteins (see also ref. 31). As mentioned previously, this observation has implications for the study of horizontal gene transfer. Synteny between phylogenetically distant species of genes for proteins that do not show physical interaction indicates recent horizontal gene transfer events.
FIG. 3. Conservation of the order of genes within the genome. Shown are the number of genes that are orthologs in both genomes, and that have at least one neighboring gene that is the same ortholog in both genomes, divided by the total number of shared orthologs between the genomes. Thex axis shows the divergence of the genomes measured in amino acid substitutions per position. The figure clearly indicates the rapid differentiation of gene order in evolution. Gene order between genomes is less conserved than the fraction of shared orthologs (compare with Fig. 2). Gene order and operons. Given the widely accepted concept of the operon, it is perhaps surprising that there is so little conservation of gene order. Why the gene order that is conserved only concerns proteins that show physical interaction might be explained by Fisher’s model of gene clustering (35). Fisher argued that the linkage between genes of proteins that function well together will tend to increase, to prevent the separation of a co-adapted pair of alleles by recombination. It is clear that operons do not only exist of genes for proteins that show physical interaction (reviewed in ref. 36). However what is conserved of operons over large time scales seems indeed to concur with Fisher’s hypothesis. A theory that explains the rearrangement of operons has to include an explanation for the existence of operons. The overall rearrangement of operons does not support any theory that is based on functional relationships of the proteins coded by the genes in the operon, unless one specifically can show that functional relationships of the genes change over the time scales on which we observe the rearrangement of operons. The recently proposed theory of “selfish operons” proposes that operons exist because they increase the probability that genes that function together are transferred together in horizontal gene transfer (36). This model was based on the observation that operon structure is conserved between E.coli and Salmonella typhimurium. The model applies only to “nonessential” genes, genes that are relatively dispensable, which can be lost and then reintroduced into the genome through horizontal operon transfer. It, for example, does not apply to the ribosomal genes that are strongly clustered, are essential, and for which we have no evidence for horizontal gene transfer. It does, however, apply to pathogenicity islands and pathogenicity islets, clusters of genes that play a role in pathogenicity, and do indeed show evidence for horizontal gene transfer (37). Regulatory Elements. With the determination of ortholo gous genes and conservation of gene order one can begin to determine whether intergenic regions are conserved. The degree of conservation of intergenic regions is remarkably low and is diverging much faster than the gene order (Y.DiazLazcoz, M.A.H. and P.B., unpublished results). The pattern in Fig. 4 can be regarded as an exception, demonstrating that at least in some cases gene regulation is preserved. At the 5 end of the ribosomal genes rpl11 and rpll in E.coli lies an RNA secondary structure potentially involved in the regulation of expression of the rpl11 operon (38). The structure is conserved
FIG. 4. Conservation of an RNA secondary structure at the 5 end of rpl11 operon in Bacterial genomes. The order of the ribosomal protein genes rpl11 and rpll is conserved in all of the Bacteria analyzed. The gene nusG is a transcription antitermination factor. Amif is an oligopeptide transport ATP-binding protein, and deoD codes for a purine-nucleoside phosphorylase. The number between the first and second gene indicates the length of the intergenic region. Surprisingly, the secondary structure is absent from H.pylori, even though it shares the presence of nusG 5 of rpl11 with E.coli, whereas H.influenzae lacks NusG at this position. Notice furthermore that the element has been deleted in H.pylori rather than lost because of point mutations, as there is no space left between nusG and rpl11 in H.pylori. The element is also present in M.pneumoniae, but is absent from the Archaea. The element is part of the 5 leader of the L11 mRNA sequence and is likely to function in the autoregulation of the rpl11 operon (ref. 38 and Y.Diaz-Lazcoz, M.A.H. and P.B., unpublished data).
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
MEASURING GENOME EVOLUTION
5854
in all Bacterial genomes analyzed in this paper, with the notable exception of H.pylori. Co-Occurrence of Genes. Some genomes are more organized than others. If neighboring genes tend to function together in one genome, as they do in the case of operons, then they should both occur in another genome, even if they are not neighbors or part of the same operon. We show (Fig. 5A) that this is indeed the case. If gene A has a neighboring gene B, then if the ortholog of B (B) occurs in another genome the probability that the ortholog of A (A) occurs in the other genome is increased (compare Fig. 2). In other words, orthologs shared between two genomes tend to be clustered in at least one of the genomes. Part of the results of Fig. 5A are caused by genes that occur as neighbors in both of the genomes compared. The analysis was repeated to only include genes that are separated in one genome (X), but neighbors in another genome (Y). The fraction of genes that are neighbors in Y was compared with the expected fraction, given a model of random shuffling of genes (see Fig. 5B for methods). Results show that genes from a genome Y that have an ortholog in genome X tend to cluster in Y. The trend is present in all genomes except M.genitalium, and is particularly pronounced in the genomes of E.coli and B.subtilis. This surprising results suggests that most genomes are organized, yet some genomes are more organized than others. We assume that the genes that occur in one genome and are neighbors in another genome are in some way or another related in function. One explanation for the high degree of clustering in E.coli and B.subtilis is they consist to a large fraction of recent horizontal gene transfers, which could increase the prevalence of polycistronic operons in their genome. Co-occurrence of genes and the conservation of pathways. Instead of analyzing spatial association of orthologs, one can analyze whether orthologs show “genome association”: i.e., they either occur together in a genome or are both absent from a genome. Such an analysis could, in principle, be used to reconstruct which genes are functionally related. The fact that orthologs that both occur in two genomes have a relative high probability of spatial association in one of the genomes (Fig. 5A), even if they are separated in the other genome (Fig. 5B), in itself points to the usefulness of this idea. By analogy to approaches using the covariation of the nucleotide content of positions in RNA (39) to predict which positions interact with each other, one can use the covariation in the occurrence of proteins to create a model of which proteins depend for their function on each other. Such information could be used to reconstruct metabolic pathways or signaling pathways. The important assumption is that the structure of the pathway was constant throughout evolution. Nonorthologous gene displacement, where a gene assumes the functions of another in a pathway suggests that pathways are more conserved than the presence of orthologous genes. Our observation of the co-occurrence of the genes dna J and dnaK in a small set of orthologs that are shared by M.genitalium and M.thermoautotrophicum, but not by M.jannaschii (see legend Fig. 2), dnaK shows that the correlation of functionally related genes is present in phylogenetically distant species. The existence of associated genes and the conservation of this association are important parameters in determining the degree of epistatis of genome evolution and determine the shape of the “adaptive landscape” (40) in which genome evolution operates. For an analysis of covariation in the occurrence of genes to be statistically meaningful more genomes then the nine that were analyzed here are required. Furthermore one needs to correct for the “baseline” probability that a gene from one genome has an ortholog in another genome, which depends on phylogenetic distance between the genomes (Fig. 2).
Comparing Rates of Genome Evolution We have studied several indicators of genome evolution and followed their conservation over time (Fig. 6). The resulting calibration curves do quantify not only the divergence of these indicators, but also have practical value as they show what information can be extracted from new microbial genomes
FIG. 5. (A) The probability that a gene in genome A has an ortholog in another genome B if a neighboring gene in A has an ortholog in genome B. The probabilities clearly increase, as compared with the average probability of having an ortholog in another genome (compare Fig. 2). (B) The relative degree of clustering of genes in one genome (A) that have an ortholog in another genome (B). The analysis includes only genes that are clustered (“neighbors”) in genome A, but not in B (and vice versa). Shown is the ratio of the number of genes in A that have an ortholog in B and have at least one neighboring gene that also has an ortholog in B, divided by the expected number. The expected number of genes that are neighbors in a genome, given a random distribution, is calculated as follows: Given X genes that are randomly distributed over a genome with Y loci, the probability that a gene from X has no neighboring genes from X (it lies isolated) is the probability that it has no leftneighbor from X nor a right-neighbor from X:P0=[(Y–X)/(Y–1)]* [(Y–X–1)/(Y– 2)]. The expected number of genes from X with at least one neighbor from X:P1,2=1–P0. The fraction of genes in genome A with at least one neighbor that also has an ortholog in genome B is thus divided by P1,2 to get to the relative clustering of the genes in genome A. The relative clustering is averaged over the genome comparisons of one genome versus the eight other genomes. The names of the species have been abbreviated to the first letters of their genus and species name. All genomes, except M.genitalium show a more than expected clustering of genes. Given its small size, M.genitalium has relatively little room to cluster the genes that have an ortholog in another genome above the expected level of clustering: i.e., most of the genes that have an ortholog in another genome are expected to be neighbors in M.genitalium. The correlation with genome size is not perfect however. For example, Synechocystis, which has a relatively large genome, shows relatively little genome organization.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
MEASURING GENOME EVOLUTION
5855
given their phylogenetic position. The calibration curves shall require refinement when more data become available but they already provide levels of expectation, deviations from which are of potential interest (e.g., synteny of genes in distant species that cannot be found in other species is an indicator of horizontal gene transfer; Fig. 1). In particular, more relatively closely related genomes that have protein identity levels higher than 50% shall be essential to provide more precise estimates of the rates at which genome organization and gene regulation evolve. The calibration curves also should influence the analysis strategy, e.g., if a closely related genome is available, orthologs are relatively easy to discriminate from other members of multigene families. By analogy to profile search techniques, it is helpful to include not too closely related but also not too divergent species into the first round of the analysis, where the closeness of the relationship depends on the features one wants to identify. For example, to study the evolution of gene regulation one needs to compare more closely related species than to study the evolution of gene order. To study the evolution of gene content, one needs to compare even less related species, whereas the study of the evolution of metabolism requires the comparison of the most distantly related species.
FIG. 6. Relative rates of genome evolution. The curves were fitted from the fraction of shared orthologs (Fig. 2) and the conservation of the order of genes (Fig. 3), the curve that shows the relationship between protein identity and the number of amino acid substitutions per position according to Grishin’s equation (Fig. 2), was added for comparison. Intergenic regions are even less conserved than the order of genes. Nonorthologous gene displacement indicates that metabolism is more conserved than the fraction of shared orthologous genes. Current analysis of genomes is driven by the prediction of functional features at the molecular and cellular level; it is based on the presence and absence of certain genes in the context of phenotypic expectations. Expectations about horizontal gene transfers and the loss, the acquisition or displacement of entire pathways (the entire metabolism in the case of the Archaea) and the study of the correlations of gene occurrence will enable us to identify functional cascades in greater detail. Identification of weak regulatory signals in the genomes requires a sensitive comparative analysis. The puzzling evolution of nonconserved but ever-present operons is only one indication that many genetic and evolutionary mechanisms are yet to be detected and quantified. We are very grateful to Chris Ponting, Berend Snel, Yolande Diaz-Lazcoz. Thomas Dandekar, and Joerg Schultz for providing data and useful discussions. The work was supported by the Bundesministerium für Bildung, Wissenschaft, Forschung and Technologie (Germany) and Deutsche Forschungsgemeinschaft. 1. Blattner, F.E., III, Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode. C.K. & Mayhew, G.F. (1997) Science 277, 1455–1462. 2. Karlin, S., Mrazek. J. & Campbell, A. (1997) J. Bacteriol. 179, 3899–3913. 3. Hood, D.W., Deadman, M.E., Jennings, M.P., Bisercic. M., Fleishmann, R.D., Venter, J.C. & Moxon, E.R. (1996) Proc. Natl. Acad. Sci. USA 93, 11121–11125. 4. Huynen. M.A. & van Nimwegen, E. (1998) Mol. Biol. Evol., in press. 5. Gelfand, M.S. & Koonin. E.V. (1997) Nucleic Acids Res. 25, 2430–2439. 6. Fleishmann, R., Adams, M., White, O., Clayton, R.A., Kirkness. E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty. B.A. & Merrick. J.M. (1995) Science 269, 496–512. 7. Fraser. C.M., White, O., Casjens, S., Huang, W.M., Sutton, G.G., Clayton, R., Lathigra, R., Ketchum. K.A., Dodson, R. & Hickey. E.K. (1995) Science 270, 397–403. 8. Kaneko, T., Sato. S., Kotani, H., Tanaka, A., Asamizu, E., Nakamura, Y., Miyajima, N., Hirosawa, M., Sugiura, M. & Sasamoto, S. (1996) DNA Res. 3, 109–136. 9. Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., Sutton. G.G., Blake, J.A., FitzGerald, L.M., Clayton, R.A. & Gocayne, J.D. (1996) Science 273, 1058–1072. 10. Himmelreich, R., Hilbert. H., Plagens, H., Pirkl, E., Li, B. & Herrmann, R. (1996) Nucleic Acids Res. 24, 4420–4449. 11. Smith, D.R., Doucette-Stamm, L.A., Deloughery, C., Lee, H., Dubois, J., Aldredge, T., Bashirzadeh, R., Blakely, D., Cook, R. & Gilbert, K. (1997) J. Bacteriol. 17, 7135–7155. 12. Tomb, J.-F., White, O., Kervalage, A.R., Clayton, R.A., Sutton, G.G., Fleischmann, R.D., Ketchum, K.A., Klenk. H.P., Gill, S., Dougherty, B.A. (1997) Nature (London) 388, 539–547. 13. Kunst, F., Ogasawara, N., Moszer, I., Albertini, A.M., Alloni, G., Azevedo, V., Bertero, M.G., Bessieres, P., Bolotin, A. & Borchert. S. (1997) Nature (London) 390, 249–256. 14. Fitch, W.M. (1970) Syst. Zool. 19, 99–110. 15. Tatusov, R.L., Koonin, E.V. & Lipman. D.J. (1997) Science 278, 631–637. 16. Schmid, K. & Tautz, D. (1997) Proc. Natl. Acad. Sci. USA 94, 9746–9750. 17. Koonin, E.V., Mushegian, A.R. & Bork, P. (1996) Trends Genet. 12. 334–336. 18 Page, R.D.M. (1994) Syst. Biol. 43, 58–77. 19. Yuan, Y.P., Eulenstein. O., Vingron, M. & Bork, P. (1998) Bioinformatics, in press. 20. Medigue, C., Rouxel, Y., Vigier, P., Henaut, A. & Danchin, A. (1991) J. Mol. Biol. 222, 851–856. 21. Lawrence, J.G. & Ochman, H. (1997) J. Mol. Evol. 44. 383–397. 22. Althschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402. 23. Feng, D.F., Cho, G. & Doolittle, R.F. (1997) Proc. Natl. Acad. Sci. USA 94, 13028–13033. 24. Doolittle. R.F., Seng, D.F., Tsang, S., Cho, G. & Little, E. (1996) Science 271, 470–477. 25. Grishin, N.V. (1995) J. Mol. Evol 41. 675–679. 26. Koonin. E.V., Mushegian, A.R., Galperin, M.Y. & Walker, D.R. (1997) Mol. Microbiol. 25, 619–637. 27. Feng, D.-F. & Doolittle, R.F. (1997) J. Mol. Evol. 44, 361–370. 28. Huynen. M., Diaz-Lazcoz, Y. & Bork, P. (1997) Trends Genet. 13, 389–390. 29. Tatusov, R.L., Mushegian, A.R., Bork, P., Brown, N.P., Hayes, W.S., Borodovsky. M., Rudd. K. & Koonin, E.V. (1996) Curr. Biol. 6, 279–291. 30. Tamames, J., Casari. G., Ouzounis, C. & Valencia, A. (1997) J. Mol. Evol. 44, 66–73. 31. Mushegian, A.R. & Koonin, E.V. (1996) Trends Genet. 12, 289–290. 32. Watanabe, H., Mori, H., Itoh, T. & Gojobori, T. (1997) J. Mol. Evol. 44, 57–64. 33. Kolsto, A.B. (1997) Mol. Microbiol. 24, 241–248. 34. Siefert, J.L., Martijn, K.A. Abdi, F., Widger, W.R. & Fox, G.E. (1997) J. Mol. Evol. 45, 467–472. 35. Fisher, R.A. (1930) The Genetical Theory of Natural Selection (Oxford Univ. Press, Oxford). 36. Lawrence, J.G. & Roth, J.R. (1996) Genetics 143, 1843–1860. 37. Barinaga, M. (1996) Science 272, 1261–1263.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
MEASURING GENOME EVOLUTION
5856
38. Branlant, C., Krol, A., Machatt, A. & Ebel, J.P. (1981) Nucleic Acids Res. 9, 293–307. 39. Gutell, R.R., Power, A., Hertz, G., Putz, E. & Stormo. G. (1993) Nucleic Acids Res. 20, 5785–5795. 40. Wright, S. (1932) in Proceedings of the Sixth International Congress on Genetics, ed. Jones, D.F. (Brooklyn Botanical Garden, New York), Vol. 1, pp. 356–366. 41. Yeh, K.C., Wu, S.H.. Murphy, J.T. & Lagarias, J.C. (1997) Science 277, 1505–1508. 42. Klenk. H.P., Clayton, R.A., Tomb, J.F., White, O., Nelson. K.E., Ketchum. K.A., Dodson, R.J., Gwinn, M., Hickey, E.K. & Peterson, J.D. (1997) Nature (London) 390, 364–370. 43. Aravind, L. & Ponting, C.P. (1997) Trends Biochem. Sci. 22, 458–45. 44. Zhulin, I.B., Taylor, B.L. & Dixon, R. (1997) Trends Biochem. Sci. 22, 331–333. 45. Ponting. C.P. & Avarind, L. (1997) Curr. Biol. 7, R674–R677. 46. Schultz, J., Milpetz, F., Bork, P. & Ponting, C.P. (1998) Proc. Natl. Acad. Sci. USA 95, 5857–5864. 47. Smith, T. & Waterman, M.S. (1981) J. Mol Biol. 147, 195–197.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS
5857
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5857–5864, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
SMART, a simple modular architecture research tool: Identification of signaling domains
(computer analysis/diacylglycerol kinases/DEATH domain/disease genes/automatic sequence annotation) JORG SCHULTZ*†, FRANK MILPETZ*†, PEER BORK*†‡, AND CHRIS P.PONTING§ “European Molecular Biology Laboratory, Meyerhofstr.1, 69012 Heidelberg, Germany: †Max-Delbrunk-Center for Molecular Medicine. Robert-Rössle-Str 10, 13122, Berlin, Germany; and §University of Oxford, The Old Observatory, South Parks Road, Oxford OXl 3RR United Kingdom ABSTRACT Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web-based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are muitidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam. SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citronhomology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline 1; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin. The functions of only a small fraction of known proteins have been determined by experiment. As a result, the use of computational sequence analysis tools is essential for the annotation of novel genes or genomes, and the prediction of protein structure and function. Currently, the most informative of these techniques are database search tools such as BLAST (1) and FASTA (2) that identify similar sequences with associated statistical significance estimates. Current limitations of the use of these programs concern less the aspects of search sensitivity and more the functional annotation of identified homologues. Annotation terms such as “hypothetical protein” or “suppressor of spt3 mutations” are helpful neither to the user’s prediction of structure and function, nor to computational procedures attempting to automatically predict function from sequence. An additional aspect concerns the annotation of complete genomes. Existing eubacterial and archaeal genomes have been analyzed with little regard to the existence of domains, because muitidomain proteins in these organisms are relatively few in number. The domain as a functional and structural unit in eukaryotic proteins, however, is pre-eminent. For example, the majority of human extracellular proteins are muitidomain in character (for reviews see refs. 3 and 4) and many complex eukaryotic signaling networks involve proteins containing multiple domains with catalytic, adaptor, effector, and/or stimulator functions (5). Several dozen of such “signaling domains” are known (for a review see ref. 6). The importance of modular proteins in disease is emphasized by the recent observation that the majority of positionally cloned human disease genes encode muitidomain proteins, many of which are, in fact, signaling proteins (7). On the other hand, the view of the domain as a fundamental unit of structure and function is not universally accepted: not a single noncatalytic signaling domain is annotated in the widely distributed Saccharomyces cerevisiae genome directory that catalogs the genes of this complete genome (8). Thus, there is a need to coordinate knowledge stored in the literature with that stored in sequence databases to facilitate the research of those in the scientific community who require the annotation of genes and genomes. It is our goal to provide an extensively annotated collection of cytoplasmic signaling domain alignments that enables rapid and sensitive detection of additional domain homologues as a Webbased tool. Because it is difficult to distinguish those domains that perform cytoplasmic signaling roles from those that primarily function in transport, protein sorting, or cell cycle regulation, and for reasons of brevity, we shall discuss those domains that fall under two categories. (i) Cytoplasmic domains that possess kinase, phosphatase, ubiquitin ligase, or phospholipase enzymatic activities or those that stimulate GTPaseactivation or guanine nucleotide exchange; these activities are known to mediate transduction of an extracellular signal toward the nucleus resulting in the initiation of a cellular response, (ii) Cytoplasmic domains that occur in at least two proteins with different domain organizations, of which one also contains a domain that is categorized under 1) (for a complete list of such domains see Table 1). Domain collections that cover a wide spectrum of cellular functions do exist in the forms of motif, alignment block, or profile databases such as PROSITE (9), BLOCKS (10), PRINTS
‡To
whom reprint requests should be addressed. © 1998 by The National Academy of Sciences 0027–8424/98/955857–8$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: SMART, simple modular architecture research tool; DAG, diacylglycerol; PH, pleckstrin homology; PTB, phosphotyrosine binding; SH, Src homology; rcm, rostral cerebellar malformation gene product; HMM, Hidden Markov model.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS 5858
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS
5859
(11), or Pfam (12) and provide a guide for the annotation of new proteins. However, there is a necessary trade-off in these collections between exhaustive coverage of domains and optimal sensitivity, specificity, and annotation quality. We have chosen to initiate the collection of gapped alignments of signaling domains because these are imperfectly covered in large collections and often include homologues with extremely divergent sequences. This collection is designed to be updated easily and is provided with a Worldwide Web interface enabling automatic sequence annotation with evolutionary, functional, and structural information. The resulting SMART procedure, a simple modular architecture research tool, offers a high level of sensitivity and specificity coupled with ease of use.
Number of domains annotated Yeast genome Domain RAS-like small GTPases
RanBD RasGAP RasGEF RasGEFN RGS RhoGAP RhoGEF SAM SH2 SH3 SPRY TBC TPR UBA UBCc UBX VHS VPS9 WH1 WW ZU5 ZZ Totals
SwissProt: S.cerevisiae Full name or function RAB RAN RAS RHO SAR Others Ran-binding domain GAF for Ras-like GTPases GEF for Ras-like GTPases In some RasGEFs Regulator of G-protein signaling GAF for Rho-like GTPases GEF for Rho-like GTPases Sterile alpha motif Src homology 2 Src homology 3 In sp1A and Ryanodine receptors In Tre-2, BUB2p, and Cdc16p Tetratricopeptide repeat Ubiquitin-associated domain Ubiquitin-conjugating enzyme Ubiquitin-related domain In VPS-27, Hrs and STAM In VPS-9-like proteins WASp homology domain 1 Conserved WW motif In ZO-1 and UNC-5 Dystrophin-like zinc finger 86
SMART 9 2 3 6 1 11 3 4 5 4 3
SMART 9 2 3 6 1 10 3 3 5 4 1
SwissProt: Homo sapiens SPa Pfamb SMART 9 – 19 2 – 1 3 – 11 6 – 13 1 – 0 7 24(all) 5 1 – 5 3 – 3 4 – 0 0 – 0 1 – 11
SPa 19 1 11 13 0 3 5 3 0 – 6
Pfamb – – – – – 48(all) – – – 0 –
9 4 6 1 28 3
6 4 3 1 25 3
3 3 0 1 25 0
– – – 1 25 –
8 7 11 51 65 7
4 6 1 51 63 0
– – – 51 57 –
10
7
0
–
1
0
–
72 10
69 8
39 0
16 –
40 12
0 0
7 –
13
13
13
13
12
12
12
8 4 2 1 9 0 2 622
4 3 1 1 8 0 2 544
0 0 1 0 7 0 0 383
– – – – 7 – – 290
1 0 1 2 9 4 4 1,137
0 0 0 0 9 0 1 886
– – – 9 – – 704
Numbers of domains detected by SMART in the yeast genome, and in the yeast and human fractions of the SwissProt database are compared with the numbers of domains derived from HMMer analysis and Pfam HMMs scanned against these database fractions, and the numbers of annotations in SwissProt. Many of these domains are reviewed elsewhere (5, 6), and additional references may be found via the SMART Web site (http:// www.bork.embl-heidelberg.de/Modules/sinput.shtml). aAnnotations in SwissProt. bAnnotations using the hmmfs program of the HMMer package with Pfam-derived HMMs (“-” indicates where no Pfam HMM was available).
METHODS Construction of Multiple Sequence Alignments and Choice of the Search Program. Of the 86 domain families, multiple alignments of 83 had been published previously (for references, see the annotation that accompanies the SMART Web site). These alignments were refined according to constraints described elsewhere (13) that included minimization of insertions/deletions in conserved alignment blocks, optimization of amino acid property conservation within these blocks, and closing of unnecessary gaps within insertion/deletion regions. Gapped alignments were constructed in preference to ungapped ones to allow the prediction of domain limits and as a result of their greater information content. Care was taken to build alignments that encompassed all secondary structures of domains whose tertiary structures are known. For remaining domains, investigations of sequence similarities beyond previously published domain limits were undertaken; this resulted in N-terminal extension of the previously described PX domain alignment by a single predicted β-strand, and identification of a conserved N-terminal motif in guanine nucleotide exchange factors for Ras-like GTPases, Prediction of domain limits also was aided by close proximities of domains to others with well-known limits, and to bona fide N- and C-terminal residues. Alignments were updated to include additional predicted homologues. Because no single database searching algorithm currently is able to detect all putative homologues that are detectable by the combination of all searching methods (13), three iterative methods—HMMer, MoST, and WiseTools (14–16)—were used to detect candidate homologues (HMMer and MoST thresholds: 25 bits and E < 0.01). Before theiraddition to multiple alignments, candidate homologue sequences were subjected to analyses using BLAST (1), Ssearch (2), and/or MACAW (17) to estimate the statistical significance of sequence similarities (PSIBLAST, BLAST, and Ssearch thresholds: E < 0.01). Those sequences that were considered homologues based on statistical significance estimates, and to a lesser extent on experimentally determined biological context, were used to construct alignments, profiles, and Hidden Markov models (HMMs).
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS
5860
As described above, care was taken to establish alignments representing entire structural domains. However, the termini were found to be the least conserved regions of alignments, and several profiles represent incomplete portions of domains. In two cases, phospholipase D and protein tyrosine phosphatase homologues, only short conserved “motifs” (conservation patterns representing an incomplete domain structure) are detectable across the domain family (18–20). For these examples, profiles/HMMs were calculated only from these short motifs to maximize the amino acid similarity signal-to-noise ratio (13). Assignment and Calibration of Thresholds for Automatic Runs. Score thresholds are required to provide automatic assignment of true positives and true negatives. There is no current method, including those that provide E- or p-value representations of score significances, that may be relied on to provide reliable values for these thresholds in all cases. As a result, manual intervention was necessary to estimate threshold values on the basis of published homology arguments and, for example, on the results of individual BLAST or Ssearch queries. SWise (16) was chosen as an established algorithm able to provide similarity scores for query sequences when compared with the alignment database; however, the SMART database method can be applied to any algorithm that provides similarity scores. For each alignment an SWise (16) threshold (Tp) was established that represents the lowest score allowable for sequences to be considered as “true positives” or homologues. As such, this single step procedure detects many true positives but does not detect few previously proposed homologues (“false negatives”) that score at levels just below that of the top “true negative.” A proportion of false negatives could not be assigned as homologues without further statistical evidence. However, consideration that domains such as ARM, C2, CBS. IQ, LIM, PDZ, SH2, SH3, and WW (Table 1) frequently are found as repeats, enabled several false negatives to be detected by using estimations of an additional threshold value, Tr (Tr < Tp). Tr represents a repeats’ threshold for a protein where at least one of the repeats scores above Tp (Fig. 1). Two or more repeats scoring above the average of TP and Tr [(TP+Tr)/2] also were considered false negatives. Some domains that appear to be found only as tandem repeats (for example, EF-hands, tetratricopeptide repeats, and armadillo repeats) are reported only if two or more copies are found that score above a low threshold Tr. To predict the subfamily of a particular domain (for example, whether a tyrosine or a serine/ threonine kinase, or whether a tyrosine-specificity or a dual-specificity phosphatase) further thresholds Ts (Ts>Tp) also were estimated; no subfamily predictions are made for those domain homologues that score above Tp but below Ts. Subset alignments of a given domain family were constructed not only to improve the specificity of functional predictions, but also for divergent families for which a single descriptor (profile/HMM) was found to be unable to detect the entire set of known homologues (e.g.. C2 and pleckstrin homology (PH) domains; refs. 21 and 22). Construction of multiple profiles each representing different regions of the domain phylogenetic tree resulted in “overlapping” profiles that, when used in combination, found the maximal number of homologues. Sensitivity and specificity is guaranteed with combinations of Ts and Tp. Overlapping hits from nonhomologous profiles, which can occur because of inserted domains (23), all are reported. Seeding and Updating Procedure. To reduce redundancy and subfamily bias within sequence families, seed alignments were calculated by using an iterative semiautomatic procedure. In a first step all database sequences considered homologous, given the threshold procedures described above, are subjected to a CLUSTALW phylogenetic tree construction (24). Only a single sequence from every branch of the tree that is shorter than a defined threshold (the default distance is 0.2, which corresponds approximately to 80% identity, ref. 24) is retained in the alignment. From this seed alignment, a profile is derived leading to reiteration of the database search procedure until convergence. For example, four iterations were required to build a Src homology 2 (SH2) seed alignment containing 95 sequences, of a total of 548 SH2 domains identified in the translated EMBL sequence database.
FIG. 1. Calibration of thresholds. Selection of thresholds from the distributions of SH3 domain scores. (Upper) A histogram of SWise scores for the best match (optimal alignment; in green) of proteins with a SH3 domain profile. (Lower) Similar histograms for the second- and third-best matches (suboptimal alignments; in light blue and dark blue, respectively). Optimal alignment scores less than threshold Tp are mostly derived from sequences considered unlikely to contain SH3 domain homologues. Threshold TP was selected as the lowest scoring true positive. Domains that are repeated twice or more in the same protein that each score above a lower threshold (Tr) are considered to be true negatives. With new sequences entering databases daily, seed alignments and derived profiles need to be updated accordingly. SMART incorporates a facility whereby database daily updates are screened for the presence of signaling domains. Those that represent a new branch of the domain family phylogenetic tree (i.e., with a distance of greater than 0.2) are recorded for inclusion in future SMART domain set updates. The alignments are accessible via the SMART Web server. Implementation into a Web Server. SMART has been provided with a user interface (http://www.bork.emblheidelberg.de/Modules/ sinput.shtml) that allows rapid and automatic annotation of the signaling domain composition of any query protein sequence. A graphical display is provided showing domain positions within the query sequence. The SMART set of signaling domains is annotated extensively via
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS
5861
hyperlinks to Medline and the Molecular Modeling Database via Entrez (25), thus providing easy access to information relating sequence, homology, structure, and function. As the set of signaling sequences is necessarily incomplete and as there may be other domains represented in the query sequence, direct access also is provided to Pfam (12), a domain database that includes a variety of different domain types, yet provides a lower representation of signaling domains and with lower sensitivity (see Discussion). Intrinsic features of the query such as coiled coil regions (26), low complexity regions (27), and transmembrane regions (28) also are displayed. Annotated or unannotated regions of the query sequence are able to be subjected individually to gapped BLAST searches (1), thus allowing the advantage of a reduced search space enabling higher sensitivity in searches. Benchmarking Protocol. To assess the sensitivity and selectivity of SMART, results were compared with annotations held by SwissProt, because this represents the best-annotated protein sequence database extant, (and includes all those annotations covered by the PROSITE database) as well as with the Pfam domain collection, because this represents the most comprehensive set of gapped alignments available (12). Our intention here was not to provide justifications for the inclusion or exclusion of particular sequences in domain alignments, but to compare literature information as represented by the SMART database, with the same information as represented by SwissProt and Pfam databases. All S.cerevisiae and human sequences were extracted from SwissProt and annotated by using the SMART protocol. Because these organisms are well-studied and their proteins relatively well-annotated they represent a stringent test for annotation procedures. The SMART domain annotations were compared manually with
FIG. 2. Schematic representations, produced using SMART, of the domain architectures of proteins discussed in the text. See Table 1 for the identified domains; gray lines (no SMART match) might contain other known domains not included in SMART. Putative homologues were identified during SWise (16) searches and/or PSIBLAST (1) searches (E<0.01). (a) Domain recognition: A novel PTB domain was identified in tensin, resulting in completion of its modular architecture assignment. A PSIBLAST search with a previously predicted PTB domain in C.elegans F56D2.1 (53) yields the tensin PTB after four passes. Prediction of molecular function via domain hit: Identification of a domain homologous to band 4.1 protein in focal adhesion kinase (FAK) isoforms. FAKs are predicted to bind cytoplasmic portions of imegrins in a similar manner to that of talin, another band 4.1 domain-containing protein. A PSIBLAST search with a band 4.1-like domain (41 HUMAN, residues 206–401) revealed band 4.1-like domains in human, bovine, and Xenopus FAK isoforms by pass 3. (b) Detection of new domains because of search space reduction: Putative DEP domains in ROM1 and ROM2 were identified by using SWise (16) and HMMer (14), but could not be detected by using PSIBLAST. Analysis of the regions surrounding identified domains revealed the presence of a novel domain in the C-terminal regions of ROM1 and ROM2 that occurs also in several Ste20-like protein kinases, and mouse citron (CNH, citron homology). A gapped BLAST search of the region of citron C-terminal to its PH domain (CTRO MOUSE, residues 1134–1457) reveals significant similarity with yeast ROM2 (E=1×10–5). (c) Functional predictions for an entire domain family: A region of p62 known to bind ubiquitin (40), and its homologous sequence in the Drosophila protein ref(2)P, scored as the highest putative true negatives in a SWise search. We predict ubiquitinbinding functions for UBA domains. PSIBLAST searches were unable to corroborate this prediction, (d) Prediction of cellular functions: Although not indicated in the primary sources (43, 44), a DEATH domain was found in rcm and other UNC5 homologues, in agreement with a previous claim (41). At the molecular level, this domain in UNC5 is predicted to form a heterotypic dimer with an homologous domain in UNC44 implying a cellular role in axon guidance. A gapped BLAST search with the known DEATH domain of death-associated protein kinase (DAPK HUMAN, residues 1304–1396) predicts a DEATH domain in rat UNC5H1 with E=9×10–3). (e) Signaling domains in “disease genes”: Pyrin or marenostrin. a protein that is mutated in patients with Mediterranean fever and is similar to butyrophilin, contains a SPRY domain. PSIBLAST with the SPRY domain of human DDX1 (EMBL:X70649, residues 124–240) yields a butyrophilin homologue by pass 5 and pyrin/marenostrin (residues 663–759) by pass 7. (f) Homologues of domains involved in eukaryotic signaling may not be eukaryotic-specific: DAG kinases have been found previously in mammals, invertebrates, plants, and slime mold. However, it is apparent that DAG kinase homologues of unknown function are present in yeasts and in eubacteria (see Fig. 3). A gapped BLAST search with Bacillus subtilis bmrU (BMRU BACSU) yields significant similarities with Arabidopsis thaliana DAG kinase (KDG1 ARATH; E=4×10–4) and a Schizosaccharomyces pombe ORF (SPAC4A8.07c; E=1×10–7). (g) Identification of potential misclassifications: A PH domain and the lack of an obvious transmembrane sequence indicates a cytoplasmic and signaling role for a protein (INT1 CANAL) previously thought to be a yeast integrin. A PSIBLAST search with the N-terminal PH domain of pleckstrin yielded INT1 CANAL in pass 3.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS
5862
those derived from HMMer (14) analysis, and those contained in SwissProt (Table 1); the hmmfs program and a 25-bits threshold was used for the HMMer analysis. As the SwissProt release 34 does not contain all yeast sequences, the complete set of S.cerevisiae ORFs also was subjected to SMART analysis (Table 1).
RESULTS Comparison with SwissProt and Pfam. Of all protein sequence databases, SwissProt is the most extensively annotated, making use of literature- and sequence-derived (9) data as source material. As a result the SwissProt database is a valuable resource for investigators searching for hints of the structure and function of their sequences of interest. Consequently, it is appropriate to compare SMART-derived annotations with those contained in SwissProt. SMART detected 548 and 1,137 domains in the yeast and human subsets of SwissProt, respectively (Table 1). Of these, 165 and 251 domains (30% and 22%, respectively) are not annotated in SwissProt. Many of these belong to the 29 domain families that are contained in SMART and yet are not annotated in SwissProt. By contrast, all SwissProt annotations relating to our domain set were detected by SMART, with the exception of a small set of domain fragments. Only 23 of the SMART domain families are represented by Prosite motifs or patterns. Moreover, because Prosite motifs commonly represent active site regions, it is apparent that these do not detect the several homologues of kinases, phosphatases, or ubiquitinconjugating enzymes that have dispensed with their active site residues. The current set of Pfam HMMs, when compared with the yeast and human SwissProt subsets, detected 290 and 704 domains. Forty-six of the 86 SMART domain types are not represented currently in Pfam. Moreover, the Pfam set does not yet allow subfamily annotation for domain families such as small GTPases, protein kinases, or protein phosphatases. Pfam and HMMer were able to identify several incomplete domain sequences that SMART could not. SMART was not designed to detect domain fragments because it was considered valuable to detect complete domains, thereby allowing assignment of putative domain boundaries. Consequently, the HMMer (hmmfs) option of SMART has been provided to allow detection of incomplete domain sequences. Identification of Signaling Domains in Yeast. Annotation of the complete yeast genome (6218 ORFs) revealed that 420 yeast proteins (6.7%) contain at least one of the domains included in SMART. This is larger than a previous estimate that 2% of yeast proteins are involved in signaling (8), which approximates to the percentage of S.cerevisiae proteins known to be kinase homologues. SMART identifies a total of 622 domains (Table 1); two or more domains occur in 96 of the 420 signaling proteins. Results of the SMART annotation of yeast proteins identified are summarized in a Web page (http://www.bork.embl-heidelberg.de/Modules/syeast.html), which was generated by using SMART’S graphical output features. These results imply an improvement by SMART on other tools and current best-annotated databases in the particular field of signaling. An additional feature of SMART is its ability to facilitate predictions of the structures and/or functions of proteins when a hit is recorded. The following examples illustrate several such instances that arise from a domain hit. Domain Annotation and Deduction of Functional Features. During construction of the SMART database, tensin and focal adhesion kinase (pp125FAK), which both are localized to focal contacts, were found to contain previously unrecognized domains. Fig. 2a shows the modular architecture of tensin, an actin filament capping protein that is known to contain large coiled coil regions, an SH2 (29) and an Nterminal domain homologous to protein tyrosine phosphatases (PTPs) (20). SMART predicts a phosphotyrosine binding domain (PTB: also called phosphotyrosine interaction [PI] domain) (Table 1) in tensin’s most C-terminal region, which has not previously been ascribed a domain homology. Each of tensin’s three globular domains—PTP, SH2, and PTB/PI—have been implicated in phosphotyrosine-mediated signaling. This is consistent with previous findings that tensin is a substrate of the tyrosine kinase pp125FAK (30), which is also highly tyrosinephosphorylated when activated (reviewed in ref. 31). Application of SMART procedures to pp125FAK homologues predicts band 4.1-homologous domains in their N-terminal regions that bind the cytoplasmic regions of integrins (32) (Fig. 2a). Although one has to be cautious when inferring functional information simply from domain identification, on this occasion the band 4.1 domains are likely to perform similar
FIG. 3. Multiple alignments of selected RasGEFN domains. A conserved region was found in the N-terminal regions of several proteins with RasGEF (Cdc25-like) domains (37). Surprisingly, this N-terminal domain may be present in the sequence either close to, of far from, the RasGEF domain. A PSI-BLAST search using a region (residues 898–946) of C.albicans Cdc25 (CC25 CANAL) and E<0.01, identified each of the sequences in Fig. 3 within nine passes before convergence. Predicted (54) secondary structure and 90% consensus sequences are shown beneath the alignments; SwissProt/PIR/EMBL accession codes and residue limits are given after the alignments. Residues are colored according to the consensus sequence [green: hydrophobic (h), ACFGHIKLMRTVWY; blue: polar (p), CDEHKNQRST; red: small (s), ACDGNPSTV; red: tiny (u), AGS; cyan: turn-like (t), ACDEGHKNQRST; green: aliphatic (l), ILV; and magenta: alcohol (o), ST). The SwissProt sequence KMHC DICDI has been altered to account for probable frameshifts.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS
5863
molecular functions because talin, another band 4.1 domain-containing protein, is known also to bind integrin cytoplasmic domains (33). Reducing the Search Space Enables Identification of Novel Domains. S.cerevisiae ROM1 and ROM2 are sequence-similar proteins that each contain a PH domain and a RhoGEF domain that stimulates exchange of Rho1GDP with Rho1GTP (34). Construction of the SMART databases led to the identification of a putative DEP domain (35) in both ROM1 and ROM2 (Fig. 2b). Comparison of the ROM1 and ROM2 sequences showed a further region of similarity C-terminal to their PH domains. This region [“citron-homology” (CNH) domain] was identified as being homologous to the mouse RhoGTP/RacGTP-binding protein, citron (36) and to the C-terminal regions of several Ste20-like protein kinases (Fig. 2b). A novel domain family (VHS) of unknown function(s) also has been detected in Vps27, Hrs, and STAM, and other proteins. A conserved domain in Cdc25p-like proteins mediates their activities as guanine nucleotide exchange factors for Ras or Ral (37). Each of these molecules contain N-terminal extensions. We find additional amino acid similarities in these regions, and these represent a novel domain family (Fig. 3). Surprisingly, this domain (which we call RasGEFN) can be contiguous to, or far from, the catalytic domain. A construct of p140 Ras-GRF that lacks this region is constitutively active (38), so it is likely that the RasGEFN domain performs a suppressor function. Deducing Functional Features of a Domain Family Via a Protein Hit. Although rare, we have identified additional members of a domain family in regions of proteins that already have been shown to perform particular functions. Such findings often suggest comparable functions for all other members of the domain family. The ubiquitin-associated (UBA) domain (Table 1) has been shown to be contained in several enzymes implicated in ubiquitination (39). We have identified a UBA domain in a region of p62, a phosphotyrosine-independent ligand of the p56lck SH2 domain (40) that is known to bind ubiquitin (Fig. 2c). Ubiquitin-binding functions are predicted for other UBA domains. Prediction of Cellular Function. Particular domains have been implicated in certain cellular events. For example, DEATH domains (Table 1) are present in proteins associated with apoptosis and/or axonal guidance (41, 42). Recent reports (43, 44) identify the rostral cerebellar malformation gene product (rcm) and similar homologues as putative netrin receptors. These reports do not indicate the presence of a DEATH domain in rcm or its homologues, even though the domain’s presence may be readily demonstrated by sequence analysis (Fig. 2d) or from its identification in the rcm Caenorhabditis elegans orthologue, UNC-5 (41). As the DEATH domain of UNC-5 is not annotated in databases, this is one of many instances where the potential of domain identification to predict cellular function has been unfulfilled. DEATH domains often form homotypic or heterotypic dimers (42). Because DEATH domain-containing proteins UNC-44 (45) and the putative netrinreceptor UNC-5 are known to be involved in axonal guidance, we predict that transduction of the netrin-initiated signal involves heterodimerization of UNC-5 and UNC-44 DEATH domains. Identification of Signaling Domains in Genes That Are Involved in Diseases. A recent study of 70 positionally cloned human genes mutated in diseases found that a significantly high proportion of these “disease genes” possess roles in cell signaling (7). In accordance with this, the SMART alignment database contains several novel signaling domains in these genes (including the DEATH domain in rcm-like netrin receptors, see above). Fig. 2e shows the modular architecture of pyrin (46) (also called marenostrin: ref. 47). Mutations in the pyrin gene result in Mediterranean fever syndromes that are inherited inflammatory disorders. In addition to its ret-like zinc finger, pyrin/marenostrin and other butyrophilin-like homologues contain a SPRY domain, a domain of unknown function found triplicated in ryanodine receptors and singly in other proteins (48) (Table 1). Midline 1, a pyrin-homologue that also contains a SPRY domain, is mutated in patients with Opitz G/BBB syndrome (49). Identification of Domains in Different Phyla. The range of species in which a particular domain type is found can correlate with the evolution of specific signaling pathways; many of the known cascades are expected only in animals or eukaryotes (3). Thus, identification of DAG kinase homologues in yeast and eubacteria (Fig. 2f) is clearly a surprise. Although further experimentation is required to infer functional features, the presence of conserved, presumably catalytic, residues in the alignment (data not shown) and the occurrence of DAG kinase activities in prokaryotes (50) suggests that the yeast and bacterial DAG kinase homologues possess similar molecular, but perhaps not cellular, roles to those of their animal and plant homologues. Significance of Domain Detection and Functional Prediction. Annotation of molecular function in sequence databases and even in the literature is difficult to interpret given that the term function may describe phenomena occurring at distinct levels, such as those of amino acids, domains, proteins, molecular complexes, cells, or organisms. Nevertheless, the examples shown above demonstrate that annotation of a certain domain can provide useful hints toward experimental characterization of function at different levels. Domain identification also might provide a counter-argument to a previously proposed molecular function. For example, identification of a PH domain and the absence of a detectable transmembrane region in a supposed integrin from C.albicans (Fig. 2g) argues strongly against its proposed role in cell adhesion (51). Integrins are transmembrane proteins that link the extracellular matrix with the cytoskeleton and normally contain, except for the B-4 subunit, short cytoplasmic sequences. The finding of a PH domain and high sequence similarity to S.cerevisiae BUD4 argues for its signaling role in bud site selection.
DISCUSSION Many proteins are multidomain in character and possess multiple functions that often are performed by one or more component domains. A Web-based tool (SMART) has been designed that makes use of mainly public domain information to allow easy and rapid annotation of signaling multidomain proteins. The tool contains several unique aspects, including automatic seed alignment generation, automatic detection of repeated motifs or domains, and a protocol for combining domain predictions from homologous subfamilies. The ability of SMART to annotate single sequences or large datasets is exemplified by the cases described in Results, including annotation of the complete set of yeast ORFs. Currently, large-scale or genome analysis is commonly performed by annotating ORFs with a single “best hit” from similarity searches. Ambiguities whether hits represent orthologs (i.e., homologues in different organisms that arose from speciation rather than intragenome duplication and are likely to have a corresponding function; ref. 52) or else paralogs (other members of multigene families) are not solved and omission of domain annotation also leads to misprediction of function. As most signaling proteins are multidomain in character, only annotation at the domain level avoids ambiguities in assigning homologies and functions to sequences, which may propagate further on additional findings of homology. Furthermore, deduction of the modular architecture is essential for the understanding of the complexities of multidomain eukaryotic signaling molecules; current annotation, however, does not adequately provide this information (Table 1). As examples of this, the existence of noncatalytic signaling
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS
5864
domains cannot be deduced from the current yeast genome directory (8) and no human RasGEF domains currently are annotated in SwissProt. Graphical representation of the complement of modular proteins in a completed genome (e.g., the 622 signaling domains in 420 yeast proteins: http://www.bork.embl-heidelberg.de/Modules/syeast.html) might provide the basis for relating experimentally derived information concerning domains and multidomain proteins, to cellular events such as signaling. Although other collections, such as PROSITE, Pfam, BLOCKS, and PRINTS, contain many more distinct domains or motifs, the focus of SMART on signaling allows significantly enhanced detection sensitivity, the inclusion of many families that are not represented in other collections, and offers a high level of specificity (i.e., a low rate of false positives that is essential for large-scale analysis). The SMART database shall be continually updated; alignment updates shall be semiautomated to avoid misalignments. Thus, forthcoming SMART database versions shall be hand-checked to provide datasets of high quality. In future, experimental findings that advance the understanding of domain structure and function also shall be provided via updates. As SMART is designed to obtain biologically relevant results without dependency on a single database search technique, there is potential to modify underlying methods to improve performance. Note Added in Proof. Recent improvements to the SMART system include implementation of SWise-derived E-values and addition of more than 80 extracellular domains. A ProfileScan Server (http://ulrec3.unil.ch/software/PFSCAN_form.html) has appeared recently that includes facilities that are similar or complementary to those of SMART. We thank colleagues at the European Molecular Biology Laboratory and Ewan Birney for many helpful discussions. We also thank Bemhard Sulzer for computational assistance. C.P.P. is a Wellcome Trust Career Development Fellow and a member of the Oxford Centre for Molecular Sciences, and was supported in part by a European Molecular Biology Organization Short-Term Fellowship. J.S. and P.B were supported by the European Union, Bundesministerium für Bildung und Forschung (Germany), and the Deutsche Forschungsgemeinschaft. 1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402. 2. Pearson, W.R. (1991) Genomics 11, 635–650. 3. Doolittle, R.F. (1995) Annu. Rev. Biochem, 64, 287–314. 4. Bork, P., Downing, A.K., Kieffer, B. & Campbell, I.D. (1996) Q. Rev. Biophys. 29, 119–167. 5. Bork, P., Schultz, J. & Ponting, C.P. (1997) Trends Biochem. Sci. 22, 296–298. 6. Ponting. C.P., Schultz, J. & Bork, P. (1997) Trends Biochem. Sci. 22, Poster Suppl. C04. 7. Mushegian, A.R., Bassett, D.E., Jr., Borguski, M., Bork, P. & Koonin, E.V. (1997) Proc. Natl. Acad. Sci. USA 94, 5831–5836. 8. Mewes, H.W., Albermann, K., Bahr, M., Frishman, D., Gleissmer, A., Hami, J., Heumann, K., Kleine, K., Muier, A., Oliver, S.G., et al. (1997) Nature (London) 387, Suppl., 7–65. 9. Bairoch, A, Bucher, P. & Hofmann, K. (1997) Nucleic Acids Res. 25, 217–221. 10. Henikoff, J.G., Pietrokovski, S. & Henikoff, S. (1997) Nucleic Acids Res. 25, 222–225. 11. Attwood, T.K., Beck, M.E., Bleasby, A.J., Degtyarenko, K., Michie, A.D. & Parry-Smith. D.J. (1997) Nucleic Acids Res. 25, 212–217. 12. Sonnhammer, E.L., Eddy, S.R. & Durbin, R. (1997) Proteins 28, 405–420. 13. Bork, P. & Gibson, T.J. (1996) Methods Enzymol. 266, 162–184. 14. Eddy, S.R., Mitchison, G. & Durbin, R.J. (1995) Comput. Biol. 2, 9–23. 15. Tatusov, R.L., Altschul, S.F. & Koonin, E.V. (1994) Proc. Natl. Acad. Sci. USA 91, 12091–12095. 16. Birney, E., Thompson, J. & Gibson, T. (1996) Nucleic Acids Res. 24, 2730–2739. 17. Schuler, G.D., Altschul, S.F. & Lipman, D.J. (1991) Proteins 9, 180–190. 18. Ponting, C P. & Kerr, I.D. (1996) Protein Sci. 5, 914–922. 19. Koonin, E.V. (1996) Trends Biochem. Sci. 21, 242–243. 20. Haynie, D.T. & Ponting, C.P. (1996) Protein Sci. 5, 2643–2646. 21. Ponting, C.P. & Parker P.J. (1996) Protein Sci. 5, 162–166. 22. Gibson, T.J., Hyvonen, M., Musacchio, A., Saraste, M. & Birney, E. (1994) Trends Biochem. Sci. 19, 349–353. 23. Russell R.B. (1994) Protein Eng. 7, 1407–1410. 24. Thompson, J.D., Higgins, D.G. & Gibson, T.J. (1994) Nucleic Acids Res. 22, 4673–4680. 25. Hogue, C.W.V., Ohkawa, H. & Bryant. S.H. (1996) Trends Biochem. Sci. 21, 226–229. 26. Lupas, A., Van Dyke, M. & Stock, J. (1991) Science 252, 1162–1164. 27. Wootton, J.C & Federhen, S. (1996) Methods Enzymol. 266, 554–573. 28. Fasman, G.D. & Gilberts, W.A. (1990) Trends Biochem. Sci. 15, 89–92. 29. Davis, S., Lu, M.L., Lo, S.H., Lin, S., Butler, J.A., Druker, B.J., Roberts, T.M., An, Q. & Chen, L.B. (1991) Science 252, 712–715. 30. Richardson, A. & Parsons, J.T. (1996) Nature (London) 380, 538–540. 31. Ilic, D., Damsky, C.H. & Yamamoto, T. (1997) J. Cell Sci. 110, 401–407. 32. Schaller, M.D., Otey, C.A., Hildebrand, J.D. & Parsons, J.T. (1995) J. Cell. Biol. 130, 1181–1187. 33. Knezevic, I., Leisner, T.M. & Lam, S.C.T. (1996) J. Biol. Chem. 271, 16416–16421. 34. Ozaki, K., Tanaka, K., Imamura, H., Hihara, T., Karaeyama, T., Nonaka, H., Hirano, H., Matsuura, Y. & Takai, Y. (1996) EMBO J. 15, 2196–2207. 35. Ponting, C.P. & Bork, P. (1996) Trends Biochem. Sci. 21, 245–246. 36. Madaule, P., Furuyashiki, T., Reid, T., Ishizaki, T., Watanabe, G., Morii, N. & Narumiya, S. (1995) FEBS Lett. 377, 243–248. 37. Boguski, M.S. & McCormick, F. (1993) Nature (London) 366, 643–654. 38. Buchsbaum, R., Telliez, J.-B., Goonesekera, S. & Feig, L.A (1996) Mol. Cell. Biol. 16, 4888–4896. 39. Hofmann, K. & Bucher, P. (1996) Trends Biochem. Sci. 21, 172–173. 40. Vadlamudi, R.K., Joung, I., Strominger, J.L. & Shin, J. (1996) J. Biol. Chem. 271, 20235–20237. 41. Hofmann, K. & Tschopp, J. (1995) FEBS Lett. 371, 321–323. 42. Feinstein. E., Kimchi, A., Wallach, D., Boldin, M. & Varfolomeev, E. (1995) Trends Biochem. Sci. 20, 342–344. 43. Leonardo, E.D., Hinck, L., Masu, M., Keino-Masu, K., Ackerman, S.L. & Tessier-Lavigne, M. (1997) Nature (London) 386, 833–838. 44. Ackerman, S.L., Kozak L.P., Przyborski, S.A., Rund, L.A., Boyer, B.B. & Knowles. B.B. (1997) Nature (London) 386, 838–842. 45. Otsuka, A.J., Franco, R., Yang, B., Shim, K.H., Tang, L.Z., Zhang, Y.Y., Boontrakulpoontawee, P., Jeyaprakash, A., Hedgecock, E., Wheaton, V.I., et al. (1995) J. Cell. Biol. 129, 1081–1092. 46. The International FMF Consortium (1997) Cell 90, 797–807. 47. The French FMF Consortium (1997) Nat. Genet. 17, 25–31. 48. Ponting, C.P., Schultz, J. & Bork, P. (1997) Trends Biochem. Sci 22, 193–194. 49. Quaderi, N.A., Schweiger, S., Gaudenz, K., Franco, B., Rugarli, E. I., Berger, W., Feldman, G.J., Volta, M., Andolfi, G., Gilgenkrantz, S., et al. (1997) Nat. Genet. 17, 285–291. 50. Loomis, C.R., Walsh, J.P. & Bell R.M. (1985) J. Biol. Chem. 260, 4091–4097. 51. Gale, C., Finkel, D., Tao, N., Meinke, M., McClellan, M., Olson, J., Kendrick, K. & Hostetter, M. (1996) Proc. Natl. Acad. Sci. USA 93, 357–361. 52. Fitch, W.M. (1970) Syst. Zool. 19, 99–113. 53. Bork, P. & Margolis, B. (1995) Cell 80, 693–694. 54. Rost, B., Sander, C. & Schneider, R. (1994) Comput. Appl. Biosci. 10, 53–60.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS
5865
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5865–5871, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittte, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Highly specific protein sequence motifs for genome analysis
CRAIG G.NEVILL-MANNING, THOMAS D.WU, AND DOUGLAS L.BRUTLAG* Department of Biochemistry. Stanford University, Stanford, CA 94305–5307 ABSTRACT We present a method for discovering conserved sequence motifs from families of aligned protein sequences. The method has been implemented as a computer program called EMOTIF (http://motif.stanford.edu/emotif). Given an aligned set of protein sequences, EMOTIF generates a set of motifs with a wide range of specificities and sensitivities. EMOTIF also can generate motifs that describe possible subfamilies of a protein superfamily. A disjunction of such motifs often can represent the entire superfamily with high specificity and sensitivity. We have used EMOTIF to generate sets of motifs from all 7,000 protein alignments in the BLOCKS and PRINTS databases. The resulting database, called IDENTIFY (http://motif.stanford.edu/identify), contains more than 50,000 motifs. For each alignment, the database contains several motifs having a probability of matching a false positive that range from 10–10 to 10–5. Highly specific motifs are well suited for searching entire proteomes, while generating very few false predictions. IDENTIFY assigns biological functions to 25–30% of all proteins encoded by the Saccharomyces cerevisiae genome and by several bacterial genomes. In particular, IDENTIFY assigned functions to 172 of proteins of unknown function in the yeast genome. Assigning function to genes in newly sequenced genomes requires highly specific search and comparison methods (1–4). The process involves first identifying all ORFs or coding regions in the genome and translating them into putative protein sequences. These protein sequences then are compared with (i) databases of individual protein sequences, (ii) databases of protein consensus sequences, or (iii) families of aligned proteins (4–9). Finally, the remaining unassigned proteins may be compared with known protein folds or structures by using sequence-structure alignment or threading methods (10–16). In large-scale searches for biological function, a high level of specificity is critical to minimize the number of false predictions made among the thousands of genes in a genome. Many popular sequence similarity methods calculate expectation values that can be used together with a threshold to guarantee a specific level of false predictions. However such highly specific similarity search methods often sacrifice sensitivity and fail to find all of the members in a particular protein family in a genome. On the other hand, protein sequence motifs usually are generated manually in an attempt to maximize the sensitivity while sacrificing specificity, thus giving rise to relatively high frequencies of false predictions (17, 18). In this paper, we present a highly systematic and objective method for determining sequence motifs from aligned sets of protein sequences called EMOTIF (19). Unlike most methods that attempt to find a single “best” motif optimized at one level of sensitivity and specificity, EMOTIF generates many possible motifs over a wide range of sensitivity and specificity. Thus, EMOTIF can generate extremely specific motifs that will produce fewer than one expected false prediction per 10l0 tests, as well as more sensitive motifs that cover all members of a family. EMOTIF also can be used to find several highly specific motifs that characterize different subsets of a protein family. By combining these highly specific motifs together in a disjunction, we can potentially describe a protein family with both high specificity and sensitivity. We have applied EMOTIF to two large data sets of aligned proteins of families, the BLOCKS and the PRINTS databases (7, 9, 20). Together, these data sets contain nearly 7,000 alignments representing protein active sites, substrate binding sites, superfamily signatures, and so on. By applying EMOTIF to all of these alignments, we have generated a database called IDENTIFY, which contains more than 50,000 sequence motifs with specificities varying from one expected false positive prediction in 105 tests to as low as one expected false positive prediction in 1010 tests. IDENTIFY can be used to scan newly sequenced ORFs from genomic sequences for function. Each IDENTIFY motif has an associated specificity, indicating the likelihood that a match is a true or false prediction. By using the IDENTIFY database of motifs, we have scanned all ORFs in several bacterial genomes and in the yeast genome for function. IDENTIFY was able to determine the function of 25–30% of all of the proteins in these genomes, usually resulting in 3–4 motifs per protein identified. In particular, IDENTIFY was able to assign a function to 172 of the 833 ORFs whose function was labeled as unknown.
METHODS Motif Substitution Groups. A sequence motif is a particular kind of representation called a regular expression (21). It represents a generalization about the range of variability that occurs in corresponding positions across a family of protein sequences. A sequence motif represents variability by specifying a group of amino acids permitted in that position. In our notation, this group of amino acids is enclosed by brackets, e.g., [ILMV]. When only a single amino acid is allowed in a position, that amino acid is represented by a single character without brackets. On the other hand, when a position has no meaningful conservation, all 20 amino acids are permitted; in that case, we use the wildcard character ‘.’. For a sequence to match a motif, each of the amino acids in the sequence must be permitted by the corresponding group in the motif. In some cases, we may relax this requirement to allow one or more mismatches. To characterize the types of variability observed in nature, we conducted a study of amino acid groups, by using empirical studies of two databases of protein families. The BLOCKS
*To whom reprint requests should be addressed at: Department of Biochemistry. Beckman Center B400, Stanford University, Stanford, CA 94305–5307. e-mail:
[email protected]. © 1998 by The National Academy of Sciences 0027–8424/98/955865–7$2.00/0 PNAS is available online at http://www.pnas.org.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS
5866
database (22) contains short, ungapped regions that are highly conserved, according to sequence characteristics. The HSSP database (23) contains global alignments of sequences based on structural alignments. We examined all possible subsets S of amino acids to find those groups that are well conserved. We had two criteria for conservation: (i) compactness—amino acids within the group should substitute for one another with relatively high frequency, and (ii) isolation—amino acids outside the group should substitute for those in the group with relatively low frequency. These criteria follow those often used in cluster analysis (24). To measure compactness and isolation, we first used the BLOCKS and HSSP databases to provide a set of conditional counts c(a|S), which equals the total number of occurrences of amino acid a in all aligned positions that contain the group S. Conceptually, we found all aligned positions that contain S. and then tabulated all amino acids from those positions. Then, we computed conditional frequencies
where the quantity f(a|S) is defined only for amino acids a not in group S. For each group, we computed the expected conditional frequencies and the standard error of the proportion for amino acids outside the group:
where c(a) is the marginal count of amino acid a’ over all aligned positions. We then computed a separation score for each group, as follows:
where Z(a|S) is a conditional relative deviate, or Z-score. The first term represents our measure of compactness, and the second term represents our measure of isolation. Based on these separation scores, we found all amino acid groups that had a separation score greater than three standard errors, which is equivalent to a significance level of 0.01. Further details of our analysis are presented in ref. 25. Our criteria were met by 30 substitution groups in the BLOCKS database and 51 substitution groups in the HSSP database. The HSSP database yielded more groups because of its larger size, and because our criterion is based on statistical significance. Twenty substitution groups were conserved empirically in both databases, and the validation by both databases provides good evidence that these groups are indeed conserved in nature. If we arrange these groups hierarchically, we obtain the set of amino acid groups shown in Fig. 1. We used these substitution groups to define the space of motifs available to describe protein families. Motif Enumeration and Ranking. A conserved region may be described by many possible motifs, with different levels of coverage and specificity. To better understand the choices involved, consider the sequence alignment in Fig. 2a. We can cover all sequences in the training set if we select the smallest group of amino acids that accounts for all of the amino acids in each position. For example, every sequence has methionine in the first position, so the first position of the motif should specify M. In the second position, both phenylalanine and tyrosine occur. The smallest group of amino acids from Fig. 1 that accounts for the entire position is [FYW], which allows tryptophan to occur in addition to phenylalanine and tyrosine. Using this group is tantamount to inferring that this position requires an aromatic amino acid. In the third position, no allowable group can account for the diverse amino acids that are observed, so to achieve complete coverage we must place a wild-card character in this position.
FIG. 1. Substitution groups. Groups of amino acids found to occur together in columns of aligned sequences in both the BLOCKS and HSSP databases. Only groups of amino acids that occur together at a significant frequency and are separated from all other amino acids at a level of significance of less than 0.01 are included. The substitution groups are arranged hierarchically to show relationships between their physical properties. The resulting motif, shown in Fig. 2b, has complete coverage, because it describes the entire training set, but it can be affected by problems with the data. Consider again the alignment in Fig. 2a. In the eighth position from the right, every sequence but one contains a leucine. The first sequence, however, contains a proline at this position. This may be the result of a sequencing error, a rare mutation, or a sequence that has been erroneously assigned to the family. In any case, if the first sequence was removed from consideration in the formation of the motif, this position in the motif would change from ‘.’ to L. Doing this reduces the coverage of the motif by one sequence, but makes it more specific. Even in the absence of problems in the data, motifs with high coverage generally may have low specificity, thereby resulting in false positives. In constructing a motif, we are faced then with a fundamental tradeoff between coverage and sensitivity. The EMOTIF algorithm explores this tradeoff for a particular alignment by exhaustively generating all possible motifs using the allowable substitution groups and quantifying the coverage and specificity for each motif. Another feature of our example bears discussion. The sequences can be partitioned into two subclasses based on the amino acid in the fourth position. The first group has arginine in this position, whereas the second group has lysine. All sequences in the first group have tyrosine in the final position, whereas none in the second group do. Indeed, partitioning the sequences in this way allows the conserved region to be described by two highly specific motifs, rather than a single, more general one. Fig. 2c shows the motif for the first group. Thirteen positions are more specific than the motif for the entire set of sequences, resulting in an factor of 1010 increase in specificity. Thus, by finding motifs that cover only part of the training set, EMOTIF is potentially able to discover subfamilies within a superfamily and characterize them with a specific motif. We define specificity as the probability that a random sequence would match the motif. To calculate this, we assume that the distribution of amino acids in each position of a random sequence is independent and identically distributed. We use the observed distribution of amino acids in the SWISSPROT database as an estimate for this distribution. The
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS
5867
specificity of a motif then is simply the product of the probabilities in each position. A wild-card character matches with probability 1.0. and a specific amino acid matches with the probability taken from database. A group of amino acids matches with the sum of the probabilities of the individual amino acids. So the probability of the motif in Fig. 2b is
FIG. 2. Aligned block of 34 tubulin proteins and two motifs representing these sequences, (a) An aligned block of 34 tubulin proteins and the sequence variation observed among them, (b) One possible sequence motif for the alignment in a that can be formed by using the amino acid substitution groups from Fig. 1. (c) A much more specific sequence motif that can be used to represent the upper 19 tubulin sequences, which form a group more closely related to each other than to the lower 15 sequences. p(M)·1·[p(F)+p(W)+p(Y)]·[p(K)+p(R)]·1·1p(F)··1. We have found empirically that this estimate accurately predicts false positive rates for matches of motifs against large protein databases, so the assumption of independence of positions is reasonable in practice. The EMOTIF algorithm exhaustively generates all possible motifs for a particular alignment using the allowable substitution groups, and quantifies the coverage and specificity for each motif. The graph in Fig. 3 illustrates the tradeoff between these quantities. Each point in the graph corresponds to a single motif for the alignment of 159 segments of tubulin sequences similar to those shown in Fig. 2a. The vertical axis is the specificity of the motif, which ranges from 1 to 10–44. The horizontal axis is the coverage of the motif, measured as the number of training sequences that the motif matches. In this case, the training set contains 159 sequences, and motifs covering fewer than 30% of the total (47 sequences) were not generated. The EMOTIF algorithm uses a lower limit on coverage to help prune the search space and to allow all motifs to be generated efficiently. Typically, the lower limit on coverage is 30%, but this value may be specified by the user. Because coverage of the training set is an integer, the graph consists of a series of vertical lines, one for each number of sequences covered. Note that even if two motifs lie in the same vertical line, meaning that they cover the same number of sequences, they do not necessarily cover the same particular subset of sequences. An ideal motif would lie in the lower right of the graph, with complete coverage and maximum specificity. However, the tradeoff between coverage and sensitivity makes the ideal motif unattainable. Motifs at the extremes are generally undesirable. Motifs in the lower left of the graph are very specific, accounting for only 30% of the training set. Motifs in the upper right are very sensitive, but result in a high number of expected false positives. Because EMOTIF displays the tradeoff between coverage and specificity explicitly, we may choose optimal motifs that achieve a desired level of specificity. One strategy for searching a large database is to require that the expected number of false positives be less than one. The expected number of false positives is approximately equal to the specificity of the motif multiplied by the number of possible match positions in the database. For example, a search of the GenPept protein sequence database, which contains 108 amino acids, achieves fewer than one expected false positive when the motif has a specificity of 10–8 or less. This specifies those motifs below a particular horizontal position in the graph. For searches of smaller databases, the line would be higher, and therefore, we could use more sensitive motifs. For searches of
FIG. 3. Enumeration of tubulin motifs by EMOTIF. EMOTIF generates all possible sequence motifs that can cover at least 30% of 159 tubulin sequences in a training set. Each motif is plotted as a dot in the figure where the horizontal axis gives the coverage of the motif (number of sequences covered in the training set), and the vertical axis plots the specificity of the motif as the probability of matching a random protein segment. The motifs occur in vertical lines because coverage is an integer quantity. The lower curve is the Pareto-optimal curve, which represents the most specific motif at each level of sensitivity.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS
5868
larger databases, the line would be lower, and we would require more specific motifs. Given this restriction, the optimal motif for a particular level of specificity would be the one beneath the line having the highest sensitivity, as approximated by coverage of the training set. The space of optimal motifs also is reduced by the principle of dominance. For any particular level of coverage, a motif that is more specific dominates one that is less specific. On the graph, for any vertical line, a motif that has fewer expected false positives specificity dominates those with more expected false positives. A similar argument can be made for motifs with a particular level of specificity. A motif with high coverage dominates those with lower coverage. The dominating motifs lie along a Pareto-optimal curve, shown in Fig. 3 as a line along the lower right frontier of motifs. No motif on that line can be made more specific without reducing its coverage, nor be made to cover more sequences without reducing its sensitivity. Therefore, motifs on or near this line should be used for searching tasks. In practice, we select the motif on the Paretooptimal line with maximum coverage at the desired level of specificity. Disjunctive Motifs. By allowing only part of the training set to be covered, we obtain motifs that may fail to describe an entire family or superfamily. thereby resulting in lower sensitivity. To solve this problem, we use disjunctive motifs to achieve high specificity and sensitivity. After we apply EMOTIF to a given training set and select an optimal motif at a given level of specificity, we can invoke EMOTIF on the sequences that were not covered. This generates a second motif, which in conjunction with the first motif, covers more of the training set than the first motif alone. This process may be continued until some coverage criteria is met, such as coverage of 90% of the training set. To evaluate the increase in coverage possible with this approach, we obtained disjunctive motifs for each of the 7,000 multiple sequence alignments in the BLOCKS and PRINTS databases. The disjunctive motif strategy requires one parameter: a desired minimum level of specificity. We applied our strategy for five levels of specificity, from 10–6 to 10–10, by factors of 10. For each level of specificity, we measured the number of motifs required to achieve 90% coverage for each sequence alignment. The results of our experiments are shown in Fig. 4. At a specificity level of 10–10, 65% of the sequence alignments had 90% coverage by a single motif, whereas at a specificity level of 10–6, 80% of the blocks had 90% coverage by a single motif. At a specificity level 10–10, 80% of the sequence alignments had 90% coverage by a disjunction of two motifs, whereas at a specificity level of 10–6, nearly 95% of blocks had 90% coverage by a disjunction of two motifs. It appears that for reasonable levels of specificity, one or two motifs are sufficient to cover most sequence alignments reasonably well in these databases. A disjunction of motifs may identify subfamilies in the training set. Each subfamily can be described specifically by its own motif. For instance, the graph in Fig. 3 shows motifs that are clustered into distinct groups. The clustering suggests the presence of several subfamilies in the training set. In fact, the training set, which consists of tubulins, can be divided biologically into subfamilies, and the various clusters in the figure correspond to motifs that cover α-tubulins only, β-tubulins only, both α- and β-tubulins, and α-, β- and γ-tubulins. We have developed methods for identifying subfamilies optimally using criteria from statistics and minimum description length principles. These methods are discussed in further detail in ref. 19. The IDENTIFY Motif Database. We used the results of the above experiments to produce a motif database for evaluating individual sequences and searching sequence databases. At each level of specificity, we obtained approximately 10,000 motifs. The collective database of motifs is called the IDENTIFY database. The motifs are grouped according to the level of specificity for which they are optimal. For large databases requiring high specificity, motifs at the 10–10 level are most appropriate. For smaller databases requiring less specificity, motifs at the 10–6 level may be appropriate.
RESULTS Unidentified ORFs from Yeast. We have applied the IDENTIFY database to predict functions in unidentified ORFs in Saccharomyces cerevisiae. At the time of the experiment (May 1997), there were 6,220 known ORFs in the yeast genome
FIG. 4. The number of motifs required to cover at least 90% of the protein family in the IDENTIFY database. EMOTIF was used to generate one or more motifs that cover at least 90% of all the sequences in each of 7,000 alignments in the BLOCKS or PRINTS databases at five different levels of specificity. Plotted are the number of motifs that are required to cover at least 90% of the sequences in the alignment.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS
5869
database (http://genome-www.stanford.edu/Saccharomyces), of which 833 had no confirmed function (26). We applied the IDENTIFY database to each translated ORF, and assigned a predicted function based on matches to motifs. Table 1 shows how many ORFs are identified by motifs at each level of specificity. For example, using the motifs at a specificity of 10–10, we assigned putative functions to 61 ORFs. Forty-one of these had no annotation whatsoever, indicating that other methods (e.g., BLAST, PROSITE, etc.) had failed to identify any significant homology to a known protein. Based on the calculated specificity of the motifs, along with the number of motifs and size of the ORFs database, the expected number of false positives is 0.02, so it is highly likely that all of the assignments are correct. Relaxing the procedure a little by using motifs with specificity at least 10–9 produces 86 assignments, including 59 not previously annotated. Again, the expected number of false positives is less than one. At the other end of the spectrum, the 10–7 set produced 172 predicted functions, but the expected number of false positives is 17. To test these 172 predictions, we compared our results with those in the Sacch3D database (http://genome-www.stanford.edu/Sacch3D) (S.Chervitz, J.M.Cherry, and D.Botstein, personal communication). This database compares each of the translated ORFs in S.cerevisiae against proteins of known structure by using sensitive alignment and threading approaches. Of the 833 unidentified ORFs, 83 had functions assigned by Sacch3D alone. 124 had functions assigned by IDENTIFY, and 48 had functions assigned by both programs. Of the 48 functions assigned by both programs, all assignments were identical. Overall, 255 of the unidentified ORFs had a putative function assigned by one or both of the programs. We analyzed our results at the level of motifs. The BLOCKS and PRINTS databases often contain several sequence alignments for a given family of proteins. Each alignment corresponds to a different conserved segment of the protein. On average, these databases contain three sequence alignments per protein family. Therefore, a match of a sequence to several distinct motifs from the same family provides independent confirmations of the predicted function. In the 48 ORFs with functions assigned by both IDENTIFY and Sacch3D, the IDENTIFY database matched 137 distinct motifs. Of these 137 motif matches, 129 of the predicted functions were the same as those of Sacch3D. We believe that independent predictions of function provides an indication of the reliability of motif matches by IDENTIFY. Whole Genome Analysis. We applied IDENTIFY to search for functions in all ORFs in several genomes including S.cerevisiae, Haemophilus influenzae, and Methanococcus jannaschii. To assess the performance of IDENTIFY, we tested our assignments against the annotations for each genome as follows. For those ORFs with annotations, we extracted keywords from the description, ignoring common words such as protein, enzyme, and domain. We also extracted significant keywords from the associated entry for the motif from the BLOCKS or PRINTS sequence alignment databases. We considered an assignment correct if the significant keywords from the genomic annotation matched significant keywords from the alignment annotation. If there was no match, then the prediction was incorrect, or the annotations were either insufficient or described the same function differently. To decide among these alternatives, we examined each of the remaining predictions manually (4,647 in total over three genomes).
Table 1. Assignment of function to 833 yeast ORFs of unknown function Specificity # ORFs assigned # ORFs assigned with no annotations 10–10 61 41 86 59 10–9 10– 8 103 69 172 121 10–7
# Motifs assigned 179 238 301 488
Expected # of false motif assignments 0.02 0.2 1.7 17
Table 2 summarizes the predictions for the seven genomes by using motifs from IDENTIFY at different levels of specificity. For each genome and level of specificity, the third column shows the number of correct predictions, as determined by automatic keyword matches. The fourth column contains the number of predictions that could not be verified by automatic keyword matching, but were found to be correct by manual inspection. In the fifth column are the number of predictions that were not confirmed by the annotations. Many of these cases corresponded to ORFs without annotations, whereas other cases showed conflicts between the annotated function and the function predicted by IDENTIFY. The conflicting predictions may be incorrect or may perhaps be plausibly related to the annotated functions. The sixth column shows the number of incorrect predictions expected by chance, based on the number of motifs, their specificity, and the size of the genomes. In the bacterial genomes and in the yeast genome with the most specific motifs, there was less than expected incorrect predictions. The seventh column shows the number of ORFs for which a function was predicted correctly by IDENTIFY. This is different from the number of correct predictions, because each ORF may match several motifs in the database, each resulting in a predicted function. The eighth column shows the total number of ORFs in the entire genome, and the final column shows the percentage of ORFs for which a function was predicted by IDENTIFY. Depending on the level of specificity used, the IDENTIFY program predicts functions that match the genomic annotation for 22–26% of ORFs in the yeast genome, 28–30% of the ORFs in H.influenzae, and 9–11% of the ORFs in M. jannaschii. The relatively few predictions for M.jannaschii may be because of its evolutionary divergence from those species that have been sequenced more extensively. In addition, the IDENTIFY program predicts several functions that are not confirmed by the genome annotations. Based on a 10–9 level of specificity, we predict novel functions in 31 ORFs in yeast, 33 ORFs in H.influenzae, and 21 ORFs in M.jannaschii. On the average, three motifs are assigned to each ORF that is identified. These ORFs often represent distinct BLOCKS or PRINTS alignments from a single protein family, thus supporting each other in the assignment of a particular function to a protein. Because these ORFs often confirm or support each other, the probability of a false positive prediction is likely to be much less than that of a single motif match.
DISCUSSION Principled Motif Generation. Motifs, including those in the PROSITE database (17, 18), generally have been generated manually. In this paper, we introduce a method for generating motifs automatically. Automated methods are becoming increasingly important as sequence databases grow. An automated method requires knowledge about sequence conservation. For EMOTIF, this knowledge is encoded as an allowed set of amino acid substitution groups. Although we have presented a empirical analysis that supports a certain set of groups (Fig. 1), the algorithm may be easily adapted to use other sets of amino acid substitution groups. For instance, substitution groups based on chemical principles (27, 28) may be appropriate in certain cases. Other researchers have generated motifs from a predefined set of substitution groups (29, 30), but these sets of allowable groups often have been too limited. Previous sets of substitution groups generally have been mutually exclusive, meaning that each amino acid may belong to only a single group. In contrast, we use overlapping groups, which allows each amino acid to belong to more than one group. This is biologically
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS
5870
appropriate, because each amino acid has several properties and can serve different functions, depending on the biochemical context. In some contexts, the size of an amino acid may be critical; in others, its charge may be the conserved property.
Table 2. Genomes scanned by using IDENTIFY Specificity Total Motifs Genome motifs verified assigned manually & verified S.cerevisiae 10–10 4,442 909 10–9 4,679 1,027 4,994 1,114 10–8 H.influenzae 10–10 1,804 644 10–9 1,899 703 349 115 M.jannaschii 10–10 403 135 10–9 M.genitalium 10–10 297 75 331 87 10–9 1,369 389 Syn. sp. 10–10 10–9 1,569 461 304 75 M.pneumoniae 10–10 350 89 10–9 H.pylori 10–10 476 100 576 121 10–9
Assignments unverified
Expected false assignments
ORFs identified
Total ORFs
%of total ORFs identified
9 31 124 11 33 3 21 4 7 21 34 6 8 16 18
0 5 42 0 0 0 0 0 0 2 20 0 0 0 0
1,345 1,466 1,621 479 503 157 192 96 108 447 513 101 117 200 233
6,220
22% 24% 26% 28% 30% 9% 11% 21% 23% 14% 16% 15% 17% 13% 15%
1,697 1,680 467 3,169 677 1,566
The ORFs encoded in the genomes of S.cerevisiae, H.influenzae, and M.jannaschii were scanned by using the IDENTIFY database. The motif assignments then were verified as described in the text. The number and percentage of ORFs identified by these motif assignments also were calculated. On average. approximately three motifs were assigned to each ORF that was identified.
By using only an allowed set of substitution groups, we avoid the problem of overfitting, which occurs commonly when motifs are generated manually. Overfitting occurs when a motif is designed to cover all variability in a training set, even when such variability may be caused by errors or may not be biologically meaningful. Errors in training sets may arise for a variety of reasons: (i) the sequence data may contain errors, including insertions, deletions, or substitutions; (ii) one or more sequences may be misaligned; (iii) the sequences may be contaminated, meaning that some sequences in the alignment may not truly belong to a particular family; or (iv) the family may contain subfamilies or subclasses, each of which may generalize well individually, but not together. Biologically meaningless variation occurs when the observed variation is caused by mutations that do not affect the structure or function of the protein. For instance, if a position in a protein family were to contain one example each of alanine. cysteine, and valine, the observed variation likely would be biologically meaningless because we know of no chemical or physical reasons that these three amino acids should be conserved together. Therefore, a motif that contains the group [ACV] would be an example of overfitting the data. A biologically meaningful generalization of the observed variation would depend on the available substitution groups. In our set of substitution groups, these three amino acids would be generalized by the wild-card character. Nevertheless, groups that are difficult to interpret biologically, such as [ACV], occur frequently in prosite. In that database, motifs are constructed by using 867 distinct amino acid substitution groups. A few groups are used frequently, such as [ILMV], which occurs 826 times in prosite. In fact, the 20 most frequently used groups account for 60% of the groups used by motifs in prosite. On the other hand, the vast majority of distinct groups—more than 70%—occur in only a single motif, and an additional 13% of groups occur in only two motifs. These groups are probably examples of overfitting. Overfitting is of concern in machine learning, because at some point, further fitting of the training set worsens performance on future test sets. For example, the group [ACV] may cover the training set entirely, but it does not allow for any other amino acid at that position, which may worsen predictive power if, in fact, there is no true conservation at that position. Enumeration Strategy. EMOTIF uses an enumeration strategy that generates all possible motifs for a given protein family. It is somewhat surprising that, in most cases, EMOTIF is able to enumerate all motifs within a few seconds. Most enumeration strategies in computer science are impractical because the space of solutions is typically so large that a complete enumeration cannot be performed in tractable time. In fact, in an early version of a motif generating program called SeqClass (31), we used a heuristic search strategy to find the single best motif. However, heuristic search strategies are not guaranteed to find the globally optimal solution. On the other hand, an enumerative strategy, if tractable, will guarantee an optimal solution. The tractability of EMOTIF relies on the fact that sequences in a protein family are related, so a single motif may be the most specific one for many different subsets of the training set. Therefore, the space of possible motifs often is limited in practice by the amount of variability possible in the protein family. For additional efficiency, EMOTIF sets a lower limit on coverage of the training set; motifs that cover less than 30% of the training set are not enumerated. The value of 30% still enables EMOTIF to recognize up to three equal-sized subfamilies. Enumeration affords three major advantages over heuristic search. First, as mentioned above, it guarantees finding the optimal motif for a particular criterion. Second, an enumeration approach finds optimal motifs for multiple criteria simultaneously. For example, EMOTIF provides optimal motifs for a wide range of specificities, each of which may be useful for a particular task. For example, scanning an entire database may require highly specific motifs, whereas characterizing a single protein sequence may require motifs with much lower specificity. A single run of EMOTIF on a single protein family will find the optimal motif at each level of specificity in advance. We have exploited this advantage in constructing the IDENTIFY database, which provides optimal motifs at different levels of specificity for different tasks. The third advantage of an enumeration strategy is that it produces a two-dimensional graph, such as in Fig. 3, which characterizes variability in a protein family. The graph provides clues about possible subfamilies, as exemplified by the α-, β-, and γ-tubulins. In addition, the shape of the Pareto-optimal line also gives insight into the structure of the set of sequences. Bulges in the line toward the lower right indicate clusters of
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS
5871
sequences, whereas a hyperbolic line along the top and left of the graph results from sequences that form no discernible clusters. Finally, the graph helps users view the tradeoff between coverage and specificity for various motifs and allows them to select motifs interactively. Assigning Function to Novel Proteins. The motifs in the IDENTIFY database are particularly valuable for assigning function to newly sequenced proteins, either individually or in large-scale searches. Motifs are particularly well-suited to large-scale searching tasks. Motifs can be used to search a database very quickly, and many fast algorithms for performing regular expression searches exist. In addition, because motifs in the IDENTIFY database are characterized by their specificity, a search using motifs can be tailored to provide maximum sensitivity for a given desired level of specificity and to minimize false positives. Each motif also is linked to the BLOCKS or PRINTS databases, which describe the family of proteins from which it was derived. Because these protein families typically have several members, a match to a motif may provide an association with several other members of the family. In addition, when a match to a motif is obtained, that motif may be used to search sequence databases, such as SWISSPROT and GenPept. for other proteins that share this motif. This function, which is implemented in IDENTIFY, provides all sequences that may share a closely related form of the motif and thereby represent a particular subfamily containing the motif. More importantly, most families in the PRINTS and BLOCKS databases are represented by several motifs, each corresponding to a different conserved region of the family. On average, each family has 3–4 conserved regions. The presence of multiple conserved regions increases the sensitivity of a search using motifs. Furthermore, they provide additional certainty about a functional assignment, above the statistical estimate of significance, when several independent motifs match a given unknown sequence. Motifs, such as those in IDENTIFY, are useful for assigning functions to proteins even in the absence of any homology apart from the limited motif regions. Unlike similarity search methods that weight every position in a sequence alignment to some extent, motifs evaluate only those positions that show conservation in the training set. Hence, motifs can discover function and assign a protein to a family even if that protein is so distantly related that it shows no sequence similarity outside the motifs. This explains why IDENTIFY can assign function to 172 proteins from the yeast genome that have no significant homology to any known protein. The frequency with which IDENTIFY assigns function to these nonhomologous proteins (172/833=21%) is somewhat less than the frequency with which IDENTIFY assigns function to the bulk of the yeast proteins (1,621/6,220=26%). The ability of motifs to assign function by using only homology at particular positions makes them particularly useful for evaluating newly sequenced genomes such as M.jannaschii, most of whose proteins are not homologous to other organisms. Currently, IDENTIFY assigns function to about 25–30% of novel protein sequences. This limit reflects, among other things, the fraction of newly sequenced proteins that share at least one motif with a current protein family present in the BLOCKS or PRINTS databases. As more genomes are sequenced and more protein families are defined in these databases, IDENTIFY should be able to assign function to a larger fraction of proteins. Despite this current limitation, IDENTIFY is a valuable tool for assignment of function to newly sequenced proteins, especially in those cases where there are no significant sequence similarities by alignment, profile, or hidden Markov methods. Availability. Access to the EMOTIF and IDENTIFY programs is available over the Internet at http://motif.stanford.edu/emotif and http://motif. stanford.edu/identify. Nonprofit institutions wishing to install the programs locally may send requests to D.L.B. (
[email protected]). Commercial and for-profit institutions can license the programs from Pangea Systems Inc. or from Stanford’s Office of Technology Licensing. This work was supported by a grant from SmithKline Beecham Pharmaceuticals and by Grant LM 05716 from the National Library of Medicine. T.D.W. is a Howard Hughes Medical Institute Physician Postdoctoral Fellow. 1. Scharf, M., Schneider, R., Casari. G., Bork, P., Valencia, A., Ouzounis. C. & Sander, C. (1994) ISMB 2, 348–353. 2. Casari, G., Ouzounis, C., Valencia, A. & Sander, C. (1996) in GeneQuiz II: Automatic Function Assignment for Genome Sequence Analysis, Pacific Symposium and Biocomputing, 1996 (World Scientific, Kohala Coast, HI), pp. 707–709. 3. Altschul, S.F., Madden. T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402. 4. Sonnhammer, E.L., Eddy, S.R. & Durbin, R. (1997) Proteins 28, 405–420. 5. Attwood, T.K., Beck, M.E., Bleasby, A.J. & Parry-Smith, D.J. (1994) Nucleic Acids Res. 22, 3590–3596. 6. Krogh, A., Brown, M., Mian, I.S., Sjolander, K. & Haussler, D. (1994) J. Mol. Biol. 235, 1501–1531. 7. Henikoff, J.G. & Henikoff, S. (1996) Methods Enzymol. 266, 88–105. 8. Gribskov, M. & Veretnik, S. (1996) Methods Enzymol. 266, 198–211. 9. Attwood, T.K., Beck, M.E., Bleasby, A.J., Degtyarenko, K., Michie, A.D. & Parry-Smith, D.J. (1997) Nucleic Acids Res. 25, 212–217. 10. Holm, L. & Sander, C. (1994) Nucleic Acids Res. 22, 3600–3609. 11. Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 536–540. 12. Holm, L. & Sander, C. (1995) Trends Biochem. Sci. 20, 478–480. 13. Brenner, S.E., Chothia, C., Hubbard, T.J.P. & Murzin, A.G. (1996) Methods Enzymol. 266, 635–642. 14. Orengo, C.A., Michie, A D., Jones, S., Jones, D.T., Swindells, M.B. & Thornton, J.M. (1997) Structure 5, 1093–1108. 15. Holm, L. & Sander, C. (1997) Nucleic Acids Res. 25, 231–234. 16. Hubbard, T.J.P., Murzin, A.G., Brenner, S.E. & Chothia, C. (1997) Nucleic Acids Res. 25, 236–239. 17. Bairoch, A. & Apweiler, R. (1997) Nucleic Acids Res. 25, 31–36. 18. Bairoch, A., Bucher, P. & Hofmann, K. (1997) Nucleic Acids Res. 25, 217–221. 19. Nevill-Manning, C., Sethi, K., Wu, T.D. & Brutlag, D.L. (1997) ISMB-97 4, 202–209. 20. Henikoff, S., Henikoff, J.G., Alford, W.J. & Pietrokovski, S. (1995) Gene 163, GC17–GC26. 21. Hopcroft, J.E. & Ullman, J.D. (1979) Introduction to Automata Theory. Languages and Computation (Addison-Wesley, Reading, MA). 22. Henikoff, J.G., Pietrokovski, S. & Henikoff, S. (1997) Nucleic Acids Res. 25, 222–225. 23. Schneider, R., de Daruvar, A. & Sander, C. (1997) Nucleic Acids Res. 25, 226–230. 24. Jain, A K. & Dubes, R.C. (1988) Algorithms for Clustering Data (Prentice Hall, Englewood Cliffs, NJ). 25. Wu.T.D. & Brutlag, D.L. (1996) ISMB-96 3, 230–240. 26. Cherry, J.M., Ball, C., Weng, S., Juvik, G., Schmidt, R., Adler, C., Dunn, B., Dwight, S., Riles, L., Mortimer, R.K. & Botstein, D. (1997) Nature (London) 387, 67–73. 27. Kidera, A., Yonishi, Y., Masahito, O., Ooi, T. & Scheraga, H.A. (1985) J. Protein Chem. 4, 23–55. 28. Nakai, M., Kidera, A. & Kanehisa, M. (1988) Protein Eng. 2, 93–100. 29. Smith, R.F. & Smith, T.F. (1990) Proc. Natl. Acad. Sci. USA 87, 118–122. 30. Saqi, M.A. & Sternberg, M.J. (1994) Protein Eng. 7, 165–171. 31. Wu, T.D. & Brutlag, D.L. (1995) ISMB 3, 402–410.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS
5872
Proc. Natl. Acad. Sci. USA Vol. 95. pp. 5872–5879, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
A statistical mechanical model for β-hairpin kinetics
VICTOR MUNOZ,* ERIC R.HENRY, JAMES HOFRICHTER, AND WILLIAM A.EATON* Laboratory of Chemical Physics, Building 5. National Institute of Diabetes and Digestive and Kidney Diseases. National Institutes of Health, Bethesda. MD 20892–0520 ABSTRACT Understanding the mechanism of protein secondary structure formation is an essential part of the protein-folding puzzle. Here, we describe a simple statistical mechanical model for the formation of a β-hairpin, the minimal structural element of the antiparallel β-pleated sheet The model accurately describes the thermodynamic and kinetic behavior of a 16-residue, β-hairpinforming peptide, successfully explaining its two-state behavior and apparent negative activation energy for folding. The model classifies structures according to their backbone conformation, defined by 15 pairs of dihedral angles, and is further simplified by considering only the 120 structures with contiguous stretches of native pairs of backbone dihedral angles. This single sequence approximation is tested by comparison with a more complete model that includes the 215 possible conformations and 15×215 possible kinetic transitions. Finally, we use the model to predict the equilibrium unfolding curves and kinetics for several variants of the β-hairpin peptide. As is evident from the presentations at this Colloquium, the continuous discovery of thousands of new gene sequences is producing a revolution in all aspects of protein physics, chemistry, and biology. Foremost among these is the protein-folding problem. C.B.Anfinsen, in his Nobel Prize winning experiments at the National Institutes of Health (1), showed that a denatured protein can refold spontaneously to form a biologically functional (native) structure. From this result. Anfinsen concluded that the information for determining the three-dimensional structure is somehow encoded in the amino acid sequence. This work has led to the realization that it should in principle be possible to calculate the three-dimensional structure of a protein from its amino acid sequence. Calculating the structure from the sequence has become known as the first part of the protein-folding problem and currently engages a large number of theoretical and computational scientists. The second part of the protein-folding problem is to understand how a protein folds. That is, what are the kinetics and mechanism (or mechanisms) of protein folding? This question is in many ways more challenging because for in vitro folding the ultimate answer is a description of the distribution of three-dimensional structures as a function of time, as the polypeptide progresses from a nearly random set of structures to the unique, compact native protein. An additional motivation for kinetic studies is their relation to the evolution of protein sequences. Evolution preserves protein sequences that correspond to structures with functions that are important to the organism. Theoretical studies by Wolynes and coworkers (2) have suggested how rapid folding to the native structure is yet another evolutionary pressure. The experimental investigation of the kinetics and mechanism of protein folding has been aided by several recent theoretical and technological advances. The theoretical advances include analytical approaches (2–4), simulations of simplified representations of proteins (2, 5– 8), and all-atom molecular dynamics calculations (9–11). This work has painted a comprehensive picture of possible general mechanisms and has provided a framework for experimentalists to think more clearly about the problem. It also has helped define questions, design new experiments, and interpret experimental results. Important technological advances include the availability of a great variety of materials from protein engineering and peptide synthesis, the development of more rapid kinetic methods (12,13), and increased computer power. The combination of these advances now permits the development of an “aufbau” approach to protein folding. This approach starts with the investigation of isolated secondary structural elements: α-helices, β-structures. and loops. The relative simplicity of these elements should permit their mechanism of formation to be described in much greater detail than is possible for proteins. Such studies include the development of statistical mechanical models which quantitatively reproduce equilibrium populations and kinetic progress curves. Once the kinetics and mechanism of the elements are understood, it should be possible to investigate structures of increasing size and complexity. We have begun to study secondary structural elements by using nanosecond-resolved kinetic methods and statisticalmechanical modeling (14). The thermodynamic and kinetic behavior of the α-helix has been studied for more than 40 years (15–20). Only recently, however, have kinetic measurements been made on helices of size and composition comparable with those found in proteins (21–23). Also, early theoretical studies (16, 17) were limited by the lack of computer power, preventing the detailed modeling of experimental kinetic data on helix formation that is now possible (13). The experimental and theoretical study of the kinetics of loops and β-structures is a new subject. Jones et al. (24) and Hagen et al. (25, 26) used a nanosecond photochemical triggering method to study loop formation in cytochrome c by determining the diffusion limited rate for an intramolecular ligand-binding reaction. We also recently reported a thermodynamic and kinetic study of a β-hairpin formed by the 16 C-terminal residues of streptococcal protein G B1 (Fig. 1) (27). This peptide had been shown to adopt the β-hairpin conformation by Blanco et al. (29) using NMR spectroscopy. Our β-hairpin experiments consisted of measuring the thermal unfolding curve for the 16-residue peptide between 273 K and 363 K and measuring the relaxation kinetics following 15-degree nanosecond laser temperature jumps to final temperatures ranging from 288 K to 328 K (27). The three principal experimental results from this study were:
*To whom reprint requests may be addressed: e-mail:
[email protected] or
[email protected]. 0027–8424/98/955872–8$0.00/0 PNAS is available online at http://www.pnas.org.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS
5873
(i) the β-hairpin peptide exhibits two-state behavior in both its equilibrium and kinetics; (ii) the apparent activation energy for the folding rate calculated from the two-state analysis is negative; and (iii) the rate of β-hairpin formation is much (>10-fold) slower than that of the α-helices that have been studied up to now in short peptides.
FIG. 1. Chemical, structural, and schematic representations of the β-hairpin. The sequence corresponds to the C-terminal fragment containing residues 41–56 of protein G B1 (28). Dashed lines indicate hydrogen bonds or hydrophobic interactions. To explain these results, we used a simple statistical mechanical model which was only briefly described (27). Here, we present a detailed description of the model, test one of its major approximations, and use the model to predict kinetic and equilibrium properties expected for other β-hairpinforming peptides. We shall see that analysis of β-hairpin thermodynamics and kinetics addresses many of the same issues that arise in considering the folding of a small protein. Description of the Model. Our objective has been to develop a model for protein secondary structure kinetics, which can be used to analyze experimental data and to predict new experiments. In this work, the model is applied to a β-hairpin, but it also can be applied to helices and is readily adapted for more complex structures. We adopt a description, which uses pairs of ,, dihedral angles to define the conformation of each molecule; the complete native structure is formed when all of the residues have native values for these angles. Formation of the native structure is opposed by the loss of conformational entropy and favored by the formation of stabilizing interactions, i.e., hydrogen bonds and hydrophobic interactions (Fig. 1). The model postulates that two groups interact only when all of the dihedral angles of the sequence connecting them are native. This restriction considerably simplifies the model by identifying three-dimensional structures with sequences of peptide bond conformations. A second simplifying step is to consider only two conformations for the backbone dihedral angles, native and nonnative (in a spirit similar to the “correct” and “incorrect” parameter of the Zwanzig model; ref. 30). The nonnative conformation of a dihedral angle pair is not a unique conformation but is the set of all conformations that are incompatible with the native structure. An additional feature of the model is that pairs of , dihedral angles are assumed to rotate between native and nonnative values simultaneously.† We chose the dihedral angles of residue i and of residue i + 1 (Fig. 2) so that the peptide bond, rather than the residue, is the conformational unit. Formation of a backbone-backbone hydrogen bond is therefore associated with the transformation of one pair of i, i+1 angles in each β strand from nonnative to native values.
FIG. 2. Choice of dihedral angle pairs for motion in elementary kinetic steps. In our thermodynamic description of the β-hairpin, we consider only three factors. These are the stabilizing effect of the hydrogen bonds between the backbone carbonyl and amide of the N- and C-terminal β strands, the stabilizing effect of the three hydrophobic interactions among the four side chains of the hydrophobic cluster (Fig. 1). and the destabilizing effect of the loss of conformational entropy when fixing pairs of dihedral angles in the native hairpin conformation. Nonnative interactions, such as wrong hydrogen bonds or hydrophobic interactions, are ignored. We also ignore electrostatic interactions among the charged side chains and chain termini (their importance could be assessed by experiments on the ionic strength dependence of the equilibrium and kinetics which have not yet been performed). Each thermodynamic factor is considered to be homogeneous, i.e., independent of side chain and position in the native structure. We assume that the free energies of formation for each of the three hydrophobic interactions, ∆Gsc, are identical. Each of the backbonebackbone hydrogen bonds, including the one in the turn region, is assumed to have the same free energy, ∆Ghb. The conformational entropy loss for the strand and turn regions also is assumed to be the same (∆Sconf). which is equivalent to assuming that the residues in the turn have a propensity for this conformation equal to the propensity of the strand residues to be in a strand conformation. To further reduce the number of parameters, we assume that the hydrogen bond is purely enthalpic, i.e., ∆Ghb=∆Hhb and that the hydrophobic inter
†When pairs of dihedral angles are used instead of single dihedral angles, the specification of a pair of angles produces a problem in phasing between the loss of entropy and the compensating decrease in interaction free energy. Either choice of ,, pairs represents a compromise. This can be illustrated by considering the formation of a six-residue β-hairpin with a side-chain interaction between residues two and five. To form the backbone-backbone hydrogen bond requires native values for four dihedral angles, 3,3,4,4. If we were only concerned with hydrogen bond formation, as in helix-coil theory for homopolypeptides, then the natural choice for the dihedral angle pairs would be the and associated with the same residue—in this case the two pairs 3,3, and 4,4. With this choice, however, formation of the two- to five-side-chain interaction requires that eight dihedral angles assume native values—when only six. i.e., 2, 3,3,4,4, and 5, actually are required. So, in choosing i,i+1 instead of i,i, pairs, we overestimate the loss in entropy associated with formation of the first hydrogen bond, in favor of accurately representing the compensation between entropy loss and formation of side-chain interactions in subsequent steps.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS
5874
actions are temperature-independent over the temperature range studied. For the elementary kinetic steps (motion of individual dihedral angle pairs), we choose a transition state that can be described in terms of the equilibrium thermodynamic parameters. It is natural to assume that there is an entropy barrier to forming a native dihedral angle pair, so we equate the entropy of activation to the equilibrium entropy loss. For some steps, native dihedral angle pair formation is not associated with stabilizing interactions, whereas in others it is associated with the formation of hydrogen bonds or both hydrogen bonds and hydrophobic interactions. We assume that all native interactions are broken in the transition state. Also, we include the possibility that these steps have an activation barrier,Eo, in addition to the barriers imposed by the equilibrium free energy changes. We must next decide how to treat the temperature dependence of the prefactor for these kinetic steps because it will have a significant effect on the height of the potential energy barrier required to fit the kinetic data. In investigating the viscosity-dependence of the conformational relaxation rate of myoglobin, Ansari et al. (31, 32) found that the data could be well-represented by a preexponential factor proportional to 1/(σ +η), where η is the solvent viscosity and σ is the contribution to the effective friction from interacting protein atoms (4 cP in myoglobin). A much greater fraction of the β-hairpin peptide atoms interact with solvent, so we expect σ to be smaller. Simulations of βhairpin formation by Klimov and Thirumalai (33) suggest a 1/η dependence (σ=0) so, in the absence of direct experimental data, we use a prefactor proportional to 1/η. The net result is that the model is completely defined by only five parameters—three equilibrium parameters, ∆Hhb, ∆Gsc, and ∆Sconf, and the two kinetic parameters, ko (To), the preexponential factor at the reference temperature, and an activation energy, Eo. A final, major simplifying feature in the model is the single sequence approximation first used by Schellman (34) in describing the helixcoil equilibrium and recently by us in describing helix-coil kinetics of a 21-residue peptide (23). In the single sequence approximation, only species with a contiguous run of native peptide bonds are considered. All other structures are ignored. For the β-hairpin peptide, which has 16 residues (15 peptide bonds), there are 215 (=32,768) possible molecular conformations. The single sequence approximation reduces this number to 121. In helix-coil theory, the justification for the single sequence approximation is the expectation that for short polypeptides there is a low probability of nucleating more than one stretch of helix in any individual molecule. For the β-hairpin, we give the justification a posteriori by comparing with a more complete model in which the approximation is not made. Partition Functions. The nonnative conformation of the peptide bond (coil, c) is taken as the reference state and assigned a weight of 1. The weight of a peptide bond in the native conformation (hairpin, h) is exp(∆Sconf/R), and the weight for a single stretch of j contiguous native peptide bonds, starting with peptide bond i [i.e., the of residue i and of residue i +1 (Fig. 2)], is:
wj,i=exp[–(∆Gj,i–jT∆Sconf)/RT]; ∆Gj,i` p∆Hhb+q∆Gsc, [1] where p is the number of backbone-backbone hydrogen bonds and q is the number of side-chain-hydrophobic interactions in the native stretch. In this model, there are 215 conformations for the 16-residue hairpin arising from all of the possible combinations of hs and cs. The weight of each of these conformations is simply the product of the weights of each of the native stretches that it contains, and the partition function is the sum of the 215 weights. The model can be greatly simplified by considering only those species which contain a single stretch of native peptide bonds (the “standard” single sequence approximation). This simplification results in a model with only 121 species with the partition function:
[2] where n+1 is the total number of residues (16 in this β-hairpin). The equilibrium probability of the all-coil conformation is P0,0=1/Q and the equilibrium probability for all other conformations is Pj,i=wj,i/Q. To test the accuracy of this standard single sequence approximation, we compared the equilibrium curves of the model with and without this approximation in Fig. 3. The approximation significantly overestimates the fraction of folded hairpin. This problem arises because the standard single sequence approximation does not properly account for the entropy of the system, as has been discussed by Qian and Schellman (35) for the helix-coil transition. The population of each of the 32,647 ignored species [such as cchcchcccchcccc with a weight of exp (3∆S/R)] is quite small, but because their number is large, their contribution to the entropy of the system is significant. In particular, most of the ignored species do not contain significant hairpin structure, and ignoring them underestimates the stability of the unfolded hairpin. The number of species ignored by the standard single sequence approximation grows geometrically with peptide length, precluding its application to molecules of different length. The underestimation of the entropy can be minimized by defining a “coil” state that includes not only the all c species (ccccccccccccccc) but also all the possible combinations of h and c peptide bonds that do not have just one single native stretch. For all those conformations, we ignore native interactions (even for a species such as ccchhhhhhhhhchc, which has the backbone conformation of the β-turn as well as residues
FIG. 3. Comparison of thermal unfolding curve for the β-hairpin predicted by standard single sequence (121 species) and complete (32, 768 species) models. The fractional population of molecules containing the intact hydrophobic cluster is plotted vs. temperature. The points are derived from a two-state analysis of the fluorescence equilibrium curves. The dashed curve is the fit to the data using the standard single sequence (121 state model) partition function (Eqs.1 and 2). The continuous curve is predicted by the 215-state partition function using the parameters from the fit with the standard single sequence model (∆Sconf=–3.09 cal mol–1 K–1, ∆Hhb=–0.86 kcal mol–1, ∆Gsc=–2.19 kcal mol–1).
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS
5875
Y45 and F52 in position to make a hydrophobic interaction). The weight of the coil state now becomes: where:
[3] where the second term eliminates the contribution to the coil state by conformations with a single stretch of native peptide bonds (see Eq. 1). The partition function in this “modified” single sequence approximation is:
[4]
Rate Equations. To transform the equilibrium description of the model with the modified single sequence approximation into a kinetic model, we begin by assuming that conformations are connected if they can be interconverted by single h→c or c→h transitions. Species that contain a single stretch of native peptide bonds are connected to those other species that contain one more or one less native peptide bond at either end of the stretch. The rate constant for adding native peptide bond i to a native stretch, ki+, is given by:
[5] where ko is the preexponential factor at the reference temperature To, ηo is the solvent viscosity at To. and Eo is the activation energy for rotation of the peptide bond. The rate constants for removing native peptides bonds i or i+j–1 from a native stretch of length j that starts at residue i (Fig. 2) are given by:
[6]
It is less straightforward to treat the contribution to the overall kinetics of the system of those additional species that now have been included in the coil state. For example, a coil conformation such as cchhhchhccccccc can convert to a single sequence conformation cchhhhhhccccccc by a single c→h transition. We assume that the rate for this process is equal to k6+(Eq. 5) times the probability of finding this particular conformation within the coil state (i.e., exp(5∆Sconf/R)/w0.0 for the above example). We then can define an overall rate that is the summation of the rates for all possible transitions between the coil state and each conformation with a single native stretch. The overall rate for going from the coil state to a conformation with a stretch of j native peptide bonds starting at residue i is given by:
[7]
[8]
where
Using these rates (Eqs. 5–8), the population of the 121-molecular species of the model as a function of time is described by the following set of master equations:
[9]
Despite its complexity, this treatment of the kinetics maintains detailed balance. Moreover, it implicitly includes all the kinetic connections involving single h→c or c→h transitions for each of the 120-single sequence species without increasing the size of the rate matrix. The physical description in this approximation is, however, somewhat artificial. For example, our definition of the coil species requires that a c→h transition which does not occur at the end of a native stretch (such as ccchhhhhhhhcccc→ccchhhhhhhhchcc), transforms the molecule back to the coil state instead of closer to the fully formed hairpin. However, an additional single transition (ccchhhhhhhhchcc→ccchhhhhhhhhhcc) returns the molecule to a more complete hairpin conformation. Test of Modified Single Sequence Approximation. We tested the modified single sequence approximation by comparing it with a “complete” model that considers all 2(=215=32,678) possible conformations explicitly. To perform the test, we fit the experimental data with the modified single sequence approximation model to obtain parameters that were then used in simulations using the complete model. The fit and simulations of the equilibrium data are shown in Fig. 4a. The equilibrium description is rather similar for both models, in contrast to the standard single sequence approximation (Fig. 3). This result confirms our interpretation that underestimation of the entropy of the unfolded ensemble is the main deficiency in the standard single sequence approximation. In the modified single sequence approximation, however, there is a small overestimation of the fraction of unfolded hairpin. The major contribution to this difference is the small subset of species that has significant β-hairpin structure (including stabilizing interactions) but are counted as species of the coil state (which have no stabilizing interactions) in the modified single sequence approximation. We also tested the kinetic description with simulations carried out with the complete model, in which there are n2" (=
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS
5876
49, 520) possible kinetic transitions. Only 450 of those are explicitly included in the modified single sequence approximation model. The fitting to the kinetic experiments with this model was performed by floating its five parameters to produce the best least-squares fit to the observed progress curves for the nine experimental temperature jumps between 288 and 328 K (Fig. 4c). This was carried out using the equilibrium populations at the initial temperatures (before the T-jump) and integrating the rate equations (Eq. 9) by using rate constants evaluated at the final temperatures (after the T-jump). These parameters were then used in kinetic simulations with the complete model. The rate matrix for this model was constructed using an automatic pattern-matching algorithm (E.R.Henry, unpublished data).‡
FIG. 4. Comparison of thermal unfolding curves and kinetics for modified single sequence and complete models, (a) Fractional population of the hydrophobic cluster as a function of temperature. Derived from a two-state analysis of fluorescence equilibrium curves (large dots). Fit to the data with the modified single sequence model (Eqs. 3 and 4), producing the parameters ∆Sconf=–2.74 cal mol–1 K–1, ∆Hhb=–0.96 kcal mol–1, ∆Gsc=–1.94 kcal mol–1 (dashed line). Calculated with the complete model using these parameters (continuous line). Fraction of native hydrogen bonds calculated using the model with modified single sequence approximation (dotted line), (b) Simulations of progress curves for the complete model (continuous line) and the model using the modified single sequence approximation (dotted line). The fractional population of the hydrophobic cluster vs. time is plotted following a temperature jump from 283 to 298 K. The dashed lines are single exponential fits to the simulated progress curves at limes >10 ns, the resolution of the T-jump instrument. The fits of the modified single sequence model to the kinetic data were performed using the LSODA routine (36). which incorporates algorithms for solving both stiff and nonstiff systems of equations. The resulting parameters were k0=8.0×108 s–1 and E0=0 (equilibrium parameters same as in a. The equilibrium and kinetic parameters are slightly different from those reported by Munoz et al. (27) for two reasons. One is that in the previous work the viscosity dependence was not included in the preexponential factor, and the second is that in the present work the kinetic and equilibrium data were fit simultaneously, whereas in the previous analysis (27), the equilibrium data were fit independently, (c) Arrhenius plot of relaxation times following 15 degree temperature jumps. The points are the experimental relaxation rates, whereas the dashed curve through the points is obtained from the fit to the data using the modified single sequence model. The continuous curve is obtained from single exponential fits to the kinetic progress curves generated by the complete 215-state model using the kinetic parameters from the modified single sequence model. The results for the two models are very similar, with fluorescence progress curves that can be represented as a biexponential process in each case. There is initially a small amplitude phase, corresponding to very rapid reequilibration among conformations in the global-free energy minima of the folded state, followed by a slower large-amplitude phase, corresponding to crossing of the free energy barrier separating the folded and unfolded states (see Fig. 5 and below). Overall, the agreement between the two models must be considered very good and justifies the use of the modified single sequence approximation. There are, however, significant differences, and the relaxation rates for the major phase (the only one detected experimentally) are about a factor of three faster in the complete model (Figs. 4 b and c). This effect is produced because the modified single sequence approximation ignores the stabilizing interactions in the rates connecting conformations included in the coil state with the conformations in the folded state (with a stretch of seven or more native peptide bonds). For example, the transition cccchhhhhhchccc→cccchhhhhhhhccc is less probable in the simpler model because ignoring the two hydrogen bonds of the starting conformation lowers its population by a factor of 25 [=exp(–2∆Hhb/RT)]. Predictions for Other β-Hairpins. An important consequence of having a statistical mechanical model for β-hairpin formation is that it can be used to make specific predictions that can be tested experimentally. A useful way of examining the results of the model is to consider the free energy as a function of the fraction of native peptide bonds, its natural reaction coordinate (Figs. 5a and 6a). The model postulates that formation of a β-hairpin in the absence of side-chain
‡The system of equations is stiff and was integrated using an iterative multi-step backward differentiation formula method (37), as implemented in the CVODE package (36, 38). This algorithm requires the solution of a set of nonlinear algebraic equations by Newton iteration at each time step. Each Newton iteration in turn requires solving an NxN linear system A∆P=residual, where the matrix A is derived from the rate matrix K. For n=32,768, this problem is rather too large to solve using standard methods (39). However, the matrix A is sparse, containing only 500,000 nonzero elements of a possible 109. Therefore, an iterative generalized minimal residual method (40) appropriate for large sparse linear systems, as implemented in the CVODE package (36, 38). was used. The performance of the algorithm was improved dramatically in this application by Jacobi (diagonal) preconditioning or very simple block-diagonal preconditioning (40).
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS
5877
interactions is continuously uphill in free energy because backbone hydrogen bonds do not compensate for the loss in contbrmational entropy of forming native peptide bonds in both β-strands. Side-chain interactions are, therefore, necessary for the stability of the hairpin and determine the position and height of the free energy barrier for hairpin formation. This hairpin is stabilized by a cluster of four hydrophobic side chains (W43, Y45, F52, and V54), making three hydrophobic interactions (Fig. 1). Based on our model, the folding free energy barrier for this hairpin is crossed when the seven central peptide bonds become native and the first hydrophobic interaction (between residues Y45 and F52) is formed.
FIG. 5. Prediction of equilibrium and kinetic properties of β-hairpins with additional interactions. Red. original hairpin; blue, hairpin with interaction in the β-turn (residues D47–K50); and green, hairpin with interaction between end residues (residues R41–E56). (a) Free energy profiles (not including kinetic barriers), (b) Population of the hydrophobic cluster (continuous lines) and fraction of hydrogen bonds (dotted lines) for the three hairpins, (c) Arrhenius plot of the kinetics of the three hairpins. Relaxation rates (continuous lines), folding rates (dotted lines), and unfolding rates (dashed-dotted lines). The folding and unfolding rates have been calculated from a two-state fit to the relaxation rates and equilibrium constants generated with the modified single sequence model. Many of the predictions of the model are immediately apparent upon examination of the free energy profile. The existence of two global minima separated by a significant free energy barrier (Fig. 5a) explains the two-state behavior and exponential kinetics. The species at the barrier maximum has two backbone-backbone hydrogen bonds and therefore a lower energy than the coil state, explaining the apparent negative activation energy in the two-state analysis (assuming a simple Arrhenius expression for the rate constants with a temperatureindependent prefactor). The global minimum on the folded side of the free energy barrier consists of several molecular conformations, with the species at the lowest free energy having the intact hydrophobic cluster but not the maximum number of backbone-backbone hydrogen bonds and native peptide bonds. This result could explain why the population of the hydrophobic cluster obtained by fitting fluorescence data is higher than the fraction of native dihedral angles estimated by NMR (29). An interesting prediction of this model of β-hairpin formation is that local and long-range interactions have very different effects on the free energy surface of the hairpin and, therefore, on its equilibrium and kinetic properties. To illustrate this point, we have performed simulations with two variants. In one of these variants, we include a side-chain interaction, which could result from a favorable electrostatic interaction between D47 and K50, which stabilizes the β-turn by 1 kcal/mol. In the other variant, a similar interaction is introduced between the first and last residues in the hairpin by mutating “in machina” glycine 41 to arginine. This computational experiment is similar in spirit to the protein engineering approach to folding kinetics (41–43). When positioned in the β-turn, the interaction is local (between residues i and i + 3). and it significantly affects both the thermodynamics and kinetics of hairpin formation by lowering the free energy of all states, which contain native interactions (Fig. 5a). Both the population of species with the hydrophobic cluster and the fraction of hydrogen bonds increase at all temperatures, and the Tm increases by ~20 K (Fig. 5b). The folding rate is accelerated by about a factor of four, whereas the unfolding rate is slightly decelerated (Fig. 5c). Because the peak of the free energy barrier stays at the same position along the reaction coordinate, the change in rates results simply from the change in the barrier height, as is commonly assumed in interpreting the effects of single residue perturbations in protein folding. When the interaction is introduced between the end residues, it is long range (i,i+15) and its effects on the folding properties are rather insignificant. The Tm changes by only ~2 K, and the change in rates is very small, with the largest change in the unfolding rate. Thus, the simulations suggest that, in hairpins, the interactions closest to the β-turn exert the largest effect on the folding rate; interactions between the ends of the strands may stabilize the hairpin structure but have very little effect on the folding rate. Another important point raised by our model of β-hairpin formation is that the shape of the free energy barrier and the position of its maximum along the reaction coordinate are determined by a delicate balance between the loss in conformational entropy and stabilization from side-chain interactions. To address this point, we have simulated two variants of the original β-hairpin: a hairpin with the hydrophobic cluster placed one residue closer to the center of the molecule (W44, Y46, F51, V53) and another one with the hydrophobic cluster one residue closer to the ends (W42, Y44, F53, V55). The effects of these changes on the equilibrium and kinetic properties are shown in Figs. 6 b and c, respectively. Moving the hydrophobic cluster one residue in either direction does not modify the interaction energies of the hairpin; however, the
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS
5878
model predicts a dramatic effect on its free energy profile (Fig. 6a). If the cluster is moved closer to the β-turn, both the minimum in the folded ensemble and the top of the free energy barrier are shifted toward less structure (closer to the unfolded ensemble). This is accompanied by an increase in stability, an acceleration of the folding rate, and almost no change in the unfolding rate. If the cluster is moved one residue in the opposite direction (toward the ends), the stability is decreased and the folding rate decreases, and there is only a small change in the unfolding rate. Displacements of the top of the free energy barrier have been reported in folding experiments on small proteins (44). The model indicates that, for a β-hairpin, the top of the free energy barrier is simply determined by the position of the stabilizing side chains in the sequence.
FIG. 6. Prediction of equilibrium and kinetic properties of β-hairpins with repositioned hydrophobic cluster. Red, original hairpin; Blue, hairpin with hydrophobic cluster moved one residue closer to the β-turn; and green, hairpin with hydrophobic cluster moved one residue closer to the ends. (a) Free energy profiles. (b) Population of the hydrophobic cluster (continuous lines) and fraction of hydrogen bonds (dotted lines) for the three hairpins, (c) Arrhenius plot of the kinetics of the three hairpins. Relaxation rates (solid lines), folding rates (dotted lines), and unfolding rates (dashed-dotted lines). The folding and unfolding rates have been calculated as in Fig. 5. Caveats. One criticism of the model that we have presented is that only native interactions are considered. This excludes the possibility, for example, of forming a turn at additional positions in the sequence, which would result in nonnative hydrogen bonds and nonnative hydrophobic interactions. There is no evidence in the NMR data of significant population of other hairpin conformations (29). Nonnative interactions also can affect the kinetics in the same two ways as in proteins (2). They can produce local minima in the energy landscape, which can result in the population of intermediate structures at equilibrium, or they can produce transient trapping of misfolded structures, which are not present at equilibrium. We have not yet found any evidence for equilibrium intermediates in the folding of the β-hairpin. For transient trapping to be observable as a separate kinetic phase, the residence time in the trapped state must be longer than the relaxation time for the overall hairpin-coil transition. Transient trapping does not appear to be occurring in this β-hairpin because the progress curves at all temperatures can be well-fit with a single exponential function (27). Another criticism of the model is that there are no native backbone or side-chain interactions between two residues unless the peptide bonds of all intervening residues have the native conformation. This postulate excludes the possibility, for example, of initiation by forming the hydrophobic cluster followed by zipping up of the hydrogen bonds. The transition state in this mechanism would be a ~10-residue loop. One is tempted by this mechanism because of the close correspondence of the β-hairpin relaxation time and the time of ~1 µs estimated by Hagen et al. (25, 26) to form a 10-residue loop. A 10-residue loop is predicted by Thirumalai (45) to be the most probable loop size in proteins, longer loops being less probable because of the larger entropy loss and shorter loops because of chain stiffness. One could possibly distinguish between the two mechanisms by measuring the kinetics for a β-hairpin in which the hydrophobic cluster is moved closer to the β-turn. Our model predicts that the rate of formation should speed up because the transition state now occurs earlier along the reaction coordinate (Fig. 6a), whereas the loop model would predict a slower rate of formation. This consideration was in fact one of the motivations for the predictions discussed above. The most convincing test of the model will of course come from measurements on other β-hairpin peptides. Another approach to both testing and refining the model is to examine the results of simulations. All-atom molecular dynamics simulations of temperature jump kinetic experiments (at experimental temperatures) may be feasible in the near future (10, 11). In the meantime, it should be useful to examine the results of Langevin simulations of simplified representations of the peptide (33). Because large numbers of sufficiently long trajectories are possible with this method, kinetic progress curves actually can be simulated. Examination of these trajectories might reveal dominant mechanisms and structural species that must be included in a kinetic model. The results will, however, depend critically on the choice of potential functions. Is the model unnecessarily complex? Could we use a much simpler model even a two-state model? The main problem with a two-state model is that it has little predictive value. In a two-state model, one would postulate a transition state structure or, as in the case of proteins, try to determine the transition state by structural perturbation experiments. The examples in Fig. 6 show that small structural perturbations lead to changes in the transition state and would therefore also result in incorrect predictions for the change in rates. Nevertheless, the model could be simplified. If we consider only residue-residue interactions, then in the single sequence approximation, the model reduces to an eight-state model. In
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS
5879
such a model, quartets of dihedral angles change simultaneously in single kinetics steps. This model can explain the experimental data, as well as predict the properties of other β-hairpins. It is, however, not straightforward to extend a model based on interactions to more complex structures because of the difficulty in defining rules that specify the rates of all elementary kinetic steps in terms of just a few parameters. We thank Attila Szabo and Peter Wolynes for helpful comments on the manuscript. 1. Anfinsen, C.B. (1973) Science 181, 223–230. 2. Bryngelson, J.D., Onuchic, J.N., Socci. N.D. & Wolynes, P.G. (1995) Proteins Struct. Funct. Genet. 21, 167–195. 3. Bryngelson, J.D. & Wolynes, P.G. (1987) Proc. Natl. Acad. Sci. USA 84, 7524–7528. 4. Orland, H., Garel, T. & Thirumalai, D.T. (1996) in Recent Developments in Theoretical Studies of Proteins, ed. Elber, R. (World Scientific, Singapore), pp. 197–268. 5. Dill K.A., Bromberg, S., Yue, K., Feibig, K.M., Yee, D.P., Thomas, P.D. & Chan. H.S. (1995) Protein Sci. 4, 561–602. 6. Karplus, M. & Sail, A. (1995) Curr. Opin. Struct. Biol. 5, 58–73. 7. Shakhnovich, E.I. (1997) Curr. Opin. Struct. Biol. 7, 29–40. 8. Klimov, D.K. & Thirumalai, D. (1996) Proteins Struct. Funct. Genet. 26, 411–441. 9. Guo, Z., Brooks, C.B. & Boczko, E.M. (1997) Proc. Natl. Acad. Sci. USA 94, 10161–10166. 10. Li, A. & Daggett, V. (1996) J. Mol Biol. 257, 412–429. 11. Lazaridis, T. & Karplus, M. (1997) Science 278, 1928–1931. 12. Eaton, W.A., Thompson, P.A., Chan C.-K., Hagen, S.J. & Hofrichter, J. (1996) Structure (London) 4, 1133–1139. 13. Gray, H.B. & Valentine, J.S. (1998) Acc. Chem. Res., in press. 14. Eaton, W.A., Hofrichter, J., Munoz, V. & Thompson, P.A. (1998) Accts, Chem. Res., in press. 15. Zimm, B., Doty, P. & Iso, K. (1959) Proc. Natl. Acad. Sci. USA 45, 1601–1607. 16. Schwarz, G. (1965) J. Mol. Biol. 11, 64–77. 17. Poland, D. & Scheraga, H.A. (1970) in Theory of Helix-Coil Transitions in Biopolymers (Academic, New York). 18. Gruenewald, B., Nicola, C.U., Lustig, A. & Schwarz, G. (1979) Biophys. Chem. 9, 137–147. 19. Chakrabartty, A. & Baldwin, R.L. (1995) Adv. Protein Chem, 46, 141–176. 20. Munoz, V. & Serrano, L. (1995) Curr. Opin. Biotech. 6, 382–386. 21. Williams, S., Causgrove, T.P., Gilmanshin, R., Fang, K.S., Callender, R.H., Woodruff, W.H. & Dyer, R.B. (1996) Biochemistry 35, 691–697. 22. Gilmanshin, R., Williams, S., Callender, R.H., Woodruff, W.H. & Dyer, R.B. (1997) Biochemistry 36, 15006–15012. 23. Thompson, P.A., Eaton, W.A. & Hofrichter, J. (1997) Biochemistry 36, 9200–9210. 24. Jones, C.M., Henry, E.R., Hu, Y., Chan, C-K., Luck, S.D., Bhuyan, A.K., Roder, H., Hofrichter, J. & Eaton, W.A. (1993) Proc. Natl. Acad. Sci. USA 90, 11860–11864. 25. Hagen, S.J., Hofrichter, J., Szabo, A. & Eaton, W.A. (1996) Proc. Natl. Acad. Sci. USA 93, 11615–11617. 26. Hagen S.J., Hofrichter J. & Eaton W.A. (1997) J. Phys. Chem. 100, 12008–12021. 27. Munoz, V., Thompson, P.A., Hofrichter, J. & Eaton, W.A. (1997) Nature (London) 390, 196–199. 28. Gronenborn, A.M., Filpula, D.R., Essig. N.Z., Achari, A., Whitlow, M., Wingfield, P.T. & Clore, G.M. (1991) Science 253, 657–661. 29. Blanco, F.J., Rivas, G. & Serrano, L. (1994) Nat. Struct. Biol. 1, 584–590. 30. Zwanzig, R. (1995) Proc. Natl. Acad. Sci. USA 92, 9801–9804. 31. Ansari, A., Jones, C.M., Henry, E.R., Hofrichter, J. & Eaton. W.A. (1992) Science 256,1796–1798. 32. Ansari, A., Jones, C.M., Henry. E.R., Hofrichter, J. & Eaton, W.A. (1994) Biochemistry 33, 5128–5145. 33. Klimov, D.K. & Thirumalai, D. (1997) Phys. Rev. Lett. 79, 317–320. 34. Schellman, J.A. (1958) J. Phys. Chem. 62, 1485–1494. 35. Qian, H. & Schellman, J.A. (1992) J. Phys. Chem. 96, 3987–3994. 36. Hindmarsh, A.C. & Petzold, L.R. (1995) Comput. Phys. 9, 148–155. 37. Hairer, E. & Wanner. G. (1996) Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems (Springer. Berlin), 2nd Ed. 38. Cohen, S.C. & Hindmarsh, A.C. (1994) CVODE User Guide (Lawrence Livermore Natl. Lab, Livermore, CA). 39. Lawson, C.L. & Hanson. R.J. (1974) Solving Least Squares Problems (Prentice-Hall Englewood Cliffs. NJ). 40. Barrett, R., Berry, M., Chan, T.F., Demmel. J., Donato. J.M., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C. & Van der Vorst, H. (1993) Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods (Soc. Indust. Appl. Math., Philadelphia). 41. Fersht, A.R., Matouschek, A. & Serrano, L. (1992) J. Mol. Biol. 224, 771–782.43. 42. ltzhaki, L.S., Otzen, D.E. & Fersht, A.R. (1995) J. Mol. Biol. 254, 260–288. 43. Onuchic, J., Socci, N.D., Luthey-Schulten, Z. & Wolynes, P.G. (1996) Fold. Des. 1, 441–450. 44. Silow, M. & Oliveberg, M. (1997) Biochemistry 36, 7633–7637. 45. Camacho, C.J. & Thirumalai, D. (1995) Proc. Natl. Acad. Sci. USA 92, 1277–1281.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COUPLING THE FOLDING OF HOMOLOGOUS PROTEINS
5880
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5880–5883, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Coupling the folding of homologous proteins
CHEN KEASAR*†, DROR TOBI*, RON ELBER*‡, AND JEFF SKOLNICK§ *Department of Physical Chemistry. Department of Biological Chemistry, Fritz Haber Research Center for Molecular Dynamics and Wolfson Center for Applied Structural Biology, Hebrew University, Givat Ram Jerusalem 91904, Israel: †Department of Structural Biology, Stanford School of Medicine, Stanford University, Stanford, CA 94305: and §Department of Molecular Biology, Scripps Research Institute, La Jolla, CA 92037 ABSTRACT The empirical observation that homologous proteins fold to similar structures is used to enhance the capabilities of an ab initio algorithm to predict protein conformations. A penalty function that forces homologous proteins to look alike is added to the potential and is employed in the coupled energy optimization of several homologous proteins. Significant improvement in the quality of the computed structures (as compared with the computational folding of a single protein) is demonstrated and discussed. It is convenient to classify methods of predicting protein conformations into one of two main categories: (a) methods that optimize energy functions and (b) methods that search through databases of protein structures. In the present manuscript we call a, “energy minimization methods”, and b “homology.” The division is not sharp. For example, many of the energy functions that are used in the prediction of threedimensional conformations of proteins include information extracted from databases on protein structures. The conformation of the global energy minimum, even if we succeed to find it, may differ from the native fold because of two possible reasons: (1) the empirical energy is inaccurate, and (2) the native fold does not correspond to a global free energy minimum. To address point 1, an adjustment of the energy function may follow, whereas to address point 2 the folding pathways (and not only the most stable state) are required. We propose below a combination of the homology and the energy approaches. The combination improves the prediction of structures of homologous proteins even if their conformations do not correspond to a global minimum of the individual molecules. In the homology approach, an empirical observation on databases of protein structures is employed: Proteins with comparable sequences adopt a similar structure in the native configuration. This information is used to build models of unknown structures. The required degree of similarity between sequences is uncertain but a bet with significant safety margins is of 40% sequence identity. A model of a protein with an unknown structure can be built by using an experimental structure of a protein with a comparable sequence. The homology protocol is the most accurate approach we have today to model protein structures on the computer. Its disadvantage is the necessity of having a similar sequence with a known structure. In this manuscript, we describe a connection between the two approaches that improves the performance of energy optimization techniques while maintaining its generality. In the next section, we describe the algorithm and an example for a “real” protein follows. Finally, we explain why the suggested coupling optimizes better than straightforward annealing. We suggest two reasons for the improvement. The first source of improvement is smoothing of the potential energy surface, making it more accessible for global optimization. The second reason for improvement in the optimization is due to averaging in sequence space (over the homologous proteins) that enhances weak signals. The Algorithm. We consider N homologous proteins with sufficient sequence identity that suggests structural similarity. The structure of the family is unknown, making the “energy minimization” approach the right choice to predict the structure (it is the only choice). The N sequences are aligned, using established sequence alignment techniques (1). Here, we assume that the alignment is adequate. The example discussed below did not include deletions or insertion of amino acids into the sequence. However, an extension that includes deletions and insertions is straightforward. The energy of a homologous set of proteins. An energy function, Etotal, is defined, which includes the sum of all the individual energies of the homologous proteins and a coupling term that penalizes structural diversity.
[1] Xi is the vector of coordinates for the i-th homologous protein, and εi(Xi) is its unique energy function. ∆ij(Xi, Xj) is a function that measures and penalizes structural diversity between proteins i and j. The larger is the difference between the two structures; the higher is the value of the penalty function. In Eq. 1, we sum over the diversities of all pairs. Optimization of Etotal) provides a prediction for the structure of the family of the homologous proteins. Exploring conformations. We used the Lattice Monte Carlo Program (LMCP) of Skolnick and Kolinski (2). The Monte Carlo procedure uses different moves on the lattice to modify a starting conformation of the protein chain. Each of the proteins is modified independently. A displacement δXi is chosen according to the LMCP protocol, so that δXiδXjij is zero ( denotes an ensemble average). New protein energies—εi(Xi,+∆Xi) (i=1,,N)—are computed and supplemented by the i-th measure of structural diversity ∆i(Xi+ δXi)= ∆ij. The displacement in Xi (δXi) is now accepted or rejected according to the usual Monte Carlo criterion with an energy, εi(Xi)+∆i(Xi). The Monte Carlo test is repeated for all the homologous structures {i=1,,N}.
‡To
whom reprint requests should be addressed, e-mail
[email protected]. © 1998 by The National Academy of Sciences 0027–8424/98/955880–4$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: LMCP, Lattice Monte Carlo program: RMSD, root mean square difference.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COUPLING THE FOLDING OF HOMOLOGOUS PROTEINS
5881
The generation of δXi (but not the decision on its acceptance) depends only on Xi and does not take into account the penalty function on structural diversity. This protocol may lead to a large number of step rejections. However, the above choice is the simplest to use in a parallel computing environment, and it further leaves some room for future publications. The parallel environment is important because the computations are pursued typically on a cluster of workstations or on a parallel computer with multiple CPUs. Each of the homologous proteins is assigned to one processor. The processor calculates the displacement δXi and the energy εi(Xi)+∆i(Xi) and decides whether to accept or to reject the move. The correlation with other structures is built using the function ∆i (Xi) that depends on all the other coordinates. The conformations of the set are therefore sampled from the canonical ensemble with an “energy”, Etotal. To compute the penalty function ∆i(Xi), it is necessary to have the coordinates of all the proteins on each of the processors. The update of the coordinates can be done with every Monte Carlo step. However, to reduce communication overhead, we usually update the coordinates only after a few Monte Carlo steps. A related algorithm can be easily formulated for molecular dynamics, solving the Newton’s equations of motion:
where M is the mass matrix for protein i. Nevertheless, the lattice approach suggests a number of unique advantages for the protein folding problem, which are discussed elsewhere (3). We now return to the functional form of the penalty on structural variations. We experiment with two measures: (a) Root mean square difference (RMSD) of the shared Cα coordinates after optimal overlap (4): (b)
L is the fraction of dissimilar contacts in the maps of the two structures. L=(number of dissimilar contacts)/(total number of contacts in the two maps). (Two residues are considered at a contact if their Cα distance is ≤6.5 Å.)
The RMSD is a common and useful measure of global similarity. However, it is doing poorly in detecting similar folds of structural segments. For example, if the secondary structure elements are predicted correctly but their packing is incorrect, the RMSD is typically high. In contrast, L, which is not as widely spread as the RMSD, detects local similarities and shows more uniform decrease in value as the structure quality decreases. Both functions are useful in comparing the final structure to the native fold; however, the task of forcing the different chains to look alike is best done with the RMSD. The application of L is problematic because maps with no contacts at all have (of course) no dissimilar contacts. As a result, restraining the structures to similar L values pushes the system to unfolded swollen states. We therefore used the RMSD. The specific functional form of ∆ij(Xi, Xj) is listed below:
Simulation Protocol and Results. We provide a numerical example for a family of pancreatic hormones (5). In Table 1, we list the seven sequences that were used in the runs with coupled optimizations (6). We performed 100 Monte Carlo simulated annealing runs of the protein 1ppt and 142 coupled runs of structures of seven sequences, which were optimized in parallel. Only one experimental structure (of 1ppt) is available, and we compared with it the results of the computations. In Fig. 1, we show the energies of the 100 Monte Carlo runs of 1ppt as a function of the RMSD from the native structure. Also shown are the energies of the 142 coupled runs. It is clearly seen that runs, which employed seven coupled proteins, cluster near lower RMSD values and therefore provide better prediction. The lowest energy structure of the coupled and the uncoupled runs (our best guess of the native conformation) are shown in Fig. 2. Again, the coupled runs provide a better answer. The improvement does not require an increase in computational effort. Each of the uncoupled 1ppt runs was seven times longer than the run of the seven sequences. Another example for a protein family (homeodomain) can be found in ref. 6. Yet, another study employed coupling in a two-dimensional lattice (7) and showed even more profound improvement.
DISCUSSION Here we discuss the question of ''why.” Why does the proposed algorithm improve structure prediction? We have seen one example, and other examples are available in the literature (6, 7). From a global optimization perspective, it may be surprising that optimizing a system, N times larger (N homologous sequences) is easier than optimizing one sequence at a time. At the limit in which the optimizations are completely independent, they should take approximately N times longer. Obviously, the coupling plays an important role in increasing the computation efficiency and accuracy. To understand the effect, it is useful to consider a simpler system first in which the “homologous” proteins are all of the same molecule. Hence, only structural diversity remains. Etotal is now Etotal= [ε(Xi)+∆i]. A single energy function [ε(Xi)] is used for the different conformations of the proteins {i=1,, N}. In Fig. 3, we compare annealing results with coupled and uncoupled energy function for the protein Ifsd. The distribution clearly shows that better energies are obtained when the coupling (of identical proteins) is employed. Hence, a better optimization protocol is obtained without sequence diversity. However, it is important to emphasize that the quality of the structures (as opposed to the energies) is not necessarily better because it depends also on the quality of the energy function.
Table 1. The seven coupled sequences that were used in the present work
PDB, protein data bank.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COUPLING THE FOLDING OF HOMOLOGOUS PROTEINS
5882
FIG. 1. Comparison between coupled and uncoupled Monte Carlo runs. X, uncoupled runs; black circles, uncoupled runs. Each point is the final configuration of a simulated annealing run. Note that the coupled runs end more frequently at lower energies and lower RMSD values. The Monte Carlo procedure produces conformations that are sampled from the canonical ensemble. The weight of a coupled state Xi,, XN at a temperature T is given by Note that ∆i depends on the coordinates of the rest of the copies and that, without the coupling, we are getting the classical Boltzmann factor for a set of N noninteracting copies Consider now, the discrete formula for quantum path integral of a system with a potential energy ε(X) where m is the mass matrix and ` is the Planck constant divided by 2π (8). For convenience, we define
and we also set N+1`1. The new “coupled” energy Ecoupled resembles
The quantum expression couples only pairs of nearest neighbor structures because the coupling corresponds to a physical entity—the kinetic energy. Nevertheless, if λ is sufficiently large (the protein is closed to be “classical”), the different structures will remain similar at each sampling point, exactly what we wanted to achieve in the folding of homologous proteins. The similarity between Etotal and Ecoupled is therefore self-evident and hints to the origin of the enhanced optimization as discussed below. Expressions that are related to the above quantum expression were investigated in the global optimization field (9). The key idea in a number of pioneering approaches was to define a new energy function. The new energy function is a local spatial average with different choices of densities: Eaverage(X) =E[X]0)ρ(X, X0)dX0. ρ(X, X0) is the density. The result Eaverage(X) is a smoother potential that is easier to minimize (Fig. 4) as was shown by numerous examples (9–13). Examples for smoothing densities are (but not limited to) Gaussians (9–11), square boxes (12), and a discrete sample of points (13). For example, the above quantum expression is an average of ε(X)over a discrete number of sampling points {X1,, XN}. The discrete averaging has advantages and disadvantages. An advantage is its simplicity. We do not have to perform complex integrals to obtain the average. The smoothing is done by a direct sum over the sampling points. However, because the number of points is small, the averaging is less effective compared with other methods that use analytical densities and integrals. Nevertheless, discrete smoothing is very suggestive for lattice calculations of the type that we present here.
FIG. 2. Comparing the native structure (a) with the lowest energy structures of the coupled (b) and the uncoupled (c) runs. To conclude, one of the reasons that the proposed protocol improves structure prediction is because of spatial potential averaging. This improvement is in the spirit of a number of recent global optimization procedures (9).
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COUPLING THE FOLDING OF HOMOLOGOUS PROTEINS
5883
FIG. 3. Comparing the optimized energies from multiple simulated annealing runs for coupled and uncoupled simulations. Dark bars, coupled runs; light bars, uncoupled runs. Another important feature of the present protocol is of averaging in sequence space. We return to the N different sequences. Each of the proteins has a unique energy surface. By virtue of experimental observations, we know that all the homologous protiens share similar structures at their native fold. At approximately the same coordinate Xnative, we expect all the proteins to be in an energy minimum. Therefore, all of the energies are correlated and the result of the sum is a quantity, which increases linearly with N.
FIG. 4. A schematic drawing of potential smoothing using discrete, local averaging: V(X)=1/N
V(Xi)·δX,Xi·
On the other hand, for unfolded states the energies of the different homologous structures are not necessarily correlated. Consider for example a correlated mutation at the hydrophobic core of the protein. To maintain the compactness of the hydrophobic core of the native state, valine and tryptophan may replace a pair of phenylalanines. At unfolded conformations, it is not necessary to assume that the contacts and the energies of the above residues are still correlated. It is more likely that the energies are not correlated. We therefore estimate This estimate is in the spirit of the Random Energy Model as applied to proteins (14). The new energy surface Etotal is therefore distorted in a favorable way when comparing it to the original εi. The shared minimum (which databases of protein structures support its existence) is deeper compared with other portions of the energy surface. The enhancement of the well depth of the shared minimum may make it the global energy minimum of the new average energy even if originally it was not. This enhancement suggest the new protocol as possibly effective for kinetically stable proteins. It is the combination of the spatial and sequence averaging, that provides significant improvement in structure prediction of ab initio algorithms as discussed above. This research was supported by a Binational Science Foundation grant (to R.E. and J.S.) and by National Institutes of Health Grant GM37408 (to J.S.). 1. Sander, C & Schneider, R. (1991) Proteins 9, 56–68. 2. Kolinski, A. & Skolnick, J. (1994) Proteins 18, 338–352. 3. Kolinski, A. & Skolnick, J. (1996) Lattice Models of Protein Folding, Dynamics and Thermodynamics (Chapman & Hall London). 4. Kabsch, W. (1976) Acta Crystallogr. A 32, 922–923. 5. Glover, I. & Blundell, T. (1983) Biopolymers 22, 293–304. 6. Keasar, C., Elber, R. & Skolnick, J. (1997) Fold. Des. 2, 247–259. 7. Keasar, C. & Elber, R. (1995) J. Phys. Chem. 99, 11550–11556. 8. Feynman, R.P. (1982) Statistical Mechanics: A Set of Lectures, (Benjamin Cummins, Reading, MA). 9. Straub, J.E. (1996) Optimization Techniques with Applications to Proteins in Recent Developments in Theoretical Studies of Proteins, ed. Elber, R. (World Scientific, Singapore), pp. 137–197. 10. Piela, L., Kostrowicki, J. & Scheraga, H.A. (1989) J. Phys. Chem. 96, 4024–4035. 11. Shalloway, D. (1992) Global Optimizat. 2, 281(1992). 12. Andricioaei, I. & Straub, J.E. (1996) Comput. Phys. 10, 449–454. 13. Roitberg, A. & Elber, R. (1991) J. Chem. Phys. 95, 9277–9287. 14. Bryngelson, J.D. & Wolynes, P.G. (1989) J. Phys. Chem. 93, 6902–6915.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PAS DOMAIN SUPERFAMILY
5884
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5884–5890, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Photoactive yellow protein: A structural prototype for the threedimensional fold of the PAS domain superfamily
JEAN-LUC PELLEQUER*, KAREN A.WAGER-SMITH†, STEVE A.KAY†, AND ELIZABETH D.GETZOFF*‡ *Department of Molecular Biology and The Skaggs Institute for Chemical Biology, †National Science Foundation Center for Biological Timing, Department of Cell Biology, The Scripps Research Institute, 10550 N.Torrey Pines Rd., La Jolla. CA 92037. ABSTRACT PAS domains are found in diverse proteins throughout all three kingdoms of life, where they apparently function in sensing and signal transduction. Although a wealth of useful sequence and functional information has become recently available, these data have not been integrated into a three-dimensional (3D) framework. The very early evolutionary development and diverse functions of PAS domains have made sequence analysis and modeling of this protein superfamily challenging. Limited sequence similarities between the ~50-residue PAS repeats and one region of the bacterial blue-light photosensor photoactive yellow protein (PYP), for which ground-state and light-activated crystallographic structures have been determined to high resolution, originally were identified in sequence searches using consensus sequence probes from PAS-containing proteins. Here, we found that by changing a few residues particular to PYP function, the modified PYP sequence probe also could select PAS protein sequences. By mapping a typical ~150-residue PAS domain sequence onto the entire crystallographic structure of PYP, we show that the PAS sequence similarities and differences are consistent with a shared 3D fold (the PAS/PYP module) with obvious potential for a ligand-binding cavity. Thus, PYP appears to prototypically exhibit all the major structural and functional features characteristic of the PAS domain superfamily: the shared PAS/PYP modular domain fold of ~125–150 residues, a sensor function often linked to ligand or cofactor (chromophore) binding, and signal transduction capability governed by heterodimeric assembly (to the downstream partner of PYP). This 3D PAS/ PYP module provides a structural model to guide experimental testing of hypotheses regarding ligand-binding, dimerization, and signal transduction. A large and growing set of multidomain protein sensors and transcription factors involved in signal transduction include PAS domain sequences (http://www.whoi.edu/biology/hahnm.html). The PAS acronym was coined originally (1) to describe the ~270-residue region encompassing two direct sequence repeats (PAS-A and PAS-B) of ~50 residues each, that had been identified in the Drosophila Period clock protein (PER) (2), the vertebrate Aryl hydrocarbon receptor nuclear translocator (ARNT) (3). and the Drosophila Single-minded (SIM) (2). These three proteins are involved in regulation of circadian rhythms, activation of the xenobiotic response, and cell fate determination, respectively. More recently, PAS domains have been found in many other proteins, including histidine-kinases (4), light receptor and regulator proteins (5), clock proteins (6, 7), sensor proteins (oxygen/redox sensors), ion channels (5), and a Ser/Thr kinase with a putative redoxsensing or flavin-binding domain, in which PAS regions are named “LOV” (light, oxygen, or voltage) (8). These PAS-containing proteins occur in a wide range of living organisms including: eubacteria, archaca, cyanobacteria, fungi, plants, insects, and mammals (5). PAS-containing proteins have been categorized (5) into three functional subgroups: (i) transcription activators [DNA-binding proteins with both basic helix-loop-helix (bHLH) and PAS sequence motifs], (ii) sensor modules of two-component regulatory systems (oxygen sensor, nitrogen fixation, sensor kinase, etc.), and (iii) ion channels (in eucarya). One function of the PAS domain is to mediate protein-protein interactions (9, 7). Dimerization has been demonstrated for many transcriptional activators such as the aryl hydrocarbon receptor (AHR), ARNT, SIM. hypoxiainducible factor 1 (HIF-1), Member Of the PAS superfamily (MOPs), and the trachealess (TRH) protein. Dimerization is known to be mediated by both the bHLH region of these transcription activators and by their PAS repeats (9–14). Some PAS-containing proteins lack a bHLH region (PER) but can still either homodimerize (9) or heterodimerize with other bHLH-PAS-containing proteins (9–11) through their PAS domains in vitro. A second function for PAS domains is ligand and/or cofactor binding, as is the case for AHR (15) and for the heme-binding bacterial O2-sensing protein FixL (16). Sequence similarities between the PAS repeats and photo-active yellow protein (PYP), a self-contained bacterial blue-light photoreceptor (17–19) implicated in negative phototaxis (20), were identified by Lagarias (21), who probed a sequence database with a 43-residue consensus sequence constructed from the PAS-A and PAS-B domains of phytochromes, which are the red and far-red photoreceptors in plants. Recently, many more PAS domains have been shown to exhibit sequence similarities with PYP (22, 7, 23, 5, 4), and sequence alignments have been extended into the “S2” region that is located immediately C-terminal to the original PAS repeats (5) and into a second more C-terminal region of the PAS B sequences, termed “PAC” (4). PYP also exhibits the functions characteristic of PAS domains: sensing (of light), binding of ligand/ cofactor (chromophore), and signal transduction through protein-protein interaction (with the downstream partner of PYP).
‡To
whom reprint requests should be addressed. E-mail:
[email protected]. © 1998 by The National Academy of Scicnces 0027–8424/98/955884–7$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: PYP. photoactive yellow protein: PER, period: ARNT. aryl hydrocarbon receptor nuclear translocator: SIM, singleminded: bHLH, basic helix loop helix; AHR. aryl hydrocarbon receptor; HIF-1, hypoxia-inducible factor 1; MOPs, member of the PAS superfamily; TRH, trachealess; 3D, three dimensional.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PAS DOMAIN SUPERFAMILY
5885
Here, we present the hypothesis that the entire PYP protein fold is the structural prototype for the modular, three dimensional (3D), PAS A and PAS B domain folds in PAS-containing proteins. We define the PAS/PYP module with a length of ~20–150 residues that matches the length of the PYP sequence and encompasses the entire PYP protein fold. PYP therefore appears to be both structurally and functionally a prototypical PAS domain, with a modular single-domain fold that links sensing and ligand binding to signal transduction via dimerization with another protein. A 3D molecular model of the PAS B domain of the human protein ARNT, based on the PYP structure, provides a detailed structural framework for integrating and differentiating sequence and functional data and suggests specific regions to test experimentally for involvement in ligand binding, dimerization, and signal transduction.
METHODS Sequence Alignment. Representative members of the ARNT family were aligned to each other using the program PILEUP (24) from the Genetics Computer Group (GCG) suite (25). Then several PAS-containing sequences, identified from previous publications (4, 5, 26) and obtained from the National Center for Biotechnology Information Entrez Web service (http://www.ncbi.nlm.nih.gov/Entrez/) were then successively added to the alignment with the PILEUP program. Finally, the sequence of PYP was added manually to complete the sequence alignment (see Fig. 2). Automatic alignment of the PYP sequence with the PAS domain sequences is hindered by the presence of a few apparent mismatched residues that inflate the alignment score (mainly Gly→hydrophobic). Such residues are discussed in detail later. Molecular Modeling. Using the alignment in Fig. 2, we built a 3D model of the PAS B domain of the human ARNT protein, based on the coordinates of dark-state PYP (2phy.pdb) (27). Nonidentical side chains were replaced using the program XFIT (28). Side-chain conformations were taken from a rotamer dictionary based on the library of Tuffery et al. (29). In each case, the most common rotamer was used unless it displayed strong steric clashes with any backbone atoms. Alternative rotamers were used for residues 33, 34, 62, 63, and 112 (PYP numbering is used throughout this paper), and a significant deviation from the standard rotamers was necessary to fit residue 29. The resulting ARNT model was energy minimized with the XPLOR program (30). using the conjugate gradient method (31) and the CHARMM22 all-atom parameter set (32). The dielectric constant was set to one. The shifted electrostatic and the switched van der Waals functions were selected using a cut-on value of 6.5 Å and a cut-off value of Å. Nonbonded interactions were cutoff beyond 9 Å. Hydrogen atoms were added with the HBUILD program (33), and their positions were energy-minimized until the norm of the gradient was <0.1. Then, while the backbone was kept fixed, all sidechain positions in the model were energy-minimized with the electrostatic energy term turned off, until the norm of the gradient was <0.5. This was followed by a short minimization of all atomic positions (norm of the gradient <2.0) in order to remove any remaining clashes between side chain and main chain atoms. Two-residue insertions at the positions 87 and 98, and a single-residue deletion at position 69 were introduced with the program TURBOFRODO (34). Energy minimization was performed again as described above. The root-mean-square deviation between the backbone atoms (N, Cα, and C) of the model and of PYP is 0.76 Å. A quality check was performed by the program PROCHECK (35) and showed that 87.1% of - angles were in the most conserved regions compared with 90.6% in the crystallographic structure of dark-state PYP.
RESULTS The PAS/PYP Module. PYP provides a structural and functional prototype for the 3D fold of the PAS domain superfamily (Fig. 1), which we name the PAS/PYP module. PYP is a self-contained, bacterial blue-light photoreceptor, with an unusual fold characterized by a central six-stranded β-sheet with N- and C-terminal β-strands in the center (27). The overall PYP fold breaks down into four segments: (i) the N-terminal helical lariat (residues 1–28), including helices al and α2, (ii) the first three-stranded half of the central β-sheet (residues 29–69), including the β1, β2 hairpin, two short intervening α-helices (α3 and α4), β3, and an overlapping turn of π-helix, (iii) the helical connector (residues 70–86), composed predominantly of the long α5-helix that diagonally crosses the β-sheet, to connect the two edge β-strands, and (iv) the last three-stranded half of the central β-sheet, including β4, a connecting loop, and the β3, β6 hairpin. PYP has a hydrophobic core on each side of the central β-sheet (27). The N-terminal helical lariat caps one side of the β-sheet to form the smaller hydrophobic core. The remaining helices and loops, together with the central β-sheet, surround the 4-hydroxycinnamoyl chromophore, to form the larger hydrophobic core. Helix α3 and flanking residues contribute the hydrogenbonding network for the phenolic hydroxyl at the tip of the chromophore. The PAS core (Fig. 1), the second segment of the PAS/PYP module, provides the photosensing active site of PYP and roughly corresponds to the traditional repeating PAS sequence motif of ~50 amino acids (1–3). This key portion of the PYP structure, ending at the Cys69 attachment site for the chromophore, forms the majority of the immediate environment of the chromophore, provides all residues that hydrogen bond the chromophore, and supplies the Arg52 gateway (27). The Arg gateway likely participates in PYP heterodimerization with a downstream signal transduction protein, by moving and pro
FIG. 1. A proposed PAS/PYP 3D fold illustrated on the PYP structure. The N-terminal cap, colored in purple, contains residues from 1 to 28. The PAS-core, colored in gold, is the domain where higher sequence homology is found among various members of the PAS-containing molecules. It spans from residue 29 to 69. The helical connector, colored in green, includes a short loop followed by the helix α5 and spans residues from 70 to 87. The β-scaffold, colored in blue, contains the last three strands of PYP, spanning residues from 88 to 125.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PAS DOMAIN SUPERFAMILY
5886
viding solvent access to the chromophore during the long-lived, bleached, signaling intermediate of the PYP light cycle (36). The β-scaffold, the fourth segment of the PAS/PYP module, provides a long platform with a characteristic β-sheet twist that supports the PAS core and completes the central six-stranded PYP β-sheet. This β-scaffold (Fig. 1) approximately matches the PAC sequence motif described by Ponting and Aravind (4). Within this β-scaffold, the end of the fourth β-strand plus the ω-loop (37) connecting β4 and β5 wrap around the PAS core to complete the chromophore environment. In the PAS ‘‘S-box” nomenclature created by Zhulin and coworkers (5), the S1 box corresponds to the last three-fourths of the PAS core, and the S2 box covers most of the β-scaffold. The central β-sheet of the PAS/PYP module is protected from solvent by the two remaining segments: the N-terminal cap protects one side, whereas the helical connector combines with parts of the PAS core to protect the other. The PAS-related LOV (light, oxygen, or voltage) sequence motif encompasses the PAS core, the helical connector, and the β-scaffold. This sequence region was identified by Briggs and coworkers (8) in the plant protein NPH1 (which participates in the signal-transduction pathway for phototropism) and in a family of proteins regulated by environmental factors that could change their redox status. Sequence conservation between PYP and PAS domains. PAS domain sequences occur in all three kingdoms of life, and act in a multitude of regulatory, sensing, and signal transduction pathways. Thus, sequence similarities are limited and may be further obscured by the functional variation of buried active-site residues that would otherwise exhibit the relative conservation expected by inward-facing residues of the hydrophobic core. To evaluate the limited sequence similarity between PYP and PAS domains, we compiled a full-length PYP sequence alignment with a set of PAS domain sequences (Fig. 2). We used automated protein sequence alignment to align the full-length sequences of closely related PAS proteins (Fig. 2), including the original PAS trio of PER, ARNT, and SIM, as well as the more recently discovered mammalian CLOCK proteins and their homologues. PYP sequences from Ectothiorhodospira halophila (38), corresponding to the crystallographic structure, and three other bacteria were added manually, starting from the sequence registration within the PAS core identified by Lagarias et al. (21) using a phytochrome PAS repeat consensus sequence as the search probe. ARNT was chosen as the major PAS protein family to include and model-build because of the many known sequences from
FIG. 2. Similarities revealed by multiple sequence alignment of several members of the PAS-containing proteins and members of the PYP family. The alignment was performed using the program PILEUP in the GCG suite (25) starting from the ARNT molecules, then adding each PAS-containing molecule in the list. PYP molecules were manually aligned on the top (see Methods). White letter amino acids are conserved in both PYP and PAS-containing proteins. Red letter amino acids highlight significant differences between PYP and PAS-containing proteins. The secondary structure of PYP is displayed on the top using the color coding from Fig. 1. Helices are represented by “noodles”, strands by arrows and loops by lines. Accession numbers from the SwissProt database, as extracted using the Entrez web service (http://www.ncbi.nlm.nih.gov/ Entrez/protein.html) are P16113 (pyp.ecto), X98888 (pyp_rhodosp). X98889 (pyp_rhodoba), M19029 (sim_fly), U33427 (trh_fly), U22431 (hifa_human), U51627 (mop3_human), X03636 (per_fly), U10325 (arnt_mouse), M69238 (arnt_human), D45239 (arnt_rabbit), and AF020426 (arnt_fly). Other sequences were obtained through the Entrez service by searching full names of proteins (pyp_chroma, clock_mouse, and arnt_trout).
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PAS DOMAIN SUPERFAMILY
5887
diverse species, their relatively unambiguous alignment with PYP sequences, and the importance of ARNT as a common regulatory partner in many PAS protein heterodimers (12, 13, 39, 40). In trial alignments, we discovered that the PAS-B sequences align better with PYP than do their more N-terminal PAS-A counterparts. As shown in Fig. 2, we found that the sequence alignment of PYP with PER and with these bHLHPAS-containing transcriptional activators can be extended both N- and C-terminally to encompass the entire PYP single-domain fold. The diversity of PAS domain sequences and the low sequence similarity among the more distant members made sequence alignment challenging but also provoked an interesting discovery. Automated alignment with standard programs failed to properly align PYP sequences with the PAS proteins. Similarly, when searching the nonredundant protein sequence database with the BLAST server (http://www.ncbi.nlm.nih.gov/ BLAST) the PYP sequence alone failed to select any PAS-domain proteins (although PAS protein sequences can successfully select PYP; ref. 21). However, we discovered that by simply changing a few functionally key PYP residues (G29 into F, G47 into V, and R52 into Y). the modified sequence was now able to pick a member of the ARNT family in a BLAST search. These residues were chosen because their specific roles in PYP would not likely be conserved in any other PAS domain proteins. As shown in Fig. 2, PYP-conserved G29 and G47 are replaced by large hydrophobic residues in the PAS domain sequences. In sequence evolution, the conservation of glycine residues often results from their unrestricted backbone dihedral angles, freed by the absence of a Cβ atom. However, in the PYP structures, neither G29 nor G47 occupies a region of , dihedral space forbidden to other larger amino acids. Instead, substitution of these glycines in PYP appears to be spacially prohibited by the proximity of adjacent side chains that make key interactions with the chromophore of PYP which is unlikely to be present in PAS domain proteins. In PYP, substitution at G29 would likely interfere with buried E46, which forms an important salt bridge with the phenolic oxygen of the deprotonated chromophore of dark-state PYP (27). Likewise, substitution at PYP G47 would likely interfere with buried Y42, which hydrogen bonds with the same phenolic oxygen of the PYP chromophore. During the PYP photocycle, R52 actively participates in signal transduction by undergoing conformational changes that allow the photoisomerized chromophore access to solvent (36). Although these three PYP function-specific residues would not require conservation in PAS proteins, standard substitution matrices used for automated sequence alignment cannot reasonably accommodate the resulting sequence substitutions and thus fail to perform appropriate sequence alignments. Such alignment peculiarities may often stymie sequence alignment programs and preclude sequence alignment where no structural information is available. At the sequence level, the alignment presented in Fig. 2 identifies several interesting features. First, only two short insertions into PYP are required: between residues 87 and 88 (PYP numbering is used throughout) and between residues 98 and 99. Second, a single short gap in SIM, TRH. and HIF-1α occurs near the N terminus. Third, PYP C69, which carries the chromophore, is deleted. Fourth, the following residues mostly are conserved among all sequences: V4, D34, G37,139, N43, G51, P54, V57, I58, G59, K60, N61, F63, P68, D71, F79, F92, Y94, V120, and F121. Fifth, the following residues represent the major differences between PYP and the PAS domains: G29, Y42, E46, G47, R52, D65, A67, T/A70, E/D93, and D/A97. These sequence conservations and differences are clarified at the 3D level, based upon the PYP structure. Structural Consequences of Sequence Conservation and Differences. As PAS domains function to mediate macromolecular interactions along signalling pathways, the development and analysis of 3D structural models for specific PAS/ PYP modules is immediately useful for designing experiments to probe function. We explored the potential structural and functional consequences of significant sequence similarities and differences by examining the structural roles of the conserved residues in PYP (Fig. 3) and by mapping residues exhibiting sequence conservation and major sequence differences onto the molecular model of the ARNT PAS B domain (Fig. 4). In general, both the greatest sequence conservation (Figs. 2 and 3) and the most striking sequence differences (Fig. 2) occur within the PAS core (Fig. 4), which represents the “active site” and chromophore ligand-binding region of PYP. Each of the 20 natural amino acids does not have the same probability of being structurally conserved. Pro residues, because of their backbone conformational restriction, and Gly residues, especially those with left-handed backbone conformations, often are evolutionarily conserved within a protein family. Both Gly and Pro residues frequently contribute to kinks and turns in the polypeptide chain. Pro 54 in PYP is
FIG. 3. Detailed residue interactions for conserved residues that form specific side-chain to main-chain hydrogen bonds in PYP and appear to be retained in PAS-containing molecules. First, Asp 34 OD1 hydrogen bonds to three backbone nitrogens from residues D36, G37, and N38 (2.87 Å, 3.08 Å and 2.89 Å, respectively). Most residues at position 34 have an atom OD1 (Asp, Asn) or similar (OG1 in Ser, Thr). Second, Asn 43 OD1 hydrogen bonds to three backbone nitrogen atoms: A30, A45, and E46 (2.96 Å, 3.55 Å, and 3.06 Å). All residues 43 in Fig. 2 have an atom OD1 or OE1. Third, Asn61 OD1 and ND2 hydrogen bond to three backbone nitrogens and one backbone oxygen: F62. F63, K64, and D36 (3.39 Å, 3.09 Å, 3.00 Å, and 2.96 Å). Almost all residues 61 in Fig. 2 have an atom OD1 or OG1. Drawing made with the program MOLSCRIPT (46).
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PAS DOMAIN SUPERFAMILY
5888
conserved in all PAS domains shown in Fig. 2. PYP has five left-handed Gly residues: G7 and G59 participate in type II β-turns, G37 mediates a β-bulge, G51 ends α3, and G86 ends the helical connector α5. In the PAS domain proteins of Fig. 2, P54 is conserved completely; G51 and G59 of PYP are conserved predominantly or substituted with residues (Asp, Asn, and Glu) that are fairly tolerant of left-handed backbone conformations; G7 and G37 of PYP also are mostly conserved but show more unusual substitutions; and the remaining PYP residues with lefthanded backbone conformations (G86 and Q99) occur near regions tolerant of insertions, making analysis of their conformations difficult.
FIG. 4. The PAS-domain of the human ARNT protein modeled from the PYP crystal structure (2phy.pdb). The Cα trace is represented by a tube. The N-terminal cap is colored in magenta, the PAS-core in gold, the helical connector in green, and the β-scaffold in blue. Conserved side chains between PYP and PAS-containing proteins are displayed in white. Amino acids that significantly vary between PYP and PAS-containing molecules are drawn in red. PYP’s chromophore is displayed in yellow in the same orientation as in the PYP molecule. Most conserved residues are located in the hydrophobic core of the PAS-core domain (in gold). Most of significantly variant amino acids (in red) occur in the vicinity of the chromophore pocket as expected from molecules that carry different biological functions. In the ARNT model, H67 and F29 occupy the chromophore pocket. Figure displayed using the Application Visualization System (AVS) (AVS, Waltham, MA). Other residues likely to be conserved are those that make critical hydrogen bonds required either for proper folding or for stabilizing a particular fold. In particular, residues that form key side-chain hydrogen bonds to main-chain atoms of other residues are expected to be conserved. Residues D34, N43, K60, and N61 all serve this function in PYP, and their counterparts in the PAS domains of Fig. 2 should be able to maintain this hydrogen-bonding function. Asp 34 in PYP can make up to three hydrogen bonds to peptide nitrogen atoms from residues 36– 38 (Fig. 3, Top), This requires an appropriately positioned hydrogen-bond acceptor atom like OD1 of Asp or OD1 from Asn (ARNT, TRH, Fig. 3, Top) or OG1 from Ser/Thr (PER, CLOCK, HIF-1α). Conversely, the Asn at position 43 also can make up to three hydrogen bonds to nitrogen backbone atoms of residues 30, 45, and 46. This particular interaction also can be mediated by the atom OD1 of the Asp in ARNT, PER, Clock, MOP3, SIM and HIF-1α or an OE1 from a Glu in TRH (Fig. 3, Middle). In these two examples, hydrogen bonds stabilize a turn structure. The last example involves hydrogen bonds between a conserved Asn in PYP and ARNT at position 61 where both their hydrogen bond donor and acceptor atoms can make contacts with other backbone atoms (Fig. 3, Bottom). Although, this Asn is not fully conserved in every PAS domain (Fig. 2), the hydrogen bonds to OD1 that mediate π helix formation are conserved by most residues at position 61 (Asn/Ser/ Thr residues, Fig. 3, Bottom). The segments of the PYP protein outside of the PAS core exhibit lesser sequence conservation. Here, those residues that are conserved across all or most of the PAS domain protein families (Fig. 2) generally have identifiable roles in defining the secondary and tertiary structure of PYP. Both the helical connector and the β-scaffold show lesser sequence conservation (Fig. 2), consistent with their apparent role in maintaining appropriate secondary structural elements for the PAS/PYP module. The largest length variations in the sequence alignment are located at PYP positions 87–88 in the helical connector, which was observed previously to differ among PAS domains (5). The only other insertion in PYP maps to the turn joining the first two strands (β4 and β5 of the β-scaffold, Figs. 2 and 4). This insertion, found in every PAS domain sequence in Fig. 2, may highlight a significant structural difference distinguishing PYP from other PAS-containing proteins. The Nterminal cap, which is located on the opposing side of the module from the PYP active site, appears the least conserved among PAS domain proteins and in some cases may be substituted with other structures capable of protecting the central β-sheet from solvent. Hydrophobic residues are usually buried in the core of a protein and therefore should show some conservation of their hydrophobic character. However, unlike the specificity of particular side-chain to main-chain hydrogen bonds, the non-specific nature of hydrophobic packing allows more liberty in sequence variation. In particular, complementary substitutions of hydrophobic residues to maintain a well packed hydrophobic core is frequently observed (41). Hydrophobic core residues conserved between PYP and PAS domains include: V4, 139, V57, I58, F63, F79, F92, Y94, V120, and F121, which are respectively conserved as or replaced by V4, V/F39, L/V57, L/V58, Y/M/V/L63, F/ Y/H79, F/Y92, L/M/F/A94, I/F120, and I/V121 in PAS domains (Figs. 2 and 4). PAS Domain Differences in the PYP Chromophore Environment. Although PYP and PAS-containing proteins share sensor and signal transduction functions, PYP incorporates both functions within a single, small, globular domain of 125 residues, whereas the much larger multidomain PAS-containing proteins like the phytochromes segregate these functions to different domains. The covalently bound 4hydroxycinnamoyl chromophore of PYP that is necessary for light-mediated negative phototaxis is not found in any of the well characterized PAS domain proteins. Hence, it is expected that residues of the chromophore environment will differ between PYP and PAS domains. Indeed, most of the PYP residues that are not shared with the other PAS domain proteins interact with the chromophore. In the PAS domain proteins in Fig. 2, chromophore-bound PYP Cys69 is deleted, the inward-facing hydrophilic side chains (Y42, E46, and T50) that form the hydrogenbonding network stabilizing the phenolic hydroxyl at the buried tip of the 4-hydroxycinnamoyl chromophore (27, 42) are replaced with hydrophobic residues, and the R52 side chain that forms the gateway of the chromophore to solvent (27, 36, 42) is converted to Tyr. In the ARNT model (Fig. 4), the cavity created by the absence of the PYP chromophore is partly filled by very conserved H67, which can make a buried salt bridge with the conserved E/D70 in the PAS domains (Fig. 4). However, a predominantly hydrophobic cavity about one-half the size of a heme remains, caused by the reduction in size of residues Y42, F62, and F96 in PYP becoming V42, I62, and S96 in ARNT. This cavity might provide insights into the ligand-binding properties of the PAS domains including possible specific
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PAS DOMAIN SUPERFAMILY
5889
hydrophilic interactions with conserved residues H67 and E70. Interestingly, PAS-conserved ARNT F29, replacing PYP G29, also can be directed into the central cavity because PYP E46 has been replaced with a smaller hydrophobic residue (Cys in ARNT). This explains why G29 was identified as PYP-specific during a BLAST search, as described above. A large hydrophobic residue at this location can help to fill the void left by the absence PYP’s chromophore. The last major sequence differences between PYP and PAS-containing proteins in the chromophore environment are located at positions 93 and 97, where negative charges are replaced by positive charges. These residues are part of a surface array of positive charges that includes R93, R95, K97, W99, and W101 in ARNT and in most PAS domains. These residues are very close to the insertion located at position 98 in PYP (Fig. 4), which was modeled to be a turn (consistent with predictions by the program DSSP; ref. 43). In PYP, this turn is positioned at the entrance of the chromophore pocket and may be important for a putative interaction with a partner to accomplish the role of PYP in signal transduction. Protein-Protein Dimerization Interface. During PYP’s light cycle, the chromophore undergoes a trans-to-cis isomerization and the protein rearranges slightly to accommodate the new chromophore configuration (36). The largest movements, taking place in the chromophore and the side chain of R52, presumably send the signal to the unknown downstream partner of PYP. leading eventually to negative photoaxis. Therefore, the surface of PYP surrounding these moving residues (Fig. 4) likely provides the interaction face for heterodimerization with the signaling partner of PYP. In other PAS-containing proteins, the PAS domains mediate homo- or hetero-dimer formation, in most cases with other PAS domains. We suggest that the interaction face of PYP is the prototype for the dimerization interface of the PAS domain protein super family. This putative dimerization interface includes residues from three regions: (i) a central region (51–68), that includes residues located within 10 Å of Y52, (ii) the loop 95–103, that includes the insertion in the β scaffold, and (iii) two residues from α3 in the PAS core domain that are H44 and R45 in ARNT. Residues that are exposed to solvent are displayed in Fig. 5A. The molecular surface area of these regions are shown in Fig. 5B. Exposed side chains are H44, R45, Y52, Q53, Q55, E56, K60, F65, R95, K97, N98, Q98A, E98B, W99, W101, and R103. Almost all of the residue types found in this interface can form specific contacts to other amino acids and are characteristic of other protein-protein interfaces. The loop composed of residues 95–103, which differs between PYP and the PAS-containing proteins in Fig. 2. might be involved in the recognition specificity for PAS dimerization because of the low sequence homology among PAS domains in this area (Fig. 2).
DISCUSSION Although sequence comparison alone is insufficient to demonstrate the proposed structural similarities between PYP and the PAS domains, low sequence homology was expected because of the evolutionary diversity among PAS-containing proteins, which populate all three kingdoms of life. Instead, the similarity and potential homology identified between PYP and the PAS domain superfamily is corroborated by finding that a modified PYP sequence, replacing only three residues specific to PYP function, allows automated selection of several PAS domain sequences from a nonredundant protein sequence database, albeit with resulting low scores, as expected. The potential homology between PYP and PAS domains is further supported by our ability to generate from the PYP crystallographic structure, a well behaved molecular model for the PAS domain from ARNT, a member of the PAS domain superfamily. Indeed, only two insertions and a single deletion were needed for building a 3D model of the PAS-B domain of human ARNT, which exhibits a root-mean-square deviation of 0.76 Å from PYP’s 3D structure, and comparable quality and stability. Moreover, from the resulting molecular model of the ARNT PAS domain, it is possible visually to identify and understand similarities and differences between PYP and the PAS domains, supporting the proposition for a PAS/PYP prototypical fold. Given that the PAS/PYP module hypothesis is valid, the 3D model of the ARNT PAS-B domain provides useful insights concerning the two major known functions of PAS domains: protein-protein interaction and ligand binding.
FIG. 5. Predicted PAS functional interactions. (A) Amino acid side chains that might participate in a protein-protein interaction are highlighted. The central segment, formed by residues 51–68, is located within 10 Å of the residue Y52 (in yellow). The second area, from residue 95–103 (in cyan), is made of a loop in which an insertion occurs at position 98 in each PAS-containing molecule. The third area is made by two residues (orange) adjacent to the central segment (yellow). H44 and R45, for which their side chains point toward the solvent. (B) same as A but the molecular surface for these residues is displayed. Protein-protein interactions between the ubiquitous partner ARNT and PAS-containing proteins AHR, SIM, MOPs, TRH, and HIF-1α (9– 11, 13, 14, 44, 45) have been identified in vitro and in vivo. These interactions are mediated by both the bHLH region and the PAS domains. Because the bHLH region is a self-dimerizing structural motif (3), the PAS domains evidently
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PAS DOMAIN SUPERFAMILY
5890
supply the recognition specificity needed for these interactions, rather than driving the interactions per se. This hypothesis is supported by recent work from Zelzer and coworkers (14) revealing that the swapping of PAS domains between SIM and TRH confers the functional specificity of the PAS domain, rather than that of the parent protein. Based on the experimentally determined structures of PYP light-cycle intermediates (36), we propose that the region of the PAS/PYP fold that is involved in protein-protein interaction is centered around residue 52, as shown in Fig. 5. This hypothesis can be tested by site-directed mutagenesis of residues highlighted in this interface (Fig. 5). Several PAS-containing proteins bind ligands and/or cofactors. To date, however, mapping the ligand binding to the PAS domain itself has been demonstrated only for the FixL (16, 21, 4) and AHR (15) proteins. The PAS domain of FixL binds heme, whereas the PAS-B domain of AHR binds dioxin and other poly-cyclic aromatic hydrocarbons. Interestingly, for both molecules, the minimum size of the PAS domain that is able to bind the ligand is ~130 residues, the size of the entire PAS/PYP module. Because of a reduction in size of several side chains in ARNT compared to PYP, the 3D model of the ARNT PAS domain displays an internal cavity large enough to accommodate a medium-sized ligand (one-half a heme). Thus, the region previously occupied by the PYP’s chromophore is a logical choice for a ligand pocket. In summary. PYP appears to exhibit all of the major structural and functional features characteristic of the PAS domain superfamily: a modular domain of ~125–150 residues, a sensor function linked to ligand or cofactor binding, and signal transduction capability governed by heterodimeric assembly. Thus, we propose the testable hypothesis that the entire PYP protein fold is the structural prototype for the modular, 3D, PAS-A and PAS-B domain folds in PAS-containing proteins. This PAS/PYP module provides a structural model to guide experimental testing of hypotheses regarding ligand-binding, dimerization, and signal transduction in PAS proteins. We thank Christopher Bruns and Christopher D.Putnam for preliminary sequence searches, alignments, and analyses; C.Bruns for help with the illustrations; and C.Bruns, C.D.Putnam, Ulrich K. Genick, and John Tainer for useful criticism and discussion. Research on PYP is funded by the National Institutes of Health Grant GM37684 (to E.D.G.). 1. Nambu, J.R., Lewis, J.O., Jr., Wharton, K.A. & Crews, S.T. (1991) Cell 67, 1157–1167. 2. Crews, S.T., Thomas, J.B. & Goodman, C.S. (1988) Cell 52, 143–151. 3. Hoffman, E.C., Reyes, H., Chu, F.-F., Sander, F., Conley, L. R., Brooks, B.A. & Hankinson, O. (1991) Science 252, 954–958. 4. Ponting, C.P. & Aravind, L. (1997) Curr. Biol. 7, 674–677. 5. Zhulin, I.B., Taylor, B.L. & Dixon, R. (1997) Trends Biochem. Sci. 22, 331–333. 6. King, D.P., Zhao, Y., Sangoram, A.M., Wilsbacher, L.D., Tanaka, M., Antoch, M.P., Sleeves, T.D.L., Vitaterna, M.H., Kornhauser, J.M., Lowrey, P.L., et al. (1997) Cell 89, 641–653. 7. Kay, S.A. (1997) Science 276, 753–754. 8. Huala, E., Oeller, P.W., Liscum, E., Man, I.-S., Larsen, E. & Briggs, W.R. (1997) Science 278, 2120–2123. 9. Huang, Z.J., Edery, I. & Rosbash, M. (1993) Nature (London) 364, 259–262. 10. Lindebro, M.C., Poellinger, L. & Whitelaw, M.L. (1995) EMBO J. 14, 3528–3539. 11. McGuire, J., Coumailleau, P., Whitelaw, M.L., Gustafsson, J.-A. & Poellinger, L. (1996) J. Biol. Chem. 270, 31353–31357. 12. Jiang, B.-H., Rue, E., Wang, G.L., Roe, R. & Semenza, G.L. (1996) J. Biol Chem. 271, 17771–17778. 13. Hogenesch, J.B., Chan, W.K., Jackiw, V.H., Brown, R.C., Gu, Y.-Z., Pray-Grant, M., Perdew, G.H. & Bradfield, C.A. (1997) J. Biol. Chem. 272, 8581–8593. 14. Zelzer, E., Wappner, P. & Shilo, B.Z. (1997) Gene Dev. 11, 2079–2089. 15. Fukunaga, B.N., Probst, M.R., Reisz-Porszasz, S. & Hankinson, O. (1995) J. Biol. Chem. 270, 29270–29278. 16. Monson, E.K., Weinstein, M., Ditta, G.S. & Helinski, D.R. (1992) Proc. Natl. Acad. Sci. USA 89, 4280–4284. 17. Meyer, T.E. (1985) Biochim. Biophys. Acta 806, 175–183. 18. Meyer, T.E., Yakali, E., Cusanovich, M.A. & Tolin, G. (1987) Biochemistry 26, 418–423. 19. Meyer, T.E., Tollin, G., Hazzard, J.H. & Cusanovich, M.A. (1989) Biophys. J. 56, 559–564. 20. Sprenger, W.W., Hoff, W.D., Armitage, J.P. & Hellingwerf, K.J. (1993) J. Bacteriol. 175, 3096–3104. 21. Lagarias, D.M., Wu, S.-H. & Lagarias, J.C. (1995) Plant Mol. Biol. 29, 1127–1142. 22. Linden, H. & Macino, G. (1997) EMBO J. 16, 98–109. 23. Crosthwaite, S., Dunlap, J.C. & Loros, J.J. (1997) Science 276, 753–754. 24. Feng, D.F. & Doolittle, R.F. (1987) J. Mol Evol. 25, 351–360. 25. Devereux, J., Haeberli, P. & Smithies, O. (1984) Nucleic Acids Res. 12, 387–395. 26. Hahn, M.E., Karchner, S.I., Shapiro, M.A. & Perera, S.A. (1997) Proc. Natl. Acad. Sci. USA 94, 13743–13748. 27. Borgstahl, G.E.O., Williams, D.R. & Getzoff, E.D. (1995) Biochemistry 34, 6278–6287. 28. McRee, D.E. (1992) J. Mol. Graphics 10, 44–47. 29. Tuffery, P., Etchebest, C., Hazout, S. & Lavery, R. (1991) J. Biomol Struct. Dyn. 8, 1267–1289. 30. Brünger, A.T. (1992) X-PLOR, A system for X-ray crystallography and NMR (Yale University, New Haven, CT), Version 3.1. 31. Powell, M.J.D. (1977) Math. Program. 12, 241–254. 32. Brooks, B., Bruccoleri, R., Olafson. B., States, D., Swaminathan, S. & Karplus, M. (1983) J. Comp. Chem. 4, 187–217. 33. Brünger, A.T. & Karplus, M. (1988) Proteins 4, 148–156. 34. Roussel, A. & Cambillau, C. (1989) TURBO-FRODO in Silicon Graphics Geometry Partners Directory (Silicon Graphics, Mountain View, CA), Version 5.2. 35. Laskowski, R.A., MacArthur, M.W., Moss, D.S. & Thornton, J.M. (1993) J. Appl. Crystallogr. 26, 283–291. 36. Genick, U.K., Borgstahl, G.E.O., Ng, K., Ren, Z., Pradervand, C., Burke, P.M., Srajer, V., Teng, T.-Y., Schildkamp, W., McRee, D.E., et al. (1997) Science 275, 1471–1475. 37. Leszczynski, J.F. & Rose, G.D. (1986) Science 234, 849–855. 38. Baca, M., Borgstahl, G.E.O., Boissinot, M., Burke, P.M., Williams, D, R., Slater, K.A. & Getzoff, E.D. (1994) Biochemistry 33, 14369–14377. 39. Wang, G.L., Jiang, B.-H., Rue, E.A. & Semenza, G.L. (1995) Proc. Natl. Acad. Sci. USA 92, 5510–5514. 40. Kallio, P.J., Pongratz. I., Gradin, K., McGuire, J. & Poellinger, L. (1997) Proc. Natl. Acad. Sci. USA 94, 5667–5672. 41. Getzoff, E.D., Tainer, J.A., Stempien, M.M., Bell, G.I. & Hallewell, R.A. (1989) Proteins Struct. Funct. Genet, 5, 322–336. 42. Genick, U.K., Devanathan, S., Meyer, T.E., Canestrelli, I.L., Williams, E., Cusanovich, M.A., Tollin. G. & Getzoff, E.D. (1997) Biochemistry 36, 8– 14. 43. Kabsch, W. & Sander, C. (1983) Biopolymers 22, 2577–2637. 44. Ohshiro, T. & Saigo, K. (1997) Development (Cambridge, U.K.) 124, 3975–3986. 45. Sonnenfeld, M., Ward, M., Nystrom, G., Mosher, J., Stahl, S. & Crews, S. (1997) Development (Cambridge, U.K.) 124, 4571–4582. 46. Kraulis, P.J. (1991) J. Appl. Crystallogr. 24, 946–950.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR
5891
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5891–5898, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
New methods of structure refinement for macromolecular structure determination by NMR
(coupling constants/chemical shifts/conformational database/diffusion anisotropy/dipolar couplings) G.MARIUS CLORE* AND ANGELA M.GRONENBORN* Laboratory of Chemical Physics, Building 5, National Institute of Diabetes and Digestive and Kidney Diseases. National Institutes of Health, Bethesda, MD 20892–0520 ABSTRACT Recent advances in multidimensional NMR methodology have permitted solution structures of proteins in excess of 250 residues to be solved. In this paper, we discuss several methods of structure refinement that promise to increase the accuracy of macromolecular structures determined by NMR. These methods include the use of a conformational database potential and direct refinement against three-bond coupling constants, secondary13C shifts, 1H shifts, T1/T2 ratios, and residual dipolar couplings. The latter two measurements provide long range restraints that are not accessible by other solution NMR parameters. The two major techniques for determining the three-dimensional structures of macromolecules at atomic resolution are x-ray crystallography in the solid state (single crystals) and NMR spectroscopy in solution. Unlike crystallography, NMR measurements are not hampered by the ability or inability of a protein to crystallize. The size of macromolecular structures that can be solved by NMR has been increased dramatically over the last few years (1). The development of a wide range of two-dimensional (2D) NMR experiments in the early 1980s culminated in the determination of the structures of a number of small proteins (2, 3). Under exceptional circumstances, 2D NMR techniques can be applied successfully to determine structures of proteins up to 100 residues (4, 5). Beyond 100 residues, however, 2D NMR methods fail, principally because of spectral complexity that cannot be resolved in two dimensions. In the late 1980s and early 1990s, a series of major advances took place in which the spectral resolution was increased by extending the dimensionality to three and four dimensions (1). In addition, by combining such multidimensional experiments with heteronuclear NMR, problems associated with large linewidths can be circumvented by making use of heteronuclear couplings that are large relative to the linewidths. The first successful application of these methods to a protein greater than 12 kDa was achieved in 1991 with the determination of the solution structure of interleukin 1β, a protein of 18 kDa and 153 residues (6). Concomitant with spectroscopic advances, significant improvements have taken place in the accuracy with which macromolecular structures can be determined. Thus, it is now potentially feasible to determine the structures of proteins in the 15- to 35-kDa range at a resolution comparable to 2.5-Å resolution crystal structures (7). The upper limit of applicability is probably 60–70 kDa, and the largest single-chain proteins solved to date are 30 kDa, comprising 260 residues (8, 9). In this paper, we discuss a number of new refinement strategies aimed at both facilitating NMR structure determination and increasing the accuracy of the resulting structures. These include direct refinement against three-bond coupling constants (10) and13C and 1H shifts (11–13), as well as the use of conformational database potentials (14, 15). More recently, methods have been developed to obtain structural restraints that characterize long range order a priori (16–18). These methods include making use of the dependence of heteronuclear relaxation on the rotational diffusion anisotropy of nonspherical molecules and of residual dipolar contributions to one-bond heteronuclear couplings arising from small degrees of alignment of molecules in a magnetic field. General Principles of NMR Structure Determination. Irrespective of the algorithm used, any structure determination by NMR seeks to find the global minimum region of a target function Etot given by: Etot=Ecov+Evdw+ENMR, where “Ecov,” “Evdw,” and “ENMR” are terms representing the covalent geometry (bonds, angles, planarity, and chirality), the nonbonded contacts, and the experimental NMR restraints, respectively (19). Algorithms currently used include simulated annealing in both Cartesian (20, 21) and torsion angle space (22), metric matrix distance geometry (23), and minimization with a variable target function in torsion angle space (24). The main source of geometric information contained in the experimental NMR restraints is provided by the nuclear Overhauser effect (NOE). The NOE (at short mixing times) is proportional to the inverse sixth power of the distance between the protons, so its intensity falls off very rapidly with increasing distance between proton pairs. Consequently, NOEs usually are observed only for proton pairs separated by ≤5–6 Å. Despite the short range nature of the observed interactions, the short approximate interproton distance restraints derived from NOE measurements can be highly conformationally restrictive, particularly when they involve residues that are far apart in the sequence but close together in space (1, 19). Systematic bias arising from the different algorithms used to calculate the structures may be introduced via the first two terms, Ecov and Evdw, in Eq. 1. The values of bond lengths, bond angles, planes, and chirality are known to very high accuracy, so it is clear that the deviations from idealized geometry, as represented by the term Ecov, should be kept very small. The second term, Evdw, representing the nonbonded contacts, is associated with considerably more uncertainty than the covalent geometry (25, 26). Given the numerous ways to represent Evdw (for example, a simple van der Waals repulsion term or a complete empirical energy function including a van der
*To whom reprint requests should be addressed. e-mail:
[email protected] and
[email protected]. 0027–8424/98/955891–8$0.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: 2D, two-dimensional: NOE, nuclear Overhauser effect.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR
5892
Waals Lennard-Jones 6–12 potential, an electrostatic potential, and a hydrogen bonding potential), it is evident that variability is introduced via Evdw. It is therefore essential to ensure that the calculated structures display good nonbonded contacts. The uncertainties associated with the covalent geometry and van der Waals terms can introduce errors of 0.3 Å in the coordinates (26). The major determinant of accuracy, however, resides in the number and quality of the experimental NMR restraints that enter into the third term, ENMR, in Eq. 1. Although a high resolution, carefully refined x-ray structure of a given protein may not be identical to the “true” solution structure, it is likely to be reasonably close in many instances, as evidenced, for example, by the excellent agreement (≤ 1 Hz rms deviation) between the experimentally determined values of the 3JHNα coupling constants in solution and their corresponding calculated values from crystal structures (10, 27, 28). Moreover, it is generally the case that three-bond coupling constants, 13C secondary shifts, and 1H shifts calculated from high resolution crystal structures agree better with the experimentally measured values than those calculated from the corresponding NMR structures (refined in the absence of coupling constant and chemical shift restraints) (10–13, 25). It is therefore instructive to examine the dependence of the backbone rms difference between NMR and x-ray structures on the precision of the NMR structures (25). This dependence is shown in Fig. 1 for 14 proteins, for which both NMR and x-ray structures are available and which are representative of some of the different programs used in NMR protein structure determination (25). A linear relationship is evident. In addition, in cases in which both low and high precision NMR structures are available for the same protein, the high precision structure is significantly closer to the x-ray structure than the low precision one. The data can be fit to a straight line with a correlation coefficient of 0.9 and a limiting rms difference between NMR and x-ray structures of 0.45 Å. Moreover, all of the monomeric NMR structures with a precision of better than 0.5 Å are 0.85 Å or less away from the corresponding crystal structures. Given the fact that the coordinate errors in 1.5- to 2-Å resolution x-ray structures are 0.2–0.3 Å (7, 29), these data provide empirical evidence that an accuracy of 0.4–0.8 Å in the backbone coordinates is attainable under appropriate circumstances by using current NMR methodology (25).
FIG. 1. Correlation between backbone precision of NMR structures and their agreement with x-ray structures. Where the backbone rms difference between the average NMR coordinates (NMR) and the corresponding x-ray structures is available, the values are represented as circles. When only the average backbone rms difference between an ensemble of NMR structures (
) and the corresponding x-ray structure, is quoted in the literature, squares are used. The straight line represents a linear fit to the data with a slope of 0.70, an intercept of 0.45 Å. and a correlation coefficient of 0.9. The structures are as follows: p53(mon), p53(dim), and p53(tet) are the monomer, dimer, and tetramer, respectively, of the p53 oligomerization domain (51); IL-8, interleukin-8 monomer (52); Hir (new), highly refined structure of hirudin (53); IL-1. interleukin-1β (6, 7); BPTI, bovine pancreatic trypsin inhibitor (54): eglin c (55); PC, French bean plastocyanin (56); tendamistat (57); Hir(old). hirudin (58); Cyp-CsA, cyclophilincyclosporin A complex (59): Mb. carbonmonoxy myoglobin (helices plus heme; ref. 60); CPI, potato carboxypeptidase inhibitor (61); PCP-B, procarboxypeptidase B (62); and BSPI, barley serine proteinase inhibitor 2 (63). The values given exclude conformationally disordered regions as described in the papers cited. Note that the NMR structures of IL-8 and Hir(old) were obtained before the corresponding x-ray structures and that the NMR structure of tendamistat was obtained independently of and at the same time as the x-ray structure. Reproduced from ref. 25. The accuracy of NMR structures will be affected by errors in the interproton distance restraints. These errors can arise from two sources: (i) misassignments and (i) errors in distance estimates. Errors due to misassignments may be quite common in low resolution NMR structures. Fortunately, in many cases, these errors are of relatively minor consequence and do not result in the generation of an incorrect fold. Systematic errors in distance estimates may be introduced in attempts to obtain precise distance restraints. For example, interactive relaxation matrix analysis of the NOE intensities (30) and direct refinement against the NOE intensities (31, 32), while accounting for spin diffusion, can result in systematic errors from several sources such as: the presence of internal motions (not only on the picosecond time scale but also on the nanosecond to millisecond time scales); insufficient time for complete relaxation back to equilibrium to occur between successive scans; and differential efficiency of magnetization transfer between protons and their attached heteronucleus in multidimensional heteronuclear NOE experiments (26). For these reasons, it is probably prudent at the present time, at least in cases dealing with proteins, to convert the NOE intensities into loose approximate interproton distance restraints (e.g., 1–8–2.7 Å. 1.8–3.3 Å, 1.8–5.0 Å, and, if appropriate, 1.8–6.0 Å for strong, medium, weak, and very weak NOEs, respectively) with the lower bounds given by the sum of the van der Waals radii of two protons. These distance ranges are sufficiently generous to take into account untoward effects in the conversion of NOE intensities into distances (2, 3, 19, 26). Using this approach, systematic errors in the interproton distance restraints generally will be introduced only at the boundary of two distance ranges. In the case of experimental structures calculated with an incomplete set of NOE restraints (i.e., comprising <90% of the structurally useful NOEs), there is no doubt that errors, arising both from misassignments as well as from the incorrect classification of NOEs into the various loose approximate distance ranges, will occur, resulting in less accurate structures. This loss in accuracy is due to the fact that, until a significant degree of redundancy is present in the NOE restraints, such errors often can be accommodated readily without unduly comprising the agreement with either the experimental NMR restraints or the restraints for covalent geometry and non-bonded contacts. However, once 90% of the structurally useful NOEs have been assigned and incorporated into the restraints set, corresponding typically to an average of 15 restraints per residue with >60% of the NOEs involving unique proton pairs, two sensitive and complementary techniques can be employed easily to identify and correct such errors. The first method involves an analysis of the distribution of restraints violations in the ensemble of calculated structures. If a given restraint is systematically violated in more than, for example, 20% of the calculated structures, even by as little as 0.1 Å, it is highly likely that it should either be reclassified into the next looser category (i.e., strong to medium, medium to weak) or that errors in NOE assignments are present (26).
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR
5893
The second approach uses complete cross-validation to assess the completeness of the experimental restraints and the degree to which each distance restraint can be predicted by the remaining ones (33). Typically, this approach involves calculating a series of simulated annealing structures in which the restraints are partitioned randomly into a test set comprising 10% of the data and a reference set. Only the reference set is incorporated into the target function, and each calculation is carried out with a different test and reference set pair, thereby permitting one to fully explore the constraining power of the NOE restraints. The average agreement with all of the test sets as well as the atomic rms shift after complete cross-validation then provides an indicator of accuracy. Finally, a further check on the correctness of the structures is provided by verifying that all short interproton distances (e.g., <3.5 Å) predicted by the structures are in fact observed in the NOE spectra (25). Indeed, this procedure forms the basis of the iterative refinement process; the structures at each successive stage of refinement are used to predict all short interproton distance contacts, which then are searched for in the NOE spectra. In general, the vast majority of interproton distances <3.5 Å, and certainly all of those <2.5 Å, should be observed. Exceptions can occur occasionally if the linewidths of the corresponding resonances are broadened severely because of some sort of intermediate chemical exchange process on the chemical shift scale (caused, for example, by multiple conformations or microheterogeneity) resulting in severe attenuation of the NOE cross peaks. Additional Experimental NMR Restraints that Define Short Range Order. Although the interproton distance restraints derived from NOEs provide the mainstay of NMR structure determination, direct refinement against other experimental NMR restraints is both feasible and desirable. In this section, we consider experimental restraints that provide short range structural information, specifically three-bond coupling constants (10), secondary 13C chemical shifts (11), and 1H chemical shifts (12, 13). Three-Bond Coupling Constants. Three-bond coupling constants are related to torsion angles by the Karplus (34) equation: 3J(λ)=Acos2(λ) +Bcos(λ)+C, where λ is the torsion angle corresponding to the three-bond coupling, and A, B, and C are constants obtained by nonlinear optimization to yield the best fit between experimental 3J values and values calculated from a series of very high resolution x-ray structures. The coupling constants can be converted directly into loose torsion angle restraints (19). Alternatively, direct refinement against coupling constants can be achieved by adding the potential EJ=kJ (Jobs–Jcalc)2 (where kJ is a force constant and Jobs and Jcalc are the observed and calculated values of the coupling constants) (10). From the standpoint of refinement, the most useful coupling constant, in so far that it can be measured accurately and easily by quantitative J correlation spectroscopy and that its Karplus relationship has been parametrized reliably, is the 3JHNα coupling, which is related directly to the backbone torsion angle (35). The Karplus curve for 3JHNα, however, is symmetric about =–120°, such that one cannot distinguish =–120° +α from =–120°–α from the 3JHNα coupling alone (36). Where appropriate, this degeneracy can be resolved by quantitative J-correlation measurement of the 3Jcoco coupling, which has its steepest dependence close to =–120° (36). It is also worth noting that the relationship between the three-bond amide deuterium isotope shift experienced by 13Cα resonances, 3∆Cα (ND), is related to the backbone angle by a Karplus type relationship of the form 3∆Cα(ND)=30.1+ 22.2 cos (– 90°) ppb (37) and hence can be incorporated into structure refinement in exactly the same manner as three-bond coupling constants. Secondary 13C Chemical Shifts. There is a clear empirical correlation between the protein backbone conformation, defined in terms of the and torsion angles, and the 13Cα and 13Cβ secondary chemical shifts (that is, the difference between observed shifts and random coil shifts) (38, 39). In addition, ab initio quantum mechanical calculations have indicated that the , angles dominate shielding for Cα and Cβ atoms (40). Because the secondary I3Cα and 13Cβ shifts provide information on as well as and because they are readily measured, it is clearly useful to incorporate them directly into the refinement algorithm. The strategy that we used makes use of an empirical surface describing the expected Cα and Cβ secondary chemical shifts as a function of the backbone torsion angles and , derived from the structurally ordered regions of a set of four proteins whose 13C chemical shifts were known and for which high resolution crystal structures are available (38). The expectation surface is given by exp k)2)/S]}, and similarly for Cβ expected (where S is a Gaussian scale factor given by r2/e0.5 where r is the radius of the Gaussian; in this case r=17.7° and S=450). The average rms difference between the observed chemical shift values and the empirical surface is 1.1 ppm. Direct refinement against the 13Gα and I3Cβ shifts is carried out by adding the potential where and kCshift is a force constant (with a value chosen to yield an rms difference between observed and calculated shifts of 1 ppm) (11). To use simulated annealing to improve the agreement of the observed and expected carbon chemical shifts, the partial derivatives of the energy along and (i.e., the forces along and ) also must be calculated. These are given by δECshift/ Because there is no explicit function fitted to the expectation values, the partial derivatives of Cα
Cβ
expected and expected with respect to and are approximated by the local slopes of the expectation value grid about the grid point (, ) at which the energy is evaluated. Although the information contained in the secondary 13Cα and 13Cβ chemical shifts is to some extent redundant with that offered by 3JHNα coupling constants, the two experimental measures are complementary (11). Thus, the values of the 3JHNα coupling constants depend only on , whereas the 13Cα and 13Cβ chemical shifts depend on both and . Moreover, 3JHNα coupling constants may not be measurable for all residues because of small values of the couplings, line broadening, or chemical shift overlap of the backbone nitrogen atoms. In contrast, 13Cα and 13Cβ shifts are obtained readily for almost all residues. 1H Chemical Shifts. Proton chemical shifts are influenced by short range ring current effects from aromatic groups, magnetic anisotropy of C=O and C-N bonds, and electric field effects arising from charged groups. Recent developments in empirical models for 1H chemical shift calculations have shown that it is now possible to predict 1H chemical shifts for nonexchangeable protons to within 0.23–0.25 ppm for proteins for which high resolution crystal structures are available (41, 42). The calculated 1H chemical shift σcalc can be decomposed into four terms: the “random coil” (σrandom), “ring current” (σring), “magnetic anisotropy” (σani), and “electric field” (σE) shifts (41). σring depends on the distance and orientation of the aromatic ring to the proton of interest. σani represents the sum of the anisotropies arising from the C=O and C-N bonds of the backbone and the side chain functional groups of Asp, Glu, Asn, and Gin and depends on distance (r–3) and orientation of the proton from these functional groups. Finally, σE depends on the distance (r–2) between the charged heavy atom
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR
5894
and the proton, the angle between the charged heavy atomproton and C-proton vectors, and the charge on the heavy atom. Direct refinement against 1H chemical shifts is carried out by adding a 1H chemical shift term, Eprot=Σ kprot (σcalc,i – σobs,i)2, where kprot is the force constant and σobs,i and σcalc,i are the observed and calculated 1H chemical shifts, respectively, of proton i (12). For nonstereospecifically assigned methylene and methyl groups, a modification of Eprot is required to make maximal use of the shift information (13). Specifically, this involves making use of a set of potentials that involve the sums and differences of the chemical shifts to automatically handle chemical shifts involving prochiral centers without the need for making a priori stereospecific assignments (13). Results of Refinement Against Three-Bond Coupling Constants and 13C and 1H Shifts. Provided there are no severe errors in the interproton distance restraints, refinement against 3JHNα coupling constants, 13C shifts, and 1Hshifts reduces the rms difference between calculated and observed values to approximately the level of the expected errors (0.5–1 Hz, 1 ppm, and 0.2–0.3 ppm, respectively) without significantly impairing the agreement with the other restraints in the target function (i.e., experimental interproton distance and torsion angle restraints, covalent geometry, and non-bonded contacts) (10–13). In addition, provided the quality of the initial structures is high, refinement results in only small overall atomic rms shifts with no increase in precision at the expense of accuracy. We have found 13C shifts particularly useful in regions that are ordered but possess no regular secondary structure. Examples that come to mind are the N-terminal tail of the transcription factor GAGA (43) and the transcriptional coactivator HMG-I/Y (44) bound to the minor groove of DNA. In such cases, the secondary 13C shifts permit one to exclude easily certain backbone conformations. Whereas coupling constants and 13C shifts are related directly to specific torsion angles, 1H shifts are influenced by close spatial proximity of various functional groups and are particularly useful in the presence of aromatic groups. Indeed, 1H shift refinement was critical in establishing the correct dimer interface in the structure of the C-terminal DNA binding domain of HIV-l integrase (45). Another example is provided by Fig. 2, which illustrates the effect of 1H shift refinement, arising from the presence of a trypophan residue, on the active site of reduced human thioredoxin (12). Additional NMR Restraints that Define Long Range Order. Until recently, NMR structure determination has relied exclusively on restraints whose information is entirely local and restricted to atoms close in space, specifically NOE-derived short (<5 Å) interproton distance restraints, which may be supplemented by coupling constants, 13C secondary shifts, and 1H shifts as described above. The success of these methods is mainly due to the fact that short interproton distances between units far apart in a linear array are conformationally highly restrictive. However, there are numerous cases in which restraints that define long range order can supply invaluable structural information (16, 17). In particular, they permit the relative positioning of structural elements that do not have many short interproton distance contacts between them. Examples of such systems include modular and multidomain proteins and linear nucleic acids. Two approaches recently have been introduced that directly provide restraints that characterize long range order a priori. The first relies on the dependence of heteronuclear (15N or 13C) longitudinal (T ) and transverse (T ) relaxation times, specifically T /T ratios, on rotational diffusion anisotropy (16), and the second 1 2 1 2 relies on residual dipolar couplings in oriented macromolecules (17, 18). The two methods provide restraints that are related in a simple geometric manner to the orientation of one-bond internuclear vectors (e.g., N-H and C-H) relative to an external tensor. In the case of the T1/T2 ratios, the tensor is the diffusion tensor (16). In the case of residual dipolar couplings, the tensor may be the magnetic susceptibility tensor for molecules aligned in a magnetic field (17), the molecular alignment tensor for molecules aligned by anisotropic media such as liquid crystals (46), the electric field tensor for molecules aligned by an electric field, or the optical absorption tensor for molecules aligned by polarized light.
FIG. 2. View of the active site and neighboring regions of reduced human thioredoxin showing a superposition of 40 simulated annealing structure before (blue) and after (red) 1H chemical shift refinement. Reproduced from ref. 12. Refinement Against T1/T2 Ratios. Heteronuclear relaxation has been used for a long time to provide information on internal dynamics. The 15N transverse relaxation time T2 is a function of frequency-dependent and -independent spectral density terms, whereas the 15N longitudinal relaxation time T1 is only a function of the frequency-dependent terms. For axially symmetric rotational diffusion (i.e.,Dzz Dxx=Dyy where Dzz, Dxx, and Dyy are the diagonal elements of the diffusion tensor) characterized by diffusion tensor constants parallel (D` =Dzz) and perpendicular (D` =[Dxx+Dyy]/2) to the unique axis of the diffusion tensor, the spectral density J(ω), in the limit of very fast, axially symmetric internal motions, is given by where ω is the angular resonance frequency, and S is the generalized order parameter for rapid internal motion; τ1, τ2, and τ3 are time constants given by (6D` )– 1, (D+5D` )–1, and (4D+ 2D` )–1; and the terms A1, A2, and A3 are given by (1.5cos2θ –0.5)2, 3sin2θcos2θ, and 0.75sin4θ, where θ is the angle between the time-averaged N-H bond vector orientation in the molecular frame and the unique axis of the diffusion tensor (47). In the absence of large amplitude internal motions and conformational exchange line broadening, the 15N T1/T2 ratio for a protein with an axially symmetric diffusion tensor depends only on three variables: the angle θ (arising from the Ak terms) and the diffusion tensor constants D` and D` . As described below, D` and D` . are extracted readily from the ensemble of 15N T1and T2 relaxation times. Thus, the individual T1/T2 ratios provide a direct measure of the angle θ between the N-H bond vector and the unique axis of the diffusion tensor. This orientation is not known a priori, so we allowed it to float by making use of an external, initially arbitrarily positioned axis, defined by a single C-C bond, positioned 50 Å away from the structure (16). The geometric content of the T1/T2 ratios is incorporated into simulated annealing refinement by adding the potential term Eanis=kanis[(T1/T2)calc–(T1/T2)obs]2, where kanis is a force constant and (T1/T2)obs and (T1/T2)calc are the observed and calculated values of T1/T2, respectively. At each step of the simulated annealing protocol, Eanis is evaluated by calculating the angle between the N-H vectors and the unique axis of the diffusion tensor, defined by the floating C-C bond vector. The desired target value between observed and calculated T1/T2
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR
5895
ratios, based on the experimental uncertainty in the measured T1/T2 values, is achieved by empirically adjusting the value of kanis. To apply T1/T2 refinement, the values of D` and D` must be determined directly from the ensemble of measured T1/T2 ratios without reference to a known structure. For a uniform distribution of N-H bond vectors in space, the probability of finding an N-H vector that makes an angle θ with the unique axis of the diffusion tensor is proportional to sinθ (16). Hence. θ values near 90° are statistically most probable. These are the amides that yield the lowest T1/T2 ratios. The probability of finding an N-H bond vector with θ 0° is low, and, consequently, the T1/T2 ratio for θ=0° is not extracted as easily from the range of experimentally observed T1/T2 ratios. Experimentally, (T1/T2)min and an initial estimate of (T1/ T2)max are obtained by taking the average of the lowest and highest T1/T2 ratios, respectively, such that the SDs in their estimates are equal to the measurement error. Initial estimates for D` and D` then are obtained by simultaneously best-fitting the complete equations describing (T1/T2)min, (T1/T2)max, and the ratio of these two terms. Because the initial estimate of (T1/T2)max is likely to underestimate the true value of (T1/ T2)max, for the reasons discussed above, the estimated value of (T1/T2)max is increased in a stepwise manner (in increments of 5% up to a 35% increase) yielding new values of D` and D` . For each set of values, an ensemble of simulated annealing structures is calculated, and the dependence of the rms difference between observed and calculated T1/T2 values. ∆(T1/T2), on the estimated value of (T1/T2) max is examined. The minimum of this function yields the best estimates for D` and D` . The minimum is relatively shallow, and the structure is not significantly affected by using D` and D` values that change (T1/T2)max by up to ±15% but keep (T1/T2)min constant. The general, fully asymmetric case, in which Dxx Dyy, is treated in an analogous manner (16). The 15N T1/T2 ratio then depends not only on the angle θ between the z axis of the diffusion tensor and the N-H vector orientation but also on the angle that describes the position of the projection of the N-H vector on the x-y plane, relative to the x axis. The rhombicity factor η is defined as 3/2(Dyy–Dxx)/[Dzz–0.5(Dyy+Dxx)]. In practice, for most proteins with large diffusion anisotropy, [2Dzz/(DXX+Dyy) ≥1.5], η is found to be smaller than 0.4. Even at the high end of this range (η=0.4), the dependence of the T1/T2 ratio on is relatively weak (introducing changes in the predicted T1/T2 ratio that are of a magnitude comparable to the uncertainty in the measurements). Although the effect of rhombicity of the diffusion tensor on the T1/T2 ratio is relatively small, including its effect in the structure refinement, procedure does not pose any fundamental problem. In this case, the floating diatomic molecule, used above to describe the orientation of the diffusion tensor in the structure calculations for the axially symmetric case, is replaced by an artificial tetraatomic molecule comprising atoms X, Y, Z, and O, with three mutually perpendicular bonds, X-O, Y-O, and Z-O corresponding to the x, y, and z axes of the diffusion tensor, respectively. Calculation of Eanis is completely analogous to the axially symmetric case but uses the full, five-term expression for the spectral density. A set of structure calculations, carried out for a small number of η values (typically 0, 0.2. and 0.4) then indicates whether inclusion of rhombicity leads to better agreement with the experimental T1/T2 data. As pointed out above, however, the T1/T2 ratio is only a weak function of η, and the exact value of η often is defined poorly by the NMR data. For the heteronuclear 15N T1/T2 method to be applicable, the molecule must tumble anisotropically (i.e., it must be nonspherical). The minimum ratio of the diffusion anisotropy (D/D` ) for which heteronuclear T1/T2 refinement will be useful depends entirely on the accuracy and uncertainties in the measured T1/T2 ratios. In practice, the difference between the maximum and minimum observed T1/T2 ratio must exceed the uncertainty in the measured T1/T2 values by an order of magnitude. This typically means that D/D` should be greater than 1.5 (16). Direct refinement against 15N T1/T2 ratios has been applied to the N-terminal domain of enzyme I (EIN), a 30-kDa protein of 259 residues (16). EIN is elongated in shape with a diffusion anisotropy of 2. As a result, the observed T1/T2 ratios range from 14 when the N-H vector is perpendicular to the diffusion axis to 30 when the N-H vector is parallel to the diffusion axis. EIN consists of two domains, and of the 2,818 NOEs used to determine its structure, only 38 involve interdomain contacts (8). Refinement against the T1/T2 ratios resulted in a small change in the relative orientations of the two domains without perturbing the structures of the individual domains. Refinement Against Residual Dipolar Couplings. The expression for the residual dipolar coupling δ(θ,) between two directly bonded nuclei can be simplified to the form δ(θ,)= Da(3cos2θ–1)+3/2 Dr(sin2θ cos2)], where Da and Dr are the axial and rhombic components of the trace less diagonal tensor D given by 1/3 [D22–(Dxx+Dyy)/2) and 1/3 (Dxx– Dyy), respectively, with Dzz>Dyy≥Dxx; θ is the angle between the interatomic vector and the z axis of the tensor; and is the angle that describes the position of the projection of the interatomic vector on the xy plane, relative to the x axis (48). Note that the terms Da and Dr subsume various constants including the gyromagnetic ratios of the two nuclei, the distance between the two nuclei, the generalized order parameter S for internal motion of the internuclear vector, the magnetic field strength, and the medium permeability. [It is worth pointing out that, because Da and Dr scale with S and not S2, the assumption of a uniform S value introduces a negligible error of at most a few percent in the dipolar coupling providing S2≥0.6, particularly when one considers that S2 values in structured regions of a protein typically fall in the 0.85±0.05 range (17)]. The applicability of the residual dipolar coupling method depends on the magnitude of the degree of alignment of the molecule in the magnetic field (17). The magnetic susceptibility of most diamagnetic proteins is dominated by aromatic residues but also contains contributions from the susceptibility anisotropies of the peptide bonds. The magnetic susceptibility anisotropy tensors of these individual contributors are generally not colinear, so the net value of the magnetic susceptibility anisotropy in diamagnetic proteins is usually small. Much larger magnetic susceptibility anisotropies are obtained if many aromatic groups are stacked on each other in such a way that their magnetic susceptibility contributions are additive, as in the case of nucleic acids. Hence, alignment induced by the magnetic field is suited ideally to nucleic acids and proteinnucleic acid complexes (17). In practice, the residual dipolar couplings must exceed the uncertainty in their measured values by an order of magnitude, which typically means that the magnetic susceptibility anisotropy should be –10×10–34 m3 per molecule, which is 10 times greater than that for benzene. This translates into values of Da obtained by measuring the difference in one-bond coupling constants at, for example, 360 and 750 MHz of 0.5 Hz for N-H vectors and 0.9 Hz for C-H vectors. To obtain these values with sufficient accuracy requires that the one-bond couplings be measured by constant-time J-modulated correlation spectroscopy (49). More recently, it has been shown that high degrees of alignment in a magnetic field, corresponding to values of Da of 10 Hz for N-H vectors and 18 Hz for C-H vectors, can be achieved readily by the addition of dilute liquid crystalline media, while retaining the sensitivity and resolution of spectra recorded in isotropic media (46). As a result, it becomes feasible to measure several different types of residual dipolar couplings by
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR
5896
simply examining the splittings in 2D or 3D coupled correlation spectra. In particular, the much smaller residual couplings for other types of internuclear vectors, such as C-N (10 times smaller than N-H) and Cα-C’ and C-C (6 times smaller than N-H), are experimentally accessible. The geometric content of the residual dipolar couplings is incorporated into the simulated annealing protocol by including the term Edipolar=kdipolar(δcalc–δobs)2, where kdipolar is a force constant and δcalc and δobs are the observed and calculated values of the residual dipolar couplings, respectively (17). Just as for Eanis in the case of T1/T2 refinement, Edipolar is evaluated by calculating the θ and angles between the appropriate bond vectors (e.g., N-H, Cα-H, or Cα-C) and an external arbitrary axis system, defined by an artificial tetraatomic molecule comprising atoms X, Y, Z, and O, with three mutually perpendicular bonds, X-O, Y-O, and Z-O, representing the x, y, and z axes of the tensor, respectively (17, 18). To apply residual dipolar coupling refinement, the values of Da and the rhombicity R (defined as Dr/Da) must be determined directly from the experimental data (18). The minimum value of the residual dipolar coupling, δmin, occurs at θ== 90°, such that Da is given by -δmin/(1 +1.5R). Experimentally, a reliable value of δmin is obtained by taking the average of the smallest residual dipolar couplings such that the SD of the estimated δmin value is equal to the measurement error. The maximum value of the residual dipolar coupling, δmax, which occurs at θ=0°, is given by 2Da. As in the case of the T1/T2 ratios discussed above (16), a reliable estimate of δmax is more difficult to obtain from the experimental data because the probability of finding a bond vector with θ0° is low. Consequently, if measurements are available for only a single type of internuclear vector, the value of δmax, and hence the value of Da generally will be underestimated by 15–20%. Nevertheless, the observed value of δmax still can be used to obtain an upper limit for the value of R given by [–2δmin(obs)/ δmax(obs)–1]/1.5 (18). Because δmin can be determined accurately experimentally (for R<0.6) but Da cannot be obtained independently of R (unless a good estimate of δmax is available), the strategy we use when residual dipolar couplings only have been measured for a single type of internuclear vector involves calculating a series of structure ensembles for different estimates of R. (Note that the rhombicity reaches a maximum value of 2/3 when Dzz= –Dxx and Dyy=0; at this point the z and x axes are interchangeable so that the probability of finding a N-H vector perpendicular to the z axis is the same as finding one parallel to the z axis). The dependence of the rms difference between target and calculated dipolar couplings on the estimated value of R (Rest) shows a minimum when Rest is approximately equal to the target value of R (Rtarget) (18). The same type of dependence is observed for the total energy of the target function, reflecting not only the agreement between target and calculated dipolar couplings but also small changes in the agreement between target and calculated values of the other terms in the target function (18). Because the distribution of the different vector types relative to the tensor is not identical, it becomes possible, once measurements are available for two or more types of internuclear vectors, to obtain reliable values of Da and R from the observed minimum (δmin), maximum (δmax), and most probable (δP) values of the normalized residual dipolar couplings. The residual dipolar couplings for different internuclear vectors are normalized readily because Da,CD=Da,AB(γCγD/ γAγB)(rAB3/rCD3), where AB and CD are two types of internuclear vector (e.g., N-H and Cα-H); γA, γB γC, and γD the gyromagnetic ratios of atoms A, B, C, and D, respectively; and rAB and rCD the internuclear A-B and C-D distances. A histogram of the normalized residual dipolar couplings displays a powder spectrum with the property that δmin+δmax+ δp=0. The values of Da and R then can be obtained readily by least squares minimization of the following three equations: δmin(obs)=–Da(1+1.5R), δmax (obs)=2Da, and δp(obs)= –Da(1–1.5R). Indeed, model calculations with four different proteins of differing sizes and secondary structure content indicate that, if the N-H, Cα-H. and Cα-C residual dipolar couplings are measured for only 50% of the residues, Da and R can be determined in this manner to within better than 5% and ±0.1, respectively, which is quite sufficient because variations in the estimated value of Da and R of ± 10% and ±0.15 have a negligible effect on the calculated structures (18). If only residual dipolar couplings are measured for the NH and Cα-H vectors, Da and R still can be determined to within an accuracy of better than 10% and ±0.15. An example of the structural impact of residual dipolar coupling refinement is illustrated in Fig. 3 for the case of a complex of the transcription factor GATA-1 with a 16-bp oligonucleotide (17). In this instance, the addition of only 90 dipolar coupling restraints to the 1,500 NOE and 300 torsion angle restraints resulted in a substantial improvement in the quality of the protein backbone, as judged by an approximately twofold reduction in the number of residues lying outside the most favored region of the Ramachadran , plot (17). With the exception of a single region, the ensembles of structures calculated with and without dipolar couplings overlap (Fig. 3). There is, however, a substantial displacement (accompanied by a maximal 4-Å rms shift in the backbone coordinates of residue 22) in the short loop (residues 21– 24) that connects strands β3 and β4. Because this loop has low mobility, as judged from 15N relaxation data, this is a good example illustrating one of the principal shortcomings of NMR structure determination based on NOE measurements, namely an ill-defined region due to lack of long range NOE restraints. The only NOEs observed for residues 22 and 23 are either intraresidue or sequential, and there are no long range NOEs involving residues 21 through 24. Hence, the precision of the backbone coordinates for this loop is lower than that for the α-helix and βstrands. Even though there are loose torsion angle restraints for the and angles of these residues, accumulation of errors in the experimental restraints (for example, an NOE interproton distance restraint that is slightly too short, even by as little as 0.1 Å) becomes an important
FIG. 3. View showing besifit superpositions of the restrained regularized mean coordinates obtained with and without dipolar coupling restraints. The protein is shown as a ribbon diagram drawn through the Cα positions. The loop between strands β3 and β4 (residues 21–24) is shown in magenta for the structure obtained with dipolar coupling restraints and in grey for the structure obtained without dipolar coupling restraints. Adapted from ref. 17.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR
5897
factor in determining the orientation of this loop with respect to the rest of the protein. Refinement with a Conformational Database Potential. In the context of simulated annealing refinement, it is found generally that conventional nonbonded interaction terms (either attractive-repulsive or purely repulsive) have very poor discriminatory power between high and low probability local conformations (14). This can be circumvented by the use of a conformational database potential derived from high resolution, highly refined protein and nucleic acid crystal structures that bias the sampling during simulated annealing refinement to conformations that are energetically possible by limiting the choices of dihedral angles to those that are known to be physically realizable (14, 15). The database potential, which is partitioned into various one, two, three, and four dimensional distributions (Table 1), is created as follows (14). For each distribution, the fractional probability Pi for a residue to appear within a particular bin (with each dimension digitized in increments of 8–10°) is converted into a potential of mean force EDB(i)=–kDB(lnPi), where kDB is a scale factor. Because the conformational database energy is not a continuous function but rather is known in discrete blocks, the partial derivatives are approximated in a manner analogous to that used for 13C chemical shift potential term (11). To this end, the energy for every rotatable bond (or set of rotatable bonds) being refined against the conformational database potential is defined by looking up the value in the grid bin that encompasses the current dihedral angle(s), and the partial derivatives of the energy with respect to the rotatable bond angles then are approximated by the local slope of the energy function, defined by ` EDB(i)/` –kDB[EDB(i– i)–EDB(i+1)]/2, where EDB (i) is the database energy of bin i along the rotatable bond i and EDB(i–1) and EDB(i÷1) are the database energies of the bins that precede and follow the bin that contains the actual energy value.
Table 1. Summary of database potentials A. Proteins One-dimensional χ4 Two-dimensional / χ1/χ2 χ2/χ3 Three-dimensional
Arg, Lys Gly, Pro, X-Pro, H-bonding*. Val/Ile, rest Leu, Ile, Gln/Glu, Arg/Lys/Met, Asn, Asp, Cys(ox), His, Trp, Phe/ Tyr Met, Gln, Glu, Lys, Arg Val, Ile, Phe/Tyr/Trp, Leu, X-Pro, Gln/Glu/Arg/Lys/ Met, Cys (red)/His/Asp/Asn, Ser, Thr, Cys(ox), Pro
χ1/χ2/χ3
Gln, Glu, Arg, Lys, Met
Four-dimensional B. Nucleic acids Two-dimensional Three-dimensional *Residues with a hydrogen bond donor or acceptor in the γ or δ position (Ser, reduced cysteine, Asp, Asn, Ser, and Thr). †The scale factor used for the interresidue potentials must be set to a value 10-fold lower than that for the intraresidue potentials; otherwise, undesirable bias in the structures may be introduced. Typically, the final value of the scale factor for the intraresidue conformational database potentials is set to 1.0.
It should be noted that there is one significant difference between the protein and nucleic acids conformational database potentials (15). In the case of the protein conformational database potential, the energy values for the various minima in the multidimensional potential energy surfaces provide a true reflection of the probability of occurrence of particular conformations because protein structures in solution and the crystal state are essentially the same. In the case of nucleic acids, however, and in particular DNA. the frequency of occurrence of different forms in the crystal state does not necessarily reflect their probability of occurrence in solution. For example, in solution under physiological conditions, short DNA oligonucleotides are invariably B-form. In the crystal, however, A, B, or Z-forms can occur depending on the crystallization conditions. As a result, the A and Z forms of DNA are overrepresented in the database, and the energy values for the different minima in the multidimensional potential energy surfaces comprising the nucleic acid conformational database potential do not necessarily reflect their probability of occurrence in solution. This does not, however, affect the positions of the various minima so that, as far as structure refinement is concerned, the nucleic acid conformational database potential still serves its primary function, namely biasing sampling to conformations that are realizable physically. The effect of incorporating the conformational database potential into refinement is to improve the stereochemistry of the structures in terms of the quality of the Ramachadran plot, the rotamer distributions, and the number of bad contacts (14, 15). If there are no significant errors in the experimental restraints, conformational database refinement will not impact the agreement between the calculated and target experimental, covalent, and van der Waals restraints. The presence of errors in the experimental restraints, however, will be reflected by a large deterioration in the agreement between calculated and target restraints upon conformational database refinement (14). Hence, incorporation of the conformational database provides a good indicator of the quality of both the model and the experimental restraints (14). Some may regard the introduction of a conformational database energy term as a major step toward empiricism in NMR structure refinement, adding a term with apparently no direct physical counterpart, whose effect will be to make the dihedral angle distributions in NMR refined structures look more like those in crystal structures. However, the combined quality and quantity of high (≤2 Å) resolution protein structures in the crystallographic databases (50) argues strongly against such a viewpoint and makes it very difficult to ignore the available experimental observations relating to dihedral angles in proteins. First, it is invariably the case that high resolution x-ray structures show significantly better agreement with solution observables, such as coupling constants. 13C chemical shifts, and proton chemical shifts, than the corresponding NMR structures, including the very best ones (obtained in the absence of direct coupling constant and chemical shift restraints) (10– 13, 27, 28, 41, 42). Hence, in most cases, a high (≥2 Å) resolution crystal structure of a soluble globular protein will provide a better description of the structure in solution than the corresponding NMR structure. Second, the probability distributions for the various dihedral angles observed in the crystallographic database are a direct result of the underlying physical chemistry of the system and as such provide a perfectly reasonable, albeit empirically derived, measure of the relative energetics of different combinations of dihedral angles (14). Third, the discriminating and converging power of the conformational database potential with regard to dihedral angles is significantly better than that of the currently available empirical nonbonded potentials. This point is hardly surprising because the conformational database potential acts
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR
5898
directly on rotatable bonds whereas the nonbonding potentials do not. A question that is invariably asked about the conformational database potential is whether one will be able to pick up unusual sidechain or backbone conformations. Inspection of high resolution protein x-ray structures indicates that one safely can assume that 90–95% of all residues have a sidechain conformation resembling that of a common rotamer (50). Under these conditions, residues that truly exhibit a skewed rotamer conformation will be spotted by specific discrepancies between the model and the experimental restraints, and in most circumstances such violations will be accounted for by special structural features of the model. Moreover, one should be especially careful in believing a nonrotamer sidechain conformation in NMR structures in the absence of extensive NOE and coupling constant data relating to that particular residue. Exactly the same arguments can be applied to , angles located in unfavorable regions of the Ramachandran plot, which likewise should be treated with extreme caution unless there is extensive experimental evidence to the contrary (50). We thank Ad Bax, Dan Garrett, John Kuszewski, and Nico Tjandra for many stimulating discussions. 1. Clore, G.M. & Gronenborn, A.M. (1991) Science 252, 1390– 1399. 2. Wuthrich, K. (1986) NMR of Proteins and Nucleic Acids (Wiley, New York). 3. Clore, G.M. & Gronenborn, A.M. (1987) Protein Eng. 1, 275–288. 4. Dyson, H.J., Gippert, G.P., Case, D.A., Holmgren. A. & Wright, P.E. (1990) Biochemistry 29, 4129–4136. 5. Forman-Kay, J.D., Clore, G.M., Wingfield, P.T. & Gronenborn, A.M. (1991) Biochemistry 30, 2685–2698. 6. Clore, G.M., Wingfield, P.T. & Gronenborn, A.M. (1991) Biochemistry 30, 2315–2323. 7. Clore, G.M. & Gronenborn, A.M. (1991) J. Mol. Biol. 221, 47–53. 8. Garrett, D.S., Seok, Y.J., Liao, D.-I., Peterkofsky, A., Gronenborn, A.M. & Clore, G.M. (1997) Biochemistry 36, 2517–2530. 9. Martin, J.R., Mulder, F.A.A., Karimi-Nejad, Y., van der Zwan, J., Mariani, M., Schipper, D. & Boelens, R. (1997) Structure 5, 521–532. 10. Garrett, D.S., Kuszewski, J., Hancock, T.J., Lodi, P.J., Vuister, G.W., Gronenborn, A.M. & Clore, G.M. (1994) J. Magn. Reson. 104, 99–103. 11. Kuszewski, J., Qin, J., Gronenborn, A.M. & Clore, G.M. (1995) J. Magn. Reson. 106, 92–96. 12. Kuszewski, J., Gronenborn, A.M. & Core, G.M. (1995) J. Magn. Reson. 107, 293–297. 13. Kuszewski, J., Gronenborn, A.M. & Clore, G.M. (1996) J. Magn. Reson. 112, 79–81. 14. Kuszewski, J., Gronenborn, A.M. & Clore, G.M. (1996) Protein Sci. 5, 1067–1080. 15. Kuszewski, J., Gronenborn, A.M. & Clore, G.M. (1997) J. Magn. Reson. 125, 171–177. 16. Tjandra, N., Garrett, D.S., Gronenborn, A M., Bax, A & Clore, G.M. (1997) Nat. Struct. Bioi. 4, 443–449. 17. Tjandra, N., Omichinski, J.G., Gronenborn, A.M., Clore, G.M. & Bax, A. (1997) Nat. Struct. Biol. 4, 732–738. 18. Clore, G.M., Gronenborn, A.M. & Tjandra, N. (1998) J. Magn. Reson., 131, 159–162. 19. Clore, G.M. & Gronenborn, A.M. (1989) CRC Crit. Rev. Biochem. Biol. Biol. 24, 479–564. 20. Clore, G.M., Brünger, A.T., Karplus, M. & Gronenborn, A.M. (1986) J. Mol. Biol. 191, 523–551. 21. Nilges, M., Clore, G.M. & Gronenborn, A.M. (1988) FEBS Lett. 229, 317–324. 22. Stein, E.G., Rice, L.M. & Brünger, A.T. (1997) J. Magn. Reson. 124, 154–164. 23. Havel, T.F. & Wuthrich, K. (1985) J. Mol. Biol l82, 381–394. 24. Braun, W. (1987) Q. Rev. Biophys. 19, 115–157. 25. Gronenborn, A.M. & Clore, G.M. (1995) CRC Crit. Rev. Biochem. Mol. Biol. 30, 351–385. 26. Clore, G.M., Robien, M.A. & Gronenborn, A.M. (1993) J. Mol. Biol. 231, 81–102. 27. Bartik, K., Dobson, C.M. & Redfield, C. (1993) Eur. J. Biochem. 215, 255–266. 28. Wang, A.C. & Bax, A. (1996) J. Am. Chem. Soc. 118, 2483–2494. 29. Luzzati, V. (1952) Acta Crystalhgr. 5, 802–810. 30. Borgias, B.A., Gochin, M., Kerwood, D.J. & James, T.L. (1990) Progr. NMR Spectrosc. 22, 83–100. 31. Yip, P. & Case, D.A. (1991) J. Magn. Reson. 83, 643–648. 32. Nilges, M., Habbazettl, P., Brünger, A.T. & Holak, T.A. (1991) J. Mol. Biol. 219, 499–510. 33. Brünger, A.T., Clore, G.M., Gronenborn, A.M., Saffrich, R. & Nilges, M. (1993) Science 261, 328–331. 34. Karplus, M. (1963) J. Am. Chem. Soc. 85, 2870. 35. Bax, A., Vuister, G.W., Grzesiek, S., Delaglio, F., Wang, A.C., Tschudin, R. & Zhu, G. (1994) Methods Enzymol. 239, 79–106. 36. Hu, J.-S. & Bax, A. (1996) J. Am. Chem. Soc. 118, 8170–8171. 37. Ottiger, M. & Bax, A. (1997) J. Am. Chem. Soc. 119, 8070–8075. 38. Spera, S. & Bax, A. (1991) J. Am. Chem. Soc. 113, 5491–5492. 39. Wishart, D.S. & Sykes, B.D. (1994) J. Biomol NMR 4, 171–180. 40. Oldfield, E. (1995) J. Biomol. NMR 5, 217–225. 41. Osapay, K.A. & Case, D.A. (1991) J. Am. Chem. Soc. 113, 9436–9444. 42. Williamson, M.P. & Asakura, T. (1993) J. Magn. Reson. 101, 63–71. 43. Omichinski, J.G., Pedone, P.V., Felsenfeld, G., Gronenborn, A.M. & Clore, G.M. (1997) Nat. Struct. Biol. 4, 122–132. 44. Huth, J.R., Bewley, C.A., Nissen, M.S., Evans, J.N.S., Reeves, R., Gronenborn, A.M. & Core, G.M. (1997) Nat. Struct. Biol. 4, 657–665. 45. Lodi, P.J., Ernst, J.A., Kuszewski, J., Hickman, A.B., Engelman, A., Craigie, R., Clore, G.M. & Gronenborn, A.M. (1995) Biochemistry 34, 9826– 9833. 46. Tjandra, N. & Bax, A (1997) Science 278, 1111–1114. 47. Woessner, D.E, (1962) J. Chem. Phys. 36, 647–654. 48. Bothner-By, A.A. (1995) in Encyclopedia of Nuclear Magnetic Resonance, eds. Grant. D.M. & Harris, R.K. (Wiley, Chichester, U.K.), pp. 2932– 2938. 49. Tjandra, N., Grzesiek, S. & Bax, A. (1996) J. Am. Chem. Soc. 118, 6264–6272. 50. Kleywegt, G.J. & Jones, T.A (1997) Methods Enzymol 227, 208–230. 51. Clore, G.M., Ernst, J.A, Clubb, R.T., Omichinski, J.G., Kennedy, W.M.P., Sakaguchi, K., Appella, E. & Gronenborn, A.M. (1995) Nat. Struct. Biol. 2, 321–332. 52. Clore, G.M. & Gronenborn, A.M. (1991) J. Mol Biol. 217, 611–620. 53. Szyperski, T., Güntert, P., Stone, S.R. & Wüthrich, K. (1992) J. Mol. Biol. 228, 1192–1205. 54. Berndt, K.D., Günter, P., Orbons, L.P.M. & Wüthrich, K. (1992) J. Mol Biol. 227, 757–775. 55. Hyberts, S.G., Goldberg, M.S., Havel, T.F. & Wagner, G. (1992) Protein Sci. 1, 736–751. 56. Moore, J.M., Lepre, C., Gippert, G.P., Chazin, W.J., Case, D.A. & Wright, P.E. (1991) J. Mol Biol. 221, 533–555. 57. Billeter, M., Kline, A.D., Braun, W., Huber, R. & Wüthrich, K. (1989) J. Mol Biol. 206, 677–687. 58. Folkers, P.J. M., Clore, G.M., Driscoll, P.C., Dodt, J., Köhler, S. & Gronenborn, A.M. (1989) Biochemistry 28, 2601–2617. 59. Spitzfaden, C., Braun, W., Wider, G., Widmer, H. & Wüthrich, K. (1994) J. Biomol. NMR 4, 463–482. 60. Osapay, K., Theriault, Y., Wright, P.E. & Case, D.A (1994) J. Mol. Biol. 244, 183–197. 61. Clore, G.M., Gronenborn, A.M., Nilges, M. & Ryan, C.A. (1987) Biochemistry 26, 8012–8023. 62. Billeter, M., Vendrell, J., Wider, G., Aviles, F.X., Coll, M., Guasch, A., Huber, R. & Wüthrich, K. (1992) J. Biomolec. NMR 2, 1–10. 63. Clore, G.M., Gronenborn, A.M., James. M.N.G., Kjaer, M., McPhalen, C.A. & Poulsen. F.M. (1987) Protein Eng. 1, 313–318.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDE SUBSTITUTION
5899
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5899–5905, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Estimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution
(substitution models/rate heterogeneity/stationarity/nonstationarity/estimation bias) XUN GU* AND WEN-HSIUNG LI†‡ *Institute of Molecular Evolutionary Genetics, 328 Muellcr Laboratory. Pennsylvania State University, University Park, PA 16802: and †Human Genetics Center, SPH, Universitv of Texas, P.O.Box 20334, Houston, TX 77225 ABSTRACT Estimation of evolutionary distances has always been a major issue in the study of molecular evolution because evolutionary distances are required for estimating the rate of evolution in a gene, the divergence dates between genes or organisms, and the relationships among genes or organisms. Other closely related issues are the estimation of the pattern of nucleotide substitution, the estimation of the degree of rate variation among sites in a DNA sequence, and statistical testing of the molecular clock hypothesis. Mathematical treatments of these problems are considerably simplified by the assumption of a stationary process in which the nucleotide compositions of the sequences under study have remained approximately constant over time, and there now exist fairly extensive studies of stationary models of nucleotide substitution, although some problems remain to be solved. Nonstationary models are much more complex, but significant progress has been recently made by the development of the paralinear and LogDet distances. This paper reviews recent studies on the above issues and reports results on correcting the estimation bias of evolutionary distances, the estimation of the pattern of nucleotide substitution, and the estimation of rate variation among the sites in a sequence. Evolutionary distances (usually designated by d) such as the number of nucleotide substitutions between two DNA sequences (K) are basic quantities in the study of molecular evolution because they are required for computing the rate of evolution in a DNA or protein sequence, for inferring the evolutionary relationships among genes or organisms, and for estimating the divergence dates between taxa or genes (1–9). For these purposes, however, it is essential to obtain reliable estimates of evolutionary distances. Indeed, if the evolutionary distances are not accurately estimated, all distance matrix methods of tree reconstruction may be misleading (5–6, 8). Because accurate estimation of evolutionary distances requires a realistic model of nucleotide substitution, much effort has been made to develop general models of nucleotide substitution (4, 8). If the process of nucleotide substitution is stationary, i.e., if the nucleotide compositions of the sequences under study have been approximately constant over time, then fairly general models of nucleotide substitution can be developed. For the stationary, time-reversible model (the SR model), Lanave et al. (10), Gu and Li (11), and others (12–14) have developed methods for estimating K. This model includes many other models as special cases (see next page). Moreover, Gu and Li (11) have recently extended the SR model to include rate variation among sites, i.e., the SRV model, in which SRV stands for stationary, time-reversible, and rate-variable. When nucleotide frequencies change with time so that stationarity does not hold, phylogenetic reconstruction using distances estimated under a stationary model can be misleading because it tends to group together sequences of similar nucleotide compositions irrespective of their true evolutionary relationships (15–18). Nonstationarity greatly complicates the mathematics. Fortunately, significant progress has been made with the development of the paralinear (19) and LogDet distances (17, 20). However, both methods assume a uniform rate among sites, and so methods for dealing with rate heterogeneity remain to be developed. An issue related to the estimation of evolutionary distances is the estimation of the pattern of nucleotide substitution. This pattern can be reliably estimated under stationarity (21–23) but is difficult to estimate under nonstationarity. Another problem closely related to distance estimation is how to estimate the degree of rate variation among sites (24–29). Many methods have been proposed for this purpose under a specific distribution (e.g., a gamma distribution). However, how to estimate rate heterogeneity without assuming a specific distribution has been unclear (30). These issues will be considered in this paper. A further issue is that estimation bias usually occurs when the sequence length is short so that stochastic effects are strong. Although the bias tends to become trivial as the sequence length increases, it is desirable to correct the bias because in practice many sequences studied are actually very short (31–32). The purpose of this article is to review recent studies on the above issues and to present our results.
Stationary Models The SR Model. Assume that nucleotide substitution follows a stationary Markov process (10–14). Denote A, G, T, and C as 1, 2, 3, and 4, respectively. Let R be the rate matrix whose ij-th element rij is the rate of change from nucleotide i to nucleotide j (ij, i, j=1, 2, 3, 4); the diagonal elements are given by rii=–∑ji rij. Then the matrix of transition probabilities P for t time units is given by P=eRt, where the ij-th element Pij is the probability of transition from nucleotide i to nucleotide j after t evolutionary time units. The substitution process is reversible in time if and only if πirij=πjrji, where πi is the equilibrium frequency of nucle
‡To
whom reprint requests should be addressed, e-mail: [email protected]. © 1998 by The National Academy of Sciences 0027–8424/98/955899–7$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: SR, stationary time reversible: SRV. SR rate-variable; NR, time-irreversible; TR, time-reversible.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDE SUBSTITUTION
5900
otide i. The preceding relation implies that the off-diagonal elements of R can be expressed as
Therefore, the SR model is a nine-parameter model and includes many models as special cases, e.g., the models of Jukes and Cantor (33), Kimura (34), Tajima and Nei (35), Hasegawa et al. (21), and Tamura and Nei (22). The SR model has been studied by many authors (10–14, 23, 36). Consider two sequences (designated by 1 and 2) that have evolved from O, their common ancestor, t time units ago (Fig. 1). Under stationarity, time-reversibility means that the substitution process from the common ancestor O to sequences 1 and 2 is equivalent to the substitution process from 1 through O to 2 (or from 2 through O to 1), whose transition probability matrix for 2t time units is given by
P=e2tR. [1] Let λk (k=1, 2, 3, 4) be the k-th eigenvalue of the rate matrix R; one of them is zero, say λ4=0. Let zk be the k-th eigenvalue of P. Eq. 1 implies zk=e2tλk. Gu and Li (11) showed that the evolutionary distance defined by the average number of substitutions per site (i.e.,) is given by
[2] where constants ck are determined by the eigenmatrix of P. Eq. 2 is generally valid since all eigenvalues zk are real under the SR model (11, 37). For example, under the Jukes-Cantor model (33), z1=z2=z3=1–4p/3 and c1=c2=c3=1/4 so that Eq. 2 is reduced to d=–(3/4)ln(1–4p/3), where p is the proportion of nucleotide differences between the two sequences. The SR distance can be estimated from the data matrix J, whose ij-th element (Jij) is the frequency of sites at which the nucleotides in the two sequences are i and j, respectively. By time-reversibility, we have Jij=πiPij. Therefore, the ij-th element of P (for 2t time units) can be estimated by Pij=Jij/πi (i,j=1,, 4), where πi, and Jij are easily obtained from the sequence data. Let matrix P consist of Pij. Its eigenvalues zk (k =1,, 3) can be computed by a standard algorithm, and the constants are given by (k=1, 2, 3), where uik and vkj are the elements of the corresponding eigenmatrix U and its inverse matrix V, respectively. For details, see Saccone et al. (38), Gu and Li (11), and Li and Gu (39). The sampling variance of d and the variance-covariance matrix for more than two DNA sequences can be found in Gu and Li (11).
FIG. 1. Two DNA sequences diverged t time units ago. Eq. 2 can be used to define many additive distances by choosing appropriate constants ck (Table 1), e.g., the number of nucleotide substitutions per site (K), the number of transitional substitutions per site (A), the number of transversional substitutions per site (B), and the number of substitutions from nucleotides i to j (Dij). These distance measures are useful for phylogenetic analysis and molecular clock testing. The SRV Model. Rate variation among sites can be incorporated into the SR model by assuming rij=aiju, where aij is a constant and u varies according to a gamma distribution
[3] with mean =α/β; α is the shape parameter and determines the degree of rate variation. Under this model, the (mean) transition probability matrix P for 2t time units is given by
[4] where I is the identity matrix and the mean rate matrix R= A where matrix A consists of aij (11). From Eq. 4, one can show that the k-th eigenvalue of P is given by
[5]
where λk is the k-th eigenvalue of R. It follows that the evolutionary distance under the SRV model is given by
[6] The constants ck are determined in the same manner as above (Table 1). Note that Eq. 4 reduces to Eq. 1 and Eq. 6 to Eq. 2 as α→ , i.e., the substitution rate is uniform among sites. Furthermore, Eq. 6 can be generalized to any distribution f(u) for the rate variation among sites. Let G(s)= (u)du be the momentgenerating function of f(u). Gu and Li (11) showed that zk=G(2λkt), k=1, 2, 3, 4. Thus, the general additive distance is given by
[7] where G–1 is the inverse function of the moment-generating function G. For example, consider the invariant+gamma model (26, 40–41): (i) for a given site, the probability of being invariable (i.e., u=0) is θ, whereas the probability of being variable is 1—θ; and (ii) among the sites that are variable, the substitution rate follows a gamma distribution. By applying Eq.
Table 1. The constants ck in the general SR or SRV distance
K is the number of substitutions per site; A is the number of transitional substitutions per site; B is the number of tranversional substitutions per site, and Dij is the number of substitutions from nucleotides i to j per site. The subscripts ji Ts and ji Tv mean that the differences between nucleotides i and j are transitional and transversional, respectively.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDE SUBSTITUTION
5901
7, one can show that the evolutionary distance under the invariant + gamma distribution is given by
[8] For other distributions, see Waddell et al. (30). Bias-Corrected SR and SRV Distances. Our computer simulation has shown that when the sequence length is short the SR and SRV methods tend to overestimate the evolutionary distance. The bias can be corrected as follows. Let d be an estimate of the SR or SRV distance. We use the first three terms of the Taylor expansion to obtain an approximate expression of E[d]. For the SR model, [9] Therefore, the bias-corrected SR distance is given by
dc=d–δ, [10] where δ is defined as
[11] and Var(zk) can be obtained by the method of Gu and Li (11). The bias-corrected distance under the SRV model also can be written as Eq. 10, except that δ is replaced by
[12] Computer Simulation. Extensive computer simulations on the performance of the SR and SRV methods have been conducted in this study and in Rodriguez et at. (14), Zharkikh (31), and Gu and Li (11). The results can be summarized as follows. (i) When the sequence length (L) is long and the rate of substitution is uniform among sites, the SR method performs well, whereas simpler methods [e.g., Kimura’s two-parameter method (34)] give biased estimates if some assumptions of the method are violated (11, 14, 31). Because the actual substitution pattern of DNA evolution may be complex, the SR method is preferred when the sequences are long, say, longer than 1,000bp. (ii) The SR method may give large biases when the sequence length is short (say, L≤200), but the biases can be substantially reduced by the bias-corrected SR distance (Table 2). As L becomes longer than 2,000 bp, the estimation bias virtually decreases to zero. The same comment applies to the SRV method (Table 3). (iii) The SR method performs well even when DNA sequence evolution is not time-reversible (see models NR1 and NR2 in Table 2). Therefore, the assumption of time-reversibility, which simplifies the estimation problem considerably, may not have serious effects on distance estimation. (iv) When the substitution rate varies among sites, the evolutionary distance can be seriously underestimated by the SR method: note that this bias is systematic and cannot be eliminated by increasing sequence length. As shown in Table 3, the SRV method performs well and the estimation bias vanishes when L is long. (v) The methods developed by Gu and Li (11) for estimating sampling variance under the SR and SRV models appear to be reliable except when L<200 and d>1.0. (vi) The mean squared error defined by MSE=bias2+ Var(d) is useful for comparing the relative performance of two methods because for a simple method, the sampling variance tends to be smaller but the bias tends to be larger (11). For example, using this criterion, Gu and Li (11) found that SR is superior to JC when L>500 bp and that SRV is always superior to SR when the substitution rate varies among sites.
Table 2. The mean of distances (d) over simulation replicates estimated by the bias-corrected SR method and the SR method Model
Sequence length (L) 200
JC K2P TN TmN SR NR1 NR2
0.503 (0.506) 0.507 (0.517) 0.508 (0.516) 0.505 (0.516) 0.509 (0.517) 0.509 (0.517) 0.510 (0.517)
JC K2P TN TmN SR NR1 NR2
1.036 (1.082) 1.072 (1.093) 1.046 (1.089) 1.061 (1.085) 1.049 (1.085) 1.057 (1.090) 1.071 (1.094)
500 (1) d=0.5 0.506 (0.516) 0.502 (0.506) 0.504 (0.507) 0.505 (0.509) 0.503 (0.506) 0.503 (0.507) 0.505 (0.509) (2)d=1.0 1.013 (1.029) 1.008 (1.038) 1.015 (1.037) 1.016 (1.050) 1.006 (1.038) 1.009 (1.044) 1.015 (1.055)
2000 0.501 (0.502) 0.501 (0.502) 0.501 (0.502) 0.501 (0.502) 0.501 (0.502) 0.501 (0.502) 0.501 (0.502) 1.005 (1.008) 1.003 (1.009) 1.006 (1.010) 1.005 (1.012) 1.005 (1.009) 1.005 (1.011) 1.006 (1.012)
The value presented in each case is the mean of d estimated by the bias-corrected SR method and the value in parentheses by the (uncorrected) SR method. Simulation models: JC, the Jukes-Cantor model (33). K2P, Kimura’s two parameter model (34): the transition/ transversion ratio is 4. For TN (Tajima and Nei, Ref. 35), TmN (Tamura and Nei, Ref. 22), SR, and the two timeirreversible models (NR1 and NR2), see Gu and Li (11) for a detailed description. Estimating the Pattern of Nucleotide Substitution. The pattern of nucleotide substitution can be measured by the off-diagonal elements of the rate matrix R. For simplicity, these elements are usually rescaled, and here, we define the pattern of nucleotide substitution as R*=2tR. Consider two DNA sequences (Fig. 1) under the SR model. Denote the diagonal matrix of the eigenvalues of P=e2tR by diag(z1, Z2, Table 3. The mean of distance (d) estimated by the SRV method and the bias-corrected SRV method L (1) α=0.5 200 500 (2) α=1.0 200 500 (3) α=2.0 200 500
true d
SR+V model
NR2+V model
0.3 0.5 1.0 0.3 0.5 1.0
0.317 0.520 1.068 0.303 0.508 1.027
(0.325) (0.552) (1.179) (0.307) (0.517) (1.061)
0.320 0.555 1.193 0.305 0.510 1.037
(0.334) (0.574) (1.303) (0.310) (0.520) (1.077)
0.3 0.5 1.0 0.3 0.5 1.0
0.312 0.513 1.038 0.306 0.508 1.013
(0.318) (0.528) (1.126) (0.309) (0.514) (1.037)
0.313 0.531 1.063 0.306 0.502 1.022
(0.319) (0.544) (1.149) (0.309) (0.507) (1.053)
0.3 0.5 1.0 0.3 0.5 1.0
0.307(0.311) 0.514 (0.526) 1.046 (1.132) 0.300 (0.302) 0.503 (0.508) 1.012 (1.034)
0.308 (0.312) 0.514 (0.524) 1.060 (1.146) 0.300 (0.301) 0.502 (0.507) 1.015 (1.043)
The value presented in each case is the mean of d estimated by the bias-corrected SRV method, and the value in parentheses by the (uncorrected) SRV method. See the note of Table 2 for details.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDE SUBSTITUTION
5902
z3, z4). By matrix theory, we have P=U diag(z1, z2, z3, z4)U–1, where U is the eigenmatrix of P. Then, the substitution pattern R*=2tR=In P can be expressed as
R*=U diag(ln z1, In z2, In z3, In z4)U–1, [13] Therefore, using the same procedure, we can estimate the evolutionary distance and the pattern of nucleotide substitution simultaneously. In the same manner, under the SRV model one can show that the pattern of nucleotide substitution can be estimated by
[14] where (see also ref. 42). It is known that estimation of the pattern of nucleotide substitution can be significantly improved by using n>2 sequences, but the estimation procedure becomes complex because it needs to consider the phylogenetic tree of the sequences, which may be unknown. The following simple method does not require knowledge of the tree topology. For a given pair of sequences i and; j, which diverged tij time units ago, the transition probability matrix under the SR model is P(ij)=e2tijR. By multiplying P(ij) over all pairs of sequences, we have
[15] where τ=∑i<j tij. Similarly, under the SRV model, one can show that
[16] Therefore, when the transition probability matrix for each pair of sequences has been estimated, which is denoted by Pi j, we first compute P (2τ)=IIi<j Pij. Then, under the SR or SRV model, the substitution pattern R*=2τR for n sequences can be estimated by an approach similar to that for the case of two sequences. The sampling variances for the estimated substitution pattern can be obtained by the analytical method developed by Gu and Li (11) or by a simple resampling technique (e.g., bootstrapping). When many sequences are considered for estimating the substitution pattern, the time scale τ in Eq. 16 can be very large, resulting in some elements in R* larger than one. Because we are more concerned with the relative rates among the types of nucleotide substitutions, it is better to provide a normalized substitution pattern. A simple normalization procedure is to compute where M=n(n–1)/2 and the weight wij=1/dij
A General Measure of Rate Variation Among Sites Gu et al. (26) suggested a normalized measure (ρ) for evaluating the relative strength of the rate variation among sites:
[17] where Var(u) and are the variance and mean of the evolutionary rate (u) for any distribution f(u). As ρ varies from 0 to 1, the rate heterogeneity increases from a uniform rate over sites (ρ=0 or Cv=0) to the maximum heterogeneity (ρ=1 or Cv=). Therefore, ρ can directly reflect rate heterogeneity, and unlike the shape parameter α of the gamma distribution, it does not depend on a specific distribution. In the following we describe a simple method for estimating ρ without assuming a specific model for the rate variation among sites. We assume (i) at each site nucleotide substitution follows a Poisson process, and (ii) the evolutionary rate u varies among sites according to the distribution f(u). Let X be the number of substitutions at a nucleotide site with rate u. Then, the first two conditional moments of X are given by E[X|u]=uT and E[X2|u]=uT+(uT)2, respectively, where T is the total evolutionary time. It follows that the first two (unconditional) moments of X over all sites are E[X]= E[E(X|u)]=TE[u], and E[X2]=E[E[X2|u]]=TE[u]+ T2E[u2], respectively, where E[u] and E[u2] are the first two moments of f(u), respectively. Let m=E[X] and V=E[X2] –m2, and let =E[u] and Var(u)=E[u2]–( )2. One can show that m= T and V= T+Var (u)T2, and so Cv= Therefore, the parameter ρ is given by
[18] To estimate ρ from sequence data, we need to know the number of substitutions at each site. Conventionally, this number is inferred by the parsimony method (43) when the phylogenetic tree is known. However, the parsimony method tends to underestimate the true number of substitutions (29, 44). Gu and Zhang (29) solved this problem by using a combination of ancestral sequence inference and maximum likelihood estimation. Let Xi be the number of substitutions at the ith site estimated by Gu and Zhansfs method (29). Then, (L is the sequence length) so that ρ can be easily obtained from Eq. 18 without knowing the distribution f(u). The biological meaning of ρ can be easily understood by using the following simple model. Let v be the mutation rate at a site. For invariant sites, the substitution rate is 0, and for the other sites, the rate is hv, where 0
u=(1–ρ)hv. [19] This formula predicts a negative correlation between substitution rate and the rate variation among sites, which has been observed by J.Zhang and X.Gu (unpublished results).
Nonstationary Models LogDet and Paralinear Distances. The paralinear (19) and LogDet (17, 20) distances have been proposed to deal with nonstationarity. They are based on the most general model of nucleotide substitution. Historically, these methods can be traced back to Barry and Hartingan (13) and Cavender and Felseinstein (45). Consider the evolution of two sequences (Fig. 1). Denote the diagonal matrix of nucleotide frequencies at node k (k=0, 1, 2) by where the subscript; j refers to nucleotide j. Let J be the data matrix as defined previously. Then, the paralinear distance (between sequences 1 and 2) is defined as
[20]
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDE SUBSTITUTION
5903
In Eq. 21, the constant –In 4 is added because it does not change any property of the original LogDet distance but makes the biological interpretation easier (32). The paralinear and LogDet distances have the following properties: (i)
Both distances are based on the most general model of nucleotide substitution, i.e., the 12-parameter model (17, 19–20, 31). Moreover, they are valid even if the rate matrix R varies among lineages. Therefore, in the case where the assumption of a uniform substitution rate among sites holds, the paralinear and LogDet distances are very useful for phylogenetic reconstruction when nucleotide frequencies are nonstationary (19–20, 32). (ii) For the neighbor-joining method and related methods, the two distance measures give the same tree topology (32). However, there are some differences between the two distances. First, the paralinear distance between two sequences is the sum of “paralinear” lengths of the branches involved. Thus, the branch lengths under a given tree can be well estimated from the paralinear distance matrix by the least-squares method. In contrast, this property does not hold for the LogDet distance. Second, the LogDet distance is particularly useful for testing the molecular clock hypothesis under non-stationarity, whereas the paralinear distance is not suitable for this purpose (see Eqs. 27 and 28). (iii) The biological interpretation of the two distances can be described as follows. Let be the arithmetic mean rate in lineage k(k=1, 2), and µ =(µ(1)+µ(2))/2. Gu and Li (32) showed that the expected paralinear distance (Eq. 20) is given by [22] and the expected LogDet distance (Eq. 21) is given by [23] Note that, when the nucleotide frequency is stationary, Eq. 22 reduces to d=2µt. which is the expected number of substitutions between the two sequences and is equivalent to the SR distance with ck=1/4 (Eq. 2). Eq. 23 reduces to d=2µt if (iv) The approximate sampling variance of the paralinear distance is given by [24] and that of the LogDet distance is given by [25] M=J–1
where L is the sequence length and Mij is the ij-th element of (13, 20, 32). For more than two sequences, the method for computing the variance-covariance matrix of the two distances has been developed by Gu and Li (32). Bias-Corrected Paralinear and LogDet Distances. Because the data matrix J and the nucleotide frequencies can be directly estimated from the sequence data, the estimation of paralinear and LogDet distances is simple (19–20). However, our simulation study has revealed that the true (paralinear or LogDet) distance can be overestimated when the sequences are short (32), a situation similar to the SR/SRV distance. Gu and Li (32) obtained the following bias-corrected paralinear or LogDet distance.
dc=d–2 Var(d), [26] where d and Var(d) are the estimates of the “standard” paralinear or LogDet distance and the sampling variance, respectively (see Eqs. 20, 21, 24, 25). The performance of the bias-corrected distances has been examined by extensive computer simulation (32). We considered two DNA sequences (Fig. 1) that evolve under a very general model: in one lineage the nucleotide substitution follows a time-reversible model (TR) and in another lineage it follows a time-irreversible model (NR). The rate matrices of TR and NR are designed to be very different, and the equilibrium GC% is 70% in TR but only 17% in NR (see ref. 32 for the detail). Moreover, The initial GC% at node O (Fig. 1) is set to be 15%, 50%, and 70%, in three cases. Our simulation results indicate that, when the sequence length is short, the bias-corrected paralinear or LogDet distance performs considerably better than the uncorrected method (Table 4). Testing the Molecular Clock Hypothesis Under Nonstationarity. The relative rate test (2) can be described as follows. Consider three species as shown in Fig. 2, where species 3 is an outgroup. To test whether the evolutionary rate in lineage O1 is the same as that in lineage O2 (i.e.. the molecular clock hypothesis), one tests whether or not the difference D=d13 –d23 is significantly different from zero. Wu and Li (2), Gu and Li (46), Muse and Weir (47), Tajima (48), and others have developed tests for the case of stationarity. When the nucleotide frequencies are nonstationary, D0 can arise from differences in nucleotide frequencies between the two sequences. Gu and Li (32) showed that this problem can be avoided by using the LogDet distance; that is,
D=d13–d23=(µ(1)–µ(2))t, [27] where t is the divergent time between species 1 and 2 (Fig. 2). To test whether D is significantly different from zero, one can estimate the sampling variance of D, V(D)=V(d13)+V(d23) –2 Cov(d13, d23) by the method of Gu and Li (32). When the sequence is long, the statistic follows approximately the standard normal distribution (2). Actually, this new relative rate test can be easily generalized to the two-cluster
Table 4. Statistical performances of the bias-corrected paralinear distance L d dc Initial GC% 2µt=0.5 50% 200 0.486 0.488 (0.4%) 500 0.486 0.489 (0.6%) 2,000 0.486 0.487 (0.2%) 70% 200 0.555 0.556 (0.2%) 500 0.555 0.557 (0.4%) 2,000 0.555 0.555 (0.0%) 15% 200 0.607 0.599 (1.3%) 500 0.607 0.602 (0.8%) 2,000 0.607 0.609 (0.3%) 2µt=0.8 50% 200 0.770 0.766 (0.5%) 500 0.770 0.768 (0.3%) 2,000 0.770 0.770 (0.0%) 70% 200 0.858 0.842 (1.9%) 500 0.858 0.854 (0.5%) 2,000 0.858 0.859 (0.1%) 200 0.926 0.880 (5.0%) 15% 500 0.926 0.918 (0.9%) 0.926 0.925(0.1%) 2,000
d 0.497 (2.3%) 0.492 (1.2%) 0.488 (0.4%) 0.572 (3.1%) 0.563 (1.4%) 0.557 (0.4%) 0.637 (4.9%) 0.613 (1.0%) 0.611 (0.7%) 0.791 (2.7%) 0.777 (0.9%) 0.772 (0.3%) 0.890 (3.7%) 0.868 (1.2%) 0.862 (0.5%) 0.986 (6.5%) 0.946 (1.2%) 0.930 (0.5%)
L is the sequence length; d is the true value of the paralinear; dc and d are the means of d estimated by the bias-corrected and uncorrected paralinear distances. The percentage values in parentheses are the biases of dc(i.e.,|dc–d|/d×100%), and d(i.e.,|d–d|/d×100%), respectively.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDE SUBSTITUTION
5904
test of Li and Bousquet (49) and Takezaki et al. (50), who considered the case of stationarity (Gu and Li, unpublished data).
FIG. 2. The phylogeny used for molecular clock testing. On the other hand, if dij is measured by the paralinear distance, one can show that D'=d13–d23 is given by
[28]
Obviously, D' is affected by differences in nucleotide frequencies and thus not suitable for testing the molecular clock hypothesis.
Discussion In the above, we discussed the estimation of evolutionary distances and related issues under three models of nucleotide substitution: the SR model (10–14, 36), the SRV model (11), and the nonstationary model (13, 17, 19–20, 32, 45). The conclusions can be summarized as follows. (i) Under stationarity, the evolutionary distances and the pattern of nucleotide substitution can be estimated under the SR or SRV model, (ii) When the nucleotide frequencies are nonstationary, the paralinear or LogDet distances should be used. However, although both distances lead to the same tree topology, the branch lengths of a tree can be appropriately estimated only from the paralinear distances, whereas the molecular clock hypothesis should be tested by the LogDet distance. (iii) The proposed bias-corrected methods for the SR/SRV and paralinear/ LogDet distances are useful when the sequences are shorter than 500 bp. (iv) A general measure for the rate variation among sites is proposed, which does not depend on any specific distribution of rates. In principle, the SR/SRV and paralinear/LogDet distances can be easily extended to more complex models in which the dimension of the rate matrix R is >4 (51–55). Two interesting cases are the amino acid-based model (a general 20×20 model) and the codon-based model (a general 61×61 model). However, our preliminary simulation showed that, even for the amino-acid based model, these distances are subject to large sampling variances unless the sequence is very long, say, larger than 2,000 amino acids; the sampling variance would be much larger for the codon-based model. Indeed, because there are too many unknown parameters, the distances cannot be estimated accurately. Thus, one should be cautious when applying these methods to analyze amino acid sequence data. We suggested to use ρ (related to the coefficient of variation Cv) as a general measure of rate heterogeneity. However, Waddell et al. (30) questioned its usefulness because they found, for a given sequence data set, the estimated Cv value differs under different assumptions of rate distribution. This dilemma has now been removed because we have developed a method for estimating ρ (or Cv) that does not require any specific model of rate distribution. Apparently, the discrepancy found by Waddell et al. (30) is caused by sampling errors or the unsuitability of the model. When the nucleotide frequencies are not stationary, the parlinear and LogDet methods provide concise and elegant distance measures for phylogenetic inference and molecular clock testing. However, how to incorporate the effect of heterogeneity into these two distances is a problem that remains to be solved. This study was supported by National Institutes of Health Grants GM 30998 (to W.H.L.) and GM 20293 (to Masatoshi Nei, Pennsylvania State University). 1. Li, W.H., Wu, C.I. & Luo, C.C. (1985) in Molecular Evolutionary Genetics, ed. MacIntyre, R.J. (Plenum, New York), pp. 1–94. 2. Wu, C.I. & Li, W.H. (1985) Proc. Nati Acad. Sci. USA 82, 1741–1745. 3. Saitou, N. & Nei, M. (1987) Mol. Biol. Evol. 4, 406–425. 4. Nei, M. (1987) Molecular Evolutionary Genetics (Columbia Univ. Press, New York). 5. Nei, M. (1996) Annu. Rev. Genet. 30, 371–403. 6. Felsenstein, J. (1988) Annu. Rev. Genet. 22, 521–565. 7. Doolittle, R.E., Feng, D.F., Tsang, S., Cho, G. & Little, E. (1996) Science 271, 470–477. 8. Li, W.H. (1997) Molecular Evolution (Sinauer, Sunderland, MA). 9. Gu, X. (1997) Mol Biol. Evol. 14, 861–866. 10. Lanave, C., Preparata. G., Saccone, C. & Serio, G. (1984) J. Mol. Evol. 20, 86–93. 11. Gu, X. & Li, W.H. (1996) Proc. Natl. Acad. Sci. USA 93, 4671–4676. 12. Tavare, S. (1986) Lect. Math. Life Sci. 17, 57–86. 13. Barry, D. & Hartigan, J.A. (1987) Biometrics 43, 261–276. 14. Rodriguez, F., Oliver, J.F., Marin, A. & Medina, J.R. (1990) J. Theor. Biol. 142, 485–501. 15. Hasegawa, M. & Hashimoto, T. (1993) Nature 361, 23. 16. Sogin, M.L., Hinkle, G. & Leipe, D.D. (1993) Nature 362, 795. 17. Steel, M.A. (1994) Appl. Math. Lett. 7, 19–24.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDE SUBSTITUTION
18. Galtier, N. & Gouy, M. (1996) Proc. Natl. Acad. Sci. USA 92, 11317–11321. 19. Lake, J.A. (1994) Proc. Natl. Acad. Sci, USA 91, 1455–1459. 20. Lockhart, P.J., Steel, M.A. Hendy, M.D. & Penny, D. (1994) Mol. Biol. Evol. 11, 605–612. 21. Hasegawa, M., Kishino, H. & Yano, T. (1985) J. Mol. Evol. 22, 160–174. 22. Tamura, K. & Nei, M. (1993) Mol. Biol. Evol. 10, 512–526. 23. Yang, Z. (1994) J. Mol. Evol. 39, 105–111. 24. Uzzel, T. & Corbin, K.W. (1971) Science 172, 1089–1096. 25. Yang, Z. (1993) Mol Biol. Evol. 10, 1396–1401. 26. Gu, X., Fu, X.Y. & Li, W.H. (1995) Mol. Biol. Evol. 12, 546–557. 27. Sullivan, J.K., Holsinger, K.E. & Simon, C. (1995) Mol. Biol. Evol. 12, 988–1001. 28. Kelly, C. & Rice, J. (1996) Math. Biosci. 133, 85–109. 29. Gu, X. & Zhang, J. (1997) Mol. Biol. Evol. 14, 1106–1113. 30. Waddell, P.J., Penny, D. & Moore, T. (1997) Mol. Phylogenet. Evol. 8, 33–50. 31. Zharkikh, A. (1994) J. Mol. Evol. 39, 315–329. 32. Gu, X. & Li, W.H. (1996) Mol. Biol. Evol. 13, 1375–1383. 33. Jukes, T.H. & Cantor, C.R. (1969) in Mammalian Protein Metabolism, ed. Munro, H.N. (Academic. New York), pp. 21–123. 34. Kimura, M. (1980) J. Mol. Evol. 16, 111–120. 35. Tajima, F. & Nei, M. (1984) Mol. Biol. Evol. 1, 269–285. 36. Steel, M., Szekely, L. & Hendy, M. (1994) J. Comp. Biol. 1, 153–163. 37. Keilson, J. (1979) Markov Chain Models: Rarity and Exponentially (Springer, New York). 38. Saccone, C., Lanave C., Pesole, G. & Preparata, G. (1990) Methods Enzymol. 183, 570–583. 39. Li, W.H. & Gu, X. (1996) Methods Enzymol. 266, 449–459. 40. Miyamoto, M.M. & Fitch, W.M. (1996) Syst. Biol. 45, 568–575. 41. Tourasse, N. & Gouy, M. (1997) Mol. Biol. Evol. 14, 287–298. 42. Yang, Z. & Kumar, S. (1996) Mol. Biol. Evol. 13, 650–659. 43. Fitch, W.M. (1971) Syst. Zool. 20, 406–416. 44. Wakeley, J. (1993) J. Mol. Evol. 37, 613–623. 45. Cavender, J.A. & Felsenstein, J. (1987) J. Classification 4, 57–71. 46. Gu, X. & Li, W.H. (1992) Mol. Phvlogenet. Evol. 234, 185–192. 47. Muse, S.V. & Weir, B.S. (1992) Genetics 132, 269–276. 48. Tajima, F. (1993) Genetics 135, 599–607. 49. Li, P. & Bousquet, J. (1992) Mol. Biol. Evol. 9, 1185–1189. 50. Takezaki, N., Rzhetsky, A. & Nei, M. (1995) Mol. Biol. Evol. 12, 823–833. 51. Dayhoff, M.O. (1978) Atlas of Protein Sequence and Structure (Natl. Biomed. Res. Found., Silver Spring, MD), Vol. 5. 52. Schoniger, M. & von Haeseler, A. (1994) Mol. Phylogenet. Evol. 3, 240–247. 53. Golding, N. & Yang, Z. (1994) Mol. Biol. Evol. 11, 725–736. 54. Muse, S.V. & Gaut, B.S. (1994) Mol. Biol. Evol. 11, 715–724. 55. Rzhetsky, A. (1995) Genetics 141, 771–783.
5905
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUSTSUBTELOMERIC SEQUENCES
5906
Proc. Natl. Acad. Sci. USA Vol. 95. pp. 5906–5912. May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Precise sequence complementarity between yeast chromosome ends and two classes of just-subtelomeric sequences
(Saccharomyces cerevisiae/inverted recombination/sequence exchange/telomeres/Y and X2 repeats) ROY J.BRITTEN* Division of Biology, California Institute of Technology, 101 Dahlia Avenue, Corona del Mar. CA 92625 ABSTRACT The terminal regions (last 20 kb) of Saccharomyces cerevisiae chromosomes universally contain blocks of precise sequence similarity to other chromosome terminal regions. The left and right terminal regions are distinct in the sense that the sequence similarities between them are reverse complements. Direct sequence similarity occurs between the left terminal regions and also between the right terminal regions, but not between any left ends and right ends. With minor exceptions the relationships range from 80% to 100% match within blocks. The regions of similarity are composites of familiar and unfamiliar repeated sequences as well as what could be considered “single-copy” (or better “two-copy”) sequences. All terminal regions were compared with all other chromosomes, forward and reverse complement, and 768 comparisons are diagrammed. It appears there has been an extensive history of sequence exchange or copying between terminal regions. The subtelomeric sequences fall into two classes. Seventeen of the chromosome ends terminate with the Y repeat, while 15 end with the 800-nt “X2” repeats just adjacent to the telomerase simple repeats. The just-subterminal repeats are very similar to each other except that chromosome 1 right end is more divergent Once the complete Saccharomyces cerevisiae DNA sequence became available (1, 2) it appeared worthwhile to see if an insight into the origin of repeated sequences could be obtained, since all repeated sequences of this yeast strain are available for examination in the complete sequence. The initial stage has been to examine the terminal regions, and that is what is reported here. Naturally the results overlap the many previous studies, but they differ from what has been published by the completeness of the examination of the terminal relationships. There have been extensive examinations of yeast telomeres and of pairing and recombination processes revealing extensive regions of subterminal sequence relationships, but they will not be reviewed here and reference is made to previous reviews (3–8). There is of course a question as to what can be learned by merely examining sequence similarities, so this work is an experiment, but there are new results of significance. Here the telomeric and subtelomeric sequences are referred to together as the terminal regions, which include the last 20 kb of each chromosome. By custom, numbering starts at the left end. Many left terminal region sequences are the reverse complement of some right terminal region sequences, and no cases occur of significant lengths of precise direct sequence similarity between any parts of left and right terminal regions among all of the chromosomes. However this observation does not signify that there is a consistent orientation of the arbitrary historical identification of the ends of the yeast chromosomes. The situation is clarified if, for test purposes, the numbering of a chromosome is reversed. After this test change all left end sequences would still be the reverse complement of the right end sequences. Only a few specific relationships would change, while the reverse complementary pattern as a whole would remain unchanged. The reverse complementary relationships between chromosome ends have been clear to some workers (e.g., ref. 9) but I am not aware of a discussion of their significance.
RESULTS Initial Tests of Terminal Reverse Complementarity. To test the general occurrence of reverse complementary relationship between chromosome terminal regions, the reverse complement of a 5-kb terminal segment of each of the left ends of the 16 chromosomes was compared with all of the chromosomes, using FASTA (10). Reverse complementary regions were observed at the right end of several chromosomes for each of the 16 searches. Always one or more examples extended all the way to the right end of the chromosome. A similar study was carried out for all chromosomes with probes that were the reverse complement of the terminal 2-kb right ends. Also in every case several chromosomes were found with left terminal regions the reverse complement of the right terminal regions, and in every case the sequence similarity of at least one chromosome extended all the way to the left end. The program that was used, FASTA, selected the best-fitting regions, as it is designed to do, and the regions of similarity were often more extensive than exhibited in this initial search. To avoid missing significant similarities the best-scoring chromosomes found by the left end probes were divided into 1-kb-long fragments to form libraries that were searched with long probes that consisted of the reverse complement of the left ends of each of the 16 chromosomes. In this way all of the significant regions of reverse complementary sequence similarity were determined, often broken by internal nonmatching regions. The results are diagrammed in Fig. 1. Fig. 1 shows the right ends of the chromosomes with the matching region of the left end of the matching chromosome identified and the percent similarity (reverse complement) for 1-kb regions listed. In 16 of 16 cases there are high quality reverse complement sequence similarities of left terminal regions at the right ends of different chromosomes. Often several chromosomes share in this sequence similarity, since there are extensive direct terminal region similarities among the same ends of the yeast
*To whom reprint requests should be addressed, e-mail: [email protected]. © 1998 by The National Academy of Sciences 0027–8424/98/955906–7$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviation: SGD. Saccharomyces Genome Database.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUSTSUBTELOMERIC SEQUENCES
5907
chromosomes. While these terminal regions include repeated sequences, they are not entirely composed of them. They include regions consisting of just the few copies resulting from the terminal relationships reported here. The precision of match often reaches 100% in the bestmatching regions as shown in Fig. 1. In many cases the central parts of the overlapping regions have the highest precision, and in almost every case one or another of the end parts of the overlap have
FIG. 1. Selected subset of right end sequences showing extensive similarity (reverse complement) to each of the 16 left ends. Long segments of the left ends of all 16 chromosomes were converted to reverse complement and compared with libraries of 1-kb segments of particular chromosomes, chosen on the basis of previous comparisons. The left column is the number of the chromosome from which the left end probe was extracted. The second column is the number of the chromosome with which it was compared. The top of the figure shows the position in the right end sequence in kilobases. The first example is for chromosome (chr) 1 left end compared with chr 14 and exhibits a just over 3-kb reverse complementary region terminating at the right end of chr 14. The double line shows the length of the matches found by FASTA and the % match is shown below. The next example, left end of chr 2 on right end of chr 8, is a case of Y repeat similarity and exhibits minor internal deletions or very poorly matching regions. The next example, chr 3 left end on chr 11, does not involve the Y repeat and is mostly made up of single-copy (2 copy) sequences. The next example, chr 4 left end (reverse complement) on chr 10, matches for 18.5 kb and requires two lines for display with the leftward part on the upper line, a very extensive reverse complementary region. This figure displays a subset of reverse complementary regions that does not necessarily include the longest and bestmatching regions.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUSTSUBTELOMERIC SEQUENCES
5908
a lower precision. This suggests that multiple events of sequence exchange are responsible. In every case the telomeric end of the sequence similarity ceases only at the end of the known sequence.
FIG. 2. Terminal region similarity to the Y repeat. The probe was 8 kb of the right end of chromosome 15. By using FASTA it was compared with a library of the whole yeast genome divided into 1-kb segments. (Upper) All of the direct sequence matches of the 16 chromosomes. All of the significant matches are at the right terminal regions that are displayed. At the top is a scale exhibiting kilobases from the ends. (Lower) Reverse complementary matches, all of which are in the left terminal regions. The symbols indicate the quality of match over the regions identified by FASTA. XXX represents greater than 94% match, while 999 represents from 85% to 94%. The other symbols follow the same pattern down to 666, which represents between 55% and 65% match. All of the longer matches are with the Y-containing chromosome ends. The short matches at the ends are the X2 repeats to be described in a later section. A variety of searches indicate that there are no long direct sequence similarities between regions near (within 20 kb) the left and those near the right end of the same or any other chromosome. However, there are a few short and imprecise direct sequence similarities between the opposite ends of different chromosomes (200–400 nt and 70% identity or less). These appear to be members of short repeated sequence families that are near the termini. Their presence is almost certainly the result of a different set of phenomena from the long and precise sequence similarities. Often there are about 5-kb-long precise sequence similarities, both forward and reverse complement, between the TY elements at locations spread through the yeast chromosomes. The terminal regions include isolated long terminal repeats (LTRs) of these elements but none of the complete elements. Genes in the Duplicated Terminal Regions. The presence or absence of genes in the regions shown in Fig. 1 was explored by using the detailed maps and tabular information of the Stanford Saccharomyces Genome Database (SGD: see acknowledgements). No genes were found that meet the stiff criterion that the gene must be genetically mapped as well as confirmed by the DNA sequence. However many ORFs, ranging up to more than 5 kb in length, are present in the terminal regions. The 18.5-kb inverted sequence similarity between the left end of chromosome 4 and the right end of chromosome 10 (Fig. 1) includes two genes identified on the basis of very good sequence similarity. These are listed in the SGD as related genes in the respective terminal regions. Only one of the regions listed in Fig. 1 does not have an ORF recognized and listed in the SGD, but the reverse complement on the matching chromosome in this case does have a listed ORF. The regions reported below, such as the extensive reverse complementary region between the left end of chromosome 16 and the right end of chromosome 15. share genes that are recognized as closely related in the SGD. In strains that carry the SUC gene it is found in this region between the X and Y repeats (3). In addition the Y repeats contain conserved ORFs that are expressed in meiosis (3). Recently the duplication of genomic ORFs has been examined (11), and
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUSTSUBTELOMERIC SEQUENCES
5909
some of the complementary regions described here are reported in ref. 11 because they include ORFs. Y Repeat Terminal Patterns. It is a matter of interest how much the known repeated sequences contribute to the terminal sequence similarities. Fig. 2 shows the sequence similarities to a particular Y repeat that occurs at the right terminal region of chromosome 15. All of the left end copies are the reverse complement of the right end copies, and all of the copies appear to be terminal in occurrence except that in some cases such as chromosome 12 right end there are additional inner copies. The Y repeat and other sequences in its region form a part of the similarities observed in this work. In some cases large lengths of Y sequence similarity (reverse complement) occur at both ends (chromosomes 5, 8, 12, and 16) but often there is Y sequence only at one end. The Y repeat occurs at about half of the ends and the expectation, if the chromosome ends that contain Y repeats were a random set, is that they would occur at both ends of four chromosomes, as observed. There are also three chromosomes for which both ends lack the Y repeats, another observation that suggests randomness. Nine chromosomes include a Y repeat at one end but lack it at the other (9). Terminal Relationships as a Whole. To examine the terminal 20-kb sequence similarities as a whole, a large number of comparisons are required. Each chromosome left end was compared as reverse complement against the full length of all chromosomes. Direct sequence searches were also made with the left and right terminal regions, adding up to 768 comparisons. The results are shown diagramatically in Fig. 3. To make these comparisons the whole yeast genome was divided into about 12,000 1-kb segments in a library. FASTA was used to compare each of the terminal region 20-kb segments with the library. With this method all occurrences of significant sequence similarity are detected. Fig. 3 shows the regions of similarity to the reverse complement of the left end of each of the 16 chromosomes. The upper left block is the set of 16 similarities of the chromosome 1 left terminal 20-kb region (reverse complement). The next block below is the same for chromosome 2 and below that chromosome 3, etc. Fig. 3 Right gives this information for chromosomes 9–16. These similarities all occur in the right terminal regions of the 16 chromosomes. With the aid of a hand lens you will notice that the majority of these similarities are symbolized XXX and thus are better than 95% matches. Many are literally 100% matches. There are also a significant number of 999s, meaning 85–95% matches. The overall view without a hand lens shows the patterns very well. For example, the second diagram in Fig. 3 Left is for chromosome 2 left end (reverse complement), which contains the Y repeat, and thus the pattern shows all of the right ends that also contain the Y repeat. Thus it is clear that the left ends of chromosomes 2, 5, 6, 8, 9, 10, 12, 13, 14, and 16 contain the Y repeat, and similarly it is present on the right ends of chromosomes 4, 5, 7, 8, 12, 15, and 16, which have a consistent relationship to the other Y repeats. There is a lot of variation in these patterns and the Y repeat is merely a prominent part. Chromosome 1 is exceptional and includes the W repeat shown on the first line of Fig. 3 as more than 8 kb of precise reverse complement similarity between the two ends of chromosomes 1. In addition there is an extensive region of direct similarity to chromosome 8 right end (not shown). The long and precise direct sequence relationships of the left ends are restricted to the left ends of the other chromosomes and the extensive direct sequence similarities of the right ends all occur on right ends (not shown). Direct terminal region similarity data will appear on my web page, as there is not space here. If a left end has few reverse complementary similarities to the right ends of other chromosomes it also has few direct similarities to the left ends of other chromosomes, reflecting the presence or absence of the Y repeats. There are many precise matches, including one very extensive match of 18.5 kb between 4 left end and 10 right end, mentioned earlier. At the very end of Fig. 3 is shown a match of nearly 20 kb between left chromosome 15 (reverse complement) and right chromosome 16. These surely resulted from extensive events of recombination or copying between the ends of inverted chromosome pairs. The extent to which the Y repeats are involved in exchanges is not obvious, but it seems likely that their precise sequence similarities are the result of both exchanges between opposite ends of different chromosomes and exchanges between same ends. There are several examples of extensive relationship separate from the Y sequences. For example, left ends of chromosomes 9 and 10 are direct copies of each other for 20 kb. A separate examination showed that the copying extends for another 2.1 kb toward the centromere. All of the similarities in the terminal regions are shown in Fig. 3, and the long and precise terminal sequence relationships are restricted to the last 20 kb at each end shown there. In every case there are also examples of short and imperfect sequence similarity occurring in a variety of locations, representing short repetitive sequences that for the most part are members of unknown families. The number of these similarities to a given chromosome terminal region ranges from 3 to 16 except for long terminal repeats (LTRs) of mobile elements, for example on chromosome 2 and chromosome 15 left ends. In these cases delta elements are present that match about 100 other sequences. In addition, at 14 kb on chromosome 15 left end is a 200-bp-long element that matches about 60 other locations and is unknown to me. These short and imprecise and LTR sequence similarities are quite distinct categories of relationship from the major long and precise similarities that are involved in the pattern of exclusively complementary relationships between the left and right terminal regions. Just-Subtelomeric Sequences and the “X2” Repeat. In attempting to And some indication of the function of the reverse complementary relationship between the ends it seemed that the very terminal sequence patterns might contain clues. Terminal short sequences (2 kb) of all chromosomes were multiply aligned with CLUSTALW, with the left end sequences present as reverse complement. The resulting alignments are quite good almost to the end. The interesting result is that the just-subtelomeric sequences fall into two classes with very good sequence similarity within the classes but none between the classes. The first class is made up of the telomeric end of the Y repeat. It occurs as the justsubtelomeric sequence of all of the Y-containing chromosome ends as listed above. The second class includes the X repeat (3) but is more extensive and thus it has been named the X2 repeat to avoid confusion with previous descriptions of X repeats. It is present on all of the chromosome ends that do not contain the Y repeat, and Fig. 4 shows the consensus sequence for the 11 best-matching members. Most members agree with about 90% accuracy with the consensus, but chromosome 1 right end matches only 62%. There are other occurrences of the X2 sequence, which may be important. It occurs centromeric of the Y repeat on all of the Y-containing chromosome ends. Thus the X2 sequence occurs on all chromosome ends, although the example 6 kb in from the left end of chromosome 5 is quite short (125 nt). It is possible that if the X2 repeat has a function it could be carried out from either location. There are small sequences between X and Ys known as STR sequences, some combination of which is found at most ends. These probably form a part of the X2 sequence. Most of the X2 repeats that occur centromeric of the Y repeats are well conserved and quite similar to those in just-subtelomeric locations. The conservation of the sequences of 30 of the X2 repeats cannot easily be explained by recombination because of their different locations. The alignments of the just-subtelomeric sequences are of sufficient precision that they can be used to decide if chromo
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUSTSUBTELOMERIC SEQUENCES
FIG. 3. Partial results of 768 comparisons of 20-kb probes with the complete yeast genome. Twenty-kilobase terminal segments of all 32 chromosome ends are compared as reverse complement with a library of the whole yeast genome divided into 1-kb segments, using FASTA. All of the regions where sequence similarity was recognized are indicated by numbers on the diagrams of the 20-kb terminal regions, each symbol representing 200 bp. The numbers describe the precision of match with the meaning mentioned in the legend of Fig. 2. In each block the 16 chromosomes are arranged in order from top to bottom, numbered at the left. At the top of each block are listed the distance from the end of the chromosomes in kb. The comparisons of left end reverse complements are shown and the other comparisons will be on my web page (www.cco.caltech.edu/~rbritten). This diagram describes the similarities to 20-kb reverse complement probes with a block for each probe, the upper left exhibiting the relationships of a probe from the left end of chromosome 1. The next block below is for a probe from chromosome 2 left end, and this probe contains the Y repeat, so that all of the chromosomes with a Y at their right ends show large regions of sequence similarity in this block. This relationship establishes the major visible pattern of Fig. 3. but there are many other features. Where a probe does not contain the Y repeat as for 1 or 3 the pattern is simpler and includes the just-subterminal similarities where the right end sequence does not contain the Y repeat. Where the right end sequence does contain the Y repeat, then a small block of similarity appears at about 6 kb in from the end. where the copy of the X2 repeat is present in these chromosomes.
5910
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUSTSUBTELOMERIC SEQUENCES
5911
some sequences are complete to the ends. The analysis shows that half a dozen ends are incomplete, missing nearly 200 nt from the justsubtelomeric sequences. Nevertheless, for every end enough of the just-subterminal region is present to permit clear identification of the X2 repeat or the Y repeat. Most of the missing sequence ends are from Y-containing ends apparently due to difficulties of cloning. An alternative could be that some other mechanism besides telomerase stabilizes the ends of the chromosomes and the listed sequences are correct, but for the present it seems safer to assume that the sequences are incomplete.
FIG. 4. The X2 repeat consensus. The 11 chromosome ends that include very good copies of the just-subtelomeric X2 repeat were aligned by using CLUSTALW, allowing for the fact that left ends are reverse complements of right ends. A consensus sequence was made showing all positions in uppercase where six or more matched. The X2 consensus begins where there is general agreement between all examples and ends early in the telomerase simple sequence. In this 800-nt region all the 11 sequences match each other with high quality, between 86% and 93%. They are in complete agreement for 63% of the positions. The conserved core of the X repeat (3) begins at position 27 and agrees well to about position 400. The sequence has structure but is not recognizably a tandem repeat of a simpler sequence.
DISCUSSION The terminal regions (the last 20 kb) of the yeast chromosomes stand out by sharing blocks of sequence with each other. They share only with each other and not with other regions of the chromosomes, except for short less precise repeats, which apparently have an origin different from the longer high precision sequence similarities. The consistent reverse complementary relationship of the left and right terminal regions suggests that there has been a history of exchange events occurring between inverted pairs of chromosomes. Consider the third block down in Fig. 3 Left, diagramming the similarities to chromosome 3 left end. There is an extensive block of nearly 7 kb of similarity to chromosome 11 right end that is the product of sequence exchange or copying between inverted pairs of chromosomes. However, the similarities to all the other chromosomes are due to the fact that the left end of chromosome 3 contains the X2 repeat. There are the just-subtelomeric similarities to the right ends of chromosomes 1, 2, 3, 6, 9, 10, 11, 13, and 14. Also there are similarities to the X2 repeats at the inner or centromeric ends of the Y repeats on the other chromosomes. These X2 repeat similarities must be considered to be due to selection on these sequences rather than exchange. That is because they occur in both the inner and terminal locations, and it is difficult to see how exchange could be limited to these short X2 repeats. Nevertheless, all of these relationships are reverse complementary between the left and right ends. The pattern observed for chromosome 3 is frequent and is shown in Fig. 3 for all cases where the probe left end sequence does not include a Y repeat. What explains the reverse complementary relationship between left and right ends where massive exchange between opposite ends does not seem to have occurred recently? It seems likely that the telomerase simple sequences have a required orientation with respect to the ends and thus must be reverse complements of each other. It is assumed that DNA synthesis proceeds outward toward the telomeres. While this is consistent with the reverse complementary relationships of the whole terminal regions, it does not supply a reason for them except for the telomerase simple sequences. There is probably a functional reason for the orientation of the X2 sequences, having to do with specific protein bindings that are part of the complex control system that stabilizes the chromosome ends. Candidate models for the exchange process are recombination by breakage and rejoining, sequence conversion, or some unknown mechanism for sequence copying and insertion. The insertions and deletions and mismatches within the similar regions shown in Fig. 1 suggest that in many cases multiple events have occurred and that deletions and base substitutions leading to mismatch have occurred subsequent to the events that created precise inverted sequence matches. It is likely that the process has been regenerative, because once inverted matching sequence regions are present between opposite ends the chromosomes could be aligned by the mechanisms that permit normal recombination to occur between matching pairs of diploid chromosomes. The existence of many near perfect (95–100%) matching regions in Figs. 1, 2, and 3 suggests that the typical event has been quite recent, and only a little base substitution has occurred since. Thus the events are quite frequent on an evolutionary time scale. It is interesting that in many cases the terminal simple sequence created by telomerase is included in the sequence similarity. In every alignment for which both sequences extend into the telomerase simple sequences the alignment extends as far as the sequences go and remains good but not perfect. It may be due to patterns created
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUSTSUBTELOMERIC SEQUENCES
5912
by the telomerase, but exchange has to be considered as partially responsible. These exchanges very likely influence the stability of the terminal regions of the yeast chromosomes (4). It is clear that the known long repeats of the terminal regions—e.g., Y, X, and W—are an intimate part of the process of terminal region sequence exchange that is probably responsible for the patterns shown in Fig. 3. It seems likely that they originate in this process. That leaves open the question as to whether all of these duplications and multiplications are simply the inevitable result of the process of terminal region sequence exchange. Whether they have useful functions is yet uncertain, though the X2 repeat (Fig. 4) is well conserved and present on every chromosome end. There are no examples of precise and long direct sequence similarity between terminal regions on the opposite ends of chromosomes. This finding suggests that the orientation of these sequences or part of them is important to yeast survival. These sequences point outwards (or inwards depending on point of view) from all yeast chromosomes. They are terminated by the simple sequences generated by telomerase. The telomerase sequences are clearly significant to chromosome stability and replication, but there is good evidence that they carry out other functions. Changing their length affects survival (4). The central issue is, of course, the evolutionary role and potential function of the reverse complementary relationship of the terminal regions, but little can yet be said. Finally, it seems very unlikely that this pattern of asymmetry is restricted to yeast chromosomes. As the human genome project advances so that sufficient lengths of terminal regions are available it will be interesting to see how well the reverse complementary relationship holds in our own genome. The prediction is that it will be very similar to the yeast situation with allowance for different telomerase synthesized sequences and lengths and distinct sets of repeats in the subterminal regions. The yeast chromosomal sequences were obtained from the Stanford SGD http://genome-www.stanford.edu/Saccharomyces/. Thanks to Ed Louis for preprints. Johnny Williams prepared useful software in Perl language. This work was supported by National Institutes of Health grants. 1. Goffeau, A., Aert, R., Agostini-Carbone, M.L., Ahmed, A., Aigle, M., Alberghina, L., Albermann, K., Albers, M., Aldea, M., Alexandraki, D., et al. (1997) Nature (London) Suppl. 387, 5–105. 2. Pryde, F.E., Gorham, H.C. & Louis, E.J. (1997) Curr. Opin. Genet. Dev. 7, 822–828. 3. Louis, E.J. (1995) Yeast 11, 1553–1573. 4. Zakian, V.A. (1996) Annu. Rev. Genet. 30, 141–172. 5. Kramer, K.M. & Haber, J.E. (1993) Genes Dev. 7, 2345–2356. 6. Wellinger, R.J., Ethier, K., Labrecque, P. & Zakian, V.A. (1996) Cell 85, 423–433. 7. Louis, E.J., Naumova, E.S., Lee, A., Naumov, G. & Haber, J.E. (1994) Genetics 136, 789–802. 8. Flint, J., Bates, G.P., Clark, K., Dorman, A., Willingham, D., Roe, B.A., Micklem, G., Higgs, D.R. & Louis, E.J. (1997) Hum. Mol. Genet. 6, 1305– 1314. 9. Louis, E.J. & Borts, R.H. (1995) Genetics 139, 125–136. 10. Pearson, W.R. & Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444–2448. 11. Coissac, E., Maillier, E. & Netter, P. (1997) Mol. Biol. Evol. 14, 1062–1074.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON
5913
Proc. Natl. Acad. Sci. USA Vol. 95. pp. 5913–5920, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Dootittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
A unified statistical framework for sequence comparison and structure comparison
(sequence analysis/structure analysis/fold family/database statistics/protein evolution) MICHAEL LEVITT*† AND MARK GERSTEIN‡ *Department of Structural Biology, Stanford University, Stanford. CA 94305: and ‡Molecular Biophysics and Biochemistry Department, P.O. Box 208114. Yale University. New Haven, CT 06520–8114 ABSTRACT We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extremevalue distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., BLAST and FASTA validates our approach. Structure comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence. Comparison is a most fundamental operation in biology. Measuring the similarities between “things” enables us to group them in families, cluster them in trees, and infer common ancestors and an evolutionary progression. Biological comparisons can take place at many levels, from that of whole organisms to that of individual molecules. We are concerned here with the comparison on the latter level, specifically, with comparisons of individual protein sequences and structures. (For an example of systematic comparison applied to whole organisms, see refs. 1 and 2.) Our overall aim is to describe these two types of comparisons in a self-consistent, unified framework. For sequence or structure comparison, each act of comparing one “entity” to another (that is, either comparing two sequences or two structures) involves two steps. First, the two objects are aligned optimally through the introduction of gaps in such a way as to maximize their residue-by-residue similarity. This operation generates some form of total similarity score for the number of residues matched—traditionally, a percent identity for sequences or an rms for structures, although we will use other measures. Second, one has to assess the significance of this score in the context of what is known about the proteins currently in the database. In earlier papers, Gerstein and Levitt (3, 30) extended the work of Subbiah et al. (4) and Laurents et al. (5) and described an approach for structural alignment in an analogous fashion to the traditional approach for sequence alignment (6–9). Like sequence alignment, this method involves applying dynamic programming to a matrix of similarities between individual residues to optimize their overall correspondence through the introduction of gaps. In this paper, we tackle the second of the two steps in protein comparison: assessing significance. We developed a simple empirical approach for calculating the significance of an alignment score based on doing an all-vs.-all comparison of the database and then curve fitting to the distribution of scores of true negatives. This allows us to express the significance of a given alignment score in terms of a P value, which is the chance that an alignment of two randomly selected proteins would obtain this score. We applied our approach consistently to both sequences and structures. For sequences, we could compare our fit-based P values with the differently derived statistical score from commonly used programs such as BLAST and FASTA (10–13). The agreement we found validated our approach. For structure alignment, we followed a parallel route to derive an expression for the P value of a given alignment in terms of the structural alignment score. Our work followed on much that recently has been done assessing the significance of sequence and structure comparison. One of the major developments in the past few years has been the implementation of probabilistic scoring schemes (13–16). These give the significance of a match in terms of a P value rather than an absolute, “raw” score (such as percent identity). This places scores from very different programs in a common framework and provides an obvious way to set a significance cutoff (that is, at P=<0.0001 or 0.01%). P values were first used in the BLAST family of programs, where they are derived from an analytic model for the chance of an arbitrary ungapped alignment (10, 17). P values subsequently have been implemented in other programs, such as FASTA and gapped BLAST by using a somewhat different formalism (13, 18, 19).
†To
whom reprint requests should be addressed, e-mail: michael.levitt©stanford.edu. © 1998 by The National Academy of Sciences 0027–8424/98/955913–8S2.00/0 PNAS is available online at http://www.pnas.org. Abbreviation: scop, Structural Classification of Proteins.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON
5914
There are currently many methods for structural alignment (20–31). Some of these are associated with probabilistic scoring schemes. In particular, one method (VAST) computes a P value for an alignment based on measuring how many secondary structure elements are aligned as compared with the chance of aligning this many elements randomly (28). Another method (27, 32) expresses the significance of an alignment in terms of the number of standard deviations it scores above the mean alignment score in an all-vs.-all comparison (i.e., a Z-score). Data Sci Used for Testing. One of the most important aspects of our analysis is that we carefully tested it against the known structural relationships. This testing allowed us to decide unambiguously whether a given comparison resulted in a true or false-positive and to decide objectively between different statistical schemes. In particular, structures were taken from the Protein Data Bank (33–34) and definitions of domains, structural classes, and structural similarities were taken from the Structural Classification of Proteins (scop) database (version 1.32: refs. 35–37). The creators of scop have clustered the domains in the Protein Data Bank on the basis of sequence identity (38, 39). At a sequence identity level of 40%, this clustering resulted in 941 unique sequences corresponding to the known structural domains. These 941 sequences were what we used as test data for both the sequence and structure comparisons. They contained 390 different superfamilies and 281 different folds. Because they had a considerably closer and more certain relationship than fold pairs, we concentrated here on superfamily pairs. These 2.107 nontrivial, pairwise relationships between the domains formed our set of true-positives. Sequence Comparison Statistics. Sequence matching was done with standard approaches: In particular, we used the SSEARCH implementation of the Smith-Waterman algorithm (7) [from the FASTA package, version 3, (12, 40); the URL is ftp://ftp.virginia.edu/pub/fasta], with a gap-opening penalty of –12. a gap-extension penalty of –2, and the BLOSUM50 substitution matrix [which has a maximal match score of 13 (for C to C) and an average match score of –0.36]. A probability-density function for sequence-comparison scores. Each pairwise sequence comparison was best quantified by three numbers, Sseq, n, and m, where Sseq is the raw sequence alignment score and n and m are the lengths of the two sequences compared. Comparing all possible pairs of sequences allowed us to calculate an observed probability density, ρ°seq, for the chance of finding a pair of sequences with particular values for Sseq and ln(nm). Fig. 1A shows the density for pairs between all sequences. This includes the scores for
FIG. 1. A probability-density distribution for sequence comparison scores, contoured against Sseq, the sequence alignment score (along the horizontal axis) and ln(nm), where n and m are the lengths of the pair sequences (along the vertical axis). This density is related closely to the raw data (via normalization) obtained by counting the number of pairs are drawn with an with particular S and ln(nm) values. Because of the wide range of density values, contours of log interval of 1 (a full order of magnitude). When contouring the logarithm of a density function, special attention must be paid to the zero values. Here, a zero value is set to 0.001, which effectively lifts the entire surface by 3 log units. The data then are smoothed by averaging with a Gaussian function [exp(~s/(∆Sseq/3)2)] over a window 14 units wide along the Sseq axis. This smoothing together with the treatment of zeros serves to emphasize the smallest observed counts (values of 1) by surrounding them with three contour levels. (A) Data from all 884,540 pairs between any one of the 941 sequences and any other sequence (pairs A-B and B-A are both included). The significant sequence matches are seen as the isolated spots at high values of the score Sseq, (B) Data from 352.168 pairs, including only those pairs of sequences in different scop classes. We also exclude pairs between an all-α or all-β domain and an α+β domain, as well as sequences that are not in one of the five main scop classes: α, β, α/β, α+β. and α+β (multidomain). This exclusion is done to ensure that no significant matches will be found, which indeed is seen in the figure by the absence of any outlying spots at high score values. Thus, the density in B is free of any significant matches and shows the underlying density distribution expected for comparison of unrelated sequences.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON
5915
300 sequence pairs that are related closely, which clearly show up as “spots” on the right side of the plot. These high-scoring “true-positives” are removed in Fig. 1B, which shows the density for just the pairs in different structural classes (42), i.e., the pairs that definitely are unrelated. This is the density distribution that we aim to fit. Fig. 2A shows the density distribution as a function of Sseq for sections at constant ln(nm). The clear linear relationship between log and Sseq at high values of Sseq is indicative of an extreme-value distribution
The variable “Z” was defined in terms of Sseq and ln(nm) by using the “Z-score-like” expression Z=(Sseq–µseq)/σseq, where µseq=a ln(nm)+b and σseq= a are the most likely sequence score and width parameter for the distribution. The two adjustable parameters a and b were obtained by fitting the calculated density (Z) to the observed density (Z) for all values of Sseq and In(nm). Substituting for µseq and σseq for Z above gave Z=(Sseq–a ln(nm)–b)/a=Sseq/a–ln(nm) –b/a. To derive specific values for the a and b parameters, we fit the above formulas to the observed density distribution obtained by comparing pairs in different scop classes, getting a= 5.84 and b=–26.3. The fit was done by least-squares optimization by using the simplex minimizer in MATLAB (Math Works, Natick, MA). It has a residual of 0.084, which was calculated by using the standard relation r=∑ wi(Oi–Cj)2/∑ Wi(Oi)2, where i indexes “bins” with particular Sseq and ln(nm) values, is the observed density in a bin, Ci= log is the calculated density in a bin, wi=1/Ni is a weighting factor, Ni is the number of sequence pairs in a bin, and the summation is over all bins, I, with ln(nm) between 5.9 and 13.5. A cumulative sequence distribution function, giving the P value. To estimate the statistical significance of a particular comparison in terms of particular Sseq, n, and m values, we needed the cumulative distribution function Pseq(z>Z), which is defined as the probability that matching any two random sequences will give a z value greater than or equal to Z. This is just the integral of =exp(–z–exp(–z))=exp(–z) exp(–exp (–z)), from z=Z to z= , so that Pseq(z>Z)=1–exp(–exp(–Z)). Writing Z in terms of Sseq, n, and m gives
Pseq(s>Sseq)=1–exp(–exp(–Sseq/a+ln(nm)+b/a)), where the parameters a and b are given above. Relation to BLAST P value. For sequence comparison without gaps, Karlin and Altschul (10, 11) derived the following cumulative distribution function: PK&A(S>Sseq)=1– exp(–exp(– λ(Sseq–ln(Kmn)/λ)))=1–exp(–exp(–λ(Sseq +ln(Kmn)/λ))), where λ and K are calculated analytically based on the sequence composition and amino acid scoring
FIG. 2. Cross-sections of the sequence and structure density distribution show they are both extreme-value distributions and that the calculated distribution fits the observed distribution well. (A) Plots of the logarithm of the observed, log and sequence pair densities against the sequence match score Sseq; log is taken from the data for pairs in calculated, log different classes (Fig. 1B). Each panel shows the variation of the density with Sseq for a particular value of ln(nm), the product of the lengths of the sequences compared; this value is indicated by assuming n=m and showing the value of n. The observed with Sseq. The calculated distribution density is clearly an extreme-value distribution with a linear fall-off of log obtained with a two-parameter fit (dashed line, see text) is a good fit for all values of n [(or ln(nm)]. (B) Plots of the logarithm of the observed, log and calculated, log structure pair densities against the structure match score taken from the data for pairs in different classes (Fig. 4B). Each panel shows the variation of the density with Sstr for a particular value of the number of aligned residues, N. The calculated distribution obtained with a five-parameter fit (dashed line, see text) is a good fit for all values of N.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON
5916
matrix. Comparison of their analytical form with our P value expression shows that λ=1/a and K=exp(b/a). Substituting the specific values for a and b that we calculated from the fit. we found that λ=0.171 and K=0.011. For the particular database sequences and amino acid scoring matrix used here, the values for λ calculated by Karlin and Altschul’s formula ranged from 0.217 to 0.259, all somewhat larger than our value for λ. Relation to FASTA E value. In the FASTA sequence comparison programs (12, 13, 18), the significance of a given alignment score Sfa is estimated by fitting an extreme-value distribution to scores resulting from comparison of a given query sequence to each sequence in the database. The distribution is recomputed for each new query so that, unlike our approach, each query sequence is associated with a different distribution function. This type of association has the advantage of allowing for any peculiarities of the query sequence (e.g., composition bias), but it also means that one cannot estimate the significance of a single pairwise comparison of two sequences. The value used by FASTA in judging the significance of a sequence similarity is known as the expectation value or E value (here Efa). The P value, defined above, gives the statistical significance of a single comparison whereas the E value is an estimate of the expected number of false-positives (dissimilar matches with a significant score) for a search of the entire database. With Ndb entries in the database, the E value Eseq is calculated from our Pseq(s>Sseq) as Eseq=Ndb Pseq. The E values we obtained were very similar to those found by FASTA over a very wide range of values (Fig. 3). When one considers that our closed-form Eseq depends on only two parameters for all pairs whereas Efa is optimized separately for each query sequence (941×2=1,882 parameters in all), this agreement is astonishing.
FIG. 3. The statistical significance derived here is shown to be similar to that derived in a completely different way by the sequence comparison program SSEARCH from the FASTA package (13). We plotted the expected number of errors per search of the database obtained by Pearson’s method, log(Efa), against the same value calculated here, log(Eseq) (which is a function of the sequence match score Sseq and the length of the two sequences). To be more specific, Efa is the E value output by the FASTASSEARCH program whereas Eseq is calculated as 940Pseq(s>Sseq) for score Sseq. The accuracy of our simple two-parameter fit is confirmed by the fact that most pairs of log(Efa) and log(Eseq) values are perfectly correlated, lying along the line log(Efa) =log(Eseq) over the entire range. Measuring coverage vs. error rate to compare different formalisms for significance-statistics. We have presented two forms of E value statistics for sequence comparison: our method, Eseq, which is based on fitting a two-parameter model to the observed distribution of alignment scores; and the FASTA method Efa, which is based on fitting different distributions for each query. Now we naturally are led to ask whether there is an objective way to decide which formalism performs the best on some representative test data. The seminal work of Brenner et al. (39) and Brenner (43) provides a framework for such an assessment by using the known true-positives in the scop database and a coverage-vs.-error plot. To compare any two significance-statistics formalisms, we proceeded as follows for each: (i) For each of the pairs in the all-vs.-all comparison (941× 940 pairs), we determined an E value and noted whether the pair was a truepositive or true-negative (for true-positives, both sequences must belong to protein domains with the same fold in the scop classification). (ii) We sorted the pairs by increasing E value. (iii) We counted down the list from best to worst until the number of false-positives was 1% of the total number of database entries (here, this was 9 false-positives, which is 1% of 941). (iv) We got the threshold E value at this point, which ideally should be close to 0.01, so as to correspond to the 1% error rate per query. (5) Finally, we got the number of entries that were more significant than the threshold E value; this number defined the coverage, which should be as large as possible. Here, we compared the coverage and error rate of our sequence score statistics with those of FASTA (Eseq vs. Efa). At the threshold E value, our sequence statistics had log Eseq= –1.98 and a coverage of 328, and the FASTA statistics had a log Efa of –1.68 and a coverage of 379. The FASTA statistics had better coverage, but our statistics had an almost perfect threshold value, which should be –2 for 1% error rate. Structure Comparison Statistics. The procedure we used for pairwise structural alignment is described in detail in Gerstein and Levitt (3, 30) and is summarized only briefly here. Our core method was based on iterative application of dynamic programming. As such, it was a simple application of the Needleman-Wunsch sequence alignment (6). It originally was derived from the ALIGN program of Cohen (21, 31), with many subsequent refinements. One starts with two structures in an arbitrary orientation. Then one computes all pairwise distances between every atom in the first structure and every atom in the second, which results in an interprotein distance matrix in which each entry, dij, corresponds to the distance between residue i in the first structure and residue j in the second (interresidue distances usually are expressed between α-carbons). This distance matrix, dij, can be converted into a similarity matrix, Sij, through the relationship Sij=M/(1+ (dij/do)2), where M=20 and do=5Å. One applies dynamic programming to the similarity matrix to get equivalences (using a gap opening penalty of M/2=10 and no gap extension penalty) and uses them to least-squares fit the first structure onto the second one (44). Then one repeats the procedure, finding all pairwise distances and doing dynamic programming to get new equivalences, until the process converges. After an alignment is determined, it can be “refined” by eliminating the worst-fitting pairs of aligned residues and then refitting to get a new rms in a similar fashion to the corefinding procedure in Gerstein and Altman (45, 46). This refinement is necessary because the dynamic programming used tries to match as many residues as possible. (It is a global, as opposed to local, method.) The structural comparison score and the rms. At the end of the procedure, we were left with a number of scores characterizing our final alignment. The score optimized by dynamic program
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON
5917
ming was the sum of the similarity matrix scores Sij minus the total penalty for opening gaps. We refer to this as “Sstr.” To be more explicit, it was computed from the following formula:
Sstr=M(∑ 1/(1+ (dij/d0)2)–Ngap/2), where Ngap is the total number of gaps (not including gaps at the end of a chain) and the summation is carried out over all pairs, ij, of equivalenced residues. The more traditional score is the rms deviation in α-carbon position after doing a least-squares fit on the aligned atoms (the “rms”). rms-based statistics were used in our earlier work (for example, refs. 3–5) and have been used in almost all other work in structural alignment. A probability-density function for structural alignment scores. To derive significance-statistics for the structural alignment score Sstr, we proceeded exactly as we did for sequence comparison. Structural alignment of all pairs in the database gave us an observed probability distribution for comparison scores which was a function of the number of residues matched N and the comparison score Sstr (Fig. 4A. This distribution contained the many pairs of structures that were similar, and these pairs stood out with high Sstr scores. Fig. 4B shows data for pairs that were in different scop structural classes and, therefore, should not have had any structural similarity. Fig. 4B is much “cleaner” than Fig. 4A and shows the underlying distribution expected for the comparison of structures that are not similar. Fig. 2B shows the density distribution as a function of Sstr for sections at constant N. There is a close parallel between the structural alignment score Sstr and the sequence alignment score, Sseq, in Fig. 2A, and both can be modeled by an extreme-value distribution. Thus, we fit the calculated structure density by (Z)=exp(–Z–exp(–Z)), where the variable Z is defined in terms of Sstr and N by using Z=(Sstr –µstr)/σstr. The most likely structure score µstr and the width parameter σstr have a more complicated dependence on sequence length N than was the case 2 for sequences with µstr(N) =c ln(N) +d ln(N)+e (if N<120), µstr(N)=a ln(N)+ b(if N≥120) and σslr(N)=f ln(N)+g(if N<120) and σstr(N)=f ln(120) +g(if N≥120). Continuity of function values and slopes allows a and b to be written in terms of c, d, and e. To be more specific, at N=120, a ln(N)+b=c ln (N)2+d ln(N)+e and a=2c ln(N)+d. Thus, the expressions for µstr(N) and σstr(N) involve five independent parameters: c, d, e, f, and g. We determined these five parameters via least-squares optimization by using the SIMPLEX minimizer in MATLAB, which yielded c=18.4, d= –4.50, e=2.64, f=21.4, g=–37.5 (a=419.3 and b= 171.8 were derived as described above). The residual was 0.288. It was given by the same formula as was used for the residual in the sequence statistics fit with and wi=1, and the summation was over bins with any value of Sstr and N between 30 and 170 residues. The resulting fit of the observed and calculated distribution (Fig. 2B) was good for all values of N and Sstr.
FIG. 4. The logarithm of the density distribution for structure comparison scores, is contoured against Sstr, the structural alignment score (along the horizontal axis), and N, the number of aligned residues (along the vertical axis). By following the protocol used for Fig. 1, the raw data obtained by counting the number of pairs with the particular Sstr and N values are “lifted” and smoothed over a window 90 units wide along the Sstr axis, and the log value is contoured in intervals of 1 log unit. Given the different scales used for Sseq and Sstr, the extent of smoothing is very similar for both. (A) Data from all 884, 540 pairs between any one of the 941 sequences and any other sequence. (B) Data from 352,168 pairs, including only those pairs of sequences in different scop classes (described in Fig. 1). Comparison of A and B shows that the true-positive structural matches are seen in the contours at the higher values of the alignment score Sstr and also at higher values of the number of matches N, The density in B is free of these significant matches and shows the underlying density distribution expected for comparison of unrelated structures.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON
5918
A cumulative structure distribution function, giving the P value. To estimate the statistical significance of a particular structure comparison in terms of its Sstr and N values, we proceeded as we did for sequence comparison. We integrated the score distribution to determine a cumulative distribution function Pstr, defined as the probability that matching two random structures will give a z value greater than or equal to Z. The structure score distribution has the same extreme-value form as the sequence score distribution, so the derivation of Pstr follows that of Pseq, with Pstr(z>Z)=1–exp[–exp(–Z)], where Z is expressed in terms of Sstr and N by using
Z=(Sstr–(c ln(N)2+d ln(N)+e))/(f ln(N)+g), N<120 Z=(Sstr–(a ln(N)+b))/(f ln(120)+g), N≥120 and the seven parameters a, b, c, d, e, f, and g are given above. Structural comparison statistics based on rms. The traditional characterization of a structural alignment is in terms of the number of residues matched, N, and the rms deviation from fitting these matched residues, R. It is convenient to focus on ln(R), which ensures that there is good separation of values for small R. where the significant pairs occur. We calculated a probability distribution [ln(R),N] for the observed rms values of true-negative pairs in the same fashion as we did earlier for the observed distribution of structural alignment scores (Sstr,N). The fact that log varies very slowly with ln(R) near the maximum (Fig. 5) led us to fit the calculated density by using (Z)=exp(– 4 2 Z ), where Z is defined in terms of ln(R) and N as Z=(ln(R)-µrms(N))/σrms (N), with µrms(N)=c ln(N) +d ln(N)+e(if N<60), µrms(N)=a ln(N)+b (if N≥60) and σrms(N)=f ln(N)+g(if N<60), σrms(N)=f ln(60)+ g (if N≥60). The values of the five independent parameters c, d, e, f, and g were determined by least-squares optimization by using the SIMPLEX minimizer in MATLAB. which yielded c= 0.155, d=–0.619, e=1.73, f=0.0922, and g=0.212, (a= 0.872 and b=0.650 were determined as before to ensure continuity.)
FIG. 5. The fit to the structure pair density by using the rms score. The observed, log and calculated, log structure pair density distributions are plotted against the rms score ln(R) for different numbers of aligned residues. N. The observed structure pair density, which is derived from pairs in different classes, is clearly not an extreme-value distribution because it is symmetrical about the maximum value and falls off faster than a linear function with increasing Z. In fact, it is best fit by exp(–Z4). The calculated distribution obtained with a five-parameter fit (dashed line) is a good fit when the number of aligned residues exceeds 50. To estimate the statistical significance of a particular comparison in terms of its R and N values, we derived a cumulative distribution function Prms(z>Z), defined as the probability that any z will be less than or equal to a given Z. This was just the integral of ρcrms(z) from z=–
to z=Z. Because the function exp(–z4) cannot be integrated analytically, we integrated it numerically for z from –5 to Z and tabulated its value for 10,000 different Z values from –5 to 5. Comparing structure comparison statistics: Alignment score Sstr vs. rms. Once we had derived structure comparison statistics based on structural alignment score Sstr and rms, we could compare them. The same coverage-vs.-error scheme used above to compare the two formulae for sequence alignment significance could be used again here. When assessed in terms of coverage (number of true-positives found) at a given error rate on our test data, the E value statistics based on Sstr gave a much better performance (i.e., had a larger coverage) than those based on rms. To be more specific, we compared the two approaches (Estr vs. Erms) in exactly the same way that we previously had compared our sequence E value to that produced by FASTA (Eseq vs. Efa). We found that, at the 1% error threshold, the rms-based statistics have log(Erms)=– 32.8 and a coverage of 202 whereas the structural-alignment score statistics have log(Estr)=–1.58 and a coverage of 627. Clearly, the statistics based on Sstr perform much better because the threshold is much more reliable (i.e., closer to the value of –2 for an error rate of 1%) and the true-positive coverage is >3-fold higher. The difference between Estr and Erms is striking and confirms that the structure score is much better than the rms score. There are other reasons why the structural alignment score Sstr is a more reliable indicator than rms: (i) Sstr depends most strongly on the best-fitting atoms whereas rms depends most on the worst-fitting atoms; (ii) Sstr penalizes gaps, whereas rms does not; and (iii) Sstr is formally analogous to the score one gets from a standard sequence comparison, Sseq, because both quantities are derived from a “dynamic-programming” similarity matrix. As dynamic programming finds a maximum score over many possible alignments, it is reasonable that both Sstr and Sseq should follow an extreme value distribution. However, this is not a trivial result, as the scores are not independent, random variables whose maximum must follow such a distribution. Relationship Between Sequence Comparison and Structure Comparison. Having derived sequence and structure significance scores by using all-vs.-all comparisons on the same database of 941 sequences and structures, we were in a position to compare directly structure and sequence significance scores. Fig. 6 shows such a comparison for the 2,107 pairs of proteins in our data set that are considered to be related evolutionarily according to scop (i.e., they are the true-positives in the same superfamily). The lines at log(Eseq)=–2 and at log(Estr)=–2 divide the 2,107 true-positive pairs among four quadrants, depending on whether their sequence or structure matches are significant, as follows: Top right (1,204 pairs; nonsignificant sequence match, non-significant structure match). Over half (1,204 of 2,107) of the pairs of domains thought to be evolutionarily related by scop fall into this category of having no significant match, indicating that the combination of manual measures used in scop is more sensitive than either automatic sequence or structure comparison.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON
5919
FIG. 6. Comparison of structure significance with sequence significance. Plots of the structure significance. log(Estr), against the sequence significance, log(Eseq). for the 2,107 pairs of proteins judged to be homologous in the scop database (in the same superfamily). Pairs are distinguished by the extent of their structural match, with solid squares used for pairs with N≥70 and unfilled diamonds used for N<70. The horizontal and vertical dashed lines, which divide the figure into four quadrants, are at log(Estr)=–2 and at log(Eseq)=–2, respectively. Both of these thresholds correspond to an E value of 10– 2 and P value of 10– 2/941=10–5 so that we judge matches with lower values to be significant at the 1% level. Lower left (244 pairs; significant sequence match, significant structure match). These pairs are evenly distributed in the lower left quadrant, indicating that the sequence and structure significance scores are on the same scale. Lower right (576 pairs: nonsignificant sequence match, significant structure match). There are many more pairs with good structure matches but without sequence matches than the converse (sequence match but no structure match). This fact objectively shows how structure is conserved more than sequence in evolution. These 576 pairs are very good test cases for threading algorithms that match a sequence to a structure, and we currently are testing them in this way. Top left (83 pairs: significant sequence match, nonsignificant structure match). Almost all of the pairs (70 of 83) in this category involve matches with a small number of residues (N< 70). For such short matches, the structures may be deformed and may not match well. There are seven labeled pairs that are exceptions because the match is extensive (N>70), but the pairs structurally are less similar than would be expected from the strong sequence match. These seven exceptions involve 11 coordinate sets. Three of these sets were solved by x-ray crystallography to only medium resolution (>2.9 Å, lmys, 1scm. and 1tlk), five were solved by NMR (1prr. 1ntr, 2pld, 2pna, and 1tnm), and three are high resolution x-ray structures (better than 1.7 Å for losa, 3chy, and 1sha). None of the seven exceptional pairs involved two high resolution structures, and it seems likely that some of the seven exceptions would have had a more significant structural match if both structures in the pair were determined to a high resolution. Furthermore, as determined from consultation of a Database of Macromolecular Movements (ref. 47: see database at http://bioinfo.mbb.yale.edu/MolMovDB), some of the seven exceptions involved proteins that had been solved in different conformational states. In particular, losa, 1mys. and 1scm involved proteins with the highly flexible calmodulin fold. These are clearly examples for which one would expect sequence similarity but structural differences.
DISCUSSION AND CONCLUSION Summary. We have presented an approach for assessing in a unified statistical framework the significance of a given comparison of proteins, whether involving sequences or structures. For either sequence or structure we fit an extreme-value distribution to the observed distribution obtained from the all-vs.-all comparison of the database (i.e., between pairs of scop domains in different structural classes). For sequence comparison, this extreme-value distribution is as expected: We empirically observed for gapped alignments what Karlin and Altschul (11) derived for ungapped ones. We also gave a simple formula for the E value that is likely to be useful for pairwise comparisons without involving searches of the entire database. For structure comparison, we found that the score distribution follows an extreme-value distribution when expressed in terms of the structural alignment score Sstr. By using this measure, expressions for statistical significance can be formulated in an almost identical way for structure as they are for sequence. It is important to realize that, although the Sstr is produced naturally by our specific alignment method, it can be calculated from any arbitrary structural alignment. Thus, by using our formulas, a significance can be computed from the results of any structural alignment program. Using the more traditional rms deviation as a score does not lead to as reliable a measure of structural significance. In connection with this, it is interesting that recent work (39, 43) indicates that the significance statistics based on optimized “sum” scores from dynamic programming (i.e., Smith-Waterman scores, which are essentially sums of BLOSUM matrix
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON
5920
values minus gap penalties) perform much better than those based on the traditional measure of sequence similarity, percentage identity, which parallels the poor performance of our structural alignment statistics based on the traditional rms. It is disconcerting that such well established and intuitive measures such as percentage identity or rms perform so much worse than the statistical measures based on the sequence or structure alignment scores. Furthermore, it is surprising that over half of the relationships between distant homologues in scop were not statistically significant (at a rate of 1% error per query) using either pure sequence comparison or pure structure comparison. Almost all of the pairs found by sequence comparison were found by structure comparison, but there were many pairs found by structure comparison that were not found by sequence comparison. Overall, structural comparison was able to detect about twice as many of the scop distant homology superfamily pairs as sequence comparison (at the same rate of error). Future Directions. The approach we have used to derive statistical significance easily could be generalized to other contexts. In particular, it can be adapted to provide significance statistics for threading. We have not presented a detailed examination of the significance values for specific pairs of sequences or structures. Such an examination could prove to be a useful endeavor in the future, particularly if it focused on pairs of proteins with the same fold but insignificant E values and those with different folds but significant E values. These two classes of pairs characterize the twilight zone for structure, which has yet to be described fully. We thank S.E.Brenner for carefully reading the manuscript and S.E.Brenner and T.Hubbard for providing the pdb40d-1.32 database. M.G. acknowledges the National Science Foundation for support (Grant DB1–9723182), and M.L. acknowledges the Department of Energy (Grant DE-FG03–95ER62135). 1. Rohlf, F. & Slice, D. (1990) Syst. Zool. 39, 40–59. 2. Bookstein, F.L. (1991) Morphometric Tools for Landmark Data (Cambridge Univ. Press, Cambridge, U.K.). 3. Gerstein, M. & Levitt, M. (1998) Protein Sci. 7, 445–456. 4. Subbiah, S., Laurents, D.V. & Levitt, M. (1993) Curr. Biol 3. 141–148. 5. Laurents, D.V., S. Subbiah & Levitt, M. (1994) Protein Sci. 3, 1938–1944. 6. Needleman, S.B. & Wunsch, C.D. (1971) J. Mol. Biol 48, 443–453. 7. Smith, T.F. & Waterman, M.S. (1981) J. Mol. Biol. 147, 195–197. 8. Doolittle, R.F. (1987) Of Urft and Orfs (Univ. Sci. Books, Mill Valley, CA). 9. Gribskov, M. & Devereux, J. (1992) Sequence Analysis Primer (Oxford Univ. Press, New York). 10. Karlin, S. & Altschul, S.F. (1990) Proc. Natl. Acad. Sci. USA 87, 2264–2268. 11. Karlin, S. & Altschul, S.F. (1993) Proc. Natl. Acad. Sci USA 90, 5873–5877. 12. Lipman, D.J. & Pearson, W.R. (1985) Science 227, 1435–1441. 13. Pearson, W.R. (1996) Methods Enzymol. 266, 227–259. 14. Karlin, S., Bucher, P., Brendel, V. & Altschul, S.F. (1991) Annu. Rev. Biophys. Biophys. Chem. 20, 175–203. 15. Altschul, S.F., Boguski, M.S., Gish. W. & Wootton, J.C. (1994) Nat. Gen. 6, 119–129. 16. Bryant, S.H. & Altschul, S.F. (1995) Curr. Opin. Struct. Biol. 5, 236–244. 17. Altschul, S.F. & Gish, W. (1996) Methods Enzymol. 266, 460– 480. 18. Pearson, W.R. (1997) Comput. Appl. Biosci. 13, 325–332. 19. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402. 20. Remington, S.J. & Matthews, B.W. (1980) J. Mol. Biol. 140, 77–99. 21. Satow, Y., Cohen, G.H., Padlan, E.A. & Davies, D.R. (1987) J. Mol. Biol. 190, 593–604. 22. Taylor, W.R. & Orengo, C.A. (1989) J. Mol. Biol. 208, 1–22. 23. Artymiuk, P.J., Mitchell, E.M., Rice, D.W. & Willett, P. (1989) J. Inform. Sci. 15, 287–298. 24. Sali, A. & Blundell, T.L. (1990) J. Mol. Biol. 212, 403–428. 25. Vriend, G. & Sander, C (1991) Proteins 11, 52–58. 26. Russell, R.B. & Barton. G.B. (1992) Proteins 14, 309–323. 27. Holm, L. & Sander, C. (1993) J. Mol. Biol. 233, 123–128. 28. Gibrat, J.F., Madej, T. & Bryant, S.H. (1996) Curr. Opin. Struct. Biol. 6, 377–385. 29. Falicov, A. & Cohen, F.E. (1996) J. Mol. Biol. 258, 871–892. 30. Gerstein, M. & Levitt. M. (1996) in Proc. Fourth Int. Conf. on Intell. Sys. Mol. Biol. (American Association for Artificial Intelligence Press. Menlo Park, CA), pp. 59–67. 31. Cohen, G.H. (1998) J. Appl. Crystallography (in press). 32. Holm, L. & Sander, C. (1996) Science 273, 595–602. 33. Bernstein, F. C., Koetzle, T.F., Williams, G.J., Meyer, E.E., Jr., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T. & Tasumi. M. (1977) J. Mol. Biol. 112, 535–542. 34. Abola, S.J., Prilusky J & Manning, N.O. (1997) Methods Enzymol. 277, 556–571. 35. Murzin, A., Brenner, S.E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 536–540. 36. Brenner, S., Chothia, C., Hubbard, T.J.P. & Murzin, A.G. (1996) Methods Enzymol. 266, 635–642. 37. Hubbard, T.J. P., Murzin, A.G., Brenner, S.E. & Chothia, C. (1997) Nucleic Acids Res. 25, 236–239. 38. Brenner, S., Hubbard, T., Murzin, A. & Chothia, C. (1995) Nature (London) 378, 140. 39. Brenner, S., Chothia, C., Hubbard, T. (1998) Proc. Natl. Acad. Sci. USA (in press). 40. Pearson, W.R. & Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444–2448. 41. Henikoff, S. & Henikoff, J.G. (1993) Proc. Natl. Acad. Sci. USA 19, 6565–6572. 42. Levitt, M. & Chothia, C (1976) Nature (London) 261, 552–558. 43. Brenner, S.E. (1996) Ph.D. thesis (Cambridge Univ., Cambridge, U.K.). 44. Kabsch, W. (1976) Acta Cryst. A 32, 922–923. 45. Gerstein, M. & Altman, R. (1995) Computer Applications in the Biosciences 11, 633–644. 46. Gerstein, M. & Altman, R. (1995) J. Mol. Biol. 251, 161–175. 47. Gerstein, M., Lesk, A.M. & Chothia, C. (1994) Biochemistry 33, 6739–6749.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES
5921
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5921–5928, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russel Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Folding funnels and frustration in off-lattice minimalist protein landscapes
HUGH NYMEYER,* ANGEL E.GARCÌA,† AND JOSÉ NELSON ONUCHIC*‡ *Department of Physics. University of California at San Diego, La Jolla, California 92093–0319; and †Theoretical Biology and Biophysics Group. T10 MS K710. Los Alamos National Laboratory, Los Alamos, New Mexico 87545 ABSTRACT A full quantitative understanding of the protein folding problem is now becoming possible with the help of the energy landscape theory and the protein folding funnel concept. Good folding sequences have a landscape that resembles a rough funnel where the energy bias towards the native state is larger than its ruggedness. Such a landscape leads not only to fast folding and stable native conformations but, more importantly, to sequences that are robust to variations in the protein environment and to sequence mutations. In this paper, an off-lattice model of sequences that fold into a β-barrel native structure is used to describe a framework that can quantitatively distinguish good and bad folders. The two sequences analyzed have the same native structure, but one of them is minimally frustrated whereas the other one exhibits a high degree of frustration. The ability of proteins to spontaneously fold into unique three-dimensional structures has been amazing scientists for the last few decades. Since the beginning of molecular biology, it has been recognized that proteins are responsible for controlling most functions in living organisms, and that their functionality strongly depends on their shape. How are these biological molecules able to fold? This question has been a puzzle that has not yet been completely answered, but a lot has been learned in recent years. Energy landscape theory and the funnel concept provide the theoretical framework towards a quantitative understanding of the folding question (1, 2). This alternative view for the folding mechanism replaced the earlier idea that there must exist a single pathway for the folding event with clearly defined chemical intermediates (3, 4). After early seminal contributions by Go (5), Bryngelson and Worynes realized in the late 1980s (6, 7) that a Kill understanding of folding process would have to involve a global overview of the protein energy landscape. Inspired by this view, Leopold and collaborators (8) introduced the concept of a funnel landscape to describe good folding sequences, a landscape that resembles a partially rough funnel riddled with traps where the protein can transiently reside. In such a funnel there is not a unique folding pathway but a multiplicity of folding routes, all converging towards the native state. Late in the folding process, the protein may be trapped in single pathways but, at this stage, most of the protein has already found its correct folding configuration and the search becomes limited. Several other groups have also participated in the development of this new view that has flourished in the 1990s. Even though the following list is clearly incomplete, in addition to the previous references, the reviews in refs. 9–20 provide a detailed description of the landscape perspective. The description that follows provides a qualitative understanding of a funnel landscape. Unlike protein-like heteropolymers, random heteropolymers with a tendency to collapse do not have a well defined three-dimensional conformation, but a collection of completely different low energy structures. How can we differentiate between these two kind of sequences? Imagine that we want to discover a sequence that favors a particular structure, called the native structure. A major task at this point is to choose a good reaction coordinate (or order parameter) that measures the similarity between this native structure and any other conformation that may be adopted by this heteropolymer. For lattice minimalist models, a successful coordinate has been Q, the fraction of native tertiary contacts (9, 21–25). For real proteins many other choices are possible and, in most cases, several of them may be necessary, such as fraction of native secondary structures and fraction of native helix caps (1, 26). For our pictorial description we consider only a single Q, varying between 0 and 1 (native structure). As shown in Fig. 1, an ideally designed folding sequence has the energy of its conformations proportional to Q plus some roughness introduced by the nonnative contacts. This correlation between energy and structure not only introduces a bias that favors the native configuration but it also proportionally biases all nonnative conformations, depending on their degree of similarity to the folded state. This correlation is responsible for the funnel shape of the landscape. It is important to notice that even conformations that are completely different but have similar Q (native parts are different) have similar energies. A random sequence would display no such correlation between energy and structure, leading to the rough landscape shown in Fig. 1. For a protein-like heteropolymer to have the energy proportional to the global order parameter Q, its stabilizing contacts should be equally distributed throughout the entire structure. All native interactions should favor folding, and they should be equally important—i.e., the system exhibits no “frustrated” interactions. This is the ideal situation and, although real proteins may not be so perfect, they clearly need to minimize frustration, an idea proposed by Bryngelson and Wolynes (6). Because proteins are finite systems, if they have a single ground state, there is always a temperature below which this lowest-energy state is stable. This temperature is called the folding temperature, Tf. On the other hand, because the landscape is rugged, there is also a temperature below which the kinetics is controlled by long-lived low-energy traps
‡To
whom reprint requests should be addressed, e-mail: jonuchic©ucsd.edu. © 1998 by The National Academy of Sciences 0027–8424/98/955921–8$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: MD, molecular dynamics; MFPT, mean first passage time; MODC, molecule optimal dynamic coordinates.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES
5922
and not by the bias toward the native conformation. This temperature is called the glass temperature, Tg. Minimally frustrated sequences require sufficient bias to have the folding temperature larger than the glass temperature. Therefore this competition between energetic bias toward native conformation and roughness is fundamental in determining the folding mechanism, and it leads to a diversity of folding scenarios that are discussed elsewhere (2). All these ideas are further explored later in this paper.
FIG. 1. (a) Energy landscape for a random heteropolymer. Notice that the presence of low energy states that are completely dissimilar is a direct consequence of the small energy bias toward the native state relative to the roughness of the landscape, (b) Funnel-like energy landscape for a minimally frustrated heteropolymer. A clearly favored native structure can be observed in the bottom of this funnel. Because of this dominant bias, all the other low energy states are similar to the native one. Sequences with a good folding funnel not only are fast folders at temperatures around the folding temperature but, most importantly, they are robust folders. Robustness is an essential property in biology. Minor variations in the folding environment such as small changes in pH, temperature, denaturant concentration, or, even more interesting, variations because of mutations may affect the native configuration in favor of other low-energy structures. If these other low-energy structures are similar to the folded one, the consequences are minor. The “new” native conformation is very similar to the “old” one. The observed linear dependence between logarithms of the folding/unfolding rates and the folding free energy is a direct indication that this is the case for proteins (27–31). Frustrated sequences, on the other hand, not only are slow folders but also may have the structure of their native state drastically changed under minor variations of the conditions described above. This diversity of scenarios suggested by the landscape theory and the funnel concept can be observed by simulations of protein folding in computer models. Such simulations can be carried out at many different levels. Ideally they should be at the atomistic level but, because of computational limitations, this approach has limited itself to insights into local aspects of folding (32, 33) and characterizing ensembles of states for unfolded proteins (34–37). Thus minimalist models have been of major importance in our understanding of protein folding. Lattice models have been the center of these studies. They include the simple ones exploited in the early 1980s (5, 38, 39), and more recently in studies by several other groups (8, 12, 15, 16, 20, 40–46). These models have really improved our present understanding of protein folding. Off-lattice models have also been studied (47–54), but little has been done in this landscape context, making this point the focus of this paper. In addition to simulations, new experiments have been devised to probe early folding events and to explore the landscape of small fastfolding proteins (NMR dynamic spectroscopy, protein engineering, laser-initiated folding, and ultrafast mixing; see, for example, refs. 13, 14, 28, and 55–67, 85). Fast-folding proteins fold on millisecond timescales and have a single domain—i.e., they have a single, well defined, funnel (68). The combination of landscape theory, simulations, and this new family of experiments is providing the basis for a quantitative understanding of the protein folding mechanism. In this paper we show results for an off-lattice minimalist model where we explore the behavior of two folding sequences with the same native structure, but with one containing a higher degree of frustration. A quantitative landscape framework for quantifying differences between good and bad folding sequences emerges from this comparison. Because most of the existent landscape analysis has been performed for lattice simulations, we present in the next section a summary of some selected results in the lattice to help with our discussion of the off-lattice simulations.
A Summary of Lattice Minimalist Models Minimalist models of protein folding must contain all the features necessary to understand the folding mechanism. In its simplest version a heteropolymer must contain at least two kinds of monomers whose interactions obey some simplified interaction rule—i.e., heteropolymers may be thought as a necklace of beads of two or more kinds. The question to be answered is what sequences of beads are able to fold into a unique three-dimensional structure. In an effort to mimic the hydrophobic effect, Dill and collaborators (12) proposed the first set of interactions, called the HP model, where the interactions between H (hydrophobic) groups are attractive and all the other ones are zero. Another popular model, which is used for our simulations of 27-mers in a cubic lattice, is the one where the interactions between nearest neighbor beads of the same color are more favorable (strong attractive interaction) than the ones between beads of different colors (weak interaction). Sequences built with two kinds of beads are called two-letter code, three kinds of beads are three-letter codes, and so on. The low-energy states of heteropolymers composed of random sequences of two or more kinds of beads are collapsed states that try to maximize the number of contacts between beads of the same color. The polymeric nature of the chain, however, prohibits all favorable interactions from being satisfied simultaneously, and some contacts occur between beads of different color. These are clearly frustrated interactions, because the polymer would rather have the maximum number of favorable interactions. Thus different low-energy states may have different structures with a different set of frustrated contacts.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES
5923
The 27-mer in a cubic lattice is a nice system to simulate because, even though it is not possible to enumerate all its conformations, we can enumerate all its maximally collapsed configurations that are 103,000 3×3×3 cubes. The details of these studies can be found elsewhere (see for example, refs. 21, 40, and 69). Investigation of several two-letter or three-letter sequences has taught us that most of the sequences are bad folders, and the good folding ones maximize the number of favorable (strong) native contacts and minimize the number of strong nonnative contacts in unfolded conformations. This strategy maximizes the energy bias toward the native state and at the same time reduces the ruggedness of the landscape, which is mostly determined by the nonnative contacts. As expected, by increasing the number of different kinds of beads, it becomes easier to obtain minimally frustrated sequences. How can we quantify good folders? The simplest measure, proposed by Bryngelson and Wolynes, is to determine the folding temperature (Tt) and the glass temperature (Tg) of a sequence. The folding temperature can be easily determined, and it has been chosen as the temperature where the native state is occupied 50% of the time. For good folding sequences, the protein-like heteropolymer really behaves as a two-state system—i.e., depending on the temperature, the protein is mostly folded or unfolded, and it is rarely found in some intermediate conformation. In this case, folding is a cooperative sharp first-order-like transition and, therefore, any quantity that is able to distinguish between these two states can be used as a probe of the folding transition. This is not the case for bad folders, where this transition is broad and noncooperative. The discussion in the later section on signatures of folders for our off-lattice models makes this distinction clear. How is the glass temperature identified? The situation is more problematic, but it can be clearly defined. On the basis of the fact that longlived traps are the source of the glass transition, Socci and Onuchic provided an operational definition for the glass transition (69). If trapping were not a problem, lowering the temperature should speed up folding because it favors collapse. As the temperature gets lowered, however, there is a point where a substantial slowdown of folding happens. This temperature has been called the kinetic glass transition and is similar to the “thermodynamic” glass transition proposed by Bryngelson and Wolynes (2, 70). A more sophisticated analysis has been developed recently. It has been shown that for a good folding sequence around Tf, the kinetics of its folding event can be described as a stochastic motion of a few reaction coordinates (or order parameters) on an effective potential defined by the free energy as function of these order parameters (7, 22, 25, 71). In the simplest possible representation, this motion can be assumed to be diffusive, with a configurational diffusion coefficient that incorporates, in an average sense, transient occupation of short-lived traps.§ In this regime the folding event is exponential and the folding time can be estimated by using diffusive reaction rate theory (22, 72). As the temperature gets closer to the glass temperature, this description completely breaks downs. The protein is now being caught in long-lived traps, and the folding kinetics is controlled by the escape time from these traps. Because there is a full ensemble of these times, the kinetics of the folding event becomes nonexponential. This behavior is illustrated in Fig. 2 for a minimally frustrated three-letter code 27-mer. Clearly, a lot has been learned about the folding mechanism by investigating these lattice models. The question is how can we use these ideas to understand folding of real proteins beyond a qualitative way. Because lattice models include only tertiary contacts, a quantitative correspondence between these models and real proteins needs to consider additional order parameters, particularly secondary structure formation. An attempt towards this goal has been taken by Onuchic and collaborators (21). Using an analytical theory of helix-coil transition in collapsed heteropolymers to renormalize the secondary structure, they have proposed a law of corresponding states to relate small fast-folding proteins (around 50–60 amino acids) with lattice simulations of a minimally frustrated three-letter code 27-mer.
FIG. 2. Log-log plots [as proposed by Frauenfelder and collaborators (84)] of the distribution of folding times for a minimally frustrated three-letter code 27-mer. Time is shown in units of the number of Monte Carlo (MC) steps. The solid lines represent single-exponential fits through the data. Calculations were performed by Socci and collaborators (71). Around Tf=1.509, single exponentials, consistent with the diffusive picture, are a good representation of the data. As the temperatures approach the glass temperature (Tg1), escape from long-lived traps starts to control the dynamics, leading to an stretchedexponential (power-law) behavior as expected for glass dynamics. The dashed line at T=1.12 is a double-exponential fit and the dashed ones at T=1.00 and 0.89 are stretched-exponential fits. This correspondence between lattice models and real proteins, however, still is very limited. To explore all possible folding scenarios, there is a need to include these additional reaction coordinates (order parameters) explicitly. The offlattice minimalist models are suited for this task. Simple off-lattice models of proteins can have protein-like shapes with well defined secondary structural elements, as in real proteins. In addition, the continuum character of the configurational variables forces the unique folded state to be one basin of attraction with an entropy proportional to the volume of the basin and not a single conformation. In this paper we show how the quantitative analysis that has been performed for lattice models to distinguish between good and bad folders can be generalized for off-lattice models. It should become clear how this framework can be used to analyze any other models, including the ones with a full atomistic description. The system analyzed here has the native conformation of a small four-strand β-barrel protein, and it is investigated for two different sequences, a minimally frustrated one and a frustrated one. The comparison between the results obtained for both of them makes apparent how the landscape theory and the funnel concept can be used to quantitatively explore the folding of protein-like heteropolymers and even of real proteins.
The β -Barrel Model Two sequences, one minimally frustrated and one frustrated, are analyzed. Both of them are Cα protein models, 46 monomers long, which fold into β-barrel-shaped structures but have different potentials. The first sequence, introduced by Honeycutt and Thirumalai (73), is (B)9(N)3 (PB)4(N)3(B)9(N)3(PB)5P with monomers that are labeled hydrophobic (B), hydrophilic (P). or neutral (N). This model, which we refer to as the BPN model, has
§Refs. 1 and 71 provide a detailed description for this formalism, including the dependence of the glass transition on the order parameters.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES
5924
been studied on several other occasions (10, 49, 50), and similar α-helical models have also been studied (74). The energetics of the BPN model is described by a potential:
The van der Waals interaction is used to mimic the hydrophobic/hydrophilic character of the different monomer types. To achieve this, the S1 and S2 coefficients are chosen to create attractive interactions between all BB monomer pairs, repulsive ones between all PP and PB pairs, and only excluded volume interactions between the pairs PN, BN, and NN, BB interactions have S1=1 and S2=1, PP and PB interactions have S1=2/3 and S2=–1, and all interactions involving N monomers have S1=1 and S2=0.¶ As becomes clear further on, this model exhibits a high degree of frustration, probably due to the long range and nonspecific character of the interactions. To contrast with the BPN model, we developed a minimally frustrated one. In this model only the interactions between monomers that form native contacts—i.e., contacts found in the native β-barrel— are attractive. By doing that we remove the roughness created by nonnative contacts, recovering nearly ideal folding behavior (see discussion in the introduction). We refer to this model as the Go -like model because it is similar to the one introduced by Go and collaborators (76).|| To construct the Go -like model, we take a quenched structure from the BPN model and identify all contacts of the type i,j >i+3 within a distance of 1.167σ. This produces 47 pairs of monomers distributed mainly between the B-monomers (see Fig. 3); several of the monomers in the turns and in one end have no contacts. All attractive van der Waals interactions between monomers are turned off except for these 47 pairs. All other pairs have only the repulsive 1/r12 term, responsible for excluded volume. The native pairs have an attractive interaction with a well depth of ε and an energy minimum at 1.2σ. This choice of interactions results in only minor differences between the ground state structure and the original quenched model. All bond and angle interactions are the same as in the BPN model. (There are many possible ways to construct a Go -like model, because the choice of the number of native contacts is somewhat arbitrary. The one adopted by us is reasonable for the purpose of building a minimally frustrated sequence with this native conformation, but it is not unique.) Already in the development of these potentials, the differing level of robustness of the two models is apparent. Although both models are weakly sensitive to changes in the angle interactions, the BPN model is very sensitive to changes in the strength of the dihedral energy interaction, unlike the Go -like model. Weakening of the intrinsic trans preferences in the BPN model by 25% makes the original native structure unstable at all temperatures. On the other hand, the dihedral preferences in the Go -like model can be strengthened or weakened while maintaining the same ground state structure. Even total elimination of the backbone rotamer preferences (A=0.0ε and B=0.2ε), adopted by us in this paper, reduces the stability by only 36%, leaving a wide temperature window between Tf and Tg.
FIG. 3. An illustration of the ground state of the Go-like model. Each arrow represents an attractive interaction that exists between two monomers. There are 47 of these interactions. The only nonbonded interaction between two monomers without a connecting arrow is a repulsive 1/r12 term responsible for excluded volume.
Signatures of Good and Bad Folders Thermodynamics. The first clear indication of the different degrees of frustration between these models comes from analyzing their thermodynamic properties. Similar to what is observed in lattice simulations (1, 71), minimally frustrated systems are characterized by equivalent folding pathways, and such systems have cooperative folding transitions. Figs. 4 and 5 show the specific heat and the degree of folding
FIG. 4. The specific heat, Cv, of the BPN model (Upper) is contrasted with the collapse and folding denaturation curves (Lower). Compared to the minimally frustrated Go-like model (see Fig. 5), it shows a reduced level of cooperativity. Notice that collapse occurs prior to folding and that, even at the lowest temperature, the number of native contacts is far from maximal. Reliable sampling could not be performed below 0.4ε because at these temperatures the kinetics is controlled by escape from long-lived traps. In particular, the lowest bump in the specific heat is partially an artifact of the low T sampling.
¶For both models, we work in reduced units—i.e., all units are defined in terms of the monomer mass M, the bond length σ, and the and friction in units of τ –l. Also, all bonds are fixed with the shake algorithm energy ε. Time is thus measured in units of (75), and bond angles are set to have a rest value of 105° and a spring constant of 40ε(rad)–2. The BPN model has stiff local trans preferences for the dihedral angles except at the loop regions. Thus the BPN coefficients for the dihedral interactions are set as A=1.2ε and B=0.2ε for all the dihedral interactions except those involving two or more neutral monomers, in which case, A=0.0ε and B=0.2ε, leading to a small barrier but no preference among the three possible backbone rotamers. As a consequence of this choice of dihedral coefficients, rigid strands appear at all temperatures below the collapse temperature. ||This model is also similar to the associative memory hamiltonian used by Wolynes and collaborators (48) in the limit of a single memory.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES
5925
(Q) and collapse (C) order parameters as a function of temperature for both models. The difference in folding cooperativity between them is noticeable. The BPN model has a broad transition region centered at Tc0.72ε that is mainly a collapse transition, although the collapsed structures are rather restricted in their conformations. Nearly all the collapsed structures have a four-stranded topology like the ground state. This similarity is reflected in the increase in the folding order parameter simultaneously with the collapsed one. Fig. 6 provides strong evidence that most of the entropy is lost upon collapse. Even though states 70% similar to the native one are formed below Tc, the native state itself is not populated until well below this temperature. The temperature at which this occurs is not known exactly because it is below T=0.4ε, temperatures where our sampling is not reliable. Notice from Fig. 6 that for temperatures below 0.5ε, this model “runs out” of entropy at Q0.3, indicating that its kinetics is now controlled by escape from long-lived low-energy traps (glassy regime). The structural properties of these lowenergy structures is discussed in the subsection on the ground state.
FIG. 5. The specific heat. Cv, of the Go -like model versus temperature (Upper) is contrasted with the mean values of Q and C versus temperature (Lower). Notice the simultaneity of the collapse and folding events as well as the high degree of cooperativity of the folding transition. FIG. 6. The thermodynamic functions plotted as a function of the folding order parameter, Q, for the BPN model. F is the free energy, TS is the temperature times the entropy, and E is the energy. The temperatures are measured in units of ε(0.6ε is just below the collapse temperature). All curves are in units of kBT for and are shifted relative to the native state. The lack of an energy bias toward the native state is apparent. The entropy plots also illustrate the onset of glassy behavior at temperatures below 0.5ε (model runs out of entropy at Q0.3). At these low temperatures, the dynamics becomes controlled by the escape time from long-lived traps. In contrast to the BPN model, the Go -like model shows a single sharp peak for the specific heat centered at 0.42ε. This “latent heat” coincides with increases in Q and C, thus collapse and folding occur simultaneously at this temperature. Even though several order parameters can monitor collapse and folding [for example, rms deviation from the native conformation, principal component analysis coordinates (77, 78), radius of gyration, secondary structure measures, and contact measures], in all our analysis C and Q are used to probe collapse and folding, respectively. Both of them have been normalized to 1 (relative to the maximum number of contacts in the quenched native configuration). (Of course this means that there are a few states with C>1.) For the purpose of calculating Q or C, we define contacts to exist between any two monomers with indices i and j>i+3 that are within 1.8σ of each other, even though when we determined the “native” contacts for the native structure a shorter cutoff is used. This flexibility allows the native contacts to fluctuate slightly. For the BPN model, we used a cutoff of 1.2ε to define native contacts, which are exactly the attractive ones in the Go -like model. The details of our results are relatively insensitive to the choice of cutoff for classifying contacts as native. The thermodynamic functions for both models are plotted versus Q in Figs. 6 and 7. The curves have been shifted to have the energy, entropy, and free energy equal to zero at the native state. The Go -like model shows a very good funnel: the energy and entropy increase smoothly with Q. This behavior, as expected from landscape theory (1, 6–8), has also been observed in lattice simulations (22, 71). The individual energy and entropy terms are very large, around 10 to 100 kBT, but they almost cancel each other, yielding a much smaller residual free energy [recall that our potentials already renormalize the effect of the solvent (2)] and, as in lattice models, a small free energy barrier of 3kBTf exists at the folding temperature. Also, since the low-energy states are all very similar to the
FIG. 7. The thermodynamic functions plotted as a function of the folding order parameter, Q, for the Go-like model. The temperatures are measured in units of ε. All curves are in units of kBT for and are shifted relative to the native state. Notice that, even for temperatures far below the folding temperature, this model does not “run out” of entropy, indicating the presence of a very good funnel as expected for minimally frustrated systems.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES
5926
native configuration (Q1), this system is very robust and therefore, as discussed in the introduction, insensitive to reasonable changes in the environment (changes in temperature or changes that affect the potential) and mutations. The behavior of the BPN model (Fig. 6) differs sharply from that of the Go -like model. The free energy plots indicate a noncooperative second-order-like collapse transition near Tc (0.72ε) with little preference among collapsed structures. The native structure is selected from a large ensemble of dissimilar low-energy structures. Most of the energy gain is used upon collapse, leaving almost nothing to bias the search among collapsed states toward the native configuration. As discussed above, the entropy decreases sharply for states with Q>0.3 for temperatures just below the collapse temperature (around 0.5ε). This entropy crisis heralds the onset of a glassy dynamics that is controlled by escape from long-lived low-energy traps. This glassy behavior is supported by three other effects that become prominent near and below this temperature: a rapid increase in the folding time as the temperature is reduced, the existence of nonexponential relaxation, and the occurrence of specific folding trajectories that are unrelated to the underlying free energy surface as plotted versus a few order parameters. Also, because low energy states may be so dissimilar, this model shows no robustness. Minor changes in the environment and mutations may cause dramatic changes in the structure of the native state (see further discussion in the next two subsections). Sampling for the determination of the thermodynamic behavior is done using the AMBER (79) program. Molecular dynamics (MD) simulations are performed at constant temperature (80) with a coupling time of 0.1τ and a time step of 0.005τ. Samples taken at several temperatures are combined by using multiple histograms (81). Simulations are done at various temperatures ranging from 0.02ε to 1.2ε. Each temperature simulation is preceded by a 2-million-step equilibration that starts from the final conformation of the previous higher temperature simulation. At each temperature 4,000 configurations are collected. Kinetics. To fully explore the dynamics of the folding event, a series of folding simulations is performed for both models at different temperatures. MD simulations are done using a leap-frog Langevin integrator (adapted from ref. 82). We do measurements of kinetic quantities with a γ of 0.2τ–1, which is a factor of 10 larger than the measured value for amino acids in water (83). We do not believe the use of a lower friction constant will qualitatively change our results, although folding timescales are probably decreased by a factor of 10. Simulations of the Go -like model for different values of the friction constant show a folding rate that varies linearly for γ greater than 2.0τ–1, and this variation appears to be temperature independent. No appreciable difference in folding behavior is noticed for the different values of γ. The same dependence is also observed for the BPN model (72). On the order of 100 simulations are performed at each temperature. Each simulation is preceded by 200,000 simulation steps at 1.6ε to unfold and randomize the system. The final coordinates and velocities of this simulation are used as the starting point for the folding simulations. Q is calculated for every tenth structure, and the simulation is halted when a native structure with Q=1 is reached. The length of the folding run is used to calculate the mean first passage time (MFPT) for each temperature. The MFPT times increase rapidly at low and high values of temperature. In the BPN model, the minimum MFPT is about 900τ and occurs at 0.6ε. In the Go -like model, the minimum MFPT is about 100τ and occurs at 0.2ε. The increase in the folding rate at low and high temperatures is a prediction of the energy landscape theory (2, 7). As discussed in the preceding section and ref. 22, the increase in the MFPT at high temperatures is caused by the growth of the folding barrier, whereas the increase at low temperatures (before the glass transition) is due to changes in the prefactor of the folding rate, which depends on a configurational diffusion coefficient that averages the effect of short-lived traps.
FIG. 8. Log-log plots of the unfolded population as a function of time for the BPN model (Upper) and the Go-like model (Lower). The dashed lines are exponential fits to the data, and the single solid line is a power-law fit for the BPN model. From Upper, we can notice that the BPN model starts to become nonexponential at temperatures just below collapse (around T=0.6ε). Deviations from single-exponential behavior are caused by a few deep traps with different escape times. Around this temperature, the kinetics is roughly bi-exponential. As the temperature gets lower, the number of these low-energy traps increases substantially, leading to the power-law decay. The onset of nonexponential kinetics in the Go-like model does not occur until temperatures much lower than the folding temperature (around T = 0.1ε). All simulations are truncated at 50,000 τ. Similar to lattice models (see the preceding section and ref. 69), a simple way to estimate the glass transition is to use the operational definition of a kinetic glass transition temperature Tg, the temperature at which the MFPT for folding has fallen to 1% of its maximal value. The approximate value of Tg for the BPN model is 0.4ε, and for the Go -like model it is 0.05ε.** This gives for the two models a Tf/Tg ratio of about 0.9 and 8. respectively. These ratios place the BPN and Go -like models squarely in the groups of strongly and minimally frustrated systems. A hallmark of glassy dynamics is nonexponential relaxation. As in Fig. 2, Fig. 8 shows log-log plots of the unfolded population as a function of time for both sequences. In these plots, an exponentially decaying population falls sharply, whereas glassy dynamics exhibits a power-law (or stretched exponential) decay (71, 84). The BPN model starts to deviate from exponential folding around 0.6ε, where the decay is bi-exponential. This is evidence that the system is starting to be trapped in nonnative conformations. At 0.45ε there is a continuum of folding times, controlled by the escape times from a large ensemble of long-lived low-energy traps. This is reflected in a power-law decay with folding times ranging from 500τ to at least 50,000τ, the time limit for individual folding simulations. The second relaxation time, at temperatures where the kinetics is roughly bi-exponential, is most likely caused by a trap in which the first completely hydrophobic strand is bent backwards to contact itself. Although there are several unfavorable dihedrals in this conformation, the large number of BB contacts makes it an exceptionally low-energy trap. On the other hand, the Go -like model decay can be fit by
**The ruggedness for the Go-like model is very small because the energy is roughly proportional to Q. This is apparent from Fig. 7, where the entropy as a function of Q is almost temperature independent.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES
5927
a single exponential for temperatures much lower than the folding temperature, all the way down to 0.1ε. Therefore, no long-lived traps exists for the relevant temperatures around Tf. The lack of folding events in either system within the first 10–20τ is due to the intrinsic collapse time; systems that fold in this time are collapsing directly into the native structure. Single folding runs of the BPN model (T0.5ε) show long-lived traps that are not visible from plots of the potential of mean force. These trapped trajectories individually show little relation to this effective potential. This behavior becomes prominent near and below Tg because the folding kinetics is then controlled by escape from low-energy long-lived traps. The Nature of the Ground State. The inherent frustration of the BPN model compared to the Go -like one can be visualized by measuring the occupation of the different collapsed states. Using MD trajectories of both models, we perform a cluster analysis of the collapsed states in terms of collective motions that best (in a least-square sense) represent the system fluctuations. These coordinates are called molecule optimal dynamic coordinates (MODC) (77, 78). The MODCs are obtained by diagonalizing the covariance matrix of selected dynamic variables (in our case, the Cartesian coordinates for the sequence beads). The largest eigenvalue MODC best describes the atomic fluctuations and, in this case, is sufficient to differentiate the various long-lived low-energy traps. In Fig. 9 Upper, a low temperature trajectory of the BPN model is plotted, using the two primary MODCs for this trajectory. Fig. 9 Lower shows the free energy as a function of the primary MODC. Superimposed in Lower is the free energy of the Go-like model, when the same MODCs are used, showing that only the native cluster is occupied. The Go -like model trajectory, not shown here, occupies only the native cluster instead of the ensemble of different structures occupied by the BPN one. Notice that each cluster does not necessarily correspond to a single structure. The rms deviation between structures in different clusters is about 1σ and within a single cluster is less than 1/2σ, whereas crystallographic structures of proteins have backbone rms deviations of about 1/3 the typical Cα–Cα distance—i.e., 1/3σ. Also, structures in different clusters have different packing arrangements of the hydrophobic monomers. Therefore, each cluster corresponds to one or a few different packing arrangements. Most differ by a combined longitudinal translation and 180° rotation of one or more of the strands, and inteconversion among them involves “reptation-like” moves (53).
FIG. 9. Cluster analysis of a low temperature trajectory (T 0.32ε) for both models. Upper plots a trajectory for the BPN model as function of the first two MODCs, and it shows that multiple clusters are often occupied. A trajectory for the Go-like model, not shown here, mostly occupies the native cluster. Supporting these observations, Lower shows the potential of mean force (PMF) for both models as a function of the first MODC. Each minimum corresponds to a different cluster. While the PMF for the BPN model has several low-energy minima, the Go-like PMF has a single, well-defined, minimum at the native cluster.
Concluding Remarks A framework based on the energy landscape theory and the funnel concept, which is able to quantitatively estimate the degree of frustration of folding sequences, has been presented. Thermodynamic and kinetic measures are used to distinguish between good folders (minimally frustrated) and bad folders (frustrated). Good folding sequences have a weakly rugged funnel-like landscape with low energy states that have structurally similar configurations. The folding kinetics is exponential for temperatures around Tf, and the system is very robust to reasonable changes in the environment and mutations. The situation reverses for frustrated sequences. The landscape is rugged and the lowenergy states are dissimilar. Around Tf, the kinetics is controlled by escape from different low-energy traps and therefore is nonexponential. The robustness observed for good folding sequences becomes nonexistent. Also, a comparison between two sequences that fold into the same native conformation, one frustrated and one minimally frustrated, has been presented as an application of this framework. Notice, however, that the landscape theory predicts a diversity of folding scenarios that cannot be discussed by a single example. Even though different order parameters may be necessary to describe different systems and their respective folding scenarios, this framework will apply for all of them. By departing from the minimalist lattice models and moving to offlattice ones, we can now develop a much richer collection of folding models and understand the folding conditions for each of them. In addition, this framework is not limited to minimalist models. It can be applied for folding of proteins at full atomistic representation. At this level the kinetic data will be very limited, but the thermodynamic analysis alone is already very informative. By comparing these results with the ones obtained for the minimalist models, we should be able to identify the possible folding scenarios and quantitatively understand the folding mechanism for real proteins at an atomic resolution. We thank Nick Socci, Gerhard Hummer, Jorge Chahine, Peter Wolynes, Joan Shea, and Charlie Brooks for helpful discussions. This work was supported by the National Science Foundation (Grant MCB-9603839). It was also partially supported by Los Alamos/ University of California directed research and development (UCDRD) funds and by molecular biophysics training grant (NIH T32 GN08326) for H.N. 1. Onuchic, J.N., Luthey-Schulten, Z. & Wolynes, P.G. (1997) Annu. Rev. Phys. Chem. 48, 545–600. 2. Bryngelson, J.D., Onuchic, J.N., Socci, N.D. & Wolynes, P.G. (1995) Proteins Struct. Funct. Genet. 21, 167–195. 3. Englander, S.W. & Mayne, L. (1992) Annu. Rev. Biophys. Biomol. Struct. 21, 243–265. 4. Kim, P.S. & Baldwin, R.L. (1990) Annu. Rev. Biochem. 59, 631–660. 5. Go, N. (1983) J. Stat. Phys. 30, 413–423. 6. Bryngelson, J.D. & Wolynes, P.G. (1987) Proc. Natl. Acad. Sci, USA 84, 7524–7528. 7. Bryngelson, J.D. & Wolynes, P.G. (1989) J. Phys. Chem. 93, 6902–6915. 8. Leopold, P.E., Montal, M. & Onuchic, J.N. (1992) Proc. Natl. Acad. Sci. USA 89, 8721–8725. 9. Dill, K.A. & Chan, H.S. (1997) Natl. Struct. Biol. 4, 10–19. 10. Guo, Z.Y. & Thirumalai, D. (1995) Biopolymers 36, 83–102. 11. Garel, T., Orland, H. & Thirumalai, D. (1996) in Recent Developments in Theoretical Studies of Proteins, ed. Elber.R. (World Scientific, Singapore), pp. 197–268. 12. Dill, K.A., Bromberg, S., Yue, K., Fiebig, K.M., Yee, D.P., Thomas, P.D. & Chan, H.S. (1995) Protein Sci. 4, 561–602. 13. Fersht, A.R. (1997) Curr. Opin. Struct. Biol. 7, 3–9.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES
5928
14. Eaton, W.A., Munoz, V., Thompson, P., Chan, C.K. & Hofrichter, J. (1997) Curr. Opin. Struct. Biol. 7, 10–14. 15. Mirny, L.A., Abkevich, V. & Shakhnovich, E.I. (1996) Folding Design 1, 103–116. 16. Sail, A., Shakhnovich, E. & Karplus, M. (1994) J. Mol. Biol. 235, 1614–1636. 17. Schcraga, H.A. (1992) Protein Sci. 1, 691–693. 18. Honig, B. & Cohen, F.E. (1996) Folding Design 1, R17–R20. 19. Zwanzig, R. (1995) Proc. Natl. Acad. Sci. USA 92, 9801–9804. 20. Pande, V.S., Grosberg, A.Y. & Tanaka, T. (1994) Proc. Natl. Acad. Sci. USA 91, 12972–12975. 21. Onuchic, J.N., Wolynes, P.G., Luthey-Schulten, Z. & Socci, N.D. (1995) Proc. Natl. Acad. Sci. USA 92, 3626–3630. 22. Socci, N.D., Onuchic, J.N. & Wolynes, P.G. (1996) J. Chem. Phys. 104, 5860–5868. 23. Socci, N.D., Nymeyer, H. & Wolynes, P.G. (1997) Physica D 107, 366–382. 24. Guo, Z., Brooks, C. & Bockzo, E. (1997) Proc. Natl. Acad. Sci. USA 94, 10161–10166. 25. Plotkin, S.S. & Wolynes, P.G. (1998) Phys. Rev. Lett., in press. 26. Saven, J.G. & Wolynes, P.G. (1996) J. Mol. Biol. 257, 199–216. 27. Wolynes, P.G., Schulten, Z.L. & Onuchic, J. (1996) Chem. Biol. 3, 415–432. 28. Riddle, D.S., Santiago, J.V., Bray, S.T., Doshi, N., Grantcharova, V., Yi, Q. & Baker, D. (1997) Nat. Struct. Biol. 4, 805–809. 29. Scalley, M.L. & Baker, D. (1997) Proc. Natl. Acad. Sci. USA 494, 10636–10640. 30. Mines, G.A., Pascher, T., Lee, S.C., Winkler, J.R. & Gray, H. (1996) Chem. Biol. 3, 491–497. 31. Itzhaki, L.S., Otzen, D.E. & Fersht, A.R. (1995) J. Mol. Biol. 254, 260–288. 32. Hirst, J.D. & Brooks, C.L. (1995) Biochemistry 34, 7614–7621. 33. Simmerling, C. & Elber, R. (1994) J. Am. Chem. Soc. 116, 2534–2547. 34. Boczko, E.M. & Brooks, C.L. (1995) Science 269, 393–396. 35. Daggett, V. & Levitt, M. (1993) J. Mol. Biol. 232, 600–619. 36. Hünenberger, P.H., Mark, A.E. & van Gunsteren, W.F. (1995) Proteins 21, 196–213. 37. Hansmann, U.H.E. & Okamoto, Y. (1993) J. Comput. Chem. 14, 1333–1338. 38. Miyazawa, S. & Jernigan, R.L. (1985) Macromolecules 218, 534–552. 39. Covell, D.G. & Jernigan, R.L. (1990) Biochemistry 29, 3287– 3294 40. Socci, N.D. & Onuchic, J.N. (1995) J. Chem. Phys. 103, 4732– 4744. 41. Hao, M.-H. & Scheraga, H.A. (1994) J. Phys. Chem. 98, 4940– 4948. 42. Camacho, C.J. & Thirumalai, D. (1993) Phys. Rev. Lett. 71, 2505–2508. 43. Govindarajan, S. & Goldstein, R.A. (1996) Proc. Natl. Acad. Sci. USA 93, 3341–3345. 44. Reva, B.A., Finkelstein, A.V., Rykunov, D.S. & Olson, A.J. (1996) Proteins 26, 1–8. 45. de Araújo, A.F.P. & Pochapsky, T.C. (1996) Folding Design 1, 299–314. 46. Shrivastava, I., Vishveshwara, S., Cieplak, M., Maritan, A. & Banavar, J.R. (1995) Proc. Natl. Acad. Sci. USA 92, 9206–9209. 47. Levitt, M. & Warshel, A. (1975) Nature (London) 253, 694–698. 48. Friedrichs, M.S., Goldstein, R.A. & Wolynes, P.G. (1991) J. Mol. Biol. 222, 1013–1034. 49. Guo, Z., Thirumalai, D. & Honeycutt, J.D. (1992) J. Chem. Phys. 97, 525–535. 50. Guo, Z. & Brooks. C.L., III. (1997) Biopolymers 42, 745–757. 51. Sasai, M. (1995) Proc. Natl. Acad. Sci. USA 92, 8438–8442. 52. Irbäck, A. & Potthast, F. (1995) J. Chem. Phys. 103, 10298–10305. 53. Berry, R.S., Elmaci, N., Rose, J.P. & Vekhter, B. (1997) Proc. Natl. Acad. Sci. USA 94, 9520–9524. 54. Nelson, E.D., Eyck, L.T. & Onuchic, J.N. (1997) Phys. Rev. Lett. 79, 3534–3537. 55. Burton, R.E., Huang, G.S., Daugherty, M.A., Calderone, T.L. & Oas, T.G. (1997) Nat. Struct. Biol. 4, 305–310. 56. Elove, G.A., Bhuyan, A.K. & Roder, H. (1994) Biochemistry 33, 6925–6935. 57. Jennings, P. & Wright, P. (1993) Science 262, 892–896. 58. Plaxco, K.W. & Dobson, C.M. (1996) Curr. Opin. Struct. Biol. 6, 630–636. 59. López-Hernández, E. & Serrano, L. (1996) Folding Design 1, 43–55. 60. Sosnick, T.R., Mayne, L. & Englander, S.W. (1996) Proteins 24, 413–426. 61. Ballew, R.M., Sabelko, J. & Gruebele, M. (1996) Nat. Struct. Biol. 3, 923–926. 62. Phillips, C M., Mizutani, Y. & Hochstrasser, R.M. (1995) Proc. Natl. Acad. Sci. USA 92, 7292–7296. 63. Williams, S., Causgrove, T.P., Gilmanshin, R., Fang, K.S., Callender, R.H., Woodruff, W.R & Dyer, R.B. (1996) Biochemistry 35, 691–697. 64. Mathews, C.R. (1993) Annu. Rev. Biochem. 62, 653–683. 65. Cordes, M.H.J., Davidson, A.R. & Sauer, R.T. (1996) Curr. Opin. Struct. Biol. 6, 3–10. 66. Raschke, T.M. & Marqusee, S. (1997) Nat. Struct. Biol. 4, 298–304. 67. Lin, L., Pinker, R.J., Forde, K., Rose, G.D. & Kallenbach, N.R. (1994) Nat. Struct. Biol. 1, 447–452. 68. Wolynes, P.G., Onuchic, J.N. & Thirumalai, D. (1995) Science 267, 1619–1620. 69. Socci, N.D. & Onuchic, J.N. (1994) J. Chem. Phys. 101, 1519– 1528. 70. Wang, J., Onuchic, J. & Wolynes, P.G. (1996) Phys. Rev. Lett. 76, 4861–4864. 71. Socci, N.D., Onuchic, J.N. & Wolynes, P.G. (1998) Proteins Struct. Funct. Genet., in press. 72. Klimov, D.K. & Thirumalai, D. (1997) Phys. Rev. Lett. 79, 317–320. 73. Honeycutt, J.D. & Thirumalai, D. (1992) Biopolymers 32, 695– 709. 74. Guo, Z. & Thirumalai, D. (1996) J. Mol. Biol. 263, 323–343. 75. Ryckaert, J.P., Ciccotti, G. & Berendsen, H.J.C (1977) J. Comput. Physiol. 23, 327–341. 76. Ueda, Y., Taketomi, H. & Go, N. (1978) Biopolymers 17, 1531–1548. 77. García, A.E. (1992) Phys. Rev. Lett. 68, 2696–2699. 78. García, A.E., Hummer. G., Blumfield, R. & Krumhansl, J.A. (1997) Physica D 107, 225–239. 79. Pearlman, D.A., Case, D.A., Caldwell, J.W., Ross, W.S., Cheatham, T.E., III, Ferguson, D.M., Seibel, G.L., Singh, U.C., Weiner, P. & Kollman, P. (1995) AMBER, version 4.1 (Univ. of California, San Francisco). 80. Berendsen, H.J. C., Postma, J.P.M., van Gunsteren, W.F., DiNola, A. & Haak, J.R. (1984) J. Chem. Phys. 81, 3684–3690. 81. Ferrenberg, A.M. & Swendsen, R.H. (1989) Phys. Rev. Lett. 63, 1195–1198. 82. van Gunsteren, W.F. & Berendsen, H.J.C. (1982) Mol. Phys. 45, 637–647. 83. Lide, D.R., ed. (1994) Handbook of Chemistry and Physics (CRC, Boca Raton, FL), 75th Ed., pp. 6–253. 84. Frauenfelder, H., Parak, F. & Young, R.D. (1988) Annu. Rev. Biophys. Biophys. Chem. 17, 451–479. 85. Grantcharova, V. & Baker, D. (1997) Biochemistry 36, 15685– 15692.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS
5929
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5929–5934, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Optimizing the stability of single-chain proteins by linker length and composition mutagenesis
CLIFFORD R.ROBINSON* AND ROBERT T.SAUER† Department of Biology, Massachusetts Institute of Technology, Cambridge MA 02139 ABSTRACT Linker length and composition were varied in libraries of single-chain Arc represser, resulting in proteins with effective concentrations ranging over six orders of magnitude (10 µM-10 M). Linkers of 11 residues or more were required for biological activity. Equilibrium stability varied substantially with linker length, reaching a maximum for glycine-rich linkers containing 19 residues. The effects of linker length on equilibrium stability arise from significant and sometimes opposing changes in folding and unfolding kinetics. By fixing the linker length at 19 residues and varying the ratio of Ala/Gly or Ser/Gly in a 16-residuerandomized region, the effects of linker flexibility were examined. In these libraries, composition rather than sequence appears to determine stability. Maximum stability in the Ala/Gly library was observed for a protein containing 11 alanines and five glycines in the randomized region of the linker. In the Ser/Gly library, the most stable protein had seven serines and nine glycines in this region. Analysis of folding and unfolding rates suggests that alanine acts largely by accelerating folding, whereas serine acts predominantly to slow unfolding. These results demonstrate an important role for linker design in determining the stability and folding kinetics of singlechain proteins and suggest strategies for optimizing these parameters. The construction of single-chain or hybrid proteins is a potentially powerful method for generating proteins with novel functions and improved properties (1–11). A critical element in such efforts is the design of the peptide linkers that serve to connect different protein domains or subunits. Designed linkers are usually glycine-based peptides with lengths calculated to span the minimum distance between the C terminus of one subunit or domain and the N terminus of the next. How important is linker design in determining the properties of single-chain proteins? Alterations in linker regions have been found to affect the stability, oligomeric state, proteolytic resistance, and solubility of single-chain proteins (12–23), but few systematic investigations of these relationships have been reported. Here, we test the effects of linker design on the stability, protein folding kinetics, and biological activity of single-chain Arc represser. Wild-type Arc is a dimer with identical subunits. and Arc-L1-Arc is a single-chain variant with a 15-residue linker connecting the subunits (see Fig. 1). The L1 linker of Arc-L1-Arc holds the subunits at an effective concentration (Ceff) of 3 mM. By varying linker length and composition, we have isolated single-chain variants with effective subunit concentrations ranging from 10 µM to 10 M, corresponding to changes in the free energy of unfolding (∆Gu) from 3 to 11 kcal/ mol. These differences in stability arise from changes in the folding and unfolding rates, suggesting that linker design can affect protein stability by altering the free energies of both the native and denatured states.
MATERIALS AND METHODS Cassettes coding for glycine-rich linkers ranging from 3 to 59 residues (Fig. 3A) were synthesized using an Applied Biosystems 381A DNA synthesizer and were purified as described (9). A precursor plasmid (pLA3), constructed to facilitate subcloning of linker library cassettes, contains tandem arc genes connected by a GGT ACC GGT adapter, which encodes Gly-Thr-Gly and contains unique KpnI and AgeI restriction sites. Cassette libraries coding for 19-residue linkers with different amounts of Gly or Ala were constructed by synthesizing an oligonucleotide, which formed a hairpin: AAA 5-ACACCTTGAGGTACCCGA (GSA) 15 GGTACCTAACAGGCG A 3-CCATGGATTGTCCGC A AAA The underlined sequences are KpnI sites. S represents a mixture of G and C, and thus, the GSA codons encode either glycine (GGA) or alanine (GCA). Three otherwise identical oligonucleotides with different G/C ratios at the randomized positions (1:1; 3:1; 1:3) were synthesized to facilitate identification of a wide range of compositions. A cassette library encoding random combinations of glycine (GGT) and serine (AGT) was constructed in the same manner. Second strand synthesis was carried out using Sequenase v.2.0 (United States Biochemical) for 2 h at 37°C in Sequenase buffer containing 1 mM dNTPs. Cassettes were digested with KpnI and ligated to the KpnI backbone of pLA3. Following transformation into Escherichia coli strain HB101, colonies were picked randomly and the appropriate region of the single-chain arc gene was sequenced using the dideoxy method. Plasmid DNA encoding in-frame constructs were transformed into E.coli strain UA2F for assays of activity in vivo (24) and into E.coli X90–λO cells for protein expression. All single-chain Arc proteins contained a (His)6 tail to facilitate purification using Ni-nitrilotriacetic acid chromatography. Protein purification, fluorescence and circular dichroism (CD) spectroscopy, analytical ultracentrifugation, and gel mobility-shift assays were performed as described (9, 25). Protein stability was assayed by urea denaturation by following changes in intrinsic tryptophan fluorescence intensity at 337 nm or CD ellipticity at 234 nm. For these experiments, the protein concentration was 10 µM in buffer containing 50 mM
*Present address: 3-Dimensional Pharmaceuticals, Exton, PA. †To whom reprint requests should be addressed. e-mail: [email protected]. © 1998 by The National Academy of Sciences 0027–8424/98/955929–6$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: Ccff, effective concentration; CD, circular dichroism.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS
5930
Tris·HCl (pH 7.5 at 25°C), 250 mM KCl, and 0.1 mM EDTA (26). Values of ∆Gu and m were obtained by fitting denaturation data to a twostate model by nonlinear least squares methods (26). Effective concentrations were calculated by using the equation Ceff=exp[(m2` ∆G1/m1– ∆G2)/RT], where m1 and ∆G1 are values for the single-chain protein, and m2 and ∆G2 are values for wild-type Arc (1.48 kcal/mol`M and 10.3 kcal/mol, respectively) (26). Stopped-flow kinetic experiments of protein folding and unfolding were monitored by changes in fluorescence at protein concentrations between 1 and 10 µM in the buffer used for stability measurements (26). Unfolding was initiated by urea-jump experiments (mixing ratio 1:10) to yield a final urea concentration of 7 or 9.1 M. Refolding was initiated by mixing protein denatured in 6.0– 9.6 M urea with low urea buffer (1:5 ratio) to yield final urea concentrations between 1.0 and 4.5 M. Rate constants were obtained by fitting the kinetic data to single exponentials. In all cases, the residuals of the fits were distributed randomly. For ease of comparison among each library of variants, rates were either measured at a single urea concentration or measured at a series of urea concentrations and extrapolated to this reference concentration by using linear regression of ln(k) vs. [urea] plots (R>0.99).
FIG. 1. (A) Tandem copies of the arc gene connected by DNA encoding a linker region comprise the gene for single-chain Arc represser. (B) One model of how a linker might connect the two subunits (colored gray and white) of single-chain Arc. The positions of the N and C termini are indicated. Prepared using MOLSCRIPT (34) and coordinates of wild-type Arc (33).
RESULTS Variation of Linker Length. A library of single-chain arc genes with linkers composed of Gly, Ser, and Thr and lengths varying from 3 to 59 aa was constructed (Fig. 3A). The fraction of Gly in different linkers ranges from 66 to 80%. The linkers and corresponding proteins are named LLX and Arc-LLX-Arc (Length Library, X=number of residues), respectively. No intracellular expression of the Arc-LL8-Arc protein was detected. Arc-LL3-Arc expressed to high levels but monomers, dimers, and higher-order oligomers were observed following SDS electrophoresis and Western analysis. This behavior may indicate “cross-folding” as has been observed with single-chain antibodies that have very short linkers (27, 28). The remaining 13 proteins in this library were all expressed at high levels and electrophoresed as monomers. The Arc-LLX-Arc variants were tested for repression of transcription of the Pant promoter in E.coli strain UA2F, using resistance to streptomycin as an assay of biological activity (24). Arc-LLX-Arc proteins with linkers containing 13 or more residues had wild-type activities. Arc-LL11-Arc was partially active; single-chain molecules with the LL3, LL8, and LL9 linkers were inactive. Modeling studies show that connecting the Arc subunits with linkers shorter than 13 residues would either require the linker to cross the DNA-binding surface of the protein and/or require distortion of the structure. Single-chain Arcs with linkers LL9–LL59 were purified for biophysical characterization. All of these single-chain proteins had CD and fluorescence spectra similar to wild-type Arc. Arc-LL11-Arc, Arc-LL19-Arc, and Arc-LL31-Arc were analyzed by analytical ultracentrifugation and found to be monomeric at concentrations between 10 and 100 µM (data not shown). Proteins containing the three longest linkers (LL47, LL51. and UL59) tended to precipitate at concentrations >100 µM, possibly because of aggregation caused by crossfolding of the Arc subunits. The thermodynamic stabilities of Arc-LLX-Arc proteins with linkers from 9 to 57 residues were determined by urea denaturation studies, revealing that the 19-residue linker provides maximal stability. As shown in Fig. 2 for a subset of these proteins, there are large changes in the concentration of urea required for denaturation of proteins with different linker lengths, but the curves are roughly parallel indicating that the denaturant m-values (variation of ∆Gu with urea) are similar. Fig. 3B shows the variations of ∆Gu and Ceff with linker length. For linkers from 9 to 19 residues, stability of the single-chain protein increased with length. Arc-L9-Arc was the least stable (∆Gu3 kcal/mol; Ceff6 µM) and Arc-LL19-Arc was the most stable (∆Gu=8.4 kcal/mol; Ceff=80mM) of the proteins examined. Increases in linker length past 19 residues resulted in decreasing stability until a plateau was reached at 4.5 kcal/mol (Ceff 150 µM) for linkers between 47 and 59 residues. The linker-dependent changes in stability arise from changes in both the folding and unfolding rates, as measured in urea-jump, stoppedflow, kinetic experiments. Fig. 3 C and D show that both the folding and unfolding rate constants vary significantly as the linker length is changed. In 7 M urea, Arc-LL9-Arc unfolds with a rate constant (ku) of 3,000 s–1. As the linker length is increased from 9 to 19, there is a roughly exponential decrease in ku that spans 3–4 orders of magnitude and reaches a value of 1s–1 for Arc-LL19-Arc. Changes in linker length between 19 and 59 residues do not change ku appreciably. Thus, linkers shorter than 19 residues reduce the
FIG. 2. Linker length has large effects on the stability of single-chain Arc to urea denaturation. The sequences of linkers LL9 (`), LL11 (∆), LL17 (`), LL19 (), LL31 ( ` ) , and LL47 ( ` ) are listed in Fig. 3A. Fraction unfolded was calculated by fitting plots of CD ellipticity (234 nm) vs. urea concentration to a two-state-unfolding transition. The solid lines represent the best theoretical fits of the experimental data.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS
5931
free energy barrier between the native state and the transition state.
FIG. 3. Properties of linker-length variants of single-chain Arc. (A) Linker sequences. (B) Equilibrium stability and effective concentration vary with linker length. Error bars indicate one SD from three independent experiments. (C) Folding rates in 2 M urea. (D) Unfolding rates in 7 M urea. Experimental conditions: protein 1–10 µM, 25°C, 50 mM Tris·HCl (pH 7.5), 250 mM KCl, and 0.1 mM EDTA. The refolding rate (kf) in 2 M urea has a maximum value of 1000 s–1 for the Arc-LL13-Arc protein. Decreasing the linker by four residues to a length of nine causes a 30-fold decrease in the folding rate. As the linker length is increased from 13 to 47 residues, the refolding rate also decreases. Over this range, there is a roughly exponential decrease in kf that spans nearly four orders of magnitude. Little change in kf is seen for linkers between 47 and 59 residues. These results show that linker length can have large effects on the free energy difference between the denatured state and the transition state. Moreover, the length optima for equilibrium stability (19 residues), refolding (13 residues), and unfolding (19–59 residues) are different. The 19-residue linker provides the greatest equilibrium stability because it is the best compromise between reasonably fast refolding and slow unfolding. Effects of Linker Composition. To asses the effects of varying the number of glycines in the linker, the length of the linker was fixed at 19 residues and 16 internal positions were randomized between Ala and Gly (ALX library) or between Ser and Gly (SLX library) by using the strategy described in Materials and Methods. For these experiments, the libraries were first selected for Arc represser activity in vivo and then the sequences of individual members were determined. Sixteen proteins comprise the ALX library; the linkers in these proteins contain from 3 to 15 alanines (Fig. 4A). Ten proteins, with 3–11 serines in the linker region, comprise the SLX library (Fig. 5A). All of the Arc-ALX-Arc and Arc-SLX-Arc proteins were expressed at high levels, were purified, and had CD and fluorescence spectra similar to wild-type Arc. In the ALX library, variants with eight or more linker alanines showed some tendency to aggregate during purification and handling but were monomeric at concentrations of 1–20 µM as judged by analytical ultracentrifugation and the concentration independence of equilibrium stability and refolding rates. All other proteins in the ALX and SLX libraries were highly soluble. The number of non-glycine residues in the 19-residue linker has a significant effect on the equilibrium stability of proteins in both the ALX and SLX libraries, as determined by urea denaturation. In the ALX library (Fig. 4 A and B), Arc-AL11-Arc, which contains 11 alanines and 5 glycines in the randomized portion of the linker, has the maximum stability (∆Gu 11 kcal/mol; Ceff8 M). Arc-AL3-Arc, with 3 alanines and 13 glycines in the randomized region of the linker, is far less stable (∆Gu3kcal/mol; Ceff10 µM), suggesting that too much linker flexibility is detrimental to stability. Fig. 4B shows, however, that stability also decreases when the number of alanines is increased past the optimum value of 11, indicating that linkers that are too inflexible also limit protein stability. The same general trends are observed in the SLX library; proteins with too many or too few glycines are significantly less stable than Arc-SL7-Arc (∆Gu7kcal/mol; Ceff7mM). There are, however, two significant differences between the ALX and SLX results. Maximum stability occurs for a protein containing eight glycines in the randomized portion of the linker in the SLX library but for a protein containing only five glycines in this region in the ALX library. Moreover, the stabilities of the most stable variants in each library also differ significantly; Arc-AL11-Arc has an effective concentration that is 1,000-fold greater than Arc-SL7-Arc. We interpret these differences as indicating that the identity of the non-glycine residues in the linker is as important as the number of these residues in determining stability. By contrast, the positions of the glycine and non-glycine residues in the randomized portion of the linker seem to be unimportant. Five pairs of variants in the ALX library and three pairs in the SLX library have the same composition but difference sequences. In each of these cases, the stabilities of these variants (indicated by open and closed symbols in Figs. 4B and 5B) were found to be within experimental error. Another significant difference between the ALX and SLX libraries is observed in the unfolding kinetics (Figs. 4D and 5D). In the ALX library, the unfolding rate of different variants only changes by a factor of 20. In the SLX library, the unfolding rates change by >1,000-fold. In addition, the shapes
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS
5932
of these plots are very different. The ALX data is concave upward with minimum occurring for the protein with seven alanines and eight glycines in the randomized portion of the linker. In the SLX library, by contrast, ku decrease exponentially with the number of serines. The rate constants for refolding in the ALX library change by more than five orders of magnitude, reaching a maximum for variants with 11 or 12 alanines in the randomized part of the linker (Fig. 5C). Because changes in the unfolding rate are small for the ALX proteins, the changes in equilibrium stability arise almost exclusively from changes in the refolding rate. In the SLX library, variants differ over a 300-fold range in refolding rates with a maximum between four and seven serines. Because
FIG. 4. Properties of ALX variants with 19-residue linkers and differing in Ala/Gly composition numbers of alanines and glycines. (A) Linker sequences. (B) Equilibrium stability and effective concentration vary with number of alanines. For compositional isomers. closed and open symbols represents “a” and “b” variants, respectively. Error bars indicate one SD from three independent experiments. (C) Folding rates in 4.5 M urea. (D) Unfolding rates in 9.1 M urea. See Fig. 3 for conditions. FIG. 5. Properties of SLX variants with 19-residue linkers differing in Ser/Gly composition. (A) Linker sequences. (B) Equilibrium stability and effective concentration vary with number of serines. For compositional isomers, closed and open symbols represents “a” and “b” variants, respectively. Error bars indicate one SD from three independent experiments. (C) Folding rates in 2.25 M urea. (D) Unfolding rates in 9.1 M urea. See Fig. 3 for conditions.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS
5933
much larger changes are seen in the unfolding rates, the changes in equilibrium stability for the SLX proteins are dominated by the changes in unfolding kinetics. These results emphasize once again that the chemical identity of the non-glycine residues in the linker can have a profound effect on the biophysical properties of the single-chain proteins.
DISCUSSION Linker length and composition exert a surprisingly large influence on the stability of single-chain Arc represser. In the LLX linker length library, the most stable protein has a linker of 19 residues, and adding or deleting a few amino acids decreases stability (Fig. 3B). These length effects on stability arise from changes in the folding and unfolding rates. In the regime from 59 to 13 residues, shortening the linker accelerates folding. This observation is explained most simply if the denatured subunit domains are constrained to smaller and smaller regions of conformation space by shorter linkers and thus require less random sampling before essential collisions required for folding occur. We note, however, that the length dependence of the stability of single-chain Arc variants in this regime is significantly steeper than for loop-length variants of single-chain Rop (29) and is modeled poorly by simple, random walk, entropic considerations (30). As the linker length decreases from 13 to 11 to 9 residues, there is a decrease in the folding rate of the corresponding Arc-LLX-Arc protein. At some point, the linkers must become too short to connect the subunits in the native conformation without strain. In fact, in the linker length regime from 19 to 9 residues, the unfolding rates of the corresponding Arc-LLX-Arc proteins increases exponentially as the linkers become shorter, suggesting that shorter tethers in this length range introduce more and more strain into the native structure. Presumably, proteins with the LL17, LL15, and LL13 linkers do not show decreased folding rates because of compensating changes in conformational search efficiency. Glycine is generally used in designed linkers because the absence of a β-carbon permits the polypeptide backbone to access dihedral angles that are energetically forbidden for other amino acids (31). Thus, a glycine-rich linker will be more flexible than a linker of comparable length composed of non-glycine residues. Our results, however, indicate that too much linker flexibility is detrimental to single-chain protein stability. In the ALX (alanine/glycine) library, maximum stability was observed when the 16-residue-randomized region contained 11 alanines and 5 glycines. In the SLX (serine/ glycine) library, the most stable protein had seven serines and nine glycines in the randomized portion of the linker. In both libraries, plots of stability vs. the number of non-glycine residues are relatively regular and proteins with the same linker compositions have comparable stabilities (Figs. 4B and 5B). Both observations suggest that it is the composition rather than the sequence of the linker that is important in determining stability. A single exception to this generalization is provided by Arc-LL19-Arc and Arc-SL3-Arc, which have the same composition but stabilities differing by 3.4 kcal/mol. The first three residues of the linker are Gly-Thr-Ser in Arc-SL3Arc, which has lower stability, and Gly-Gly-Gly in Arc-LL19-Arc, suggesting that the conformational flexibility imparted by glycine may be important at the junction between C terminus of the first subunit and the N terminus of the linker. In the ALX library, the main effects of alanine composition on stability result from changes in the refolding rate. For example, as the number of alanines in the linker increases from 3 to 11, the folding rates of the corresponding proteins increase by 30,000-fold. Alanine restricts the number of allowed conformations of the linker compared with glycine and, in this length regime, probably accelerates the conformational search that occurs during folding. Increasing the number of alanines to 14 or 15 then reduces the folding rate, probably because these linkers become too inflexible. When serine is substituted for glycine, there are also effects on the refolding rate but with several differences: the optimal number of serines is smaller than the optimal number of alanines (7 Ser vs. 11 Ala), the difference between the fastest and slowest folders are smaller 2,000-fold for SLX vs, 30,000-fold for ALX), and the maximum folding rates are different (in 2.25 M urea, the fastest ALX protein folds 250 times faster than the fastest SLX protein). Clearly, alanine and serine affect linker flexibility in rather different ways. Large differences between alanine and serine are also apparent when comparing effects on the unfolding rate. As the number of serines in the linker increases, the unfolding rate continues to decrease over a 5,000-fold range (Fig. 5D). By contrast, in the alanine library, the minimum unfolding rate is observed for a protein with seven alanines and the total change between the slowest and fastest unfolders is only 15-fold. We presume that the ability of serine to form hydrogen bonds allows formation of new stabilizing interactions in the native state but whether these interactions are within the linker or involve interactions between the linker and the body of the single-chain protein is unknown. Because alanines in the linker primarily affect folding rates whereas serine has the largest effects on unfolding rates, it seems possible that optimizing the composition of Gly, Ser, and Ala in a linker library might produce single-chain molecules with even greater stabilities than those described here. Preliminary studies also suggest that the effects of length and composition may be interdependent. For example, linkers of different lengths may have different optimal compositions. Variations in linker length or composition caused no significant changes in represser activity in vivo except in proteins with linkers shorter than 11 residues. In gel mobility-shift assays, Arc-LL19-Arc and Arc-LA11-Arc, which have 19-residue linkers, bound operator DNA as strongly as wild-type Arc dimers (data not shown). In earlier work, however, we found that Arc-L1-Arc (which is identical to Arc-LL15-Arc) had a 10-fold enhanced affinity for operator DNA (9, 26). In single-chain Arc, the linker connects the N-terminal arm of the second subunit to the C terminus of the first subunit; in wild-type Arc, this N-terminal arm is disordered in solution (32) but folds against the operator in the protein-DNA complex (33). The L1/LL15 linker may increase operator affinity by helping to restrict the conformation of the arm in solution, thereby reducing the entropic penalty for ordering the arm upon DNA binding (9). By this model, lengthening the linker to 19 residues probably reduces constraints on the arm conformation. In summary, we find that changes in linker length and composition can produce substantial changes in the stability and folding kinetics of single-chain Arc. Poly-glycine linkers maximize the conformational freedom of the polypeptide backbone but do not result in optimal stability. For single-chain or hybrid protein designs that have folding problems, alterations in linker length and/or composition should provide a useful method for increasing stability. We thank David Goldenberg for helpful discussions. This work was supported by an National Institutes of Health postdoctoral fellowship (to C.R.R.) and by National Institutes of Health Grant AI-15706 (to R.T.S.). 1. Bird, R.E., Hardman, K.D., Jacobson, J.W., Johnson, S., Kaufman, B.M., Lee, S.-M., Lee, T., Pope, S.H.. Riordan, G.S. & Whitlow, M. (1988) Science 242, 423–426. 2. Pomerantz, J.L., Sharp, P.A. & Pabo, C.O. (1995) Science 267, 93–96. 3. Predki, P.F. & Regan, L. (1995) Biochem. 34, 9834–9839.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS
5934
4. Hallewell, R.A., Laria, I., Tabrizi, A., Carlin, C., Getzoff, E.D., Tainer, J.A., Cousens, L.S. & Mullenbach, G.T. (1989) J. Biol. Chem. 264, 5260–5268. 5. Bizub, D., Weber, I.T., Cameron, C.E., Leis, J.P. & Skalka, A.M. (1991) J. Biol. Chem. 266, 4951–4958. 6. Kim, S.-H., Kang, C-H., Kim, R., Cho, J.M., Lee, Y.-B. & Lee, T.-K. (1989) Protein Eng. 2, 571–575. 7. Liang, H., Sandberg, W.S. & Terwillinger, T.C. (1993) Proc. Natl. Acad. Sci. USA 90, 7010–7014. 8. Toth, M.J. & Schimmel, P. (1986) J. Biol. Chem. 261, 6643–6646. 9. Robinson, C.R. & Sauer, R.T. (1996) Biochem. 35, 109–116. 10. O’Shea, E.K., Rutkowski, R. & Kim, P.S. (1992) Cell 68, 699–708. 11. Pantoliano, M.W., Bird, R.E., Johnson, S., Asel, E.D., Dodd. S.W., Wood, J.F. & Hardman, K.D. (1991) Biochem. 30, 10117–10125. 12. Mallender, W.D. & Voss, E.W., Jr. (1994) J. Biol Chem. 269, 199–206. 13. Rumbley, C.A., Denzin, L.K., Yantz, L., Tetin, S.Y. & Voss, E.W., Jr. (1993) J. Biol. Chem. 268, 13667–13674. 14. Stemmer, W.P., Morris, S.K. & Wilson, B.S. (1993) BioTechniques 14, 256–265. 15. Lieschke, G.J., Rao, P.K., Gately, M.K. & Mulligan, R.C (1997) Nat. Biotech. 15, 35–40. 16. Eustance, R.J. & Schleif, R.F. (1996) J. Bacterial. 178, 7025– 7030. 17. Govindaraj, S. & Poulos, T.L. (1996) Protein Sci. 5, 1389–1393. 18. Kortt, A.A., Lah, M., Oddie, G.W., Gruen, C.L., Burns, J.E., Pearce, L.A., Atwell, J.L., McCoy, A.J., Howlett, G.J., Metzger, D.W., et al (1997) Protein Eng. 10, 423–433. 19. Whitlow, M., Bell, B.A., Feng, S.-L., Filpula, D., Hardman, K.D., Hubert, S.L., Rollence, M.L., Wood, J.F., Schott, M.E., Milenic, D.E., et al. (1993) Protein Eng. 6, 989–995. 20. Deonarain, M.P., Rowlinson-Busza, G., George, A.J.T. & Epenetos, A.A. (1997) Protein Eng. 10, 89–98. 21. Tang, Y., Jiang, N., Parakh, C. & Hilvert, D. (1996) J. Biol. Chem. 271, 15682–15686. 22. Newton, D.L., Xue, Y., Olson, K.A., Fett, J.W. & Rybak, S.M. (1996) Biochem. 35, 545–553. 23. Huston, J.S., McCartney, J., Tai, M.-S., Mottola-Hartshorn, C, Jin, D., Warren, F., Keck, P. & Oppermann, H. (1993) Int. Rev. Immunol. 10, 195–217. 24. Bowie, J.U. & Sauer, R.T. (1989) Proc. Natl Acad. Sci. USA 86, 2152–2156. 25. Milla, M.E., Brown, B.M. & Sauer, R.T. (1993) Protein Sci. 2, 2198–2205. 26. Robinson, C.R. & Sauer, R.T. (1996) Biochem. 35, 13878–13884. 27. Poljak, R.J. (1994) Structure 2, 1121–1123. 28. Perisic, O., Webb, P.A., Holliger, P., Winter, G. & Williams, R.L. (1994) Structure 2, 1217–1226. 29. Nagi, A.D. & Regan, L. (1997) Fold. Des. 2, 67–75. 30. Chan, H.S. & Dill, K.A. (1988) J. Chem. Phys. 90, 492–509. 31. Ramachandran, G.N. & Sasisekharan, V. (1968) Adv. Protein Chem. 23, 283–437. 32. Breg, J.N., van Opheusden, J.H.J., Burgering, M.J.M., Roelens, R. & Kaptein, R. (1990) Nature (London) 346, 586–589. 33. Raumann, B.E., Rould, M.A., Pabo, C.O. & Sauer, R.T. (1994) Nature (London) 367, 754–757. 34. Kraulis, P.J. (1991) J. Appl. Cryst. 24, 946–950.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA
5935
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5935–5941, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Architecture and mechanism of the light-harvesting apparatus of purple bacteria
XICHE HU, ANA DAMJANOVI , THORSTEN RITZ, AND KLAUS SCHULTEN Beckman Institute and Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801 ABSTRACT Photosynthetic organisms fuel their metabolism with light energy and have developed for this purpose an efficient apparatus for harvesting sunlight The atomic structure of the apparatus, as it evolved in purple bacteria, has been constructed through a combination of x-ray crystallography, electron microscopy, and modeling. The detailed structure and overall architecture reveals a hierarchical aggregate of pigments that utilizes, as shown through femtosecond spectroscopy and quantum physics, elegant and efficient mechanisms for primary light absorption and transfer of electronic excitation toward the photosynthetic reaction center. The prevalent color green in Earth’s biosphere is testimony to the important role that chlorophylls play in harnessing the energy of the Sun to fuel the metabolism of photosynthetic life forms. Chlorophylls are assisted in their light-harvesting role by carotenoids, also widely known through their coloration of petals and fruits in plants. Photosynthetic organisms have evolved intricate aggregates of chlorophylls and carotenoids for efficient light harvesting and exploit in subtle ways the laws of quantum mechanics. This role of chlorophylls and carotenoids has emerged in full detail only recently, when the atomic structures of proteins involved in bacterial photosynthetic light harvesting have been solved by a combination of x-ray crystallography, electron microscopy, and molecular modeling. However, the conceptual foundation for our present understanding of light harvesting was laid long ago, when Emerson and Arnold demonstrated that it required hundreds of chlorophylls to reduce one molecule of CO2 under saturating flash light intensity (1, 2). To explain the cooperative action of these chlorophylls, Emerson and Arnold postulated that only very few chlorophylls in the primary reaction site, termed the photosynthetic reaction center (RC), directly take part in photochemical reactions; most chlorophylls serve as light-harvesting antennae by capturing the sunlight and funneling electronic excitation toward the RC. This notion gave rise to the definition of the photosynthetic unit (PSU) as an ensemble of an RC with associated light-harvesting complexes containing up to 250 chlorophylls, and became widely accepted only when Duysens carried out a critical experiment in which energy transfer between different chlorophylls was observed (3). A wealth of accumulated evidence proves that the organization of PSUs, to surround an RC with aggregates of chlorophylls and associated carotenoids, is universal in both photosynthetic bacteria and higher plants (2, 4–6). Of the known photosynthetic systems, the PSU of purple bacteria is the most studied and best characterized. Fig. 1 depicts schematically the intracytoplasmic membrane of purple bacteria with its primary photosynthetic apparatus. In the PSU, an array of light-harvesting complexes captures light and transfers the excitation energy to the photosynthetic RC. This article focuses on the primary processes of light harvesting and electronic excitation transfer that occur in the PSU, and describes the role of molecular modeling in elucidating the underlying mechanisms. In most purple bacteria, the photosynthetic membranes contain two types of light-harvesting complexes, light-harvesting complex I (LH-I) and light-harvesting complex II (LH-II) (7). LH-I is found surrounding directly the RCs (8, 9), whereas LH-II is not in direct contact with the RC but transfers energy to the RC through LH-I (10, 11). For some bacteria, such as Rhodopseudomonas (Rps.) acidophila and Rhodospirillum (Rs.) molischianum strain DSM 120 (12), there exists a third type of light-harvesting complex, LH-III. A 1:1 stoichiometry exists between the RC and LH-I (9); the number of LH-IIs and LH-IIIs varies according to growth conditions such as light intensity and temperature (13). Purple bacteria absorb light in a spectral region complementary to that of plants and algae, mainly at wavelengths of about 500 nm through carotenoids and above 800 nm through bacteriochlorophylls (BChls). Fig. 2 shows the energy levels for the key electronic excitations in the PSU. There exists a pronounced energetic hierarchy in the light-harvesting system: LH-III absorbs light at the highest energy (800 and 820 nm); the LH-II complex, which surrounds LH-I, absorbs maximally at 800 nm and 850 nm; and LH-I, which in turn surrounds the RC, absorbs at a lower energy (875 nm) (11). The energy cascade serves to funnel electronic excitations from the LH-IIIs and LH-IIs through LH-I to the RC. Time resolved picosecond and femtosecond spectroscopy revealed that excitation transfer within the PSU occurs on a subpicosecond time scale and at near unit (95%) efficiency (14, 15). Today, structures of the major components of the bacterial photosynthetic apparatus are available at atomic resolution. Structures of the RC are known for Rps. viridis (16) as well as for Rhodobacter (Rb.) sphaeroides (17). Recently, high resolution crystal structures of LH-II have been determined for Rps. acidophila (18) and for Rs. molischianum (19). Based on a high degree of homology of the αβ-heterodimer of LH-I from Rb. sphaeroides to that of LH-II of Rs. molischianum (12, 20), an atomic structure for LH-I of Rb. sphaeroides has been modeled (21).
Structure of Light-Harvesting Complexes Accordingly, a structural model for the bacterial PSU has been established and consists of LH-IIs, LH-I, and the RC; this model provides detailed knowledge of the organization of
© 1998 by The National Academy of Sciences 0027–8424/98/955935–7$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: RC, reaction center: PSU: photosynthetic unit; LH-I and LH-II, light-harvesting complexes I and II; BChl, bacteriochlorophyll; PBS, phycobilisome; PCP, peridinin-chlorophyll-protein.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA
5936
chromophores in the photosynthetic membrane and opens a door to the study of excitation transfer in the PSU based on a priori principles.
FIG. 1. Schematic representation of the photosynthetic apparatus in the intracytoplasmic membrane of purple bacteria. The RC (red) is surrounded by the light-harvesting complex I (LH-I, green) to form the LH-I-RC complex, which is surrounded by multiple light-harvesting complexes LH-II (green), forming altogether the PSU. Photons are absorbed by the lightharvesting complexes and excitation is transferred to the RC initiating a charge (electron-hole) separation. The RC binds quinone QB, reduces it to hydroquinone QBH2, and releases the latter. QBH2 is oxidized by the bci complex, which uses the exothermic reaction to pump protons across the membrane; electrons are shuttled back to the RC by the cytochrome c2 complex (blue) from the ubiquinone-cytochrome bc1 complex (yellow). The electron transfer across the membrane produces a large proton gradient that drives the synthesis of ATP from ADP by the ATPase (orange). Electron flow is represented in blue, proton flow in red, and quinone flow, likely confined to the intramembrane space, in black. LH-II. The structure of LH-II from Rs. molischianum had been determined to 2.4 Å resolution (19) and is shown in Fig. 3a. The complex is an octameric aggregate of αβ-heterodimers; the latter contains a pair of short peptides (α- and β-apoproteins) noncovalently binding three BChl a molecules and one lycopene (a specific type of carotenoid). Presumably, there exists a second lycopene for each αβ-heterodimer. The electron density map indeed contains a stretch of assignable density, but the stretch is not long enough to positively resolve the entire lycopene (19). Two concentric cylinders of α-helices, with the α-apoproteins inside and the β-apoproteins outside, form a scaffold for BChls and lycopenes. Fig. 3b depicts the 24 BChl molecules and 8 lycopene molecules in LH-II with all other components stripped away. Sixteen B850 BChl molecules form a continuous overlapping ring of 23 Å radius (based on central Mg atoms of BChls) with each BChl oriented perpendicular to the membrane plane. The Mg–Mg distance between neighboring B850a and B850b BChls is 9.2 Å (within an αβ-heterodimer) and between B850a and B850b is 8.9 Å (between heterodimers). Eight B800 BChls, forming another ring of 28 Å radius, are arranged with their tetrapyrrol rings nearly parallel to the membrane plane and exhibit a Mg–Mg distance of 22 Å between neighboring BChls, i.e., the BChls are coupled only weakly. The ligation sites for the B850 BChls are α-His-34 and β-His-35, and the B800 BChls ligate to α-Asp-6. Eight lycopene molecules span the transmembrane region: each makes contact with B800 BChl and the B850a BChl.
FIG. 2. Energy levels of the electronic excitations in the PSU of BChl a containing purple bacteria. The diagram illustrates a funneling of excitation energy toward the photosynthetic RC. The dashed lines indicate (vertical) intracomplex excitation transfer, and the solid lines (diagonal) indicate intercomplex excitation transfer. LH-I exists in all purple bacteria; LH-II exists in most species; LH-III arises in certain species only.
FIG. 3. The octameric LH-II complex from Rs. molischianum (19). (a) The α-helical segments are represented as cylinders with the α-apoproteins (inside) in blue and the β-apoprotein (outside) in magenta. The BChl molecules are in green with phytyl tails truncated for clarity. The lycopenes are in yellow, (b) Arrangement of chromophores with BChls represented as squares, and with carotenoids (lycopenes) in a licorice representation. Bars connected with the BChls represent the Qy transition dipole moments as defined by the vector connecting the N atom of pyrrol I and the N atom of pyrrol III (22). Representative distances between central Mg atoms of B800 BChl and B850 BChl are given in Å. The B850 BChls bound to the α-apoprotein and the β-apoprotein are denoted as BS50a and B850b, respectively; BChl B850a is bound to the (left) neighboring heterodimer. It is remarkable that LH-II results from the self-aggregation of a large number of identical, noncovalently bonded transmembrane helices, BChls, and carotenoids. With its simple, symmetric architecture, LH-II constitutes an ideal model system for studying aggregate formation and adhesive interactions of proteins. Mechanical models reveal perfect self-complementarity of the αβ-heterodimers that interlock with each other to form a circular aggregate (23).
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA
5937
LH-I-RC Complex. LH-I of Rb. sphaeroides has been modeled in ref. 21 as a hexadecamer of αβ-heterodimers; the modeling exploited a close homology of these heterodimers to those of LH-II of Rs. molischianum. The resulting LH-I structure yields an electron density projection map that is in agreement with an 8.5 Å resolution electron microscopy projection map for the highly homologous LH-I of Rs. rubrum (24). The LH-I complex contains a ring of 32 BChls referred to as B875 BChls according to their main absorption band. The Mg–Mg distance between neighboring B875 BChls is 9.2 Å within the αβ-heterodimer and 9.3 Å between neighboring heterodimers. The modeled LH-I has been docked to the photosynthetic RC of Rb. sphaeroides by means of a constrained conformational search (21), employing for the latter the structure reported in ref. 17. Fig. 4a presents the LH-I-RC complex. The arrangement of the BChls in the LH-I-RC complex is depicted in Fig. 4b. One can discern the ring of B875 BChls of LH-I that surrounds the RC special pair (PA and PB) and the so-called accessory BChls (BA, BB). The closest distance between the central Mg atom of the RC’s special pair (BChls PA, PB) and the Mg atom of the BChls in LH-I is 42.6 Å. The distance between the Mg atom of the accessory BChl (BChls BA, BB) and the LH-I BChls is shorter, the nearest distance measuring 35.7 Å. Rb. sphaeroides contains an additional PufX gene of unknown function. It has been suggested that the PufX protein may substitute one or more αβ-heterodimers of LH-I to open up the circular ring shown in Fig. 4a and to facilitate thereby the flow of quinones (QB/QBH2) between the RC and the cytochrome bc1 complex (see Fig.1) (4, 9).
FIG. 4. Structure of the LH-I-RC complex, (a) Side view of the LH-I-RC complex with three LH-I αβ-heterodimers on the front side removed to expose the RC in the interior. The α-helices are represented as cylinders with the L, M, and H subunits of the RC in yellow, red, and gray, and the α-apoprotein and the β-apoprotein of the LH-I in blue and magenta. BChls and bacteriopheophytins are represented as green and yellow squares, respectively. Carotenoids (spheroidenes) are in a yellow licorice representation, and quinone QB is rendered by gray van der Waals spheres. QB shuttles in and out (as QBH2) of the LH-1-RC complex as indicated in Fig. 1. (b) Arrangement of BChls in the LH-I-RC complex. The BChls are represented as squares with B875 BChls of LH-I in green, and the special pair (PA and PB) and the accessory BChls (BA and BB) of the RC in red and blue, respectively: cyan bars represent Qy transition moments of BChls. [Produced with the program VMD (25)]. The PSU. Fig. 5 presents a model of the PSU for Rb. sphaeroides. Only three LH-IIs are shown. The actual photosynthetic apparatus can contain up to about 10 LH-IIs around each LH-I. Because electron microscopy observations suggest that LH-II of Rb. sphaeroides contains nine αβ-heterodimers (J.Olsen, personal communication), instead of eight as in LH-II of Rs. molischianum, LH-II of Rb. sphaeroides, as shown in Fig. 5, has been constructed as a nanomer of αβ-heterodimers by means of homology modeling by using the αβ-heterodimer of LH-II from Rps. acidophila as a template. For this purpose, the modeling protocol developed and applied successfully in refs. 19–21 was used. Two essential features of the pigment organization of the PSU, as depicted in Fig. 5, are (i) the ring-like aggregates of tightly coupled BChls within LH-I and LH-II, and (ii) the coplanar arrangement of these BChls and of the BChls in the RC. Analysis of the LH-I and LH-II structures as reported in refs. 21 and 26 indicates that each BChl of the B850 ring of LH-II and of the B875 ring of LH-I is noncovalently bound to three side-chain atoms of the α- or β-apoprotein such that the BChls are held in a rigid orientation. The planar organization of the BChls in the PSU is optimal for the transfer of electronic excitation to the RC.
Mechanisms of Excitation Transfer Photosynthetic bacteria evolved a pronounced energetic hierarchy in the light-harvesting system. The hierarchy, as shown in Fig. 2, furnishes a cascade-like system of excited states that funnels electronic excitation from the outer LH-IIs through LH-I to the RC. The excitation transfer cascading into the RC involves intracomplex and intercomplex processes, defined as excitation transfer within each pigment-protein complex (LH-II, LH-I, RC) and between pigment-protein complexes (LH-II →LH-II, LH-II→LH-I, LH-I→RC), respectively. Intracomplex transfer, for the main part, occurs faster than intercomplex transfer. We will first discuss intercomplex excitation transfer, and then we will describe intracomplex excitation transfer.
FIG. 5. Arrangement of pigment-protein complexes in the modeled bacterial PSU of Rb. sphaeroides. The α-helices are represented as Cα-tracing tubes with α-apoproteins of both LH-I and LH-II in blue and β-apoproteins in magenta, and the L, M, and H subunits of RC in yellow, red, and gray, respectively. All the BChls are in green, and carotenoids are in yellow. [Produced with the program VMD (25)].
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA
5938
Exciton Migration. One of the most intriguing structural features of the bacterial light-harvesting complexes is the circular organization of BChl aggregates (2). To understand the primary processes of light absorption and the subsequent excitation transfer from LH-IIs, through LH-I, to the RC, it is essential to characterize the electronic properties of the excited states of the circular BChl aggregate. The close proximity of the B850 BChls in LH-II implies strong interactions, leading to coherent superpositions, termed excitons (27, 28), of the lower energy excited states of individual BChls, the Qy states as demonstrated in INDO-CIS level quantum chemical calculations of the complete circular aggregates of 16 B850 BChls and 8 B800 BChls of LH-II from Rs. molischianum (29). As shown in Fig. 6, two bands of excitons arise, with the band splitting reflecting a weakly dimerized form of the aggregate (the BChl-BChl distances alternate slightly along the ring) (26). Only 2 of the 16 exciton states are optically allowed and thus carry an 8-fold enhanced oscillator strength (superradiance); the lowest excited state is optically forbidden and does not fluoresce, which may allow LH-II to preserve excitation energy, though disorder confers readily oscillator strength to this state (26).
FIG. 6. BChl-carotenoid interactions. (Upper Left) Exciton bands of the circular B850 BChl aggregate as determined by quantum chemical (INDO/S) calculations (29) based on coordinates of the crystal structure of LH-II from Rs. molischianum (19). The degenerate states that carry all the oscillator strength are highlighted by thickened lines. (Upper Right) Excitation energies of BChl and carotenoid states in LH-II of Rb. sphaeroides. Solid lines represent spectroscopically measured energy levels. The dashed line indicates the estimated (see refs. 44 and 47) energy for the optically forbidden S1 state of the carotenoid spheroidene. (Lower) Arrangement of spheroidene and the most proximate BChls based on the modeled structure of LH-II from Rb. sphaeroides. Close contacts between BChl and the carotenoid spheroidene are indicated by representative distances (in angstroms). Other, less extensive calculations, ranging from an effective Hamiltonian representation based on the point dipole treatment (30, 31) to a point monopole treatment (32), and to the quantum mechanical consistent-force-field/π-electron (QCFF/PI) approach (33), yield a similar exciton band structure but differ in detailed exciton levels and band gaps. According to the INDO-CIS calculation (29), the lowest exciton state is significantly lowered in energy through level repulsion with charge resonance states, resulting in an energy gap ∆ of 422 cm–1 (see Fig. 6). The optically allowed exciton states should then be populated 9% at thermal equilibrium at room temperature. To extend the quantum chemical calculations to LH-I and the complete PSU, an effective Hamiltonian H in the basis of single BChl Qy excitations had been established in ref. 26. The matrix elements of the Hamiltonian describe couplings between neighboring Qy states by j|H|j +1 , assuming values of v1 (v2) for odd (even) j. The diagonal elements j|H|j =ε account for the excitation energy of the Qy state of individual BChls. All other elements of H are approximated by dipole-dipole coupling terms
where dj are unit vectors describing the direction of the transition dipole moments of the ground state→Qy state transition of the j-th BChl and rjk is the vector connecting the centers of BChl j and BChl k. The adjustable parameters of the effective Hamiltonian were determined in ref. 29 to reproduce the exciton spectrum in Fig. 6: ε=13,242cm–1, v1=790cm–1, v2=369 cm–1, and C=505,644 Å3·cm–1. The effective Hamiltonian was extended in ref. 34 to incorporate two exciton states as they arise in pump-probe spectroscopy (35). The effective Hamiltonian can be applied without further modification to describe the circular aggregate of 32 BChls in LH-I (26). The same characteristics of the exciton bands as in the B850 BChl aggregate of LH-II are found, i.e., the second and the third exciton states carry all the oscillator strength, with the lowest energy excitation state being optically forbidden. The exciton states in LH-II and LH-I are completely delocalized over the ring-like B850 and B875 aggregates because of the assumption of perfect symmetry, i.e., absence of disorder, It is widely believed that the B850 BChl excited states, despite
FIG. 7. Excitation transfer in the bacterial photosynthetic unit. LH-II contains two types of BChls, commonly referred to as B800 (dark blue) and B850 (green), which absorb at 800 nm and 850 nm, respectively. BChls in LH-I absorb at 875 nm and are labeled B875 (green). PA and PB refer to the RC special pair, and BA, BB refer to the accessory BChls in the RC. The figure demonstrates the coplanar arrangement of the B850 BChl ring in LH-II, the B875 BChl ring of LH-I, and the RC BChls PA, PB, BA, BB. [Produced with the program VMD (25)].
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA
5939
natural disorder, are delocalized, but the extent of delocalization has been debated (15). The estimate for the number of coherently coupled BChls ranges from two BChl molecules (36) to the entire length of the B850 BChl aggregate (37). In principle, the relative strengths of the disorder and of the coupling between BChls determine the delocalization length. According to the INDO-CIS calculation (29), the effective coupling between nearest neighbor BChls is 790 cm–1 (v1) within the αβ-heterodimer and 369 cm–1 (v2) between the αβ-heterodimers. The effect of static disorder has been modeled in ref. 26 by randomizing the diagonal elements of an effective Hamiltonian. By using a distribution consistent with the inhomogeneous broadening measured by hole-burning spectroscopy, the effect of diagonal disorder on exciton delocalization was found to be noticeable but small. It has long been observed that excitation transfer LH-II→ LH-I→RC occurs in the PSU in fewer than 100 ps and with about 95% efficiency (14). In this respect, it is interesting to note that the transition dipole moments of the Qy excitations of the B850 and B875 BChls are all oriented in the two-dimensional plane that encompasses the ring-like BChl aggregates of LH-II, LH-I, and the RC special pair and is optimally attuned to the desired flow of electronic excitation LH-II→ LH-I→RC. There are many potential pathways for photons to be absorbed and for the subsequent excitations to reach the RC. A path may begin with absorption of an 800 nm photon by one of the B800 BChls in LH-II (see Fig. 7). At least three sequential steps are required for the B800 excitation to be transferred to the RC: B800 (LH-II)→B850 (LHII)→LH-I →RC. Time resolved picosecond and femtosecond spectroscopy revealed that the B800→B850 excitation transfer proceeds within about 700 fs (14, 38). Two color pump-probe femtosecond measurements determined a time constant of 3~5 ps for the B850→LH-I step (39). The final LH-I→RC transfer step requires about 35 ps (40), i.e., this is the slowest step (11, 14). Intercomplex LH-II→LH-II transfer may occur, but a rate for this process has not yet been determined. The effective Hamiltonian for LH-II, as described above, had been extended in refs. 26 and 34 to describe the exciton system of the entire aggregate shown in Fig. 7. One can determine the transfer rates between the different components, i.e., LH-II→LH-I, and LH-I→RC, by using a perturbation scheme (34). The calculated time constants of 3.3 and 65 ps for the excitation transfer processes LH-II→LH-I and LH-I→RC in Rb. sphaeroides, respectively, are in agreement with experimental values of 3~5 ps and 35 ps (39, 40). A startling result from these calculations has been a suggested role of the accessory BChls as mediators of the excitation transfer from LH-I to the RC special pair: the calculated time for LH-I→RC transfer, in the absence of accessory BChls, is about 600 ps, which is an order of magnitude too long compared with observations; the accessory BChls in RC provide a path for the excitation transfer that bridges the large distance of 42 Å or longer between LHI BChls and the RC special pair. Role of B800 BChls and Carotenoids. B800 BChls absorb light in a slightly higher spectral region than the B850 BChls and are oriented such that they absorb in a direction perpendicular to that of the B850 BChls. Quantum chemical calculations in ref. 29 have demonstrated that the B800 BChls are only weakly coupled with each other and with the B850 BChls. The individual B800 BChls transfer the resulting excitation energy to the B850 ring through the so-called Forster mechanism (41–43). The transfer proceeds within 700 fs (38). Quantum mechanical calculations show that this short transfer time, to a large degree, results through the exciton splitting of the accepting B850 exciton levels shown in Fig. 6; the exciton splitting greatly improves the resonance of the excitations of B800 and B850 BChls (44). Carotenoids absorb light at 500 nm into a strongly allowed state and transfer the excitation energy within 200 fs and with nearly 100% efficiency to the Qy exciton states of the B850 ring (45). The question arises by which pathways and by which mechanism such an efficient excitation transfer is achieved. Fig. 6 presents also the excitation energies of the spheroidene and BChl states in LH-II of Rb. sphaeroides. Spheroidene features two lowlying singlet excited states. A strongly allowed state absorbing at 500 nm is labeled S2. It decays within <200 fs into an optically forbidden electronic state labeled S1, which has been characterized in refs. 46 and 47. The S1 state is in resonance with the accepting Qy exciton states and, thus, provides a possible gateway for transfer to the B850 ring. The optically forbidden character of the S1 state of spheroidene precludes its coupling to the B850 ring through the Forster mechanism, thus limiting potential mechanisms to coupling through Coulomb interaction including higher-order multipoles (generalized Forster mechanism) or coupling through electron exchange [Dexter mechanism (48)]. The Dexter mechanism requires an overlap of donor and acceptor wave functions and, thus, is only efficient when donor and acceptor are in van der Waals contact. Because spheroidene and BChls are indeed in close contact, as shown in Fig. 6, one is tempted to suggest that the mechanism underlying singlet excitation transfer is electron exchange. Recent calculations (44, 49), however, do not support this assumption. Based on the geometric arrangement of carotenoids and BChls in LH-II (Fig. 6) and on CI expansions of the electronic states of carotenoids and chlorophylls, calculations in ref. 44 showed that the generalized Förster mechanism governs the transfer of singlet excitations, resulting in a transfer time of 260 fs through the S1 (carotenoid)→B850 (exciton states) pathway. The transfer through the optically forbidden S1 state is strongly accelerated by the splitting of the B850 exciton levels as seen also in the case of the B800→B850 transfer (44). Without the exciton splitting, the calculated transfer time is as slow as 2.5 ps. This suggests that purple bacteria may have evolved the ring structure of LH-II to improve resonance between acceptor and donor systems. In addition to transfer through the forbidden S1 state of spheroidene, the absorbing S2 state of carotenoids is likely to transfer some excitation also directly to the Qx state of BChl as suggested by the calculated transfer time of 330 fs (44) and the shortened (60 fs) in vivo lifetime of the S2 state (50). In addition to the light-harvesting function, carotenoids protect the light-harvesting system from the damaging effect of BChl triplet states that arise with a small, but finite, probability and can generate highly reactive singlet oxygen according to the reaction 3O2+3BChl*→ 1 + BChl. Carotenoids prevent this reaction by quenching the BChl triplet states through triplet excitation transfer from the BChls. This transfer involves a spin change and can only proceed through the electron exchange or Dexter mechanism (48). The triplet excitation transfer in LH-II of Rs. molischianum has been described in detail (44). The calculations showed that B850a and B800 are well protected by one of the eight lycopenes seen in the crystal structure of LH-II of Rs. molischianum (see Figs. 3 and 6), whereas B850b is not directly protected but can transfer triplet excitation within a few picoseconds to the well protected B850a BChl.
Other Photosynthetic Organisms Photosynthetic organisms have developed from a few common components a rather divergent set of antenna systems. The divergence is demonstrated in Fig. 8, which compares antenna systems of green bacteria, cyanobacteria, dinoflagellates, and green plants: to these examples is to be added the apparatus of purple bacteria shown in Fig. 1.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA
5940
FIG. 8. Schematic representation of proposed models of the PSUs in other photosynthetic systems. The figure displays interand extramembrane light-harvesting complexes, together with the RCs (RC in green bacteria, and PS-I and PS-II in cyanobacteria, dinoflagellates, and green plants). (a) Green bacteria: The major light-harvesting complex, chlorosome, contains rod-like BChl c aggregates surrounded by a layer of protein embedding lipids. Excitation energy harvested by the rod-like aggregates reaches the RC through a BChls a containing baseplate and membrane-bound light-harvesting BChl a complexes. (b) Cyanobacteria: The dominant light-harvesting complex of cyanobacteria and red algae, phycobilisome (PBS), is unique in choosing linear tetrapyrroles as pigments. Several types of disk-like pigmentprotein complexes such as Rphycoerythrin (51) constitute the phycobilisome rods and core. (c) Dinoflagellates: The photosynthetic unit of dinoflagellates consists of several membrane-bound pigmentprotein complexes and an extramembrane light-harvesting complex, the peridinin-chlorophyll-protein (PCP). (d) Green plants: Chloroplasts of green plants possess chlorophyll-carotenoid containing LH-CII (6) as the most abundant light-harvesting complex. [Images of R-phycoerythrin and PCP were produced with the program VMD (25)]. Anoxygenic photosynthetic (purple and green) bacteria employ a single RC. Oxygen-evolving photosynthetic organisms, e.g., cyanobacteria, dinoflagellates, and plants, possess in their PSUs two RCs of different types, namely PS-I and PS-II (see Figs. 8 b, c, and e). PSI shows similarity to the RCs of green sulfur bacteria, whereas PS-II is thought to be evolutionary related to the RC of purple bacteria. PS-I and PS-II have integral light-harvesting pigments associated with them (5). Apart from those integral light-harvesting pigments, oxygen-evolving photosynthetic organisms possess additional light-harvesting complexes that display significant structural variability among species. To illustrate the common components of the light-harvesting systems in Figs. 1 and 8, we summarize the properties of the antenna systems of purple bacteria as far as they are relevant to photosynthetic life forms in general. The chromophores of purple bacteria, i.e., BChls and carotenoids, are attuned to their ambient light. In case of lycopene/ spheroidene and B800/B850 BChls, the combined absorption spectrum is complementary to that of chlorophyll a or b in green plants, i.e., adjusted to a habitat below plants. The purple bacteria exploit the low-lying excited states of polyenes (46, 47) to couple the carotenoid excitations to BChls. The carotenoids are entrusted with the excitation energy for only a few hundred femtoseconds, after which time BChls are the wardens of the energy. The spectra of BChls are tuned only to a limited degree through interaction with the protein environment, e.g., through formylmethionineMg ligation in case of B800 of LH-II from Rps. acidophila (18) or through an Asp-Mg ligation in case of B800 of LH-II from Rs. molischianum (19); the observed spectra result mainly from intrinsic properties of BChls and excitonic interactions (26, 29). Excitonic coupling splits the excited state energies, thus improving the overlap between donor and acceptor spectra in the excitation cascade (26, 41, 44). The BChls have the disadvantage that their lowest-energy triplet state lies high enough to excite molecular oxygen. Their companion carotenoids quench the triplet excitations of BChls. The efficient flow of excitation through the chromophore system requires highly ordered aggregates, the geometry of which is adapted to the needed interactions; carotenoids must be in close (van der Waals) contact with BChls for triplet quenching and must be proximate within a few angstroms for transfer of optically forbidden excitations. Chlorophylls, to achieve significant exciton splitting, must have Mg-Mg distances of about 10 Å; for energy transfer on a picosecond time scale. Mg–Mg distances must be of the order of 20 Å. It is possible that BChls form aggregates to achieve coherence over many chromophores, such that the lowest-energy state becomes optically forbidden, increasing its lifetime. A multiprotein architecture is necessary to provide a large enough scaffold for the number of chromophores employed in light harvesting. Because of this architecture, antenna systems employ a hierarchy of chromophore aggregates; the chromophores are closer and more tightly coupled in the individual pigment-protein complex, e.g., in LH-II, and more loosely coupled between different pigment-protein complexes. The control of the overall aggregation of the multiprotein system is in itself an impressive achievement worthy of study (23). To direct flow of excitation to the RC, the antenna system of purple bacteria assumes a spatial organization in which the BChls with lower energy excitations are closer to the RC. Such arrangement, as shown in Fig. 2, yields an energy funnel that prevents detours in the excitation flow, enhancing the overall efficiency of light harvesting as measured by the quantum yield for a photon absorbed to reach a RC. The features of light harvesting in purple bacteria can serve as a background to a comparison of the alternative antenna systems shown in Fig. 8. As their primary light-harvesting complexes, green bacteria use extramembrane sack-like aggregates of BChl c (d or e in some species) called chlorosomes (Fig. 8a). Chlorosomes consist of pigment oligomers which in some species appear to be rod-shaped aggregates of BChls. It has been suggested that the rod-shaped BChl aggregates are stabilized solely through pigment-pigment interactions between the BChls. Chlorosomes are positioned external to the membrane, on top of the RC as shown in Fig. 8a. In cyanobacteria and red algae, the dominant light-harvesting complexes, shown in Fig. 8b, are extramembrane PBSs with discoidal pigment-protein complexes exhibiting an energy cascade from the outer rod disks toward the core and the RC. The PSU of dinoflagellates. presented in Fig. 8c, contains, among other pigment-protein complexes, the extramembrane PCP. Recently, the structure of PCP has been solved at 2.0 Å resolution (52). PCP distinguishes itself from other light-harvesting complexes in using carotenoids as the predominant light absorbers, exhibiting a chlorophyll-to-carotenoid ratio of
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA
5941
1:4. Efficient excitation transfer between carotenoids and chlorophylls, based on the structure of the aggregate, i.e., close contacts between peridinins and chlorophylls, has been confirmed by quantum chemical calculations (T.R., A.D., and K.S., unpublished work). The most abundant light-harvesting complex located in chloroplasts of green plants is LHCII. shown in Fig. 8d. The structure of LHCII, resolved at 3.4 Å (6), features two carotenoids, seven Chls a, and five Chls b as light-absorbing agents. LHCII is located within the thylakoid membrane in the vicinity of PS-II. It has been suggested that LHCII can, according to light conditions, physically move toward PS-I, regulating thereby the relative flow of energy into PS-II and PS-I. The multiprotein photosynthetic apparatus as shown in Fig. 1 poses the challenge for eventually modeling the conversion of light into ATP in its entirety. Few would have predicted that the protein constituents of the photosynthetic apparatus would be structurally known in principle already today, but many expect that biologists will see more and more often entire protein systems engaged in complex overall functions resolved at atomic resolution. The questions posed by the photosynthetic apparatus will then be typical for biology of the 21st century: how are multiprotein systems genetically controlled, how do they physically aggregate, how did they evolve, and how do they compare between species? The PSU constitutes an ideal subsystem of the photosynthetic apparatus that, because of its smaller size, is more amenable to study while posing the same principal challenges: how do LH-I and LH-II form from their many independent components, what determines the ring size and stability, and how do the completed LH-IIs aggregate around the LH-I-RC complex? The function of the PSU emerges as a true system properly, all components being designed to cooperate in absorbing light effectively and channel its energy to the RC. The common origin of photosynthetic, respiratory, and other organisms makes the PSU and the photosynthetic apparatus a valuable model for understanding, at the level of multiprotein systems, not only photosynthesis but also life in general. We acknowledge financial support from the National Institutes of Health [Grant P41RR05969], the National Science Foundation [Grants NSF BIR 9318159 and NSF BIR-94–23827(EQ)]. and the Carver Charitable Trust. 1. Emerson, R. & Arnold, W. (1932) J. Gen. Phvsiol. 16, 191–205. 2. Hu, X. & Schulten, K. (1997) Physics Today 50, 28–34, 3. Duysens, L.N.M. (1952) Ph.D. thesis (Utrecht, The Netherlands). 4. Cogdell, R., Fyfe, P., Barrett, S., Prince, S., Freer, A., Isaacs, N., McGlynn, P. & Hunter, C. (1996) Photosvnth. Res. 48, 55–63. 5. Krauss, N., Schubert, W.-D., Klukas, O., Fromme, P., Witt, H.T. & Saenger, W, (1996) Nat. Struct. Biol. 3, 965–973, 6. Kühlbrandt, W., Wang, D.-N. & Fujiyoshi, Y. (1994) Nature (London) 367, 614–621. 7. Zuber, H. & Brunisholz, R.A. (1991) in Chlorophylls, ed. Scheer, H. (CRC, Boca Raton, FL), pp. 627–692. 8. Miller, K. (1982) Nature (London) 300, 53–55. 9. Walz, T. & Ghosh, R. (1997) J. Mol. Biol. 265, 107–111. 10. Monger, T. & Parson, W. (1977) Biochim. Biophys. Acta 460. 393–407. 11. van Grondelle, R., Dekker, J., Gillbro, T. & Sundstrom, V. (1994) Biochim. Biophys. Acta 1187, 1–65. 12. Germeroth, L., Lottspeich, F., Robert, B. & Michel, H. (1993) Biochemistry 32, 5615–5621. 13. Aagaard, J. & Sistrom, W. (1972) Photochem. Photobiol. 15, 209–225. 14. Pullerits, T. & Sundstrom, V. (1996) Acc. Chem. Res. 29, 381–389. 15. Fleming, G.R. & van Grondelle, R. (1997) Curr. Opin. Struct. Biol. 7, 738–48. 16. Deisenhofer, J., Epp, O., Miki, K., Huber, R. & Michel, H. (1985) Nature (London) 318, 618–624. 17. Ermler, U., Fritzsch, G., Buchanan, S.K. & Michel, H. (1994) Structure 2, 925–936. 18. McDermott, G., Prince, S., Freer, A., Hawthornthwaite-Lawless, A., Papiz, M., Cogdell, R. & Isaacs, N. (1995) Nature (London) 374, 517–521. 19. Koepke, J., Hu, X., Münke, C, Schulten, K. & Michel, H. (1996) Structure 4, 581–597. 20. Hu, X., Xu, D., Hamer, K., Schulten, K., Koepke, J. & Michel, H. (1995) Protein Sci. 4, 1670–1682. 21. Hu, X. & Schulten, K. (1998) Biophys. J., in press. 22. Gouterman, M. (1961) J. Mol. Spectrosc. 6, 138–163. 23. Bailey, M., Schulten, K. & Johnson, J.E. (1998) Curr. Opin. Struct. Biol., in press. 24. Karrasch, S., Bullough, P.A. & Ghosh, R. (1995) EMBO J. 14, 631–638. 25. Humphrey, W.F., Dalke, A. & Schulten, K. (1996) J. Mol. Graphics 14, 33–38. 26. Hu, X., Ritz, T., Damjanovi , A. & Schulten, K. (1997) J. Phys. Chem. B 101, 3854–3871. 27. Frenkel, J. (1931) Phys. Rev. 37, 17–44. 28. Knox, K. (1963) Theory of Excitons (Academic, New York). 29. Zerner, M.C., Cory, M.G., Hu, X. & Schulten, K. (1998) J. Phys. Chem. B., in press. 30. Dracheva, T.V., Novoderezhkin, V.I. & Razjivin, A. (1996) FEBS Lett. 387, 81–84. 31. Hu, X., Xu, D., Hamer, K., Schulten, K., Koepke, J. & Michel, H. (1995) in Biological Membranes: A Molecular Perspective from Computation and Experiment, eds. Merz, K. & Roux, B. (Birkhäuser, Cambridge, MA), pp. 503–533. 32. Sauer, K., Cogdell, R.J., Prince, S.M., Freer, A., Isaacs, N.W. & Scheer, H. (1996) Photochem. Photobiol. 64, 564–576. 33. Alden, R., Johnson, E., Nagarajan, V., Parson, W., Law, C. & Cogdell, R. (1997) J. Phys. Chem. B 101, 4667–4680. 34. Ritz, T., Hu, X., Damjanovi , A. & Schulten, K. (1998) J. Lumin., 76–77, 310–321. 35. Pullerits, T., Sundstrom, V. (1996) J. Phys. Chem. 100, 10787– 10792. 36. Jimenez, R., Dikshit, S., Bradforth, S. & Fleming, G. (1996) J. Phys. Chem. 100, 6825–6834. 37. Wu, H.-M., Reddy, N.R.S. & Small, G.J. (1997) J. Phys. Chem. B 101, 651–656. 38. Shreve, A.P., Trautman, J.K., Frank, H.A., Owens, T.G. & Albrecht, A.C. (1991) Biochim. Biophys. Acta 1058, 280–288. 39. Hess, S., Chachisvilis, M., Timpmann, K., Jones, M.R., Fowler, G.J. S., Hunter, C, N. & Sundstrom. V. (1995) Proc. Natl.. Acad. Sci. USA 92, 12333–12337. 40. Visscher, K.J., Bergstrom, H., Sundstrom, V., Hunter, C.N. & van Grondelle, R. (1989) Photosynth. Res. 22, 211–217. 41. Arnold, W. & Oppenheimer, J.R. (1950) J. Gen. Physiol 33. 423–435. 42. Oppenheimer, J.R. (1941) Phys. Rev. 60, 158. 43. Förster, T. (1948) Ann. Phys. (Leipzig) 2, 55–75. 44. Damjanovi , A., Ritz. T. & Schulten, K. (1998) Phys. Rev. E., in press. 45. Chadwick, B.W., Zhang, C., Cogdell, R.J. & Frank, H.A. (1987) Biochim. Biophys. Acta 893, 444–457. 46. Hudson, B.S., Kohler, B.E. & Schulten, K. (1982) in Excited States, ed. Lim, E.C. (Academic, New York). Vol. 6, pp. 1–95. 47. Tavan, P. & Schulten, K, (1987) Phys. Rev. B 36, 4337–4358. 48. Dexter, D.L. (1953) J. Chem. Phys. 21, 836–850. 49. Nagae, H., Kakitani, T., Katoh, T. & Mimuro, M. (1993) J. Chem. Phys. 98, 8012–8023. 50. Ricci, M., Bradforth, S.E., Jimenez, R. & Fleming, G.R. (1996) Chem. Phys. Lett. 259, 381–390. 51. Chang, W., Jiang, T., Wan, Z., Zhang, J., Yang, Z. & Liang, D. (1996) J. Mol. Biol. 262, 721–731. 52. Hofmann, E., Wrench, P., Sharples, F., Hiller, R., Welte, W. & Diederichs, K. (1996) Science 272, 1788–1791.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS
5942
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5942–5949. May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold find Mabel Beckman Center in Irvine, CA.
Electrostatic steering and ionic tethering in enzyme- ligand binding: Insights from simulations
REBECCA C.WADE*, RAZIF R.GABDOULLINE, SUSANNA K.LÜDEMANN, AND VALÈRE LOUNNAS European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany ABSTRACT To bind at an enzyme’s active site, a ligand must diffuse or be transported to the enzyme’s surface, and, if the binding site is buried, the ligand must diffuse through the protein to reach it. Although the driving force for ligand binding is often ascribed to the hydrophobic effect, electrostatic interactions also influence the binding process of both charged and nonpolar ligands. First, electrostatic steering of charged substrates into enzyme active sites is discussed. This is of particular relevance for diffusioninfluenced enzymes. By comparing the results of Brownian dynamics simulations and electrostatic potential similarity analysis for triose-phosphate isomerases, superoxide dismutases, and β-lactamases from different species, we identify the conserved features responsible for the electrostatic substrate-steering fields. The conserved potentials are localized at the active sites and are the primary determinants of the bimolecular association rates. Then we focus on a more subtle effect, which we will refer to as “ionic tethering.” We explore, by means of molecular and Brownian dynamics simulations and electrostatic continuum calculations, how salt links can act as tethers between structural elements of an enzyme that undergo conformational change upon substrate binding, and thereby regulate or modulate substrate binding. This is illustrated for the lipase and cytochrome P450 enzymes. Ionic tethering can provide a control mechanism for substrate binding that is sensitive to the electrostatic properties of the enzyme’s surroundings even when the substrate is nonpolar. Conceptually, the process of ligand-protein binding may be considered to consist of the following consecutive steps: 1, diffusion of the ligand to the entrance to the binding site on the protein surface; 2, diffusion of the ligand through the protein to the binding site; 3, rearrangement of the ligand in the binding site into its bound orientation (see Fig. 1). In some cases, the binding site is situated on the protein surface and diffusion through the protein is not necessary. Step 1 may involve diffusion in reduced dimensions or the ligand may be actively transported to the protein surface. However, in general the above three steps should be considered and electrostatic interactions can influence all three. Here, their role in the first two steps is considered. The important influence of electrostatic interactions on the rates of diffusion of charged substrates toward the active sites of enzymes is now well established (1–3). Electrostatic steering is of greatest importance for diffusion-controlled enzymes because it is one of the main factors determining the catalytic rate. For these enzymes, the postdiffusional steps of the reaction have been so optimized that the diffusional association of substrate and protein has become the rate-limiting step. Enhancement of the diffusional association rates can be achieved by attractive electrostatic interactions between the substrate and the protein binding site. Here, we ask what enzyme features are necessary for electrostatic steering resulting in rapid ligand-protein association rates. By examining orthologs from different species, by means of Brownian dynamics (BD) simulations (4) and electrostatic potential similarity analysis, we identify common features important for their shared molecular function and find that these are confined to the close vicinity of the active site.
FIG. 1. Schematic diagram showing how electrostatic interactions can influence the binding of a ligand (shaded) to a protein (outline). Step 1, electrostatic forces and torques can steer the ligand into its binding site on the protein. Step 2, electrostatic interactions such as salt links can affect the protein dynamics necessary for ligand access to binding sites shielded from solvent in “gated” binding. Step 3, electrostatic interactions, particularly salt links and hydrogen bonds, between ligand and protein can contribute to binding affinity and specificity and to the structural binding mode of the complex formed. The role of electrostatic interactions in the next two steps of the ligand-protein binding process is more complex. Here, we focus on one type of interaction, which we shall term “ionic tethering.” This entails the formation of salt links between charged residues in the protein that affect conformational changes in the protein associated with or necessary for ligand binding. Salt-link formation between charged groups in the ligand and the protein can also contribute to ligand binding, but this will not be considered here. Instead, uncharged ligands, which cannot themselves engage in salt links, will be examined. Nevertheless, we show the importance of electrostatic interactions for the binding of nonpolar ligands and making binding sensitive to the surrounding environment. We
*To whom reprint requests should be addressed. e-mail: [email protected]. © 1998 by The National Academy of Sciences 0027–8424/98/955942–8$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: BD, Brownian dynamics: BLAC. β-lactamase; SOD. superoxide dismutase; TIM, triose-phosphate isomerase.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS
5943
discuss possible mechanisms in the light of two examples that have been the subjects of recent calculations.
ELECTROSTATIC STEERING One way to identify the features important for the electrostatic steering of a substrate toward its binding site on an enzyme is by sitedirected mutagenesis. This approach has been employed to examine the fast association rates of superoxide dismutase (SOD) and acetylcholinesterase with their substrates, and barnase with the inhibitor barstar. • For human SOD, mutations were identified by using BD simulations to improve electrostatic steering, and, indeed, when the mutations were made, greater electrostatic enhancement of the rate was observed (5). resulting in a “superperfect” enzyme (6). • For acetylcholinesterase, a large number of mutations of charged residues was made, and these were shown to have little effect on the rate of substrate binding (7). This result was interpreted as evidence of lack of diffusion control and electrostatic steering. However, the rates for the mutants could be well reproduced in BD simulations, which demonstrated enhancement of rates because of electrostatic steering of substrate toward and inside the substrate-binding gorge (8, 9). • Barstar is the intracellular protein inhibitor of the extracellular ribonuclease barnase, and it binds very tightly with high on-rates. Even so, the rate of binding could be improved by mutation (10). The effects of mutations and ionic strength on the association rates could be well reproduced by BD simulations (11). The data show the dominance of certain residues on the protein binding faces in determining the electrostatic enhancement of the association rate. Together, these results indicate that electrostatic enhancement of association rates arises mostly from the presence of a few charged residues close to the binding site. Here, we take an alternative approach to site-directed mutagenesis, namely, comparison of diffusion-influenced enzymes from different species to find out what is required for electrostatically enhanced substrate binding rates. By relying on natural evolution, we are assured of examining fully functioning enzymes although they may not be fully optimized for electrostatic enhancement of substrate on-rates or fast reaction, as this may not be desirable in their in vivo environment. We examine three families of diffusion-influenced proteins, triose-phosphate isomerases (TIM), Cu,Zn-superoxide dismutases (SOD), and class A β-lactamases (BLAC), for which crystal structures are available from several organisms and whose kinetic properties have been measured (see Table 1). Diffusional Control of Catalytic Rates. The primary indicator for diffusion control of an enzyme reaction is a fast catalytic rate that is dependent on the viscosity and ionic strength of the solvent. Both TIM and SOD are extremely fast, efficient enzymes with the rate-limiting step of their reactions under physiological conditions being the diffusion of substrate, glyceraldehyde 3-phosphate and superoxide, respectively, to the active site, Indeed, TIM has been described as a “perfect enzyme” (12, 13). The catalytic rates measured for TIMs from more than five species are all about 108 M–1·s–1 at 100 mM ionic strength (see ref. 14 and references in ref. 15), and viscosity dependence of the rates has been demonstrated (16). The catalytic rate has been measured for SODs from more than eight species, and all have rates of about 3×109 M–1·s–1 at 20 mM ionic strength (see references in ref. 17). The rates of SODs exhibit ionic strength dependence and decrease as the ionic strength increases (18). BLACs have been characterized as fully efficient enzymes with no single rate-determining step (19). They are partly diffusion-controlled for good substrates, such as benzylpenicillin with a single negative charge, and most have catalytic rates of 107 to 108 M–1·s–1 for such substrates at 100 mM ionic strength (20, 21). BD Simulations. Experimental association rates were reproduced well for six variants of SOD (17) and four variants of TIM (15) by BD simulation. These results show that the main features influencing the catalytic rates are represented in the simulation model. The protein is represented by all atoms observed crystallographically plus modeled polar hydrogen atoms, with each atom assigned a partial charge and a van der Waals radius. The protein is immersed in a uniform solvent continuum. The electrostatic potential of the protein is computed from numerical solution of the finite-difference linearized Poisson-Boltzmann equation (22). The substrate is represented by a charged sphere (for SOD) or dumbbell (for TIM). The molecules are treated as rigid, and intermolecular hydrodynamic interactions are neglected. Comparison of simulations with and without a net charge on the substrate show that electrostatic interactions enhance the association rates for all the enzyme variants studied. Electrostatic Potential Similarity Analysis. To quantify the common features in the electrostatic potentials of different variants of the enzymes, we carried out an electrostatic po-tential similarity analysis. The members of each family of enzymes were superimposed by matching α-carbons. Then
Table 1. Properties of the diffusion-influenced enzymes triose-phosphate isomerase (TIM), superoxide dismutase (SOD), and lacktamase (BLAC) TIM SOD BLAC Substrate Glyceraldehyde 3-phosphate Superoxide Benzylpenicillin Net charge of substrate, c –1/–2 –2 –1 No. of protein variants 4 6 4 compared Variants compared together E.coli (ltre), yeast (lypi), Spinach (Isrd), frog (Ixso), E.coli (TEM-1) (Ixpb), with Protein Data Base chicken muscle.* T.brucei yeast (Isdy), human (Ispd), B.licheniformis (4blm), S.albus identifier code (listed in (5tim) bovine (2sod), P.leiognathi G,† S.aureus (3blm) order of increasing net (lyai) charge) Net charges of proteins at –12, –6, –2, +12 –8, –6, –4, –4, –2, +2 –6, –6, –4, +16 neutral pH. e Range of sequence identity 50 30–55 30–45 between proteins, % 1.0–8.4 25–39 0.03–0.8 Measured kcatKm, M–1·s– 1×10–8 100 20 100 Ionic strength for rate measurement, mM E.coli, Escherichia coli; T.brucei, Trypanosoma brucei; P.leiognathi, Photobacterium leiognathi; B.licheniformis, Bacillus licheniformis; S.albus, Streptomyces albus; S.aureus, Staphylococcus aureus. *Coordinates provided by P.Artymiuk. †Coordinates provided by O.Dideberg.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS
5944
average potentials, sign conservation, and similarity indices, as defined in the legend to Fig. 2. were computed as a function of position around the superimposed molecules. The results are shown in Fig. 2. The substrates considered are all negatively charged. The majority of the proteins are also net negatively charged. Thus, electrostatic enhancement of the association rates does not arise from nonspecific attraction between molecules due to monopole interactions. Instead, it arises from the nonuniform charge distribution of the proteins, which results in steering of the substrates toward the positively charged regions of the active sites.
FIG. 2. Electrostatic potential comparison for variants of TIM (Left). SOD (Center), and BLAC (Right). (Top) Average potential contoured at ±0.4 kcal·mol– l·e–1 (1 kcal=4.184 kJ). (Middle) Similarity index with most conserved regions within contours at a level of 0.75 in all cases, except for the red contours in Center, which are at 0.85. (Bottom) Contours enclose regions where the sign of the electrostatic potential is conserved. In all cases, red represents regions of negative potential and blue represents regions of positive potential. Magenta solid spheres represent important active site-atoms: carboxylate oxygens of Glu-165 and amino nitrogen of Lys-13 in TIM. Cu and Zn ions in SOD, and the side chain of the catalytic Ser-70 in BLAC. The proteins are represented by ribbon plots of representative variants: chicken muscle TIM. bovine (yellow) and P.leiognathi (green) SOD, and TEM-1 BLAC. The dimers of all SODs studied except that from P.leiognathi superimpose well on the bovine SOD. Consequently, one monomer of P.leiognathi was superimposed on one monomer of bovine SOD instead of the complete dimer as done for the other SODs. Negative contours are not shown for the sign conservation plot in SOD (Center Bottom) for clarity. The similarity index, SI, is computed at points (i,j,k) around the proteins from the following formula, which is generalized to the comparison of N potentials. 1, l=1, 2,, N, from the Hodgkin formula for the comparison of two potentials (59):
SI=+1 when the potentials are all identical. SI=–1 when two potentials are opposite (N< 3). For small deviations, ∆n, from the average potential, the decrease in the SI from its maximum (=1) is proportional to (∆n)2. This is because the SI can be rewritten as:
For example, when SI=0.85 for four potentials, The SI (and the average potential and sign conservation) are computed outside the molecules combined van der Waals volume as defined with atomic radii set at twice their normal values. The TIM and SOD enzymes considered here are homodimeric enzymes with two active sites, whereas BLAC is a monomer with one active site. The average potentials (Fig. 2 Top) all show attractive positive regions of potential over the active sites. On average, the electrostatic potential of the TIMs confine the substrate to a ring around the protein that includes both active sites. For BLAC, the positive active site potential is on average surrounded by a ring of negative potential and isolated from other positively charged regions of the protein surface. The similarity index plots (Fig. 2 Middle) show the most conserved regions of the potentials for the variants of each
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS
5945
enzyme. The largest regions of positive potential are situated over the active sites for all enzymes. Regions of negative potential are also conserved away from the active sites but, as can be seen by comparison with the average potential maps, the potential in these regions is generally smaller in magnitude. Fig. 2 Bottom shows the regions where the sign of the potential is the same in all variants for each species. This gives an indication of the extent of the attractive potential acting on the substrate. TIM In the TIMs, the positive potential at the active site is mostly because of the conserved active site Lys-13 and the conserved Lys-237 (chicken muscle TIM numbering). This conserved region of positive potential has a volume of about 6,000 Å3 and extends about 30 Å from the amino group of Lys-13 at 100 mM ionic strength. The large pincer-like regions of conserved, but small in magnitude, negative potential are mostly due to Glu-23, which is conserved in all four TIMs examined but not all known TIM sequences. SOD. In the SODs, the positive potential over the active site channel is due primarily to the copper and zinc ions, the conserved Arg-141 (bovine SOD numbering), and a few other nearby positively charged residues that are not totally conserved. One of the SOD variants studied, that from the prokaryote Photobacterium leiognathi, displays a different dimerization mode from the other enzymes, which are eukaryotic (23). as shown in Fig. 2. It has a looser dimer interface than the other SODs, indicating that it may act partially as a monomer like the SOD from E.coli whose structure was solved very recently (24). Nevertheless, the structures of the monomers of all SODs are similar, and thus we compare the potentials around a superimposed monomer of each dimer. The region of attractive potential for the substrate is elongated, roughly following the shape of the active site cleft. It has a volume of 3,000 Å3 and extends up to 35 Å from the copper ion. While there is a common region of conserved positive potential as shown by the similarity index plot, the regions of positive potential in each of the proteins do not superimpose exactly. In particular, the P.leiognathi enzyme has a loop insertion known as the SSloop (green loop in the middle of Fig. 2 Top Center) containing two lysine residues, an aspartic acid, and a glutamic acid, that may compensate for the deletion of the 7,8 loop that is present in the other enzymes and contains one lysine and two glutamic acids. This increases the attractive potential near the SSloop (which is present to a lesser extent in the other SODs, as can be seen from the sign conservation map). It should also be noted that the magnitude of the positive potential in the active site is greater for the P.leiognathi enzyme than for the bovine enzyme, although its rate is identical (25). Although mutations have not yet been reported of charged residues in the SSloop. several studies of mutations of the charged residues in the 7,8 loop (Asp-130, Glu-131. Lys-134) and the nearby Lys-120 (not conserved in prokaryotic SODs) and Arg-141 have been made. Arg-141 has been shown to be particularly important electrostatically and mechanistically (26). The other residues are of lesser but significant importance for the catalytic rate, but the relative importance of each of the residues has been shown to differ in the different variants (5, 27). BLAC. In the BLACs, the conserved positive potential runs along the active site cleft and is due primarily to Lys-234, which is conserved in all four enzymes, and Arg-244, which is present in all but the Streptomyces albus G enzyme (ABL numbering scheme). However the S.albus G BLAC has an arginine not present in the other three enzymes, Arg-220, whose guanidino group occupies a very similar position in the threedimensional structure to that of Arg-244 in the other enzymes. While this appears to be a largely compensatory mutation, and is present in other BLAC sequences, it may be one of the reasons why the rate for the S.albus enzyme is lower than that of the other BLACs for benzylpenicillin. Lys-73, which is in the active site close to the catalytic Ser-70, is of less importance for the conserved attractive steering potential. The pKa of this residue is a subject of controversy (see ref. 28 and references therein), but here we note that neutralization of Lys-73 makes very little difference to the conserved positive potential region shown near the active site in Fig. 2 Middle Right. Lys-234, Arg-220, and Arg-244 have been the subject of mutational studies in several BLACs (20). These have shown that Lys-234 plays a role in both the initial recognition of the substrate and in the stabilization of the transition state, with the latter being dominant. Mutation to a nonpolar residue of Arg-244 or Arg-220 in the corresponding protein had a deleterious effect on the catalytic rate for charged substrates and indicated an important role for substrate binding. The conserved attractive potential region over the active site is much smaller for BLAC than for the other enzymes: it has a volume of about 200 Å3 and extends only 14 Å from the hydroxyl group of Ser-70. Mechanistic Implications. Overall, these data show that the largest conserved regions of substrate-attracting electrostatic potential are near the active sites in all of the enzymes studied. This fact implies that the localized potentials at the binding site are sufficient for efficient electrostatic steering of substrate into the binding site. This is consistent with recent energetic analysis showing that the rate enhancement due to electrostatic interactions can be approximately estimated from the Boltzmann weighted average of the interaction energy in the binding site (29– 31). However, the conserved local potentials vary in size and extent between the enzymes studied. While the volume of conserved attractive potential at the active site is twice as large for TIM as for SOD, it is much smaller for BLAC than for the other enzymes, indicating less electrostatic enhancement of rates, which is consistent with the lower measured rates for the BLACs. The local attractive potentials are largely provided by a few charged residues that are mostly highly conserved between orthologs. This permits enzymes with the same activity to have very different net charges (ranging from –12 to +12 e for the TIMs and –6 to +16 e for the BLACs examined). Thus, they can be highly efficient enzymes and fulfill different secondary functions or survive in different cellular environments.
IONIC TETHERING Ionic Tethering Mechanisms. Ligand binding to proteins is frequently accompanied by conformational changes in the proteins such as loop motions, channel openings, or side-chain rotations. These may result in energy barriers to ligand binding and affect binding rates and their time dependence as described by gating theory (32, 33). We investigate the role of salt links as ionic tethers providing the protein with a means of controlling such conformational “gating.” Some of the possible mechanisms by which ionic tethers might act are as follows: • • • • •
by thermodynamic stabilization of a conformation of the protein; by kinetic stabilization of a conformation of the protein; as devices to allow a certain degree of flexibility in the protein structure; as devices dependent on pH. ionic strength, and dielectric of the environment; and as devices to ensure specific interactions.
Protein folding is generally considered to be driven by hydrophobic interactions, and it is these interactions rather than interactions between charged groups (whose desolvation is unfavorable) that are thought to stabilize the folded states of proteins (34, 35). Nevertheless, comparison of the crystal structures of a number of proteins from mesophilic and thermophilic organisms shows that the latter often contain significantly more hydrogen bonds and salt links (36). More
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS
5946
over, calculations of the electrostatic free energy contribution to protein stability of salt links by using a classical continuum electrostatic model (34) show an overall tendency for salt links to be more stable in thermostable proteins than their mesostable counterparts (I. Shrivastava, V.L., and R.C.W., unpublished data). Stabilization due to the formation of ionic networks is indicated by mutational data (37) and calculations (38) showing that removal of salt links by mutation of their side chains tends to be more destabilizing when they participate in salt-link networks. These data suggest that, under certain conditions, salt links can thermodynamically stabilize a folded conformation of the protein relative to its fully or partially unfolded state. However, their increased presence in proteins from thermophilic organisms has also been attributed to “resilience” (39) or kinetic stabilization (40) which increases the kinetic barrier to unfolding. That is, if the protein is perturbed from its equilibrium structure, the long-range nature of charge-charge interactions in salt links will facilitate return to the original structure. Chargecharge interactions can be thought of as providing a smoother energy landscape funnel than short-range hydrophobic interactions. Thus, they can allow greater structural flexibility in proteins than hydrophobic interactions, although this is combined with greater specificity in the actual interactions formed. Moreover, the strength of salt links is dependent on the physical properties of the environment. Thus, protein conformational changes controlled by ionic tethers may be triggered because of a change in environment—e.g., pH, ionic strength, dielectric screen-ing—that alters the strength of a salt link. Consider two situations: one in which a salt link tethers and stabilizes an enzyme in its active conformation under certain environmental conditions, and the other in which a salt link affects the opening of substrate-access and product-exit channels—i.e., gating perturbations from the equilibrium structure. In the former case, the ionic tether stabilizes one of several low-energy conformations. In the latter case, the ionic tether acts as a control of structural deviations from a low-energy conformation. We will illustrate the former by interfacially activated lipases and the latter by cytochrome P450s, which have buried active sites. Lipase. Lipases catalyze the hydrolysis of uncharged ester substrates. Lipases undergo interfacial activation—i.e., their activity is greatly increased when they act on substrate at a lipid/water interface. Crystallographic studies (41–43) have shown that lipases possess a surface loop or “lid” over the active site that, upon activation, opens up to permit the binding of substrate. Upon opening, the lid’s hydrophobic face is exposed and its hydrophilic side is buried. The properties of the active-site lid are probably best characterized in the lipase from Rhizomucor miehei. Crystal structures (41, 44) show that the main difference between the open and closed forms is the displacement of a helical lid of about 12 residues (see Fig. 3). The lid contains two charged residues, Arg-86 and Asp-91. In the open, inhibitor-bound form, Arg-86 is close to Asp-61. Evidence that these residues form an ionic tether that stabilizes the open form relative to the closed form in certain environments is provided by the following theoretical and experimental studies. Molecular and Brownian dynamics simulations. The opening of the active-site lid has been simulated by molecular and Brownian dynamics (45–47). The time scale of the opening means that the loop must be artificially guided from closed to open states during the molecular dynamics simulations, which give information about the relative energies of the open and closed states. On the other hand, BD simulations, in which the lid is modeled as a simple chain of spherical residues rather than with an all-atom model, can be carried out on the relevant time scales and permit opening times to be estimated. The simulations show that opening of the lid is facilitated when the dielectric constant and polarity of the environment are reduced, indicating that the open state is stabilized by electrostatic interactions. In BD simulations (46), the lid opens up in times on the order of 100 ns in a nonpolar low-dielectric medium, whereas the lid does not always open during simulations of 900 ns in a polar high-dielectric medium. BD simulations with Arg-86 and/or Asp-91 neutralized show their importance in activation, with the effect of Arg-86 being dominant. In the closed inactive conformation, Asp-91 experiences repulsive forces that tend to push the lid toward the active, open conformation. On opening, Arg-86 approaches Asp-61 to make a favorable ionic interaction stabilizing the open conformation. The crystal structure of the open form shows that the side chain of Arg-86 is disordered (41). Thus the interaction with Asp-61 does not result in the formation of a highly ordered hydrogen-bonded salt link but a less specific charge-charge interaction. The model in the BD simulations does not explicitly represent the side-chain atoms of Arg-86 and shows, therefore, that such less specific interactions can stabilize the open conformation of the lid.
FIG. 3. Part of the α-carbon trace of two crystal structures of the lipase from R.miehei showing the positions of the helical lid in open (black) and closed (gray) forms of the enzyme. In the crystal structure of the open form of the enzyme, an inhibitor is bound in the active site. All non-hydrogen atoms are shown for selected titratable residues (numbered) involved in electrostatic interactions affecting the position of the helical lid. In the open form, Arg-86 in the lid is close to Asp-61. Chemical modification and inhibition assays. Experimentally, the activity of the R.miehei lipase has been shown (48) to be reduced by chemical modification of arginines and the addition of guanidine before substrate. Chemical modification was shown to be greater for Arg-86 than for any other arginine in R.miehei. Inhibition by guanidine was not observed when the guanidine was added after addition of substrate, indicating that arginine residues are important only during activation. Inhibition experiments with guanidine also showed reduced activity for Hamicola lanuginosa and porcine pancreas lipases, although the reduction was smaller than for R.miehei lipase (70–88% vs. 26% residual activity) (48). Both these enzymes have arginine residues in the active-site lids, although at different positions from Arg-86 in the R.miehei lipase. No reduction in activity was observed for lipases without arginine
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS
5947
in the lid—e.g., Candida rugosa. The H.lanuginosa lipase has a glutamic acid (Glu-87) at the equivalent position to Arg-86 in the R.miehei lipase. Molecular dynamics simulations (47) indicate that this residue will make unfavorable electrostatic interactions in the open state that can be removed by neutralizing it. However, experiments show that the Glu-87→Ala mutant has decreased activity (35–70% residual activity) compared with wild type (49, 50). A possibility is that Arg-84 in the H.lanuginosa lipase approximately counters the effects of Glu-87 and makes favorable ionic interactions with the rest of the protein in the open state. Role of ionic tethering in lipases. There are not strictly conserved charged residues in the lids that act as ionic tethers stabilizing the open form in the presence of a low-dielectric lipid and the closed form in high-dielectric aqueous solution. The lid residues, however, contribute to the different substrate specificities in different lipases. Thus, with the dual requirements of different substrate specificities and interfacial activation, lipases appear to have evolved alternative arrangements of charged residues (often arginines) in the lid to act as ionic tethers to control interfacial activation. Stabilization of the active lipase conformation is brought about, not by the formation of strong hydrogen bonds but by longer range precise charge-charge interactions, which may permit the enzyme more flexibility for efficient turnover. Such ionic tethers enable the binding of nonpolar substrates by the lipases to be sensitive to the electrostatic properties of its environment and contribute to the phenomenon of interfacial activation. Cytochrome P450. Crystal structures show that the active site of cytochrome P450cam from Pseudomonas putida is buried in the protein, isolated from the solvent (51). Data for other cytochrome P450s (52) show that the active site is sometimes isolated from solvent and sometimes has an open channel to the active site lined on one side by a rather mobile F-G helix-loop-helix segment (see Fig. 4A). Clearly, in the case of cytochrome P450cam, protein motions are necessary for the substrate, camphor, to enter the active site. The binding of camphor to cytochrome P450cam can be considered as a two-step process: camphor first diffuses from the outside of the protein to the binding site, and then there is a low- to high-spin transition at the heme iron. Experiments show that the equilibrium constant for the diffusion step, and the accompanying enthalpy and entropy changes, are dependent on the dielectric constant and ionic strength of the surrounding solvent (53). The equilibrium constant is less sensitive to these properties when Asp-251 is mutated to Asn. Asp-251 participates in a tetrad of salt links in the crystal structure that connect the I helix, on which it sits, to the F helix (see Fig. 4A). This finding suggests that Asp-251 may influence substrate binding by participating in ionic tethers that regulate the opening and closing of the substrate access channel, which may involve motion of the F-G helix-loop-helix segment of the protein. Electrostatic calculations of salt-link stability. To obtain an indicator of the energetic cost of perturbing the salt links to Asp-251, we computed their electrostatic contribution to protein folding stability by using a classical electrostatic continuum model (38). On average, the salt links in cytochrome P450cam [and in other proteins for which calculations have been done (34)], are neither stabilizing nor destabilizing. However, we found that the salt links to Asp-251 are exceptionally stable. The only salt links that were more stable were those to the propionate groups of the heme. This suggests that cytochrome P450cam has evolved particularly stable salt links to perform functional roles: keeping the heme group bound and regulating the opening and closing of the substrate/product access channel to the active site. Thermal pathway analysis and molecular dynamics simulation. To probe the conformational changes for substrate access to and exit from the active site more explicitly, we performed two types of analysis: thermal pathway analysis and molecular dynamics simulation (54). In thermal pathway analysis, the measured temperature factors in the crystal structures are analyzed to identify flexible regions where ligand channels may open (55). This analysis for cytochrome P450cam indicates three particularly mobile regions of the protein as candidates for ligand channels. Similar regions were located as exit channels for expulsion of camphor from the binding site during the molecular dynamics simulations. Representative trajectories for each of the three main channels are shown in Fig. 4B. These simulations were performed for times of approximately 100 ps. This is orders of magnitude less than the time it would actually take for camphor to escape from the active site. Therefore, simulations were performed with an additional artificial randomly oriented force applied to camphor to improve its sampling and enable it to find an exit channel in a short simulation time (54). The simulations show that perturbation of the salt links to Asp-251 is not necessary for expulsion of substrate from the active site, as their geometry is perturbed in only about half the trajectories generated. Surprisingly, relatively small and localized displacements of protein atoms, involving 0.5– to 2–Å shifts of backbone atoms and rotation
FIG. 4. Ribbon diagram of the crystal structure (51) of cytochrome P450cam with the buried heme and camphor substrate shown in bold. (A) The salt-link tetrad of residues involving Asp-251 is shown in bold. The region where a channel has been proposed, on the basis of crystallographic data (51, 60), to open up to allow ligand access to the active site is indicated. This channel is lined by aromatic residues whose side chains are shown (Tyr-96, Phe-87, and Phe-193). (B) Three representative camphor exit pathways derived by molecular dynamics simulation (54) are shown by thick lines that follow the position of the center of mass of the camphor as it escapes from the active site during the trajectories. The other trajectories simulated are clustered in the vicinity of each of these trajectories.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS
5948
of a few side chains, are sufficient to permit camphor to escape from the protein. Role of Asp-251 ionic tethers in substrate binding. What then is the reason for the dependence of camphor binding rates on salt links to Asp-251? Comparison of cytochrome P450cam with other cytochrome P450s indicates considerable flexibility in the F-G loop and the opening of a channel next to it as shown in Fig. 4A. This channel is the most often observed of the three classes of exit channel identified in the molecular dynamics simulations. Exit of camphor here during the simulations usually involves either perturbation of the F-G loop and Phe-193 or perturbation of the B' helix with rotation of Phe-87. Further evidence for this ligand channel is data from site-directed mutagenesis, which shows that mutation of Phe-87 to Trp reduces the camphor on-rate, mutation of Phe-193 to Cys increases the off-rate, and the introduction of Cys-Cys tethers to reduce the dynamics of the G helix affects on- and off-rates (56). A possible mechanism to explain the involvement of Asp-251 is that the salt links to Asp-251 regulate slower breathing motions of the protein by tethering the F-G helix-loop-helix to the I helix. These breathing motions allow for opening of the access channel, thus making the entrance and exit of substrate via the channel next to the F-G loop preferred over passage through the other channels identified in the calculations. Evidence for the modulation of the general dynamics of the protein by the salt links to Asp-251 also comes from photoacoustic calorimetry measurements of CO rebinding in cytochrome P450cam (57). Role of ionic tethers in cytochrome P450s. Although residue 251 is conserved as Asp—or Glu—in cytochrome P450s, the salt links to Asp-251 are not conserved in the cytochrome P450s whose structures are available. However, in cytochrome P450cryf, Arg-185 in the G helix makes hydrogen bonds to the protein core that could affect the dynamics of the F-G flap in an analogous fashion to the salt-links to Asp-251 in cytochrome P450cam. This observation suggests that the control of protein dynamics permitting substrate access to the active site is tuned in each cytochrome P450 according to the particular substrates it acts upon and the efficiency and regio- and stereoselectivity required. In cytochrome P450s that bind large substrates, it may be possible to achieve sufficient desolvation of the catalytic site without the need to isolate the active site from the solvent. In cytochrome P450cam, a mechanism to bury the active site is likely to be crucial to its ability to catalyze a highly regio- and stereospecific reaction with remarkably little uncoupling side-reactions. Insights into Ionic Tethering. The present examples demonstrate a role for ionic tethers in stabilizing an active conformation of an enzyme (lipase) under certain environmental conditions and regulating deviations from an enzyme’s (cytochrome P450’s) equilibrium structure affecting substrate binding. They make ligand binding sensitive to the electrostatic properties of the protein’s surroundings even when the ligand is nonpolar. The preceding example shows that the way in which ionic tethers affect ligand binding may be rather subtle. How widespread the phenomenon of ionic tethering is in proteins remains to be investigated. Ionic tethers are not involved in all major protein conformational transitions on ligand binding, but they can play a role in the binding of charged as well as uncharged ligands. For example, for sulfatebinding protein, it has been observed that two salt links affect kinetic dissociation rate constants by stabilizing the closed liganded form and modulating the rate of cleft opening (58). Further experimental and theoretical studies are necessary to fully understand the mechanisms of ionic tethering and its relation to ligand binding to proteins. We thank Drs. P.Artymiuk and O.Dideberg for provision of coordinate sets. This work was partially supported by the European Union (Biotech CT94–2060). S.K.L. acknowledges an ErwinSchrödinger Fellowship granted by the Austrian Fonds zur Förderung der Wissenschaftlichen Forschung (JO1379-CHE). 1. Davis, M.E., Madura, J.D., Sines, J., Luty, B.A., Allison, S.A. & McCammon, J.A. (1991) Methods Enzymol. 202, 473–497. 2. Tan, R.C, Truong, T.N., McCammon, J.A. & Sussman, J.L. (1993) Biochemistry 32, 401–403. 3. Wade, R.C. (1996) Biochem. Soc. Trans. 24, 254–259. 4. Madura, J.D., Briggs, J.M., Wade, R.C. & Gabdoulline, R.R. (1998) in Encyclopedia of Computational Chemistry, eds. von Rague Schleyer, P., Allinger, N.L., Clark, T., Gasteiger, J., Kollman, P.A. & Schaefer, H.F. (Wiley, Chichester, U.K.), in press. 5. Getzoff, E.D., Cabelli, D.E., Fisher, C.L., Parge, H.E., Viezzoli, M.S., Banci, L. & Hallewell, R.A. (1992) Nature (London) 358, 347–351. 6. McCammon, J.A. (1992) Curr, Biol. 2, 585–586. 7. Shafferman, A., Ordentlich, A., Barak, D., Kronman, C., Ber, R., Bino, T., Ariel, N., Osman, R. & Velan, B. (1994) EMBO J. 13, 3448–3455. 8. Antosiewicz, J., McCammon, J.A., Wlodek, S.T. & Gilson, M.K. (1995) Biochemistry 34, 4211–4219. 9. Antosiewicz, J., Wlodek, S.T. & McCammon, J.A. (1996) Biopolymers 39, 85–94. 10. Schreiber, G. & Fersht, A.R. (1996) Nat. Struct. Biol. 3, 427–431. 11. Gabdoulline, R.R. & Wade, R.C. (1997) Biophys. J. 72, 1917–1929. 12. Albery, J.W. & Knowles, J.R. (1976) Biochemistry 15, 5631–5640. 13. Knowles, J.R. (1991) Nature (London) 350, 121–124. 14. Albery, W.J. & Knowles, J.R. (1976) Biochemistry 25, 5627–5631. 15. Wade, R.C., Gabdoulline, R.R. & Luty, B.A. (1998) Proteins, in press. 16. Blacklow, S.C., Raines, R.T., Lim, W.A., Zamore, P.D. & Knowles, J.R. (1988) Biochemistry 27, 1158–1167. 17. Sergi, A., Ferrario, M., Polticelli, F., O’Neill, P. & Desideri, A. (1994) J. Phys. Chem. 98, 10554–10557. 18. Argese, E., Viglino, P., Rotilio, G., Scarpa, M. & Rigo, A. (1987) Biochemistry 26, 3224–3228. 19. Christensen, H., Martin, M.T. & Waley, S.G. (1990) Biochem. J. 266, 853–861. 20. Matagne, A. & Frere, J. (1995) Biochim. Biophys. Acta 1246, 109–127. 21. Qi, X. & Virden, R. (1996) Biochem. J. 315, 527–541. 22. Davis, M.E. & McCammon, J.A. (1989) J. Comput. Chem. 10. 386–391. 23. Bourne, Y., Redford, S.M., Steinman, H.M., Lepock, J.R., Tainer, J.A. & Getzoff, E.D. (1996) Proc. Natl. Acad. Sci. USA 93, 12774–12779. 24. Pesce, A., Capasso, C., Battistoni, A., Folcarelli, S., Rotilio, G., Desideri, A. & Bolognesi, M. (1997) J. Mol. Biol. 274, 408–420. 25. Foti, D., Curto, B.L., Cuzzocrea, G., Stroppolo, M.E., Polizio, F., Venanzi, M. & Desideri, A. (1997) Biochemistry 36, 7109–7113. 26. Fisher, C.L., Cabelli, D.E., Tainer, J.A., Hallewell, R.A. & Getzoff, E.D. (1994) Proteins 19, 24–34. 27. Polticelli, F., Bottaro, G., Battistoni, A., Carri, M.T., DijnovicCarugo, K., Bolognesi, M., O’Neill, P., Rotilio, G. & Desideri, A. (1995) Biochemistry 34, 6043–6049. 28. Raquet, X., Lounnas, V., Lamotte-Brasseur, J., Frere, J.M. & Wade, R.C. (1997) Biophys. J. 73, 2416–2426. 29. Zhou, H.-X. (1996) J. Chem. Phys. 105, 7235–7237. 30. Zhou, H.-X., Briggs, J.M. & McCammon, J.A. (1996) J. Am. Chem. Soc. 118, 13069–13070. 31. Zhou, H.-X., Wong, K.-Y. & Vijayakumar, M. (1997) Proc. Natl. Acad. Sci. USA 94, 12373–12377. 32. McCammon, J.A. & Northrup, S.H. (1981) Nature (London) 293, 316–317. 33. Northrup, S.H., Zarin, F. & McCammon, J.A. (1982) J. Phys. Chem 86, 2314–2321. 34. Hendsch, Z.S. & Tidor, B. (1994) Protein Sci. 3, 211–226. 35. Wimley, W.C., Gawrisch, K., Creamer, T.P. & White, S.H. (1996) Proc. Natl. Acad. Sci. USA 93, 2985–2990. 36. Vogt, G., Woell, S. & Argos, P. (1997) J. MoL Biol. 269, 631–643. 37. Waldburger, C.D., Schilbach, J.E. & Sauer, R.T. (1995) Nat. Struct. Biol. 2, 122–128. 38. Lounnas, V. & Wade, R.C. (1997) Biochemistry 36, 5402–5417. 39. Aguilar, C.F., Sanderson, I., Moracci, M., Ciaramella, M., Nucci, R., Rossi, M. & Pearl, L.H. (1997) J. Mol. Biol. 271, 789–802. 40. Pappenberger, G., Schurig, H. & Jaenicke, R. (1997) J. Mol. Biol. 274, 676–683.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS
5949
41. Brzozowski, A.M., Derewenda, U., Derewenda, Z.S., Dodson, G.G., Lawson, D.M., Turkenburg, J.P., Bjorkling, F., HugeJensen, B., Patkar, S.A. & Thim, L. (1991) Nature (London) 351, 491–494. 42. Derewenda, U., Brzozowski, A.M., Lawson, D.M. & Derewenda, Z.S. (1992) Biochemistry 31, 1532–1541. 43. van Tilbergh, H., Egloff, M.-P., Martinez, C., Rugani, N., Verger, R. & Cambaillau, C. (1993) Nature (London) 362, 814–820. 44. Derewenda, Z.S., Derewenda, U. & Dodson, G.G. (1992) J. Mol. Biol. 227, 818–839. 45. Norin, M., Olson, O.H., Svendsen, A., Edholm, O. & Hult, K. (1993) Protein Eng. 6, 855–863. 46. Peters, G.H., Olsen, O.H., Svendsen, A. & Wade, R.C. (1996) Biopliys. J. 71, 119–129. 47. Peters, G.H., Toxaerd, S., Olsen, O.H. & Svendsen, A. (1997) Protein Eng. 10, 137–147. 48. Holmquist, M., Norin, M. & Hult, K. (1993) Lipids 28, 721–726. 49. Holmquist, M., Martinelle, M., Berglund, P., dausen, I.G., Patkar, S., Svendsen, A. & Hult, K. (1993) J. Protein Chem. 12, 749–757. 50. Martinelle, M., Holmquist, M., Clausen, I.G., Patkar, S., Svendsen, A. & Hult, K. (1996) Protein Eng. 9, 519–524. 51. Poulos, T.L., Finzel, B.C. & Howard, A.J. (1987) J. Mol. Biol 195, 687–700. 52. Hasemann, C.A., Kurumbail, R.G., Boddupalli, S., Peterson. J.A. & Deisenhofer, J. (1995) Structure 3, 41–62. 53. Deprez, E., Gerber, N.C., Di Primo, C., Douzou, P., Sligar, S.G. & Hui Bon Hoa, G. (1994) Biochemistry 33, 14464–14468. 54. Luedemann, S.K., Carugo, O. & Wade, R.C. (1997) J. Mol. Model. 3, 369–374. 55. Carugo, O. & Argos, P. (1998) Proteins, in press. 56. Sligar, S.G. (1995) in Cytochrome P450: Structure, Mechanism and Biochemistry, ed. Ortiz de Montellano, P.R. (Plenum, New York), pp. 83–124. 57. Di Primo, C., Deprez, E., Sligar, S.G. & Hui Bon Hoa, G. (1997) Biochemistry 36, 112–118. 58. Jacobson, B.L., He, J.J., Lemon, D.D. & Quiocho, F.A. (1992) J. Mol. Biol. 223, 27–30. 59. Hodgkin, E.E. & Richards, W.G. (1987) Int. J. Quantum Chem.: Quant. Biol Symp. 14, 105–110. 60. Raag, R., Li, H., Jones, B.C. & Poulos, T.L. (1993) Biochemistry 32, 4571–4578.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION
5950
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5950–5955, May 1998 Colloquium Paper This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wofynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
Computer simulations of enzyme catalysis: Finding out what has been optimized by evolution
ARIEH WARSHEL† AND JAN FLORIÁN Department of Chemistry, University of Southern California, Los Angeles, CA 90089–1062 ABSTRACT The origin of the catalytic power of enzymes is discussed, paying attention to evolutionary constraints. It is pointed out that enzyme catalysis reflects energy contributions that cannot be determined uniquely by current experimental approaches without augmenting the analysis by computer simulation studies. The use of energy considerations and computer simulations allows one to exclude many of the popular proposals for the way enzymes work. It appears that the standard approaches used by organic chemists to catalyze reactions in solutions are not used by enzymes. This point is illustrated by considering the desolvation hypothesis and showing that it cannot account for a large increase in kcat relative to the corresponding kcage for the reference reaction in a solvent cage. The problems associated with other frequently invoked mechanisms also are outlined. Furthermore, it is pointed out that mutation studies are inconsistent with ground state destabilization mechanisms. After considering factors that were not optimized by evolution, we review computer simulation studies that reproduced the overall catalytic effect of different enzymes. These studies pointed toward electrostatic effects as the most important catalytic contributions. The nature of this electrostatic stabilization mechanism is far from being obvious because the electrostatic interaction between the reacting system and the surrounding area is similar in enzymes and in solution. However, the difference is that enzymes have a preorganized dipolar environment that does not have to pay the reorganization energy for stabilizing the relevant transition states. Apparently, the catalytic power of enzymes is stored in their folding energy in the form of the preorganized polar environment. Enzymatic reactions are involved in the acceleration and control of most biological processes. Thus, the understanding of the origin of the enormous catalytic power of enzymes is one of the important goals in molecular biology. Unfortunately, despite the enormous progress in structural and biochemical studies of enzymes, we still cannot use direct experiments to determine uniquely what are the most important factors in enzyme catalysis. It is quite obvious that enzymes reduce the activation free energies of their reactions, but, as will be shown in this work, it does not follow that evolution can do “everything” and that all possible mechanisms (e.g., entropy, strain, dynamic effects, etc.) can provide effective ways of catalyzing enzymatic reactions. Finding out what free-energy factors can help in catalysis is far from trivial because no current experimental technique can provide direct correlation between the structure of an enzymesubstrate complex (ES) and the detailed contributions to its transition state energy. Such correlation can be established, at least in principle, by using computer simulation approaches. This work will address the general problem of enzyme catalysis and the importance of using energy-based considerations for resolving this problem. It will be pointed out that many proposals about the catalytic power of enzymes cannot be addressed in a meaningful way without using the relevant thermodynamic cycles. This point will be illustrated by considering the desolvation hypothesis, showing that desolvation effects do not provide a useful catalytic advantage. We will also review energy considerations and computational studies of other catalytic proposals. Special attention will be given to electrostatic energies, emphasizing that such contributions appear to account for the catalytic effects of all of the enzymes that were examined by consistent computational studies. Finally, the nontrivial nature of the electrostatic catalysis will be discussed. It will be pointed out that enzymes stabilize transition states more than water does because their active sites contain dipoles that specifically have been ordered by the protein folding process. The presence of such preoriented dipoles in the enzyme active sites greatly reduces the destabilizing contribution of the so-called “reorganization energy” to the enzyme transition state binding energy. Establishing the Key Problem in Understanding Enzyme Catalysis. To address the nature of enzyme catalysis, it is crucial to analyze the corresponding energetics in a clear way. In doing so, we start by considering the fact that most enzymes evolved to optimize kcat/Km. As shown in Fig. 1, this evolutionary constraint is equivalent to the requirement of reducing ∆g, which corresponds to the difference between the energy of ES and E+S. A part of this reduction can be accomplished by binding the parts of the substrate that are far from the reacting region, thus stabilizing ES and ES, the activated complex, by the same amount This binding effect has been obvious for a long time (see, e.g., refs. 1 and 2), but, in most instances, the binding contribution alone could not account for the large observed catalytic efficiencies of enzymes. It also has been obvious for >50 years (3, 4) that enzymes must reduce the activation barrier by interacting differently with the substrate in the ES and ES states. What was not clear and what is still one of the most fundamental problems in molecular biology is the actual mechanism for the reduction of and whether this reduction involves the ground state destabilization or the transition state stabilization.
†To
whom reprint requests should be addressed. e-mail: [email protected]. © 1998 by The National Academy of Sciences 0027–8424/98/955950–6$2.00/0 PNAS is available online at http://www.pnas.org. Abbreviations: OMP, orotidylic acid; ODCase, orotidine monophosphate decarboxylase; ES, enzyme-substrate complex.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION
5951
FIG. 1. A free-energy profile along a reaction coordinate for the enzymatic reaction in the regime of the low substrate concentration. The reaction involves the formation of the ES from the free enzyme (E) and substrate (S) in solution and the formation of the activated complex [(ES)]. The symbols Km and kcat, are, respectively, the activation, binding and apparent activation free energies, the equilibrium constant for the dissociation of the ES complex, and the second order rate constant for the enzyme-catalyzed reaction. The symbol A is the pre-exponential factor in the expression for the rate constant (5). The reason for using the lowercase “g” as a symbol for activation barriers is explained in ref. 5. The figure demonstrates that ∆g is independent of the magnitude of the ground state destabilization (∆∆Gb). The difficulty of finding logical explanations for the reduction of led‡ to many proposals that can be eliminated in hindsight by considering the evolutionary pressure on enzymes that evolved to optimize kcat/Km. That is, as seen from Fig. 1, no ground state destabilization (∆∆Gb) will help to reduce ∆g. Thus, it is not useful, at least from an evolutionary point of view, to use ground state destabilization mechanisms. This point can be verified easily by going backward in evolution and considering the effect of mutations on ∆g, kcat, and Km (see below). Before examining which mechanisms work and which do not work, it is important to realize that many of the outstanding questions in this field cannot be resolved uniquely by current experimental approaches. That is. enzyme transition states can-not be isolated experimentally and, although indirect experiments are very valuable, they cannot be interpreted without some model for structure-function correlation. In addition, it is important to realize that the issue of enzyme catalysis is an energy issue, and, as such, it cannot be resolved without the ability of dissecting the observed energy to the individual contributions. Finally, in analyzing the effect of enzymes on it is essential to focus on the proper reference state, thus avoiding considerations of irrelevant factors. One of the most effective ways of doing so involves comparison of the given assumed mechanism in the enzyme active site with the same mechanism in a solvent cage, where all of the reactants are at a contact distance (5) (Fig. 2). This definition allows one to avoid the rather trivial question associated with bringing the reactant to the same solvent cage (5) and to focus on the origin of the difference between kcat and kcage. In other words, such an analysis forces one to focus on the true reason for the fact that kcat is much larger than kcage. The rate constant kcage can be evaluated from experimental information about elementary reactions in solutions (5, 6) and/or ab initio calculations in solution (7, 8), but such studies are not practiced by most workers in the field, in part because of the difficulties in estimating the energetics of some reaction intermediates in aqueous solution and the frequent reluctance to ask quantitative questions about energetics. Thus, in many cases, the discussion of the catalytic power of enzymes overlooks the most important question: How large is the effect of the enzyme environment? Instructive works documented the large acceleration of the reaction rate in different enzymes (9, 10) by comparing kcat/Km to the second-order rate constant in water. However, such a comparison includes the effect of the binding energy (∆Gbind) and does not tell us about the effect of the enzyme environment on For example, our recent analysis of the catalytic reaction of ribonuclease (T.M.Glennon and A.W., unpublished work) indicated that this enzyme provides the transition state stabilization, as large as 24 kcal/mol. This fact (which is not mentioned in the vast literature about ribonucleases) presents a major theoretical challenge because it is hard to see how simple environmental effects can lead to such a large free-energy change. Trying to address such problems quantitatively forces one to quantify the effects of different catalytic factors and to offer a concrete explanation for the overall reduction of
FIG. 2. A comparison of the free-energy profiles for an enzymatic reaction and for a reaction proceeding via an identical mechanism in a reference solvent cage. The symbols E, S, and Saq designate the enzyme, the substrate, and the substrate in corresponds to the same reaction mechanism assumed for the the bulk solvent, respectively. The activation free-energy given enzymatic reaction (i.e., it does not necessarily correspond to the actual reaction in solution). This can be determined by using experimental information for the related elementary reaction(s) or by using ab initio calculations.
‡Trying, for example, to explain the differential binding of the ground state and the transition state by van der Waals interactions can be accomplished only by invoking the repulsive part of these forces. Consequently, these forces can be involved only in ground state destabilization effects (which eventually were found to be inconsistent with the flexibility of proteins). As far as the van der Waals attraction between the enzyme and substrate is concerned, it is very similar for the ground and transition states. This insensitivity to the exact structure is caused by the nature of the London dispersion forces that are approximately proportional to the number of interacting atoms. Similarly, hydrophobic forces cannot provide large differential binding for the ground and transition state of the reacting fragments. Finally, even electrostatic effects, which do contribute to the transition state stabilization, accomplish this stabilization in a complex way (see below) that was not realized by early workers in the field.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION
5952
Finding Out What Was Not Optimized by Evolution-Desolvation and Other Proposals. Numerous seemingly reasonable proposals for the origin of the catalytic power of enzymes have appeared in the literature (see, e.g., refs. 1 and 11–23). However, some of these proposals turn out to be ineffective after one considers the relevant thermodynamic cycle and when one uses computer simulation approaches. As an example of this point, we will first consider the desolvation hypothesis. This hypothesis, as introduced by Cohen et al. (13) and Crosby et al. (14), suggested that enzyme active sites become basically nonpolar after the removal of water molecules and that such nonpolar sites help in accelerating enzymatic reactions. Realizing that a polar substrate would not bind in a nonpolar cavity, these authors pointed out that the binding energy of the nonreacting part of the substrate could be used as a driving force for the ground state destabilization by desolvation. By comparing activation energies measured for the reaction catalyzed by lyophilized hydrogenase in dry state and in solution (24), they estimated the desolvation contribution to kcat to be ≥103. Later, Jencks (1, 25) included the desolvation mechanism in the family of enzyme mechanisms that are based on the ground state destabilization concept. The “geometric destabilization” (strain, substrate distortion) and “induced destabilization” were considered to belong in the same class. The common denominator of these mechanisms was that they increased kcat by destabilizing the ground state (ES in Figs. 1 and 2) for the given enzymatic reaction (1, 25). Recently, large desolvation contributions to kcat were predicted by quantum mechanical calculations (26–29) that considered enzymatic reactions involving reactive ionized groups in their ground state. This finding led several research groups to the proposal that enzyme proficiency primarily is caused by the nonpolar enzyme active sites evolved to stabilize the gas-phase transition states (26–29). Although it is possible that small ground state destabilization accompanied with the strong binding of a distant part of a substrate could in principle increase kcat by up to three orders of magnitude, it is unclear how the ground state destabilization could lead to the increase of kcat/Km and, consequently, to any evolutionary advantage. In addition to this general point, which pertains to all ground state destabilization mechanisms, there are problems related specifically to the desolvation hypothesis. These points were established clearly by the quantitative analysis of the reaction profile of amide hydrolysis in gas-phase, in aqueous solution, and in the enzyme (30). Here, we reiterate general problems associated with the idea of catalysis by nonpolar enzyme active sites and illustrate these concepts for the particular case of decarboxylation of orotidylic acid (OMP) to uridylic acid by orotidine monophosphate decarboxylase (ODCase) (28) and the hydrolysis of alkyl halides by haloalkane dehalogenase (29). To explain the enormous proficiency of ODCase (10), which participates in the biosynthesis of pyrimidine nucleotides. Lee and Houk suggested that the enzymatic reaction involves a transformation from a ground state composed of a lysine +–OMP– ion pair to a neutral transition state (Fig. 3) (28) “in a nonpolar enzyme environment” (28). In the absence of the experimental structure of this enzyme, the proposal of Lee and Houk was based on the results of the ab initio calculations that modeled the enzyme environment by a dielectric continuum model. To be more specific, they found that the decarboxylation reaction is barrierless in the gas phase and that the experimentally observed magnitude of kcat is reproduced by a model that considers the enzyme as a uniform medium with a dielectric constant, ε=4 (Fig. 4). Consequently, they concluded that the ODCase works by providing a nonpolar environment for the decarboxylation reaction.
FIG. 3. The mechanism for the enzymatic decarboxylation of the OMP suggested by Lee and Houk (28). The ground and transition states for this reaction are denoted as EH+S– and ESH, respectively. The –NH3+ and –NH2 groups belong to the catalytic lysine residue in the hypothetical nonpolar active site of ODCase.
FIG. 4. An inconsistent free-energy diagram for the decarboxylation of OMP by ODCase in hypothetical active sites characterized by different values of the dielectric constant (ε), Note also that, in deriving this diagram, Lee and Houk modeled the EH+S– ground state as EH+ +S– (infinitely separated ions). Unfortunately, the energy of the EH+S– state in ε=1 and ε =4 is not identical at all to the corresponding energy at ε=78.§ For example, the gas-phase energy should be pushed up by the corresponding absolute value of the solvation free energy. Considering the uninteracting enzyme and substrate molecules in aqueous solution as a correct reference state (or simply using any single correct reference state), one obtains a qualitatively different reaction profile (Fig. 5). Here, the highest barrier corresponds to the formation of the R–NH3+ +S– ion pair in a vacuumlike environment. This barrier and the related barrier at ε=2 reflect the fact that ion pairs are less stable in a nonpolar environment (32). In addition, the uncharged transition state (EHS) has now the same energy in all three environments. Thus, the transition state stabilization of Fig. 4 disappears. Furthermore, the proposal of Lee and Houk (28) is undermined by the unrealistically low ε needed to obtain the experimental kcat (Fig. 5) and the fact that the NH3+ group will be deprotonated in such a nonpolar environment. This unprotonated alternative (NH2 +S–) reference state is denoted in Fig. 5 as E+S–. In fact, even the NH3+S– ion pair will become a neutral pair (NH2SH) in a nonpolar environment (for the sake of simplicity, this neutral pair is not shown in Fig. 5). At any rate, the enormous catalytic effect found by Lee and Houk will disappear once fully consistent thermodynamic and electrostatic considerations are invoked.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION
5953
FIG. 5. A consistent free-energy diagram for the decarboxylation of OMP by ODCase. Note that the free energies of the ground state (EH+S–) in each environment involve the contribution of the corresponding solvation free energies. [The magnitude of this contribution was determined by the polarized continuum model (PCM HF/6–31G*) implemented in the Gaussian 94 program (59). The default Pauling’s atomic van der Waals radii scaled by 1.2 were used. The calculated solvation free energies for orotate. CH3NH3+, carbene-methylamine complex, and CO2 were –68, –76, –13.8, and –1.3 kcal/ mol, respectively. For the EH+S– state, in which the orotate and CH3NH3+ ions were assumed to lie 6 Å apart, solvation free energy was estimated by using the generalized Born formula, in which the gas phase interaction energy of –55 kcal/mol was assumed. In addition, experimental binding free energy of the ODCase-OMP complex of –9 kcal/mol was taken into account.] As is clear from the figure, the ground state energies are very different at different values of ε (in contrast to the free-energy diagram presented in Fig. 4). Now the transition state energy is nearly identical in different environments, and the experimental value of kcat of 16 kcal/mol is obtained for ε as low as 1.5 (this value is unrealistic and will not support an ionized NH3+ group). Without having the crystal structure of this protein, we can only point out major inconsistencies in the model of Lee and Houk; we cannot show how the enzyme stabilizes its transition state in this specific case. However, any single case that was examined by us (see also below) was found to involve transition state stabilization by a very- polar active site. In this respect, it is very Instructive to consider the pioneering work of Lienhard and coworkers (14), who introduced the desolvation idea in their study of pyruvate decarboxylase (14). These authors postulated that the active site of the enzyme must be nonpolar, and this view was later adopted (1) as a major support of the desolvation hypothesis. Recently, the crystal structure of pyruvate decarboxylase (33) revealed that its active site contains many polar residues. In particular, the substrate binding site is located in the region of the contact of the a and b domains of the protein and contains the Glu91 His92, Cys221, and Cys222 residues. Although a part of the thiamin cofactor for this reaction is surrounded by nonpolar amino acid residues, the activation of this cofactor occurs via the general base catalysis involving the water molecule and several polar residues. Moreover, the current mechanism of the substrate activation by the cofactor (34), which is supported by the available structural information (33), differs from the mechanism using the formation of the neutral pyruvic acid in the nonpolar environment, as suggested by Crosby et al. (14). Although the overall pyruvate decarboxylation reaction involves several steps for which the structures of the relevant intermediates in the enzyme were not yet determined by x-ray crystallography, it is clear that the enzyme active site is very heterogeneous with the numerous preoriented polar and ionic groups. Another theoretical study that invoked the desolvation hypothesis is a recent investigation of the SN2 displacement of Cl– from dichloro ethane in the active site of haloalkane dehalogenase by Bruice and coworkers (29). In this case, unlike the ODCase discussed above, it was possible to base the theoretical analysis on the known crystal structures (at 1.9- and 2.4-Å resolution, respectively) of this enzyme (35) and its complex with dichoroethane (36). The assumed reaction mechanism involves the nucleophilic attack of the ionized aspartate residue (Asp124) on the carbon atom of the substrate, which results in the displacement of the Cl– ion and formation of the alkyl ester intermediate (Fig. 6). In the next reaction step, the product (chloro ethanol) is formed by the nucleophilic attack of the water molecule on the alkyl intermediate, recovering the aspartate residue. The nucleophilic Asp124 is, together with the nearby His289 and Asp260 residues, situated in a cavity that also contains two Trp residues that form hydrogen bonds with the displaced chlorine ion. The structure of the corresponding transition state was calculated by the semiempirical PM3 method for the reaction occurring in the gas phase, in a dielectric continuum (ε=80), and in the enzyme active site (29). The enzyme active site was modeled by 14 amino acids, including the Trp125 and Trp175 residues, and a catalytic water molecule. As in the case of ODCase, the nonenzymatic reaction (37) was calculated to be extremely slow in aqueous solution, and the gas-phase reaction was found to be very fast. Moreover, the geometry and the relative energy of the transition state inside the enzyme active site was found to be similar to the transition state energy and geometry obtained at the same computational level for the gas phase reaction. Lightstone et al. (29) concluded that the hydrogen bonds of two Trp residues are important for stabilizing the transition state [note that such hydrogen bonds are the dipoles considered in our early studies (5, 22)). They also pointed out the importance of small reorganization energy that will be discussed below. However, restoring the arguments of Dewar and Storch (26), Lightstone et al. (29) also concluded that the enzyme operates by a desolvation mechanism, destabilizing the reactants in a gas-phase-like environment.
FIG. 6. The proposed (29, 36) mechanism for the enzymatic dehalogenation of the dichloroethane. The ground and transition states for this reaction are denoted as ES and (ES), respectively. The –COO– group belongs to the catalytic Asp residue in the active site of the haloalkane dehalogenase (29, 36). Although the above proposal contains correct elements, it has several problems. First, a detailed examination of the active site of haloalkane dehalogenase reveals a very polar environment at the chemically relevant sites.¶ In fact, this environment is already entirely obvious from the inspection of the x-ray structure where two dipoles (two hydrogen bonds)
§Note that using macroscopic concepts in describing electrostatic effects in proteins is an inadequate approach (see. e.g., ref. 31). For example, the use of a uniform dielectric constant is inadequate as a measure of the polarity of the enzyme active site because an active site containing fixed dipoles is polar, but its dielectric coast ant can be rather small. However. Lee and Houk (28) clearly meant the low dielectric active site to be an homogeneous, nonpolar environment analogous to the nonpolar solvents used by organic chemists to accelerate chemical reactions. Here, we will invoke macroscopic concepts (i.e., dielectric constants) just to show that enzyme active sites do not use such an environment.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION
5954
are provided by Trp125, Trp175, two dipoles by the main chain peptide bonds of Glu56 and Trp 125, and two by the end of the α-helix (36). Second, the same energy considerations applied above for ODCase can show that the desolvation hypothesis does not work in the present case. To be more specific, if the active site of haloalkane dehalogenase really was nonpolar, the nucleophile Asp 124 would not be ionized in the ground state. An attack by the neutral Asp on the dichloro ethane [not considered by Lightstone et al. (29)]—which is, in principle, possible— will involve a zwitterionic transition state, and, as such, it will proceed more slowly in the nonpolar than in the polar environment. Third, the low activation barrier calculated for the gas-phase reaction reflects the assumption that the ground state of this system involves the ionized Asp124. Although Asp124 probably is ionized in the real enzyme, using an ionized nucleophile in the gas-phase calculation amounts to starting the reaction from a high energy intermediate (30). Finally, the correct inclusion of some of the active site dipoles (e.g., Trp125 and Trp175 dipoles) amounts to a study of the given reaction in a polar environment and not in the gas phase. Saying that the enzyme works by using the gas-phase environment plus hydrogen bonds to stabilize the transition state is problematic. In the same way, one might include the entire enzyme explicitly while neglecting the solvent and state that the system works like a gas-phase system. Apparently, the desolvation hypothesis reflects the realization that organic chemists can accelerate chemical reactions by moving them from polar to nonpolar solvents (thus, one would assume that enzymes can accomplish the same trick). However, what is missing in this analysis is the energy of moving the reactants from a test tube with a polar solvent to the test tube with a nonpolar solvent. This energy contribution is equivalent to the energy lost when the substrate is moved from water to the hypothetical nonpolar active site. Another point that is missing is the lack of an evolutionary constraint to use binding energy in distant regions in destabilizing the reacting parts in their ground state. To be more specific, one may design a catalytic antibody that will gain some increase in kcat by a desolvation mechanism, but enzymes do not have the incentive to do so (this mechanism will not help in reducing kcat/Km). Furthermore, as was mentioned above, enzymes cannot destabilize the ionized form of natural amino acids and still use them as nucleophiles because such ionized groups will become unionized in nonpolar sites. In addition to the desolvation hypothesis, there are other mechanisms for reducing that can be excluded by using energy considerations and computer simulations. These include the following mechanisms: (i) the strain mechanism (1, 11). This mechanism involved a ground state destabilization caused by steric strain. However, force field calculations (5, 39, 40) have established that such a mechanism cannot account for a large reduction in (ii) The orbital steering mechanism (16). Such a mechanism requires that the approach of the reacting molecules is restricted to a very narrow angular range by a steep potential energy surface in the direction perpendicular to the reaction coordinate. This proposal can be excluded once the actual energetics is estimated (5, 41). (iii) The low-barrier hydrogen bond mechanism (20, 21). This mechanism implies that hydrogen bonds catalyze reactions by forming partial covalent bonds to the corresponding transition state. This proposal has been shown to be anticatalytic (relative to the corresponding regular hydrogen bond) by considering the relationship between the solvation energy and charge delocalization (42). (iv) The idea that enzyme catalysis involves significant dynamic effects (17, 18) has been excluded by computer simulation studies (5, 43). Other proposals, such as the idea that ground state destabilization by entropic effects is the origin of the reduction in (1, 12), cannot be excluded yet by computer simulation studies because of convergence problems. However, qualitative estimates (5) do not support this idea (see also ref. 44). Of interest, none of the suggestions that involve ground state destabilization are supported by mutation experiments. That is, the ground state destabilization mechanisms require that at least some of the mutations that change the activity of the enzyme in a significant way will involve large ground state stabilization (this means that the native enzymes evolved by using the specific group to induce ground state destabilization). However, mutations that lead to a large loss in catalysis (see, e.g., refs. 45– 47) involve mainly large reduction of kcat accompanied by small changes in Km, (small changes in the ground state free energy) or in the increase in Km and no change in kcat (equal destabilization of the ground and transition state). Using Computer Modeling To Determine How Enzymes Really Work. As clarified in the previous section, it is possible to exclude some major catalytic proposals by simple energy considerations. Doing so, however, is not sufficient when one likes to determine the real origin of enzyme catalysis. It seems to us that the only way of resolving this issue is to take crystal (or solution) structures of different enzymes and to reproduce the observed Once this is accomplished, it is simple to examine which energy contributions are responsible for the overall effect. The selection of the proper computational strategy is not completely obvious. In principle, one can use the hybrid quantum mechanical/molecular mechanics approach introduced by Warshel and Levitt (40). However, despite the recent popularity of this approach (48– 50), it does not yet provide sufficiently quantitative answers. The problems are that (i) current semi-empirical molecular orbital models are not accurate enough; (ii) most quantum mechanical/molecular mechanics approaches do not treat properly the complete enzyme substrate environment; and (iii) ab initio quantum mechanical/molecular mechanics approaches are too time consuming and, with the exception of one approach (51), do not involve calculation of activation free energies. At present, the most effective and consistent way of using computer simulations in studies of enzyme catalysis is provided by the empirical valence bond method (5, 52, 53). This method does not try to evaluate from a first principle the energy surface of the substrate but rather focuses on the change in this free energy on moving from the reference solvent cage to enzyme active site. Thus, this method focuses directly on Extensive empirical valence bond studies of many enzymatic reactions have been reported in the literature (5, 6, 53– 56). Many of the above simulation studies provided quantitative or semiquantitative results, reproducing frequently the overall catalytic effect of the enzyme. The most significant finding of all the above studies is that the largest catalytic effect always is associated with electrostatic contributions. In other words, the electrostatic stabilization of the transition state is larger in the enzyme active site than in water. The finding that enzymes provide large electrostatic stabilization is far from trivial and, in fact, seems at first sight to be inconsistent with all studies before the emergence of computer modeling. For example, studies of model compounds in solutions have not reproduced large electrostatic effects even with covalentty linked ionized groups that are aligned properly to stabilize ionic transition states (57). This fact can be rationalized by saying that electrostatic effects cannot be large in aqueous solution because the dielectric constant is large in such an environment even at a short interaction distance (22). It thus can be argued that protein active sites with low dielectric constant should be able to enhance electrostatic effects (58). However, ionized groups that were supposed to be the source of electrostatic effects in enzymes would not be ionized in low dielectric sites. Of course, the argument that electrostatic effects are small in high dielectric environments and cannot exist in low dielectric environments (which is the best that could have been concluded before the emergence of crystal structures of enzymes) is not correct. Protein active sites are neither homogenous low dielectric nor homogeneous high dielectric media. They are usually very polar heterogeneous sites (22, 31). This fact, however, cannot explain the finding that protein active sites provide larger stabi
¶The statement that the active site is polar may sound unreasonable in view of the fact that the active site was found to contain mostly hydrophobic residues (35) However, considering the residue type without proper computational tools for structure-function correlation may be quite deceiving. As is clear now to most workers who are involved in studies of electrostatic effects in proteins (see e.g., refs. 31 and 38), even hydrophobic residues have very polar main-chain dipoles. Thus, the main-chain dipoles of hydrophobic groups and the side-chain dipoles of a few selected residues are sufficient to create high polarity in the proper places. Thus, the decision whether the active site environment is polar should be determined by calculating the interaction energy of this environment with the particular substrate and not by counting amino acids.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION
5955
lization to ionic transition states than water does. Here, the most puzzling problem is associated with the fact that the actual average electrostatic interaction between the transition states of an enzyme and the surrounding dipoles is not larger than the corresponding interaction with the dipoles of the reference solvent cage The solution to this problem has been given before in a work (22) that pointed out that, in polar solvents, about half of the charge-dipole interaction is spent on dipole-dipole interaction () so that
In proteins, however, a significant part of ∆Vµµ (or the corresponding reorganization energy) already is paid during the folding process, in which the folding energy is used to compensate for dipole-dipole repulsion and to align the active site dipoles in away that will maximize ∆Gsol. The preoriented environment allows the protein to minimize the reorganization energy associated with the formation of the charged transition state (see Figure 9.7 of ref. 5). The idea of preorganized dipoles as the source of the catalytic power of enzymes explains what really was done by evolution in optimizing enzymes. That is, the active site has to interact with the changes that occur in the substrate during the formation of the transition state to reduce The structural changes of the substrate during the given reaction could not be used effectively because enzymes are rather flexible; changes in entropy were not useful because they can only help in ground state destabilization. Thus, the most effective way was to interact with the changes in charges during the reaction. Here, a requirement of providing a large stabilization to the transition state charges forced the enzyme to be a better “solvent” for the transition state than aqueous solution (22), which can be accomplished by preorienting the active site dipoles. Note that the resulting environment is exactly the opposite from the nonpolar environment envisioned as a source of catalytic energy in the desolvation models above. Finally, the reason for the difficulty in realization that enzymes use preorganized dipoles can be traced in part to the fact that the catalytic energy is already in the folding process and not in the enzyme substrate interaction. This work was supported by National Institutes of Health Grant GM24492. 1. Jencks, W.P. (1987) Catalysis in Chemistry and Biology (Dover, New York). 2. Fersht, A.R. (1985) Enzyme Structure and Mechanism (Freeman, New York). 3. Haldane, J.B.S. (1930) Enzymes (Longman, New York), p. 182. 4. Pauling, L. (1946) Chem. Eng. News 24, 1375. 5. Warshel, A. (1991) Computer Modeling of Chemical Reactions in Enzymes and Solutions (Wiley, New York). 6. Fuxreiter, M. & Warshel, A. (1998) J. Am. Chem. Soc., 120, 183–194. 7. Florián, J. & Warshel, A. (1997) J. Am. Chem. Soc. 119, 5473–5474. 8. Florián, J. & Warshel, A. (1998) J. Phys. Chem. B 102, 719–734. 9. Radzicka, A. & Wolfenden, R. (1996) J. Am. Chem. Soc. 118, 6105–6109. 10. Radzicka, A. & Wolfenden, R. (1995) Science 267, 90–93. 11. Phillips, D.C. (1966) Sci. Am. 215, 78–90. 12. Page, M.I. (1977) Angew. Chem. Int. Ed. Engl. 16, 449–459. 13. Cohen, S.G., Vaidya, V.M. & Schultz, R.M. (1970) Proc, Natl. Acad. Sci. USA 66, 249–256. 14. Crosby, J., Stone, R. & Lienhard, G.E. (1970) J. Am. Chem. Soc. 92, 2891–2900. 15. Menger, F.M. (1985) Acc. Chem. Res. 18, 128–134. 16. Storm, D.R. & Koshland, D.E. (1970) Proc. Natl. Acad. Sci. USA 66, 445–452. 17. Careri, G., Fasella, P. & Gratton, E. (1979) Annu. Rev. Biophys. Bioeng. 8, 69–97. 18. Gavish, B. & Werber, M.M. (1979) Biochemistry 18, 1269–1275. 19. McCammon, J.A., Wolynes, P.G. & Karplus, M. (1979) Biochemistry 18, 927–942. 20. Frey, P.A., Whitt, S.A. & Tobin, J.B. (1994) Science 264, 1927–1930. 21. deland, W.W. & Kreevoy, M.M. (1994) Science 264, 1887–1890. 22. Warshel, A. (1978) Proc. Natl. Acad. Sci. USA 75, 5250–5254. 23. Cha, Y., Murray, C.J. & Klinman, J.P. (1989) Science 243, 1325–1330. 24. Yagi, T., Tsuda, M., Mori, Y. & Inokuchi, H. (1969) J. Am. Chem. Soc. 91, 2801–2807. 25. Jencks, W.P. (1975) in Advances in Enzymology and Related Areas of Molecular Biology, ed. Meister, A. (Wiley, New York), Vol. 43, pp. 219–410. 26. Dewar, M.J.S. & Storch, D.M. (1985) Proc. Natl. Acad. Sci. USA 82, 2225–2229. 27. Dewar, M.J.S. & Dieter, K.M. (1988) Biochemistry 27, 3302–3308. 28. Lee, J.K. & Houk, K.N. (1997) Science 276, 942–945. 29. Lightstone, F.C, Zheng, Y.J., Maulitz, A.H.& Bruice, T.C. (1997) Proc. Natl. Acad. Sci. USA 94, 8417–8420. 30. Warshel, A., Åqvist, J. & Creighton, S. (1989) Proc. Natl. Acad. Sci. USA 86, 5820–5824. 31. Warshel, A. and Russell, S.T. (1984) Q. Rev. Biol. 17, 283–421 32. Warshel, A. (1981) Biochemistry 20, 3167–3177. 33. Arjunan, P., Umland, T., Dyda, F., Swaminathan, S., Furey, W., Sax, M., Farrenkopf, B., Gao, Y., Zhang, D. & Jordan, F. (1996) J. Mol. Biol. 256, 590–600. 34. Metzler, D.F. (1977) Biochemistry (Academic, New York). 35. Verschueren, K. R G., Franken, S.M., Rozenboom, H.J., Kalk, K.H. & Dijkstra, B.W. (1993) J. Mol. Biol. 1993, 856–872. 36. Verschueren, K.H.G., Seljee, F., Rozenboom, H.J., Kalk, K.H. & Dijkstra, B.W. (1993) Nature (London) 363, 69–698. 37. Maulitz, A.H., Lightstone, F.C, Zheng, Y.-J. & Bruice, T.C. (1997) Proc. Natl. Acad. Sci. USA 94, 6591–6595. 38. Sharp, K.A. & Honig, B. (1990) Annu. Rev. Biophys. Biophys. Chem. 19, 301–332. 39. Levitt, M. (1974) in Peptides, Polypeptides and Proteins, eds. Blout, E.R., Bovey, F.A., Goodman, M. & Lotan, N. (Wiley, New York), pp. 99–113. 40. Warshel, A. & Levitt, M. (1976) J. Mol. Biol. 103, 227–249. 41. Bruice, T. C, Brown, A. & Harris, D.O. (1971) Proc. Natl. Acad. Sci. USA 68, 658–661. 42. Warshel, A. & Papazyait, A. (1996) Proc. Natl. Acad. Sci. 98, 13665–13670. 43. Warshel, A., Sussman, F. & Hwang, J.-K. (1988) J. Mol. Biol. 201, 139–159. 44. Lightstone, F.C & Bruice, T.C. (1996) J. Am. Chem. Soc. 118, 2595–2605. 45. Wilks, R M., Hart, K.W., Feeney, R., Dunn, C R., Muirhead, H., Chia, W.N., Barstow, D.A., Atkinson, T., Clarke, A.R. & Holbrook, J.J. (1988) Science 242, 1541–1544. 46. Leatherbarrow, R.J., Fersht, A.R. & Winter, G. (1985) Proc. Natl. Acad. Sci. USA 82, 7840–7844. 47. Carter, P. & Wells, J.A (1990) Proteins 7, 335–342. 48. Waszkowycz, B., Hillier, I.H., Gensmantel N. & Payling, D.W. (1990) J. Chem. Soc. Perkin Trans. 2, 1259–1264. 49. Bash, P.A., Field, M.J., Davenport, R.C, Petsko, G.A., Ringe, D. & Karphis, M. (1991) Biochemistry 30, 5826–5832. 50. Mulholland, A.J., Grant, G.R & Richards, W.G. (1993) Protein Eng. 6, 133–147. 51. Bentzien, J., Muller, R.P., Florián, J. & Warshel, A. (1998) J. Phys. Chem B, 102, 2293–2301. 52. Warshel, A. & Weiss, R.M. (1980) J. Am. Chem. Soc. 102, 6218–6226. 53. Åqvist, J. & Warshel, A. (1993) Chem. Rev. (Washington, D.C) 93, 2523–2544. 54. Åqvist, J. & Fothergill, M. (1996) J. Biol. Chem. 271, 10010–10016. 55. Fothergill, M., Goodman. M.F., Petruska, J. & Warshel, A. (1995) J. Am. Chem. Soc. 117, 11619–11627. 56. Hwang, J.K. & Warshel, A. (1996) J. Am. Chem. Soc. 118, 11745–11751. 57. Dunn, B.M. & Bruice, T.C. (1973) Adv. Enzymol. Relat. Areas Mol. Biol. 37, 1–60. 58. Fife, T.H., Jaffe, S.H. & Natarajan, R. (1991) J. Am. Chem, Soc. 113, 7646–7653. 59. Frisch, M.J., Trucks, G.W., Schlegel, H.B., Gill, P.M.W., Johnson, B.G., Robb, M.A., Cheeseman, J.R., Keith, T., Petersson, G.A., Montgomery, J.A., et al. (1995) Gaussian 94, Revision C.2 (Gaussian. Pittsburgh).