Multiscale Approaches to Protein Modeling

Multiscale Approaches to Protein Modeling Andrzej Kolinski Editor Multiscale Approaches to Protein Modeling 13 Ed...

Author: Andrzej Kolinski

77 downloads 1066 Views 6MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Multiscale Approaches to Protein Modeling

Andrzej Kolinski Editor

Multiscale Approaches to Protein Modeling

13

Editor Andrzej Kolinski Department of Chemistry University of Warsaw ul. Pasteura 1 02-093 Warszawa Poland [email protected]

ISBN 978-1-4419-6888-3 e-ISBN 978-1-4419-6889-0 DOI 10.1007/978-1-4419-6889-0 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010934732 © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Thanks to enormous progress in sequencing of genomic data, presently we know millions of protein sequences. At the same time the number of experimentally solved protein structures is much smaller, ca. 60,000. This is because of large cost of structure determination. Thus, the theoretical in silico prediction of protein structures and dynamics is essential for understanding the molecular basis of drug action, metabolic and signaling pathways in living cells, and designing new technologies in the life science and material sciences. Unfortunately, a “brute force” approach remains impractical. Folding of a typical protein (in vivo or in vitro) takes milliseconds to minutes, while the state-of-the-art all-atom molecular mechanics simulations of protein systems can cover only a time period of nanoseconds to microseconds. This is the reason for the enormous progress in the development of various multiscale modeling techniques applied to protein structure prediction, modeling of protein dynamics and folding pathways, in silico protein engineering, model-aided interpretation of experimental data, modeling of macromolecular assemblies, and theoretical studies of protein thermodynamics. Coarse-graining of the proteins’ conformational space is a common feature of all these approaches, although the details and the underlying physical models span a very broad spectrum. This book contains comprehensive reviews of the most advanced multiscale modeling methods in protein structure prediction, computational studies of protein dynamics, folding mechanisms, and macromolecular interactions. The presented approaches span a wide range of the levels of coarse-grained representations, various sampling techniques, and a variety of applications to biomedical and biophysical problems. It was our intention to provide a collection of comprehensive reviews that could be used as a reference book for those who just are beginning their adventure with biomacromolecular modeling but also as a valuable source of more detailed information for those who are already experts in the field of biomacromolecular modeling and in related areas of computational biology or biophysics. Proteins are linear copolymers composed of amino acids. Important ideas of polymer physics inspired the field of protein modeling. Chapter 1 explains some basic concepts of polymer conformational statistics and dynamics of chain molecules in context of simple lattice models. This chapter demonstrates how

v

vi

Preface

these ideas could be employed in protein modeling. Chapter 2 describes application of a lattice-based protein model to the very challenging problem of protein docking. Chapter 3 provides a comprehensive overview of various coarse-grained protein-like and protein models. This chapter describes (among other approaches) probably the most rigorous system of physics-based reduced modeling of proteins. Coarse-grained, multiscale, protein modeling requires specific designs of interaction schemes. Chapters 4–6 provide in-depth overviews of various level force-fields for the reduced representations of protein conformational space, including knowledgebased statistical potentials. Chapters 7 and 8 (but also, in part, Chapters 3–5 and 12) describe a variety of applications of reduced models in the study of protein dynamics, folding pathways, molecular mechanisms of mechanical unfolding, and protein interactions. Chapter 9 gives an overview of the most effective sampling strategies in a reduced, although unrestricted conformational space. Chapters 10 and 11 present a very efficient philosophy of a conformational search, where the target structures are assembled from fragments excised from already known protein structures. These strategies proven to be very effective in the large-scale, automated in silico structure prediction. Chapter 12 describes a multiscale method, based on a high-resolution lattice model, for modeling protein folding pathways. Chapters 13 and 14 discuss the most important ideas and techniques of comparative modeling – the most effective and the most popular method for theoretical prediction of protein structures. These chapters provide also reviews of the model-quality assessment methods. The contributing authors are world-wide recognized experts. Some of them (Bujnicki and Zhang) are leaders in the field of protein structure prediction, as assessed by the recent (CASP6–CASP8) community-wide experiments in a blind structure prediction. Others also developed very successful methods for the protein structure prediction (Scheraga, Liwo, Feig, and Kihara). Several of the authors of this book developed very efficient coarse-grained interaction schemes for protein models based on either an evolutionary knowledge approach (Jernigan and Scheraga have built theoretical foundations of this class of approaches, but others also contributed significantly: Feig and Micheletti) or a physics-based approach (Scheraga, Liwo, Feig, and Irback). Among the authors are also the world top leaders of comparative modeling (Bujnicki, Zhang, Tramontano, and Kihara) and automated structure prediction (Zhang and Bujnicki) – the structure prediction server created by Zhang is the best till date. The book presents also the state-of-the-art methods of evaluation of quality of the theoretical protein models (Tramontano and Kihara). Recently, a significant progress has been achieved in multiscale modeling of protein dynamics and folding mechanisms. The authors of the chapters dealing with this class of problems are also world-class leaders (Scheraga, Liwo, Irback, Feig, Cieplak, Jernigan, and Micheletti). The conformational search strategies are crucial in protein modeling. Developers of the most efficient computational techniques and strategies are also among the authors (Hansmann, Scheraga, and others). Warsaw, Poland

Andrzej Kolinski

Contents

1 Lattice Polymers and Protein Models . . . . . . . . . . . . . . . . Andrzej Kolinski

1

2 Multiscale Protein and Peptide Docking . . . . . . . . . . . . . . . Mateusz Kurcinski, Michał Jamroz, and Andrzej Kolinski

21

3 Coarse-Grained Models of Proteins: Theory and Applications . . . . . . . . . . . . . . . . . . . . . . . Cezary Czaplewski, Adam Liwo, Mariusz Makowski, Stanisław Ołdziej, and Harold A. Scheraga 4 Conformational Sampling in Structure Prediction and Refinement with Atomistic and Coarse-Grained Models . . . . . . Michael Feig, Srinivasa M. Gopal, Kanagasabai Vadivel, and Andrew Stumpff-Kane 5 Effective All-Atom Potentials for Proteins . . . . . . . . . . . . . . Anders Irbäck and Sandipan Mohanty 6 Statistical Contact Potentials in Protein Coarse-Grained Modeling: From Pair to Multi-body Potentials . . . . . . . . . . . Sumudu P. Leelananda, Yaping Feng, Pawel Gniewek, Andrzej Kloczkowski, and Robert L. Jernigan 7 Bridging the Atomic and Coarse-Grained Descriptions of Collective Motions in Proteins . . . . . . . . . . . . . . . . . . . . Vincenzo Carnevale, Cristian Micheletti, Francesco Pontiggia, and Raffaello Potestio 8 Structure-Based Models of Biomolecules: Stretching of Proteins, Dynamics of Knots, Hydrodynamic Effects, and Indentation of Virus Capsids . . . . . . . . . . . . . . . . . . . . . Marek Cieplak and Joanna I. Sułkowska 9 Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . Ulrich H. E. Hansmann

35

85

111

127

159

179

209

vii

viii

Contents

10 Protein Structure Prediction: From Recognition of Matches with Known Structures to Recombination of Fragments . . . . . . . . . . . . . . . . . . . Michal J. Gajda, Marcin Pawlowski, and Janusz M. Bujnicki

231

11 Genome-Wide Protein Structure Prediction . . . . . . . . . . . . Srayanta Mukherjee, Andras Szilagyi, Ambrish Roy, and Yang Zhang

255

12 Multiscale Approach to Protein Folding Dynamics . . . . . . . . . Sebastian Kmiecik, Michał Jamroz, and Andrzej Kolinski

281

13 Error Estimation of Template-Based Protein Structure Models . . Daisuke Kihara, Yifeng David Yang, and Hao Chen

295

14 Evaluation of Protein Structure Prediction Methods: Issues and Strategies . . . . . . . . . . . . . . . . . . . . . . . . . Anna Tramontano and Domenico Cozzetto

315

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

341

Contributors

Janusz M. Bujnicki Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw, Poland; Laboratory of Bioinformatics, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland, [email protected] Vincenzo Carnevale Institute for Computational Molecular Science, Temple University, Philadelphia, PA, USA, [email protected] Hao Chen Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN, USA, [email protected] Marek Cieplak Institute of Physics, Polish Academy of Sciences, Warsaw, Poland, [email protected] Domenico Cozzetto Department of Biochemical Sciences, “Sapienza” University of Rome, Rome, Italy, [email protected] Cezary Czaplewski Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA, [email protected] Michael Feig Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA; Department of Chemistry, Michigan State University, East Lansing, MI, USA, [email protected] Yaping Feng Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA, [email protected] Michal J. Gajda European Molecular Biology Laboratories, Hamburg Outstation, Hamburg, Germany, [email protected] Pawel Gniewek L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA; Laboratory of Theory of Biopolymers, Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected]

ix

x

Contributors

Srinivasa M. Gopal Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA, [email protected] Ulrich H. E. Hansmann Department of Physics, Michigan Technological University, Houghton, MI, USA, [email protected] Anders Irbäck Computational Biology & Biological Physics, Department of Theoretical Physics, Lund University, Lund, Sweden, [email protected] Michał Jamroz Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected] Robert L. Jernigan Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA, [email protected] Daisuke Kihara Department of Biological Sciences, College of Science; Department of Computer Science, College of Science; Markey Center for Structural Biology, Purdue University, West Lafayette, IN, USA, [email protected] Andrzej Kloczkowski Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA, [email protected] Sebastian Kmiecik Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected] Andrzej Kolinski Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected] Mateusz Kurcinski Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected] Sumudu P. Leelananda Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA, [email protected] Adam Liwo Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA, [email protected] Mariusz Makowski Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA, [email protected]

Contributors

xi

Cristian Micheletti Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy; Democritos CNR-IOM and Italian Institute of Technology (SISSA Unit), Trieste, Italy, [email protected] Sandipan Mohanty Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH, Jülich, Germany, [email protected] Srayanta Mukherjee Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA, [email protected] Stanisław Ołdziej Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA; Laboratory of Biopolymer Structure, Intercollegiate Faculty of Biotechnology, University of Gda´nsk and Medical University of Gda´nsk, Gda´nsk, Poland, [email protected] Marcin Pawlowski Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw, Poland, [email protected] Francesco Pontiggia Department of Biochemistry, Brandeis University, Waltham, MA, USA, [email protected] Raffaello Potestio Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy, [email protected] Ambrish Roy Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA, [email protected] Harold A. Scheraga Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA, [email protected] Andrew Stumpff-Kane Department of Biochemistry and Molecular Biology, Michigan State University, Michigan, USA, [email protected] Joanna I. Sułkowska Institute of Physics, Polish Academy of Sciences, Warsaw, Poland; CTBP, University of California, Gilman Drive 9500, La Jolla, San Diego, CA, USA, [email protected] Andras Szilagyi Center for Bioinformatics, University of Kansas, Lawrence, KS, USA; Institute of Enzymology, BRC, Hungarian Academy of Sciences, Budapest, Hungary, [email protected] Anna Tramontano Department of Biochemical Sciences, “Sapienza” University of Rome, Rome, Italy; Istituto Pasteur – Fondazione Cenci Bolognetti, “Sapienza” University of Rome, Rome, Italy, [email protected] Kanagasabai Vadivel Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA, [email protected]

xii

Contributors

Yifeng David Yang Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN, USA, [email protected] Yang Zhang Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA, [email protected]

Chapter 1

Lattice Polymers and Protein Models Andrzej Kolinski

Abstract The size of conformational space of chain polymers is enormous. Much has been learned about polymer structure, thermodynamics, and dynamics by theoretical considerations and numerical study of simple lattice models. Self-avoiding random walks on a lattice provide a good approximation for the excluded volume effect and nature of the coil–globule transition. Semiflexible polymers on a lattice exhibit two-state collapse transition that captures some essential features of the allor-none folding transition of small globular proteins. More complex, decorated with some structural details, lattice polymers provide a very powerful means for study of protein dynamics and thermodynamics and protein structure prediction.

1.1 Reduced Models of Chain Molecules The torsional rotations, only around the main-chain backbone bonds, make the conformational space of chain molecules enormous in size (Flory 1969). For a chain containing N single bonds, the number of conformations is in the range of qN , where q is approximately equal to the number of distinct low-energy regions of the rotational potential. For a polyethylene chain, q would be 3. Obviously, when N is hundreds or many thousands, a detailed conformational analysis becomes impractical. Impractical are also detailed all-atom computer simulations, unless only very local conformational changes require examination. Thus, in order to make the problem tractable, simplified models have often been designed and studied (Milik et al. 1990; Kolinski and Skolnick 1996), either from statistical analyses or/and by computer simulations. As it will become apparent later, the statistical analysis itself is of rather limited utility and in typical cases requires quite drastic simplifications. Usually, it is difficult to estimate a priori the effect of such simplifications on the final results.

A. Kolinski (B) Faculty of Chemistry, University of Warsaw, Warsaw, Poland e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_1,

1

2

A. Kolinski

Let us consider two extremely simple models of polymers, one for idealized conformational statistics and the second for the first level of approximation for chain dynamics. These models can be solved rigorously by simple analytical considerations (Flory 1969). The first is the freely jointed chain (sometimes it is also called “the random flight model”). The freely jointed chain (see Fig. 1.1) consists of n segments of equal length l. Mutual orientations of the segments are completely uncorrelated. It is well known that for a sufficiently large number of segments, the mean-square end-to-end distance of such a chain scales with the number of segments as = l2 n. This result closely resembles the central formula obtained for a Brownian particle theory, where the mean-square displacement is proportional to time. It is also easy to show that the mean-square radius of gyration (a quantity that is easier to measure experimentally than the ) is related to as <S2 > = /6. The distribution of the end-to-end distance and distribution of the segment density is Gaussian. Such an ideal polymer random coil is frequently called the Gaussian chain, although the freely jointed chain is not uniquely Gaussian since other types of chains can also follow Gaussian statistics. Fig. 1.1 An example of the freely jointed chain

The simplifications of the physical properties of real polymers assumed in the freely jointed chain model are essentially of two types. First, the correlations between the chain segments, especially between those that are close to one another along the chain contour, are an important property of polymers and strongly depend on their chemical structure. As long as these correlations extend only to a distance small in comparison with the chain length, it is relatively straightforward to generalize the model by introducing various approximation of the local chain stiffness related to sometimes complex profiles of the rotational potential energy. All the short-range (short distance along the chain contour) correlations do not change the general picture. For all such ideal models = Cl2 n, and the value of the prefactor C depends on the shape of the rotational potential and the temperature. Approximations of the second type are much more significant and much more difficult to deal with. Namely, all ideal chains neglect the effective interactions between the chain segments that are far away from one another along the chain

1

Lattice Polymers and Protein Models

3

but close to each other in space. On the most trivial level, the fact that two segments cannot occupy the same element of space must be taken into account. A rigorous analytical treatment of such “real” chains is not possible, although approximate theories exist (de Gennes 1979). Probably, the most famous is Flory’s mean-field theory (Flory 1953). The theory assumes that a balance between intramolecular interactions and those with solvent defines the average coil size. A quasi-chemical approximation is employed and an average Gaussian density of segments is assumed. The resulting formula describes the chain dimension as a function of temperature: α 5−α3 = const. (1−/T) n1/2

(1.1)

where α is the so-called expansion factor and is defined as α 2 = /

(1.2)

with denoting the ideal chain dimensions. Note that for T = the chain dimensions become identical with the dimensions of the ideal chain. Thus, idea behind Flory’s “theta” () temperature closely resembles the Boyle temperature for real gases. At temperatures below , the chain undergoes a transition to a dense globular state, and this transition is somewhat similar to the gas–liquid transition of small molecule systems. However, the transition for flexible polymers is continuous and has most of the features of a secondorder phase transition (Kolinski et al. 1987b). At high temperatures (see Eq. (1.1)) ∼ n6/5 , and the average chain dimensions are much larger than for an equivalent ideal chain. Interestingly, despite a rather poor estimation of chain entropy and internal energy, Flory’s theory gives quite an accurate estimation of the free energy and conformational properties of chain molecules. Such a cancellation of errors is quite typical of mean-field-type theories. Ideal chain statistics provides a zero-order picture of the protein denatured state, while Flory’s theory is a zero-order approximation for the folding (or collapse) transition. The approximation is quite crude for several reasons. First, protein chains are relatively stiff polymers and the limit of infinitely long chains is hardly satisfied even for large proteins (Creighton 1993). Second, proteins are heteropolymers with highly specific patterns of intramolecular interactions (Branden and Tooze 1991). Even in the random coil state, there is a significant extent of residual structure. Thus, the mean-field theory is hardly applicable. We will address these issues later in more detail. Somewhat analogous to ideal chain statistics, models for ideal chain dynamics were designed. Probably the best known of these is the Rouse model (Rouse 1953), shown in a schematic fashion in Fig. 1.2. It assumes that a flexible polymer chain can be represented as a chain of points joined by harmonic springs of equal strength. This model is analytically solvable. The results are quite interesting. For short times, when the average displacements of chain segments must be small in comparison with the coil size a single segment moves according to

4

A. Kolinski

Fig. 1.2 Schematic drawing of beads-and-springs Rouse chain

(r)2 ∼ t1/2

for

l2 < (r)2 <<S2 >

(1.3)

while at longer times the “regular” diffusion is recovered and the mean-square displacement of a segment follows the mean-square displacement of the center of mass of the coil, (r)2 ∼ t, with the diffusion coefficient proportional to n−1 and the longest relaxation time proportional to n2 . It is easy to see that the Rouse model neglects several basic aspects of the physics: It ignores chain volume and the resulting topological restrictions, i.e., a “phantom” chain approximation, and does not take into account the non-uniform flexibility of copolymers and does not account for hydrodynamic interactions (although some extensions of the Rouse model can do this in a highly approximate way). These approximations are more serious for proteins than for flexible long polymers.

1.2 Simple Lattice Polymers Lattice models of simple liquids, except for providing a clear explanation of the entropy of mixing for an ideal solution, are not so useful. The opposite is the case for polymers. In polymers complex correlations can extend to distances many times larger than the sizes of single monomers. Thus, the local details are of less importance. The two ideal models described in the previous section have close lattice analogs. The freely jointed chain can be represented on a regular lattice, and the asymptotic properties remain unchanged (de Gennes 1979). Since on a lattice the allowed angles between consecutive segments belong to a discrete set, the only differences would be seen for short distances along the chain. The Flory’s type of real chain could be modeled on a lattice in a straightforward and efficient fashion. The simplest possibility is a chain with excluded volume (double occupancy of lattice sites is prohibited) having attractive interactions for non-bonded nearest neighbors. The idea is explained in Fig. 1.3. Such a model enables the detailed Monte Carlo study of polymer collapse transitions for various patterns of interactions and for various topologies of the model polymers (branched chains,

1

Lattice Polymers and Protein Models

5

Fig. 1.3 Square lattice polymers. An ideal chain (left) and a real (with excluded volume) chain (right)

macrocycles, etc.). Interestingly, the critical exponent for an athermal linear chain with excluded volume (the limited case of a high-temperature system) estimated from numerous computational experiments is close (however, clearly not identical) to the 6/5 obtained from Flory’s theory. It also has been proven that the collapse transition for long flexible polymers is continuous and that the observed physics does not depend on the particular type of lattice used. Usually, sampling of the conformational space of a lattice polymer is carried out with the use of various Monte Carlo techniques (Binder et al. 2004; Smith and Lisal 2002; Pakula 2004). A Monte Carlo procedure could be employed to build a large number of completely independent random conformations. Then, such a statistical ensemble can be used for the statistical analysis of the conformational and thermodynamic properties of the model. Alternatively, the ensemble can be constructed in a long iterative process of conformational transformations of a single chain or with a collection of chains. For technical reasons, the second possibility is recommended for studies of multichain systems, where growing all chains in parallel without introducing a statistical bias would be a difficult task (Binder et al. 2004; Smith and Lisal 2002; Frenkel and Smit 2001). Computer simulations of single long flexible polymer provided some important insights into nature of coil–globule collapse transition. It has been shown that with increasing chain length the average size of polymer coil changes faster with temperature and that the collapse transition becomes sharper, although it is always continuous (Kolinski et al. 1987b). Critical exponents describing the chain dimensions at various conditions of solvent quality and temperature have been determined from extensive Monte Carlo simulations for polymers of various topology of the main chain: simple linear, branched (especially star-branched) (Hsu et al., 2004; Sikorski and Romiszowski 1996; Sikorski 1993), and ring polymers. Also, computer modeling of polymers stimulated development of new computational techniques (Grest et al. 1996; Freire 1999; Likos 2006).

6

A. Kolinski

The ideal dynamics of the Rouse chain (see the previous section) also has a lattice analog. Imagine a simple lattice chain, which is ideal (lacking excluded volume), and consequently it is also a “phantom” chain – fragments of a chain can cross its own paths during a random motion. Stochastic dynamics could be simulated as a long sequence of local (involving few bonds) conformational transitions at randomly selected positions along the chain (see Fig. 1.4). It has been shown a long time ago by Verdier and Stockmayer (1962) that the long-time dynamics of such model is equivalent to the Rouse dynamics. Lattice chains could easily be modeled as “real” chains, having excluded volume and topological constraints on their motion. This opens the possibility for computational studies of various complex dynamic phenomena, including the mechanism of polymer collapse (protein folding), diffusion in a restricted space, diffusion in dense solutions. It has been shown that excluded volume of chain molecules (and long-range interactions in general) leads to somewhat stronger dependencies of diffusion coefficient and the longest relaxation time on the chain length, when compared to the corresponding relations for the ideal chains in an infinitely diluted solution. Computer simulations were especially helpful in understanding of mechanism of polymer diffusion in gel, concentrated solutions, and in polymer melts. A system of many mutually entangled polymers is probably one of the most complex (if not the most complex) examples of classical multibody problems. It has been shown that the famous “reptation” theory of de Gennes (1979) describes very well the motion of flexible polymers in gel. The term “reptation” relates to snake-like motion of a polymer chain throughout the net of obstacles superimposed by the crosslinked gel. At the same time, many computer simulations demonstrated that the situation in solutions and melts is more complex (Kolinski et al. 1987a, c, and d) and that the mechanism of diffusion cannot be described in the framework of simple “reptation” theory (Skolnick and Kolinski 1990; Sikorski et al. 1994; Di Cecca and Freire 2002).

Fig. 1.4 Verdier–Stockmayer dynamics of a short simple cubic lattice chain showing typical lattice moves: (a) the corner flip, (b) three-bond permutation, (c) the crankshaft move, and (d) the chain-end move

1

Lattice Polymers and Protein Models

7

1.3 Simple Lattice Polymers with Protein-Like Features The collapse transition of a long flexible polymer chain is continuous. But, relatively short natural polypeptide chains undergo reversible pseudo-first-order cooperative transitions from a random coil denatured state to a structurally organized dense globular state (Anfinsen 1973). Since the single protein–solvent system is rather small, it is probably better to describe this transition as all-or-none. This way one can avoid an implicit reference to the thermodynamic limit. At the same time, the term all-ornone refers to a negligible population of the folding intermediates at the transition temperature. It has been pointed out that polypeptides are relatively stiff polymers. Perhaps, the chain stiffness itself can induce a more cooperative collapse transition. To check this hypothesis, extensive simulations were done almost 25 years ago (Kolinski et al. 1986a). The model chains studied were relatively short and consisted of 50–400 segments restricted to the diamond lattice (see Fig. 1.5). The diamond lattice has been chosen because of its tetrahedral valence angle and the qualitative similarity of its stretched (trans-conformation) segments to the β-strands in globular proteins. Short-range and long-range interactions were modeled in a simple way. Local stiffness was controlled by a potential energy preference for the expanded trans-conformation with respect to the two gauche conformations. Attractive longrange interactions were accounted for with a simple contact potential for the nearest non-bonded neighbors on the lattice. It is probably the simplest possible model of a semiflexible homopolymer in a thermodynamically poor solvent. The degree of stiffness can be controlled by changing the ratio of the stiffness parameter to the contact energy parameter. Monte Carlo simulations revealed interesting behavior for such a simple system. For moderately stiff polymers, the collapse transition was continuous, qualitatively similar to the collapse of a chain with unrestricted flexibility. However, at some critical ratio of the stiffness parameter to the segment attraction parameter, the collapse transition became highly cooperative, with all-or-none thermodynamic characteristics. At the transition temperature, the semiflexible polymers exhibit existence of metastable states, characteristic for first-order phase transitions.

Fig. 1.5 A short fragment of a model chain restricted to the diamond lattice. Semiflexible chains have a preference for the expanded trans-type conformations. The long-range interactions are modeled by a contact potential

8

A. Kolinski

Fig. 1.6 Dimensions of short polymeric chains as a function of temperature. The solid line corresponds to the case of flexible chain and the dashed line describes behavior of semiflexible polymer. Tf indicates the collapse (or folding) temperature

This is illustrated in Fig. 1.6. Two types of structures coexisted at the transition temperature. The swollen random coil state exhibited a low density of contacts between the chain segments and relatively low-average lengths of the fully expanded segments. Upon collapse, the length of the expanded strands increased abruptly, accompanied by an abrupt increase of the number of polymer–polymer contacts. In this way the entropy of the random coil has compensated for the low potential energy of the globule. For relatively short chains, the globule had the structure of a bundle of parallel strands. There was a hypothesis that the collapse transition itself induces formation of secondary structures in proteins. This is true, but only when there is an interplay between the short-range conformational stiffness and the long-range interactions. Such interplay seems to be a fundamental feature of proteins and the major factor responsible for the folding cooperativity. For highly flexible polymers, the local ordering in the globular state was undetectable. Here, however, a word of caution needs to be exercised. The results described were obtained for a model of a homopolymer. Strong, specific sequence patterns of the long-range interactions may actually lead to some ordering of the globular structure. Specific interactions of side chains can also augment folding cooperativity (Pande et al. 1996). The model of a semiflexible polymer has one more striking feature. With increasing chain length (somewhere between 200 and 400 segments, depending on the degree of the local stiffness), the globular structure divides into domains of bundles having different orientations of their axes (Rutkowska and Kolinski 2007). This again resembles globular proteins, where longer polypeptide chains fold into two or more separate domains. Obviously, at the limit of a very long chain, where the persistence length becomes small in respect to the chain length of the model homopolymer, a continuous collapse transition should be recovered. The homopolymeric model of protein collapse has, however, an important non-protein-like feature. The globular structure, although highly ordered, is not unique. The average length of the extended

1

Lattice Polymers and Protein Models

9

strands and the distribution of their sizes may differ quite a bit between particular simulations. Also the topology of connections between the strands is not unique. It is interesting that the cooperative collapse transition of a semiflexible polymer could actually be predicted in the framework of a mean-field-type theory, as it was demonstrated a time ago by Post and Zimm (1979). Nevertheless, the picture emerging from the computer simulations is much deeper, and it is exact in the limits of model simplifications and small statistical errors of the Monte Carlo simulations. Models of homopolymeric semiflexible chains provide the zero-order approximation of the physics of globular protein collapse transition, where the interplay between secondary structure preferences (here the local stiffness) and the long-range interactions leads to the characteristic cooperative behavior (Kolinski et al. 1996).

1.4 Minimal Protein-Like Models The homopolymer model of a semiflexible chain, described in the previous section, has several important protein-like features, except the uniqueness of the globular structure. Real protein chains are heteropolymers, with amino acid units that differ in the strength and the physical nature of their short-range and long-range interactions. In a simplest possible approximation, there are two types of amino acids with respect to their hydrophobicity: polar (P), which tend to be exposed to the solvent on the protein surface, and hydrophobic (H), which tend to be buried inside the globule (Lau and Dill 1989). From the point of view of the chain flexibility, one may distinguish between the three main classes (Skolnick et al. 1989): amino acids (or short sequences of amino acids) that tend to adopt extended (e) conformations, residues that tend to build helical structures (h), and, usually more flexible, residues that prefer coil or turn-type local structures (c). Assuming a beta-barreltype target structure, it is natural to limit these possibilities to the two cases: e and c. Thus, in a very crude approximation, the β-type proteins are built from four types of amino acids: He, Hc, Pe, and Pc. By using the diamond lattice approximation described previously, it is relatively easy to design a number of sequences based on these four types of residues that undergo the all-or-none transition to a unique three-dimensional structure, although the level of the folding cooperativity is rather low. Even a quite complex Greek-key topology (seen in many real proteins) of the globule could be designed and obtained, with high reproducibility, in computer simulations (Kolinski et al. 1986b; Skolnick et al. 1988). This, however, requires somewhat more complex patterns of the chain hydrophobicity and flexibility (Skolnick et al. 1989). Also simple helical motifs could be designed and folded in silico employing these simple rules. Of course, instead of e-type residues, the htype had to be used, for which the right-handed three-bond turns have a preferential energy in the short-range interactions (Grosberg and Khokhlov 1994; Kolinski et al. 1996). Slightly after the studies outlined above, a different approach to design a minimal protein-like model has been proposed by Chan and Dill and pursued by many

10

A. Kolinski

others (Dill et al. 1995; Dinner et al. 1994, 1996). In its classic form, the model assumed the simple cubic lattice representation of the chain conformation and two types of residues: polar (P) and hydrophobic (H). In many applications, the target structure had the form of a 3×3×3 cube consisting of 27 model amino acids. It has been shown that it is possible to design sequences that have a single minimum of the chain conformational energy, which is consistent with the cube. This result was exact, since it was feasible to enumerate all possible compact conformations of such short chains (Dill 1999). The model stresses the hydrophobic collapse as the main feature of protein folding and was used in studies of protein-folding thermodynamics, stability, and folding pathways. Many varieties of modifications to this type of model have been proposed and studied in great detail (Sun et al. 1995; Kolinski and Skolnick 1996; Chen et al. 2004; Abkevich et al. 1994, 1996; Sali et al. 1994; Li et al. 1996; Micheletti et al. 1998). The simple polymer approach to protein folding has proven itself to be very productive – through these the most general features of protein folding become better understood. In spite of many successes, the cubic lattice hydrophobic polar (HP) models (and closely related models) have intrinsically some shortcomings that are difficult to ignore. First of all, the notion of secondary structure, so important in real proteins, is quite unclear in these models. Second, the cube geometry has a peculiar pattern of exposed-buried residues. The 27-mer cube has eight highly exposed corner residues, and obviously, unlike in real proteins, all of them are far apart from each other in space. There is only one completely buried residue in the center of the cube. Again, a typical protein domain has about half of its residues buried inside. These shortcomings can be to some extent improved upon by a different definition of interactions. Recent studies by Kaya and Chan (2002) have shown that in order to reproduce true two-state all-or-none folding transitions, more than two types of amino acids need to be included in the model sequences. Their findings indicated also the high significance of the interplay between short-range and long-range interactions. This is in qualitative agreement with the results from the diamond lattice models described in the previous section. Recently, the problem of the minimal model of protein structure and protein folding has been revisited (Pokarowski et al. 2003, 2005). Chains restricted to the face-centered cubic (fcc) lattice were used to represent protein conformational space. This lattice has a higher coordination number, z = 12, and allows for more flexibility than do other simple lattices. Thus, the lattice anisotropy effects are perhaps less severe. Moreover, this lattice allows for the crude, although not trivial, representation of all basic protein-folding motifs: β-type, α-type, and mixed α/β motifs. Representatives of all these motifs have been designed and studied in computational experiments. It was assumed a priori that a minimal model might require three types of potentials to mimic the complex network of molecular interactions in real proteins. In agreement with previous finding, the short-range (sequence-dependent local conformational stiffness) and the long-range contact interactions (with two types of residues, polar and hydrophobic) were implemented. Additionally, the effect of the main-chain hydrogen bonds has been taken into account in the form of a directional component in the pairwise potentials. In the

1

Lattice Polymers and Protein Models

11

simplest form in the β-type models, it seems to be enough to make the polar-polar (PP) interactions orientation dependent. Indeed, in globular proteins the contacting polar side groups on the surface of a globule are almost always approximately parallel. Without the explicit model of the side groups, their hypothetical orientation can be determined easily from the mutual orientation of the two-segment fragments of the interacting nearest neighbors. A more general definition of the ersatz of the hydrogen bonds could be designed, which has the same meaning and the same effect on the system behavior. Let us focus on the example of the relatively complex Greek-key motif of a two-sheet, six-stranded β-barrel. In order to perform detailed analysis of the system thermodynamics, the Replica Exchange Monte Carlo (REMC) sampling method (Hukushima and Nemoto 1996), using a carefully designed set of local conformational modifications, has been combined with the histogram analysis of the density of states (Ferrenberg and Swendsen 1989). A large number of simulations have been performed spanning a wide range of the relative strength (various scaling factors) of the short-range, pairwise hydrophobic, and orientation-dependent polar group interactions. The low-energy structures, including the putative native-like structure and a set of partly folded structures, some of them near-native, were extracted from the REMC pseudo-trajectories. For each one of these structures, its potential energy could be calculated as a function of the scaling factors for particular types of interactions. The assumption that the native structure potential energy has to be the lowest leads to a set of inequalities, with the interaction scaling parameters as the free variables. Solution of these inequalities determines a set of “good” parameters for the model. An important result of such analysis is that all energy parameters must have non-zero contributions. Otherwise, the native structure would not have the minimum energy. Within the set of allowed interaction parameters, there are many possibilities. This parameter space has been explored in additional long simulations, and the level of folding cooperativity has been estimated for every set of scaling parameters from a relatively dense grid within the allowed subspace. The highest, essentially purely two-state, cooperativity has been observed for the system with relatively strong pairwise interactions (hydrophobic and polar, orientation dependent) and a moderate short-range conformational stiffness. The same has been observed for other structural motifs studied. Thus, it has been proven that the proposed model is a minimal one – the three types of interactions are necessary for the protein-like uniqueness of the native structure and the highly cooperative all-or-none folding transition. Interestingly, all designed motifs exhibited some degeneracy for the native structure. For instance, for the Greekbarrel model, 20 structures have exactly the same topology and exactly the same patterns of interactions for all components of the model force field. The only differences were geometrical details, including mirror-image structures. This is probably physical (except of the mirror-image structures) – there are fluctuations in the native structure of the real proteins, and there are known examples where mobile parts contribute to the entropic stabilization of the native state. The simulations for various motifs have shown that a higher degeneracy of the native state leads to a higher cooperativity of the folding transition. Due to the higher entropy of the globular state, its free energy is lower and consequently the free energy gap between the

12

A. Kolinski

globular state and the manifold of random structures is larger. This leads to a very clear two-state behavior of the model system (Pokarowski et al. 2003, 2005). The minimal protein-like models capture the most general physics of globular protein folding. Nevertheless, they are only generic models, which are of a quite limited use for addressing the more detailed problems of specific folds and protein interactions. Possibly, the fcc model described above, with a larger alphabet for the amino acid sequences, can be used for the crude modeling of specific proteins, although the expected accuracy would be rather low, only an overall topology might be reproduced correctly.

1.5 High-Coordination Lattice Protein Models For many reasons, lattice approaches to polymer and biopolymer modeling are very appealing. Conformational transitions could be rapidly calculated in the discrete space of the lattice. The energy landscape is smoother; many local energy barriers are eliminated due to the simplification of the interaction schemes. Moreover, energy calculations are much faster due to the discrete set of allowed distances and angles. This is especially true for proteins, where the interactions are actually rather complex. On the other hand, a higher geometrical accuracy is needed for the study of specific proteins and for protein structure predictions. For this reason, several lattice models of intermediate-to-high resolution were developed in past, and some of them have proven to be rather effective tools for protein molecular modeling (Kolinski and Skolnick 1996; Kolinski 2004; Kolinski et al. 1995, 1996). In several studies of proteins and protein-like models, the three-dimensional “chess-knight” representation of the alpha-carbon trace was used (Kolinski et al. 1996). The chess-knight lattice is built upon a set of vectors type [2,1,0]. There are 24 such vectors; 6 permutations of the coordinates and 4 permutations of the signs. Due to the restrictions on the values of the planar angles of the alpha-carbon trace in real proteins, the number of possible orientations of a Cα −Cα virtual bond should be smaller and dependent on the orientation of the preceding and following bonds. The chess-knight representation of protein structures is more realistic than chains on simple lattices (compare Fig. 1.7 with Fig. 1.5). However, this model is still an intermediate between the protein-like models and models applicable to real proteins – the effects of the lattice anisotropy remain large. For instance, geometrical fidelity of a projection of short β-strands onto the chess-knight lattice depends on the orientation of the projected fragment with respect to the principal axis of the Cartesian coordinate system. Lattice effects are particularly harmful for the simulated dynamics of lattice systems – the relaxation processes could be significantly distorted. To overcome this problem, a modification of the 210 representation was proposed. The set of Cα -trace vectors has been expanded adding the vectors type [1,1,1] and [2,1,1]. The total number of allowed orientations is thereby increased to 56. The vectors type [2,0,0] were excluded for a technical reason relating to the convenience of coding the excluded volume using additional vertices of the underlying simple cubic lattice. The model becomes significantly more flexible, and the

1

Lattice Polymers and Protein Models

13

Fig. 1.7 A short fragment of the “chess-knight” chain with side groups restricted to the lattice

effect of the lattice anisotropy decreases. In spite of rather non-physical fluctuations of the bond length, the overall accuracy and precision of the protein representation improves a lot. As a result, the model produces a plausible picture of polypeptide chain dynamics and enables the de novo prediction of simple low-resolution protein structures (Godzik et al. 1993). Obviously, the structure prediction algorithm requires a properly designed force field based on statistical potentials derived from known protein structures. Basic principles for the design of the interaction schemes for reduced models will be outlined later for a different lattice representation. Fluctuating bond models proved to be a milestone in lattice modeling of protein structures. Interestingly, in parallel the fluctuating bond concept was extensively employed in studies of generic polymeric systems (Carmesin and Kremer 1988). Several specific representations were developed. It has been proven that the dynamics of the fluctuating bond lattice models reproduces well the Rouse dynamics of the continuous space models. Flexibility and computational efficiency of the fluctuating bond models enabled the detailed study of the thermodynamics and dynamics of long polymers, including the extremely complex dynamics of multichain systems. These findings are important for protein modeling. They provide a justification for applications of the flexible lattice models in studies not only of protein structures but also of protein dynamics and protein-folding mechanisms (Kolinski and Skolnick 2004). Proteins are complex heteropolymers with 20 different side chains that are attached to the main-chain backbone, which is rather generic (with the exception of the proline residues), as in synthetic homopolymers or simple copolymers. Thus a satisfactory model of the main chain is just a starting point for an acceptable protein model. Let us consider a more exact fluctuating bond model than the one described above (Kolinski and Skolnick 1994). In this model the main-chain backbone is also

14

A. Kolinski

reduced to the alpha-carbon trace. The number of backbone vectors is equal to 90. These vectors belong to the following set: {[3,1,1],. . . [3,1,0],. . . [3,0,0],. . . [2,2,1], ...}. The amplitude of the bond fluctuations in this model is relatively small, and the lattice anisotropic effects are essentially negligible. The excluded volume of the main chain can be modeled in a convenient way. It is enough to associate with every lattice position of the alpha carbon the 18 closest points of the underlying simple cubic lattice. These 18 lattice vertices (plus the central one) are excluded to other alpha-carbon units, which are also clusters of 19 lattice points. Such lattice coding of excluded volume simplifies immensely the simulation process – the main chain overlaps could be detected with small computational cost. The main-chain discrete geometry provides a convenient reference frame for the definition of the side-chain positions. For each amino acid, a database of known protein structures could be scanned and a database of the observed side-chain rotamers created, assuming certain level of resolution of the model. During the simulations every update of the main-chain conformation has to be associated with an update of the sidechain positions. It could be efficiently done with a help of “prefabricated” sets of allowed side-chain coordinates. The side chains could be modeled as single or multiple interaction centers. They could be restricted to the underlying lattice or could be off-lattice, however with the lattice-bounded reference frame. In the published applications, of this type of fluctuating bond model, a single off-lattice sphere for rotamers was used. The model enabled reproducible de novo folding of several small proteins, with an accuracy of 2–4 Å with respect to their crystallographic structures after the best superposition with the computational models (Kolinski and Skolnick 1994). The CABS (Cα −Cβ –side group) lattice-based model employs a high-resolution discretization of the polypeptide conformational space (Kolinski 2004). As in previously described models, the framework of a polypeptide chain representation is the alpha-carbon trace (see Fig. 1.8 for explanation of the CABS reduced

Fig. 1.8 Schematic drawing of a short fragment of the CABS model

1

Lattice Polymers and Protein Models

15

representation). The alpha carbons are located on the vertices of a simple cubic lattice with the mesh size equal to 0.61 Å. The virtual bonds connecting the alpha carbons belong to the set of 800 vectors type of v = [i, j, k]. The integer coordinates i, j, k are the all possible triplets for which 29 ≤ |v|2 ≤ 49. This set of vectors reproduces the Cα −Cα distance, equal to 3.78 Å, with fluctuations in the range of ±10%. Protein structures could be approximated with an accuracy of about 0.35 Å cRMSD (coordinate Root-Mean-Square Deviation) after the best superposition of the model Cα -trace with corresponding coordinates of the experimental structure. An example is given in Fig. 1.9. The side groups are modeled by two centers of interactions: beta carbons and the centers of the remaining part of the side chain (where applicable). These are not restricted to the lattice and their positions with respect to the backbone are derived from a proper statistics of the known protein structures. Two, the most probable, rotamers are defined for each residue. The excluded volume is modeled by a set of hard spheres centered on the alpha and beta carbons and in the middle of the Cα −Cα virtual bonds. The side groups are treated as soft spheres. Fig. 1.9 High-resolution lattice representation of the alpha-carbon trace of small globular protein (domain B of protein G) – the CABS model. The average accuracy in respect to the crystallographic coordinates is about 0.35 Å. The most probable coordinates of the side-chain united atoms are calculated basing on the Cα -trace geometry and proper statistics of high-resolution protein structures

The force field of the CABS model consists of several components mimicking the real physical interactions in proteins (Kolinski 2004). The generic short-range conformational biases simulate characteristic protein-like chain stiffness. A set of sequence-specific potentials simulate the short-range conformational propensities. Directional potentials of interactions between alpha carbons and between the centers of Cα −Cα virtual bonds simulate the structure-ordering effect of the main-chain hydrogen bonds. Pairwise interactions between the side-group united atoms are “context-dependent,” i.e., they depend on the mutual orientation of the interacting side groups and on the conformations of the corresponding two-bond segments of the main chain. In an implicit way, the pairwise interactions take an approximate account of the average effects of the surrounding solvent. This force field is knowledge-based – the statistical potentials of mean force are derived from structural regularities seen in known high-resolution protein structures (Skolnick

16

A. Kolinski

et al. 1997a; Kolinski 2004). It is perhaps worth noting that the CABS modeling tool performs certainly no worse (and is computationally more effective) than a similar continuous space-reduced model (Boniecki et al. 2003). This shows that the high-resolution lattice approximations are free of lattice artifacts and, due to their computational efficiency, perfectly suited for large-scale applications. Some practical applications of the CABS model are described in the next section. All the high-coordination lattice models described so far have focused on the design of a convenient representation of the main chain of polypeptides, which subsequently are “decorated” with the side chains, using the main-chain backbone as a reference frame. The SICHO (SIde CHain Only) model is based on a completely different concept (Kolinski et al. 2001). In this approach, the explicit lattice approximation uses the fluctuating bond framework for modeling the virtual chain connecting the positions of the centers of mass of polypeptide side groups. Opposite to the CABS model, in the SICHO model the positions of the main-chain atoms are defined in the reference frame of the pseudo-chain connecting the side groups. The SICHO concept is based on the fact that the packing of the side chains is probably the most sequence-specific property of globular proteins. In principle, the sidechain-based models should be computationally faster than the main-chain-based ones.

1.6 Protein Folding and Structure Prediction with Lattice Models High-resolution SICHO and CABS models (and their clones) have been used in a variety of applications. These include ab initio structure prediction (Ortiz et al. 1999; Skolnick et al. 2001), study of protein dynamics, folding pathways and thermodynamics, prediction of protein structure from sparse experimental data and distant homology, or comparative, structure modeling (Kolinski et al. 1999; Kolinski and Skolnick 2004; Pierri et al. 2008; Skolnick et al. 1997b, 1998, 2003). The ab initio folding with the lattice models is practical only for relatively small (say up to 150 residues) and topologically not too complex proteins. With the increasing size of a query protein the success ratio, as well as the accuracy of the produced structures, decreases. Small proteins (50–75 residues) can be folded to a resolution range approaching 2 Å, while for a 150-residue structure the accuracy would be nearer 3–6 Å, depending on fold complexity. It should be pointed out that the bottleneck of the accuracy for reduced models is not their reduced representation but rather the deficiencies of their force fields. The force fields of the reduced models are being permanently updated and in the future this should be the main factor leading to improvements in the algorithms performance. There is a suggestive result partially confirming this. The CABS model has been used many times for comparative modeling, where in addition to the force field, the folding process has been guided by a set of spatial restraints extracted from structures of homologous (or structurally analogous) proteins. In these circumstances the resulting models could be even as good as the best crystallographic structures with a resolution of

1

Lattice Polymers and Protein Models

17

1.0–1.5 Å, at least for the main-chain atoms (Kolinski and Bujnicki 2005). Every 2 years, the Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP) is organized in order to assess the current status of in silico protein structure prediction. The idea of the experiment is simple. Experimentalists provide sequences of amino acids of a number of proteins for which the structures are expected to be solved in the next few months. During this time theoreticians try to predict these structures and deposit them in the CASP databank. A variety of modeling techniques are used by researchers from around the world. Afterward, when the experimental structures become available, a group of experts assesses quality of predictions. Interestingly, the high-resolution latticebased modeling methods systematically perform very well and are among the best methods for protein structure prediction. An example of the CASP6 prediction using the CABS lattice modeling is given in Fig. 1.10. The accuracy of this particular prediction is about 3.0 Å after the best superposition of the predicted structure onto the experimental one (Kolinski and Bujnicki 2005).

Fig. 1.10 Side-by-side view of the crystallographic structure and the predicted structure (the CABS model) for one of the CASP6 targets (target # T0223)

Finally, it should be noted that the reduced structures of SICHO or CABS models can be used as a meaningful starting point for the all-atom reconstruction and structure refinement (Feig et al. 2000; Kolinski and Bujnicki 2005). Such procedures are now pursued in several laboratories, opening the possibility for multi-scale modeling of large biomolecular systems.

References Abkevich VI, Gutin AM, Shakhnovich EI (1994) Free energy landscape for protein folding kinetics: Intermediates, traps, and multiple pathways in theory and lattice model simulations. J Chem Phys 101:6052–6062 Abkevich VI, Gutin AM, Shakhnovich EI (1996) Improved design of stable and fast-folding model proteins. Fold Des 1:221–230 Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230

18

A. Kolinski

Binder K, Müller M, Baschnagel J (2004) Polymers models on the lattice. In: Kotelyanskii MJ, Theodorou DN (eds) Simulation methods for polymers, M. Dekker, New York, NY Boniecki M, Rotkiewicz P, Skolnick J, Kolinski A (2003) Protein fragment reconstruction using various modeling techniques. J Comp Aid Mol Des 17:725–738 Branden C, Tooze J (1991) Introduction to protein structure. Garland, New York, NY Carmesin I, Kremer K (1988) The bond fluctuation method: a new effective algorithm for the dynamics of polymers in all spatial dimensions. Macromolecules 21:2819–2823 Chen H, Zhou X, Chih YL, Chan GK (2004) Kinetic analysis of protein folding lattice models. Mod Phys Lett B 18:163–172 Creighton TE (1993) Proteins: structures and molecular properties. W. H. Freeman, New York, NY De Gennes PG (1979) Scaling concepts in polymer physics, 1st edn. Cornell University Press, New York, NY Di Cecca A, Freire JJ (2002) Monte Carlo simulation of star polymer systems with the bond fluctuation model. Macromolecules 35:2851–2858 Dill KA (1999) Polymer principles and protein folding. Prot Sci 8:1166–1180 Dill KA, Bromberg S, Yue K, Fiebig KM, Yee DP, Thomas PD, Chan HS (1995) Principles of protein folding – a perspective from simple exact models. Prot Sci 4:561–602 Dinner A, Sali A, Karplus M, Shakhnovich E (1994) Phase diagram of a model protein derived by exhaustive enumeration of the conformations. J Chem Phys 101:1444–1451 Dinner AR, Sali A, Karplus M (1996) The folding mechanism of larger model proteins: role of native structure. Proc Natl Acad Sci USA 93:8356–8361 Feig M, Rotkiewicz P, Kolinski A, Skolnick J, Brooks CL 3rd (2000) Accurate reconstruction of all-atom protein representations from side-chain-based low-resolution models. Proteins 41: 86–97 Ferrenberg AM, Swendsen RH (1989) Optimized Monte Carlo data analysis. Phys Rev Lett 63:1195–1198 Flory PJ (1953) Principles of polymer chemistry. Cornell University Press, New York, NY Flory PJ (1969) Statistical mechanics of chain molecules. Wiley, New York, NY Freire J (1999) Conformational properties of branched polymers: theory and simulations. Branched polymers II. Advances in polymer science, vol 143/1999. Springer, Berlin, pp 35–112 Frenkel D, Smit B (2001) Understanding molecular simulation. From algorithms to applications. Computational science series, vol 1, 2nd edn. Academic, New York, NY Godzik A, Kolinski A, Skolnick J (1993) De novo and inverse folding predictions of protein structure and dynamics. J Comp Aid Mol Des 7:397–438 Grest GS, Fetters LJ, Huang JS, Richter D (1996) Star polymers: experiment, theory, and simulation. Adv Chem Phys 104:67–163 Grosberg AY, Khokhlov AR (1994) Statistical physics of macromolecules. American Institutes of Physics Press, New York, NY Hsu HP, Nadler W, Grassberger P (2004) Scaling of star polymers with 1–80 arms. Macromolecules 37:4658–4663 Hukushima K, Nemoto K (1996) Exchange Monte Carlo method and application to Spin Glass Simulations. J Phys Soc Jpn 65:1604–1608 Kaya H, Chan HS (2002) Origins of chevron rollovers in non-two-state protein folding kinetics. Phys Rev Lett 90:258104 Kolinski A (2004) Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 51:349–371 Kolinski A, Skolnick J (2004) Reduced models of proteins and their applications. Polymer 45: 511–524 Kolinski A, Betancourt MR, Kihara D, Rotkiewicz P, Skolnick J (2001) Generalized comparative modeling (GENECOMP): a combination of sequence comparison, threading, and lattice modeling for protein structure prediction and refinement. Proteins 44:133–149 Kolinski A, Bujnicki JM (2005) Generalized protein structure prediction based on combination of fold-recognition with de novo folding and evaluation of models. Proteins Suppl 7(61):84–90

1

Lattice Polymers and Protein Models

19

Kolinski A, Galazka W, Skolnick J (1996) On the origin of the cooperativity of protein folding: implications from model simulations. Proteins 26:271–287 Kolinski A, Milik M, Rycombel J, Skolnick J (1995) A reduced model of short range interactions in polypeptide chains. J Chem Phys 103:4312–4323 Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J (1999) A method for the improvement of threading-based protein models. Proteins 37:592–610 Kolinski A, Skolnick J (1996) Lattice models of protein folding, dynamics and thermodynamics. Molecular biology intelligence unit. Chapman & Hall, New York, NY Kolinski A, Skolnick J (1994) Monte Carlo simulations of protein folding. I. Lattice model and interaction scheme. Proteins 18:338–252 Kolinski A, Skolnick J, Yaris R (1986a) The collapse transition of semiflexible polymers. A Monte Carlo simulation of a model system. J Chem Phys 85:3585–3597 Kolinski A, Skolnick J, Yaris R (1986b) Monte Carlo simulations on an equilibrium globular protein folding model. Proc Natl Acad Sci USA 83:7267–7271 Kolinski A, Skolnick J, Yaris R (1987a) Does reptation describe the dynamics of entangled, finite length polymer systems? A model simulation. J Chem Phys 86:1567–1585 Kolinski A, Skolnick J, Yaris R (1987b) Dynamic Monte Carlo study of the conformational properties of long flexible polymers. Macromolecules 20:438–440 Kolinski A, Skolnick J, Yaris R (1987c) Monte Carlo studies on the long time dynamic properties of dense cubic lattice multichain systems. I. The homopolymeric melt. J Chem Phys 86: 7164–7173 Kolinski A, Skolnick J, Yaris R (1987d) Monte Carlo studies on the long time dynamic properties of dense cubic lattice multichain systems. II. Probe polymer in a matrix of different degrees of polymerization. J Chem Phys 86:7174–7180 Lau KF, Dill KA (1989) A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 22:3986–3997 Li H, Helling R, Tang C, Wingreen N (1996) Emergence of preferred structures in a simple model of protein folding. Science 273:666–669 Likos CN (2006) Soft matter with soft particles. Soft matter 2:478–498 Micheletti C, Seno F, Maritan A, Banavar JR (1998) Protein design in a lattice model of hydrophobic and polar amino acids. Phys Rev Lett 80:2237–2240 Milik M, Kolinski A, Skolnick J (1990) Monte Carlo dynamics of a dense system of chain molecules constrained to lie near an interface. A simplified membrane model. J Chem Phys 93:4440–4446 Ortiz AR, Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J (1999) Ab initio folding of proteins using restraints derived from evolutionary information. Proteins 37:177–185 Pakula T (2004) Simulations of completely occupied lattice. In: Kotelyanskii MJ, Theodorou DN (eds) Simulation methods for polymers. M. Dekker, New York, NY Pande VS, Grosberg AY, Tanaka T, Rokhsar DS (1996) Pathways for protein folding: is a new view needed? Curr Opin Struct Biol 8:68–79 Pierri CL, De Grassi A, Turi A (2008) Lattices for ab initio protein structure prediction. Proteins 73:351–361 Pokarowski P, Droste K, Kolinski A (2005) A minimal protein-like lattice model: an alpha-helix motif. J Chem Phys 122:214915 Pokarowski P, Kolinski A, Skolnick J (2003) A minimal physically realistic protein-like lattice model: designing an energy landscape that ensures all-or-none folding to a unique native state. Biophys J 84:1518–1526 Post CB, Zimm BH (1979) Internal condensation of a single DNA molecule. Biopolymers 18:1487–1501 Rouse PE (1953) A theory of the linear viscoelastic properties of dilute solutions of coiling polymers. J Chem Phys 21:1272–1280 Rutkowska A, Kolinski A (2007) Why do proteins divide into domains? Insights from lattice model simulations. Biomacromolecules 8:3519–3524

20

A. Kolinski

Sali A, Shakhnovich E, Karplus M (1994) How does a protein fold? Nature 369:248–251 Sikorski A (1993) Monte Carlo study of the dynamics of star-branched polymers. Macromol Theory Simul 2:309–318 Sikorski A, Kolinski A, Skolnick J (1994) Dynamics of star branched polymers in a matrix of linear chains: a Monte Carlo study. Macromol Theory Simul 3:715–729 Sikorski A, Romiszowski P (1996) Motion of star-branched vs. linear polymer: A Monte Carlo study. J Chem Phys 104:8703–8712 Skolnick J, Jaroszewski L, Kolinski A, Godzik A (1997a) Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? Prot Sci 6:676–688 Skolnick J, Kolinski A (1990) Dynamics of dense polymer systems: computer simulations and analytic theories. In: Advances in chemical physics, vol 78. Wiley, New York, NY Skolnick J, Kolinski A, Kihara D, Betancourt M, Rotkiewicz P, Boniecki M (2001) Ab initio protein structure prediction via a combination of threading, lattice folding, clustering, and structure refinement. Proteins Suppl 5:149–156 Skolnick J, Kolinski A, Ortiz AR (1997b) MONSSTER: a method for folding globular proteins with a small number of distance restraints. J Mol Biol 265:217–241 Skolnick J, Kolinski A, Yaris R (1988) Monte Carlo simulations of the folding of beta-barrel globular proteins. Proc Natl Acad Sci USA 85:5057–5061 Skolnick J, Kolinski A, Yaris R (1989) Dynamic Monte Carlo study of the folding of a six-stranded Greek key globular protein. Proc Natl Acad Sci USA 86:1229–1233 Skolnick J, Zhang Y, Arakaki AK, Kolinski A, Boniecki M, Szilágyi A, Kihara D (2003) TOUCHSTONE: a unified approach to protein structure prediction. Proteins 53:469–479 Smith WR, Lisal M (2002) Direct Monte Carlo simulation methods for nonreacting and reacting systems at fixed total internal energy or enthalpy. Phys Rev E 66:011104 Sun S, Brem R, Chan HS, Dill KA (1995) Designing amino acid sequences to fold with good hydrophobic cores. Protein Eng 8:1205–1213 Verdier PH, Stockmayer WH (1962) Monte Carlo calculations on the dynamics of polymers in dilute solution. J Chem Phys 36:227–235

Chapter 2

Multiscale Protein and Peptide Docking Mateusz Kurcinski, Michał Jamroz, and Andrzej Kolinski

Abstract The number of functional protein complexes in a cell is larger by an order of magnitude than the number of proteins. The experimentally determined three-dimensional structures exist for only a very small fraction of these complexes. Thus, the methods for theoretical prediction of structures of protein assemblies are extremely important for molecular biology. Association of two (or more proteins) always induces conformational changes of the individual components. In many cases, these induced changes are relatively small and involve mostly the side chains at the association interface. In such cases, the approaches of rigid-body docking of two (or more) structures are quite successful. Quite frequently, however, the docking-induced conformational changes are significant. In such cases, prediction of the resulting structures is extremely challenging. The cases, where experimental structures of some components do not exist, are yet even more difficult. In this chapter, we briefly overview the existing in silico docking methods and describe a multiscale strategy of unrestricted flexible docking of proteins and peptides.

2.1 Introduction In eukaryotic cells, an average protein can participate in several protein–protein (or protein–nucleic acid) complexes. The number of such complexes is larger by an order of magnitude than the number of proteins. Since the number of experimentally solved protein structures (about 60,000) is a small fraction of all proteins, the fraction of structurally annotated protein complexes is very small. Thus, the theoretical, in silico, prediction of molecular structures of multimeric protein assemblies is one

A. Kolinski (B) Faculty of Chemistry, University of Warsaw, Warsaw, Poland e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_2,

21

22

M. Kurcinski et al.

of the most important task of bioinformatics and computational biology (Wodak and Janin 1978; Valencia and Pazos 2002; Salwinski and Eisenberg 2003; Aloy and Russell 2004). There are relatively dependable computational methods for socalled rigid docking. These methods are applicable, provided that the structures of individual components are known and the conformational changes of these components induced upon docking are small. For proteins of known structures, the last requirement is approximately fulfilled quite frequently (Ritchie 2008). Then the problem reduces to generation of a large number of possible poses, according to the shape complementarity of the components and scoring of binding poses by interaction patterns of the interfaces. The latter task is by no means trivial since at least some of the side chains at the interface certainly change their conformations with respect to the conformations seen in the monomeric state or in different complexes. At the moment, the knowledge-based statistical potentials, either atom-wise or united-atom-wise ones, seem to be most productive in scoring of protein–protein interactions. The ultimate goal of the protein docking could be described as follows: having just a set of sequences, find the structure (structures) of the possible assembles. In general, this may appear to be not feasible, at least at present, but maybe this is not so hopeless. Firstly, with the constant progress in protein structure prediction, mainly via rectification of comparative modeling, now it is possible to predict monomeric structures at least for a half of protein sequences and at least with a moderate resolution. Secondly, provided that bioinformatics methods are developed for identification of structure fragments that may change upon the docking, it should be possible to design methods for semi-flexible docking that accounts for the allowed conformational changes of fragments of the components’ structures. A step toward such a docking methodology is described in this chapter. The method employs a multiscale modeling of proteins and peptides. It is based on CABS modeling software. CABS is a high-resolution, coarse-grained, protein modeling tool (the acronym stands for the united atoms representing a residue in a polypeptide chain: CA alpha carbon of the main chain, CB -beta carbon, and S – the center of side group). The CABS protein structure representation is based on united-atom description of protein structure, where a single residue is represented by several (three or four, depending on the size of side chains) united atoms. The conformational space of CABS polypeptide chains is sampled by means of very efficient Monte Carlo schemes. Details of the CABS design are described in the first chapter of this book and in previous publications (Kolinski 2004). The spatial resolution of CABS allows for quite precise reconstruction of atomic details. This reconstruction process (Gront et al. 2007) for main-chain atoms is very fast and accurate within range of few tens of Angstrom. The reconstruction of side groups is less accurate and depends on the achieved accuracy of the Cα-trace fold. Below, we overview briefly the techniques for rigid docking problem, docking with a highly limited flexibility of some structural elements, and then we outline the more flexible (and fully flexible) docking based on a multiscale approach in which the CABS-based structure assembly is the key step of a molecular complex building procedures.

2

Multiscale Protein and Peptide Docking

23

2.2 Rigid Docking Procedures Suppose we know the three-dimensional structures of two proteins that form a dimer, although we do not know how these two proteins are posed in the complex, and which residues form the protein–protein interface. Finding the structure of the resulting complex is not a trivial task. Classical rigid docking consists of two or three fundamental steps (Vajda and Kozakov 2009). The first one is the generation of a large number of binary structures. The second one is scoring the structures according to the shape complementarity and interactions at the interface. Finally, one may perform rectification of the best structures by adjustments of conformations of the side chain at the interface. Finding plausible poses in rigid docking is not trivial – this requires a very effective search algorithm. Fast Fourier Transform makes it possible to reduce the six-dimensional problem to a one-dimensional problem. A number of algorithms have been developed for this purpose (Katchalski-Katzir et al. 1992; Vakser and Aflalo 1994; Vakser 1995; Mandell et al. 2001; Del Carpio-Muñoz et al. 2002; Chen et al. 2003; Carter et al. 2005; Kozakov et al. 2006; Sternberg et al. 1998). Alternatively, various geometric hashing procedures could be used (Fischer et al. 1995). The resulting poses, usually several thousands of them, need to be scored in order to produce a small number of plausible structures. Scoring functions span a wide range, from a simple shape complementarity (Chen et al. 2003), through the knowledge-based statistical potentials (Kozakov et al. 2006; Tobi and Bahar 2006; Zhang et al. 2005; Cerutti et al. 2005), physics-based force fields (Koehl 2006; Sheinerman et al. 2000; Jiang et al. 2002) to data-driven docking, supported by available biochemical information (Res and Lichtarge 2005; Res et al. 2005; Lichtarge et al. 1996; Dominguez et al. 2003; Nilges 1995; Anand et al. 2003; van Dijk et al. 2005). It has been also noted that rigid docking could be achieved in a different way. Instead of ab initio computing the assembly structure, sometimes it is more effective to predict binding interfaces of the interacting proteins and then perform docking in a limited conformational space. In some sense, this is yet another variant of data-driven docking (Jones and Thornton 1997; Burgoyne and Jackson 2006). Efficiency of various approaches to protein docking is systematically evaluated within the framework of community-wide experiments of Critical Assessments of PRediction of Interactions (CAPRI) (Carter et al. 2005; Janin et al. 2003).

2.3 Flexible Docking Usually, although not always, protein association induces conformational changes of the components (Bonvin 2006; Camacho and Vajda 2001; May and Zacharias 2005). In many cases, these conformational changes are essentially limited to the interface side chains (Andrusier et al. 2008). RosettaDock algorithm is well suited to deal with such cases (Gray et al. 2003; Daily et al. 2005; Wang et al. 2005). The procedure starts from rigid docking and then the side chains are optimized using

24

M. Kurcinski et al.

either a rotamer library or free-space side-chain optimization. Such an approach proven to be very successful in blind predictions within CAPRI (Wang et al. 2007; Schueler-Furman et al. 2005). Recently, Rosetta modeling technology has been applied to fully flexible docking, or rather “folding and docking” (Das et al. 2009), of small homo-oligomeric protein assemblies. The method utilizes available experimental data: nuclear magnetic resonance (NMR) chemical shifts and residual dipolar coupling (RDC). Somewhat similar strategy for semi-flexible docking is adapted in ATTRACT (the name of the algorithm comes from attractive interactions of the interface residues) algorithm (Zacharias 2003; May and Zacharias 2007). The method employs a coarse-grained representation of side chains and the docking procedure consists of two steps: rigid docking and optimization of the resulting poses, allowing for flexibility of the side chains of interface residues. ATTRACT algorithm was also used in docking simulations allowing for backbone flexibility of the loop regions. It has been demonstrated that even limited flexibility improves the docking results for most of the tested cases. Small conformational changes, induced by docking, could be accommodated to some extent by reduced representations or/and coarse-grained potentials describing the interface residues. This is probably one of the reasons for surprisingly good performance of docking procedures based on just shape complementarity and smoothened details of the surface. Recently it has been shown that smoothed lowresolution representation of the surface residues leads to more consistent shape complementarity (Zhang et al. 2009). Another way to increase recognition specificity of the interfaces could be achieved by use of multibody knowledge-based potentials. For instance, four-body statistical pseudo-potentials proven to be useful in protein–peptide docking, allowing for full flexibility of the peptide moieties (Aita et al. 2010). Before the applications in protein docking, four-body potentials proven to be very effective efficient in scoring protein decoys (Krishnamoorthy and Tropsha 2003; Feng et al. 2010). In summary, while small, local conformational changes accompanying protein docking are relatively well handled by a variety of docking algorithms, large deformations of components are more difficult to predict. RosettaDock is one of few exceptions, where de novo prediction of new (compared to known structures of components) structures is sometimes feasible. The multiscale approach described below, based on CABS modeling software, bootstrapped with all atom molecular mechanics, is a step toward fully flexible docking of proteins and peptides.

2.4 Multiscale Flexible Docking with CABS CABS (the acronym stands for the united atoms representing a residue in a polypeptide chain: CA – alpha carbon of the main chain, CB – beta carbon, and the center of side group). Cα trace in CABS is restricted to a high-resolution cubic lattice grid, where the lattice spacing is set to 0.61 Å. Cα−Cα distances in CABS are allowed to fluctuate near the 3.8 Å. An additional pseudo-atom is located in the center of

2

Multiscale Protein and Peptide Docking

25

the Cα−Cα bond and supports a model of main-chain hydrogen bonds. The accuracy of a projection of high-resolution protein structure onto the lattice is of about half of the lattice spacing. The coordinates of beta carbons and side chains are not restricted to the lattice and are defined in the reference frame defined by the Cα trace. Due to the lattice representation, computations of local conformational transitions of the model chains are extremely fast and, in most cases, they are reduced to straightforward shuffling of integer numbers. Similarly, most of interactions could be computed via simple references to large hashing tables. Such computations with CABS are about two orders of magnitude faster than it would be possible for – otherwise equivalent – continuous space model (Boniecki et al. 2003). It should be pointed out that, due to the fine grid of the lattice representations, the model does not exhibit any lattice artifacts. Actual accuracy of the molecular models generated by CABS is lower than the resolution resulting from the lattice representation. The results of free modeling, when successful, are accurate within a low-resolution range of 2.5–5 Å. Comparative models are more accurate and their accuracy depends on the quality of templates from which the distance restraints between Cα atoms are extracted. The best models have an accuracy of about 1 Å. CABS allows for easy implementation of various restraints, not only Cα−Cα distances from templates but also restraints from sparse experimental data, as chemical shifts, residual dipolar coupling, side chain–side chain contacts from mutagenesis, etc. This opens a convenient framework for the treatment of docking flexibility at various levels. A geometric fidelity of the CABS representation is sufficient for a reasonably accurate all-atom reconstructions (Gront et al. 2007). A good measure of this fidelity is an experiment of projecting the structure onto CABS lattice followed by subsequent reconstruction of atomic details. The reconstruction consists of two stages: the first one is a very fast rebuilding of the main chain and beta carbons, executed by Backbone Building from Quadrilaterals (BBQ) program, the second one could be side-chain fitting via side-chain replacement with rotamer libraries (SCWRL) program by Dunbrack (Canutescu et al. 2003). The accuracy of such a reconstruction cycle is within a range of few tens of angstroms for the main-chain atoms and a range of 1.5–2 Å for the side-chain atoms, depending on the structure. The allatom structures could be generated at any stage of the docking and scored by force fields other than CABS. Details of the CABS knowledge-based force field (Kolinski 2004) and description of combinations of CABS with all-atom molecular mechanics (Kmiecik and Kolinski 2007, 2008; Kmiecik et al. 2007) could be found in earlier publications. Sampling protocols of CABS employ various Monte Carlo-based algorithms. When the folding mechanisms are of interest, simple simulated annealing or isothermal Monte Carlo dynamics could be appropriate. Since CABS conformational updating employs various local rearrangements controlled by a pseudo-random mechanisms, the trajectories from such simulations represent solutions of a certain Master Equation of motions and thereby provide a coarse-grained picture of the system dynamics. In this respect, CABS differs from the most of other structure assembly-reduced models, such as Rosetta (Rohl et al. 2004). There are, however, reduced space models enabling similar studies. The continuous models (like united

26

M. Kurcinski et al.

residues (UNRES), (Ołdziej et al. 2005) allow for Molecular Dynamics simulations and, similarly to CABS, for Monte Carlo dynamics. When just structure prediction is a goal, more effective than simulated annealing are various multicopy MC algorithms. Most docking experiments with CABS proceed according to a combination of Replica Exchange Monte Carlo (REMC) (simulated tempering) (Hukushima and Nemoto 1996) with simulated annealing. During a typical simulation, a large number of replicas (50) are subject to slow annealing of the entire stack.

2.4.1 Treating of Flexibility CABS facilitates various levels of docking flexibility. Schematically, different instances of docking could be outlined as follows: A. A semi-flexible docking of two or more proteins of known structures. B. Docking of a fully flexible (unrestrained folding) protein or peptide on a semiflexible scaffold of other proteins. C. De novo assembly (fully flexible, free folding) of protein (peptide) complex. Obviously, the success rate and accuracy of the resulting structures decreases from A to C. The unrestricted de novo assembly of a protein complex (C) is now feasible only for relatively small and structurally not too complex proteins. Various symmetry-resulting (as for homo-dimers, trimers, etc.) restraints could be easily implemented (similarly as it was done before for RosettaDock), increasing the docking accuracy significantly. In semi-flexible docking (A), intra-protein restraints are read from the unbonded structures and the corresponding distances are allowed to fluctuate around their unbonded values. The poses could be generated by the CABS REMC in an unrestricted fashion or the initial poses (structures placed at starting replicas) could be obtained from various fast docking shape-complementarity-based algorithms. To speed up assembly, centers of gravity of the assembled molecules are subject to a weak generic attractive force acting at distances larger than the plausible estimated distance within the complex. In such procedures, trivial flexibility related to different conformations of the interface side chains is approximately accounted for at the stage of the all-atom reconstruction. Flexibility of interacting proteins does need to be treated in a uniform fashion. It is easy to restraint just parts of each molecule, allowing the remaining portions to freely adjust during the docking. Prediction of flexible fragments could be achieved in various ways, including structural comparison of different complexes of proteins of interests, normal modes, or Gaussian Network, analysis of these proteins, etc. A relative strength of restraints could be also included in the input data for the docking. Docking of a fully flexible small protein (or a peptide) to semi-flexible scaffold of other protein (a receptor) is almost always successful, without assuming a priori anything about the pose (except the penalty for large distances between molecules) and internal conformation of the free molecule.

2

Multiscale Protein and Peptide Docking

27

In principle, the interaction between the interface amino acids does need to be the same as the intra-protein interactions. The interactions between the side chains in CABS model are described by statistical pseudo-potentials derived from regularities seen in known structures. These potentials are context dependent (accounting for mutual orientations of the side chains and local conformations of the main chain). Thus, the potentials account in an implicit way for complex multi-body packing effects. Also the averaged solvent effect is encoded in these potentials. Interestingly, potentials derived separately for the interfaces in known complexes do not differ significantly from the potentials derived for monomeric proteins. In the example docking simulations described in this chapter, generic CABS potentials were used.

2.4.2 Example of Peptide Docking to Receptor Protein Frequently small peptides act as coactivators for larger proteins. Below we describe a typical example of such a docking experiment (Kurcinski and Kolinski 2007). The receptor protein is the vitamin D receptor (or rather the receptor part of the entire protein). The receptor is treated in a semi-flexible fashion. A large number of Cα−Cα intra-molecular distances are extracted from the crystallographic structure of the protein. Additionally, the secondary structure defined according to the define secondary structure of proteins (DSSP) assignment is a part of the simulation input. Assigned secondary structure provides a bias toward the proper short-range geometry and favors the hydrogen-bonding patterns consistent with this secondary structure. The simulation set-up for the receptor is schematically depicted in the left-hand side of the flow chart given in Fig. 2.1. During the simulation, the receptor structure oscillates around its native structure. The initial set of 50 replicas for the REMC simulations are generated by replication of the receptor structure with randomly placed peptide chains near the protein surface. Internal conformations of the peptide and its location in respect to the receptor are both selected in a random fashion. The set-up for the peptide is illustrated in the top right-hand part of the flow chart. The starting replica with the superimposed receptor structure is shown in the top central panel of the flow chart. The main part of the docking simulations is executed by the CABS algorithm. CABS produces a large number of conformations, stored in a pseudo-trajectory read from the lowest-temperature replica. Typically, the CABS output contains some thousands of structures. Single run generates millions of states and requires several hours of a single LINUX computing unit. The structures stored in the pseudo-trajectory are subject to a clustering procedure (hierarchical clustering or K-means clustering). In the case of the example illustrated here, there is only one well-defined cluster of solutions, containing majority of the structures. Remaining structures are scattered in apparently random fashion. The main cluster is very dense with nicely superimposed receptor structures. Only the end and some loop residues deviate a little (0.5–1.0 Å) from the mean structure, which is almost identical with the crystallographic structure. The cloud of the peptide structures is also very well defined (bottom, right-hand panel of Fig. 2.1), with

28

M. Kurcinski et al.

Fig. 2.1 Flow chart of multiscale hierarchical peptide–protein docking. See the text for details

the mean-square dispersion below 1 Å. The centroid structure from the main cluster provides a scaffold for the all-atom reconstruction (left-hand panels) of the complex. The reconstructed structure is optimized with all-atom force field and rectified in Molecular Dynamics. The final structures obtained in such test docking are of crystallographic resolution. In several tests of peptide docking to various receptors, the proper pose has always been found to be within the main cluster of solution. For longer peptides (25–30 amino acids), the resulting coordinates of end residues of the peptide were usually of lower accuracy (2–3 Å), although the interface contact maps were always predicted (or rather postdicted) with high accuracy.

2.4.3 Protein–Protein Docking The methodology described in the previous sections could be used for protein– protein docking where one or two (or more) proteins could be treated in a fully

2

Multiscale Protein and Peptide Docking

29

flexible fashion, without assuming anything about their structures within the complex. Good results of fully flexible, unrestricted docking, could be expected only for relatively small and structurally not too complex proteins, of a size of the Rop homo-dimer that consists of two antiparallel long-helical hairpins or the crambin pseudo-dimer. For larger proteins, properly folded complex structures are not always obtained. At the present status of the CABS methodology, the semi-flexible docking simulations are more dependable, where at least parts of the modeled structures are controlled by weaker or stronger intra-molecular restraints derived from non-bonded structures or from different complexes of the proteins of interest. Two examples of semi-flexible docking results are illustrated in Figs. 2.2 and 2.3. In both cases, the structures of the larger proteins in the complex were strongly restrained to their non-bonded native structures. Docking simulations modified these structures very little (deviation range of 0.3–0.9 Å). Actually, this is very close to the structural differences seen between the individual proteins in the complexes and their unbound structures. The structures of larger protein are shown in gray. The second components of the complexes had higher flexibility, the restraints were much weaker allowing for large fluctuations, ranging within 5–10 Å. In both cases, near-native structures were found in the largest clusters. The resulting poses are qualitatively correct, although the details of the internal structures of these proteins have several errors. For easy comparison, Figs. 2.2 and 2.3 show both experimental (green) and calculated (red) structures. The drawings were done assuming the best superimposition of the larger proteins in the complex. The resulting superimposition of the second proteins illustrates a sum of the errors of pose and the errors of internal coordinates. The sum of these errors (coordinate Root-Mean-Square deviation of the smaller protein after the best superimposition of the larger protein) is 2.6 Å in the first case and 3.8 Å in the second case, respectively. Thus, qualitatively correct poses have been predicted (nothing has been assumed about the mutual orientation

Fig. 2.2 Structure obtained from docking procedure. The “receptor protein” (PDB code 1ppn) shown in gray, “ligand” protein (PDB code 2oct) shown in green in the crystallographic structure (PDB code 1stf), and in red for the final model. See the text for details

30

M. Kurcinski et al.

Fig. 2.3 Structure obtained from docking procedure. The “receptor protein” (PDB code 2hnt) shown in gray, “ligand” protein (PDB code 5hir) shown in green in the crystallographic structure (PDB code 4htc), and in red for the final model. See the text for details

of the components) although the structural details have been distorted, especially in the second case. Initial 50 replicas for the REMC simulations were generated by FTdock program.

2.5 Perspectives The problem of in silico flexible docking, especially in cases where the dockinginduced conformational changes are large, is far from being solved. Nevertheless, there are numerous encouraging small steps toward a partial solution of this problem. Multiscale procedures, where flexible docking is performed using various coarse-grained protein models, followed by refinement of the resulting poses by more detailed molecular mechanics seem to be very promising. Here, we described combinations of the CABS-reduced space modeling methodology with all-atom refinements applied to flexible and semi-flexible protein–protein and protein– peptide dockings. The method is now mature enough for large-scale predictions of protein interactomes. In the large-scale applications, it may be necessary to introduce a pre-screening phase employing fast docking procedures based on a shape complementarity. At present, the described method is limited to proteins and peptides. An extension onto nucleic acids will require development of their coarse-grained representation, consistent with the CABS representation of proteins. Also the treatment of small ligands within such multiscale docking procedures requires significant extensions of the knowledge-based force fields, and this is still an open problem.

2

Multiscale Protein and Peptide Docking

31

Another possibility of applications of the outlined method is related to the assembly mechanisms of protein complexes. CABS sampling techniques and its force field enable meaningful simulations of folding pathways (see Chapter 12). Extension of the method on multimeric assemblies is straightforward, although it will require larger computing resources, due to a higher complexity of the problem. Finally, we would like to note that the coarse-grained protein models are potentially very interesting in the context of yet different class of docking problem, namely, fitting molecular structures into cryo-electron microscopy (EM) (or similar) low-resolution experimental data (Lindert et al. 2009; Orzechowski and Tama 2008; Jolley et al. 2008). This problem, however, is beyond the scope of this chapter.

References Aita T, Nishigaki K, Husimi Y (2010) Toward the fast blind docking of a peptide to a target protein by using a four-body statistical pseudo-potential. Comput Biol Chem 34:53–62 Aloy P, Russell RB (2004) Ten thousand interactions for the molecular biologist. Nat Biotechnol 22:1317–1321 Anand GS, Law D, Mandell JG, Snead AN, Tsigelny I, Taylor SS, Ten Eyck LF, Komives EA (2003) Identification of the protein kinase A regulatory RIalpha-catalytic subunit interface by amide H/2H exchange and protein docking. Proc Natl Acad Sci USA 100:13264–13269 Andrusier N, Mashiach E, Nussinov R, Wolfson HJ (2008) Principles of flexible protein–protein docking. Proteins 73:271–289 Boniecki M, Rotkiewicz P, Skolnick J, Kolinski A (2003) Protein fragment reconstruction using various modeling techniques. J Comput Aid Mol Des 17:725–738 Bonvin AM (2006) Flexible protein–protein docking. Curr Opin Struct Biol 16:194–200 Burgoyne NJ, Jackson RM (2006) Predicting protein interaction sites: binding hot-spots in protein– protein and protein–ligand interfaces. Bioinformatics 22:1335–1342 Camacho CJ, Vajda S (2001) Protein docking along smooth association pathways. Proc Natl Acad Sci USA 98:10636–10641 Canutescu AA, Shelenkov AA, Dunbrack RL (2003) A graph-theory algorithm for rapid protein side-chain prediction. Prot Sci 12:2001–2014 Carter P, Lesk VI, Islam SA, Sternberg MJ (2005) Protein–protein docking using 3D-dock in rounds 3, 4, and 5 of CAPRI. Proteins 60:281–288 Cerutti DS, Ten Eyck LF, McCammon JA (2005) Rapid estimation of solvation energy for simulations of protein–protein association. J Chem Theory Comput 1:143–152 Chen R, Li L, Weng Z (2003) ZDOCK: an initial-stage protein-docking algorithm. Proteins 52: 80–87 Daily MD, Masica D, Sivasubramanian A, Somarouthu S, Gray JJ (2005) CAPRI rounds 3–5 reveal promising successes and future challenges for RosettaDock. Proteins 60:181–186 Das R, André I, Shen Y, Wu Y, Lemak A, Bansal S, Arrowsmith CH, Szyperski T, Baker D (2009) Simultaneous prediction of protein folding and docking at high resolution. Proc Natl Acad Sci USA 106:18978–18983 Del Carpio-Muñoz CA, Ichiishi E, Yoshimori A, Yoshikawa T (2002) MIAX: a new paradigm for modeling biomacromolecular interactions and complex formation in condensed phases. Proteins 48:696–732 Dominguez C, Boelens R, Bonvin AM (2003) HADDOCK: a protein–protein docking approach based on biochemical or biophysical information. J Am Chem Soc 125:1731–1737 Feng Y, Kloczkowski A, Jernigan RL (2010) Potentials ‘R’ Us web-server for protein energy estimations with coarse-grained knowledge-based potentials. BMC Bioinformatics 11:92 Fischer D, Lin SL, Wolfson HL, Nussinov R (1995) A geometry-based suite of molecular docking processes. J Mol Biol 248:459–477

32

M. Kurcinski et al.

Gray JJ, Moughon S, Wang C, Schueler-Furman O, Kuhlman B, Rohl CA, Baker D (2003) Protein– protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J Mol Biol 331:281–299 Gront D, Kmiecik S, Kolinski A (2007) Backbone building from quadrilaterals: a fast and accurate algorithm for protein backbone reconstruction from alpha carbon coordinates. J Comput Chem 28:1593–1597 Hukushima K, Nemoto K (1996) Exchange Monte Carlo method and application to spin glass simulations. J Phys Soc Jpn 65:1604–1608 Janin J, Henrick K, Moult J, Eyck LT, Sternberg MJ, Vajda S, Vakser I, Wodak SJ (2003) CAPRI: a Critical Assessment of PRedicted Interactions. Proteins 52:2–9 Jiang L, Gao Y, Mao F, Liu Z, Lai L (2002) Potential of mean force for protein–protein interaction studies. Proteins 46:190–196 Jolley CC, Wells SA, Fromme P, Thorpe MF (2008) Fitting low-resolution cryo-EM maps of proteins using constrained geometric simulations. Biophys J 94:1613–1621 Jones S, Thornton JM (1997) Prediction of protein–protein interaction sites using patch analysis. J Mol Biol 272:133–143 Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA (1992) Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci USA 89:2195–1299 Kmiecik S, Gront D, Kolinski A (2007) Towards the high-resolution protein structure prediction. Fast refinement of reduced models with all-atom force field. BMC Struct Biol 7:43 Kmiecik S, Kolinski A (2007) Characterization of protein-folding pathways by reduced-space modeling. Proc Natl Acad Sci USA 104:12330–12335 Kmiecik S, Kolinski A (2008) Folding pathway of the b1 domain of protein G explored by multiscale modeling. Biophys J 94:726–736 Koehl P (2006) Electrostatics calculations: latest methodological advances. Curr Opin Struct Biol 16:142–151 Kolinski A (2004) Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 51:349–371 Kozakov D, Brenke R, Comeau SR, Vajda S (2006) PIPER: an FFT-based protein docking program with pairwise potentials. Proteins 65:392–406 Krishnamoorthy B, Tropsha A (2003) Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations. Bioinformatics 19:1540–1548 Kurcinski M, Kolinski A (2007) Steps towards flexible docking: modeling of three-dimensional structures of the nuclear receptors bound with peptide ligands mimicking co-activators’ sequences. J Steroid Biochem 103:357–360 Lichtarge O, Bourne HR, Cohen FE (1996) An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 257:342–358 Lindert S, Staritzbichler R, Wötzel N, Karaka¸s M, Stewart PL, Meiler J (2009) EM-fold: de novo folding of alpha-helical proteins guided by intermediate-resolution electron microscopy density maps. Structure 17:990–1003 Mandell JG, Roberts VA, Pique ME, Kotlovyi V, Mitchell JC, Nelson E, Tsigelny I, Ten Eyck LF (2001) Protein docking using continuum electrostatics and geometric fit. Protein Eng 14: 105–113 May A, Zacharias M (2005) Accounting for global protein deformability during protein–protein and protein–ligand docking. Biochem Biophys Acta 1754:225–231 May A, Zacharias M (2007) Protein–protein docking in CAPRI using ATTRACT to account for global and local flexibility. Proteins 69:774–780 Nilges M (1995) Calculation of protein structures with ambiguous distance restraints. Automated assignment of ambiguous NOE crosspeaks and disulphide connectivities. J Mol Biol 245: 645–660 Orzechowski M, Tama F (2008) Flexible fitting of high-resolution X-ray structures into cryoelectron microscopy maps using biased molecular dynamics simulations. Biophys J 95:5692–5705

2

Multiscale Protein and Peptide Docking

33

Ołdziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila JA, Khalili M, Arnautova YA, Jagielska A, Makowski M, Schafroth HD, Ka´zmierkiewicz R, Ripoll DR, Pillardy J, Saunders JA, Kang YK, Gibson KD, Scheraga HA (2005) Physics-based protein-structure prediction using a hierarchical protocol based on the UNRES force field: assessment in two blind tests. Proc Natl Acad Sci USA 102:7547–7552 Res I, Lichtarge O (2005) Character and evolution of protein–protein interfaces. Phys Biol 2:S36–S43 Res I, Mihalek I, Lichtarge O (2005) An evolution based classifier for prediction of protein interfaces without using protein structures. Bioinformatics 21:2496–2501 Ritchie DW (2008) Recent progress and future directions in protein–protein docking. Curr Protein Pept Sci 9:1–15 Rohl CA, Strauss CE, Misura KM, Baker D (2004) Protein structure prediction using Rosetta. Methods Enzymol 383:66–93 Salwinski L, Eisenberg D (2003) Computational methods of analysis of protein–protein interactions. Curr Opin Struct Biol 13:377–382 Schueler-Furman O, Wang C, Baker D (2005) Progress in protein–protein docking: atomic resolution predictions in the CAPRI experiment using RosettaDock with an improved treatment of side-chain flexibility. Proteins 60:187–194 Sheinerman FB, Norel R, Honig B (2000) Electrostatic aspects of protein–protein interactions. Curr Opin Struct Biol 10:153–159 Sternberg MJ, Gabb HA, and Jackson RM (1998) Predictive docking of protein–protein and protein–DNA complexes. Current Opinion in Structural Biology 8:250–256 Tobi D, Bahar I (2006) Optimal design of protein docking potentials: efficiency and limitations. Proteins 62:970–981 Vajda S, Kozakov D (2009) Convergence and combination of methods in protein–protein docking. Curr Opin Struct Biol 19:164–170 Vakser IA, Aflalo C (1994) Hydrophobic docking: a proposed enhancement to molecular recognition techniques. Proteins 20:320–329 Vakser IA (1995) Protein docking for low-resolution structures. Protein Eng 8:371–377 Valencia A, Pazos F (2002) Computational methods for the prediction of protein interactions. Curr Opin Struct Biol 12:368–373 van Dijk AD, de Vries SJ, Dominguez C, Chen H, Zhou H, Bonvin AM (2005) Data-driven docking: HADDOCK’s adventures in CAPRI. Proteins 60:232–238 Wang C, Schueler-Furman O, Andre I, London N, Fleishman SJ, Bradley P, Qian B, Baker D (2007) RosettaDock in CAPRI rounds 6–12. Proteins 69:758–763 Wang C, Schueler-Furman O, Baker D (2005) Improved side-chain modeling for protein–protein docking. Prot Sci 14:1328–1339 Wodak SJ, Janin J (1978) Computer analysis of protein–protein interaction. J Mol Biol 124: 323–342 Zacharias M (2005) ATTRACT: protein–protein docking in CAPRI using a reduced protein model. Proteins 60:252–256 Zacharias M (2003) Protein–protein docking with a reduced protein model accounting for sidechain flexibility. Prot Sci 12:1271–1282 Zhang C, Liu S, Zhu Q, Zhou Y (2005) A knowledge-based energy function for protein–ligand, protein–protein, and protein–DNA complexes. J Med Chem 48:2325–2335 Zhang Q, Sanner M, Olson AJ (2009) Shape complementarity of protein–protein complexes at multiple resolutions. Proteins 75:453–467

Chapter 3

Coarse-Grained Models of Proteins: Theory and Applications Cezary Czaplewski, Adam Liwo, Mariusz Makowski, Stanisław Ołdziej, and Harold A. Scheraga

Abstract In this chapter, reduced (coarse-grained) protein models are discussed. Emphasis is given to those models which can be used in simulating the structure, thermodynamics, and dynamics of real proteins and are, at the same time, transferable. The coarse-grained force fields are introduced in a physics-based way as potentials of mean force of polypeptide chains in reduced representations, in which the secondary degrees of freedom have been averaged out. Based on this general formula, three categories of coarse-grained potentials are introduced: (i) statistical potentials derived from structural databases, (ii) potentials obtained by factorization of the parent potential of mean force, which enables us to split the system into smaller subsystems and derive each effective energy contribution independently, and (iii) potentials obtained by the force-matching method. Optimization of the potential function to achieve foldability is discussed. Applications of coarse-grained potentials to predict protein structures and simulate long-time protein dynamics are presented. We conclude that while, with the aid of massively parallel computers, coarse graining enables us to reach millisecond simulation timescales of real-size proteins, and case studies indicate that the results of these simulations are realistic, much work remains to be done to improve the force fields.

3.1 Introduction There are two aspects to the protein-folding problem. These are the determination of the folding pathways and the resulting native structure. Both experimental and theoretical methods are used to solve this problem. This chapter is concerned only

A. Liwo (B) Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA e-mail: [email protected] This chapter is dedicated to the memory of Urszula Kozłowska, our long-time colleague and coworker. It is unfortunate that her early passing prevented her from participating in writing this chapter.

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_3,

35

36

C. Czaplewski et al.

with the theoretical approach, given the amino acid sequence of the protein. The theoretical approach is based on the thermodynamic hypothesis enunciated by Anfinsen (1973), according to which the native protein adopts the conformation in which the protein plus the surrounding solvent is a system whose free energy is at the global minimum. There are two basic ingredients of the theoretical approach: formulation of an appropriate potential energy function with which to compute the interaction between every pair of atoms in the polypeptide chain and development of an algorithm to identify its global minimum. Originally, the focus was on locating the global minimum of the potential energy, based on a large menu of procedures to search conformational space (Scheraga 1988, 1996; Scheraga et al. 2004), but entropic effects were later introduced (Liwo et al. 2007) to locate the global minimum of the free energy. The initial applications made use of an all-atom potential energy function but, with the computer resources available at that time, the largest structure that could be simulated was the 46-residue protein A, with one of the procedures in the aforementioned menu, namely Electrostatically Driven Monte Carlo (EDMC) (Vila et al. 2003). In order to make further progress, resort was to have a hierarchical procedure (Liwo et al. 1993b; Scheraga et al. 2004) in which the initial search of conformational space was carried out with a coarse-grained United-Residue (UNRES) model of the polypeptide chain to locate the region of the global minimum of the UNRES potential energy. This was followed by conversion of the UNRES model to an allatom one (Ka´zmierkiewicz et al. 2002, 2003) and subsequent optimization of the all-atom model with the EDMC procedure. Coarse-grained representations of proteins have long been of interest in theoretical simulations of protein structure and dynamics (Koli´nski and Skolnick 2004; Tozzini 2005; Colombo and Micheletti 2006; Clementi 2008). The primary reason for this is that they involve much less computational effort than all-atom or united-atom representations of the polypeptide chain; this facilitates speed-up of the simulations of dynamics, folding pathways, and thermodynamics by four orders of magnitude compared to all-atom simulations with explicit solvent (Liwo et al. 2005) and, in turn, simulation of biomolecular processes at the millisecond timescale. Another application of coarse-grained models is prediction of protein structure from amino acid sequence, which becomes increasingly important because of the growing gap between the number of known protein sequences and structures; this gap is not likely to be diminished in the foreseeable future even with the improvement of experimental methods such as X-ray, nuclear magnetic resonance (NMR) spectroscopy, and cryomicroscopy. A number of comprehensive reviews of coarse-grained models applied to biomolecular and soft-material systems (Ayton et al. 2007a; Pincus et al. 2008) and specifically to proteins (Koli´nski and Skolnick 2004; Tozzini 2005; Colombo and Micheletti 2006; Gront et al. 2009) have been published recently. A book on coarsegrained models has been published recently under the editorship of Voth (2008) with contributions from over 30 research groups; this book offers a comprehensive survey of the state-of-the-art of predominantly physics-based coarse-grained models.

3

Protein Coarse-Grained Models

37

3.2 History of Coarse-Grained Protein Models The history of coarse-grained models of proteins began with the pioneering work of Levitt and Warshel (1975) continued by Levitt (1976). These investigators used one or two centers to represent a side chain and three centers (Cα and two pseudo-atoms for the peptide group). The virtual-bond geometry was derived by averaging protein crystal data, and a large part of the potential energy function was obtained by Boltzmann-averaging the all-atom energy of model systems, while the parameters of hydrophobic/hydrophilic interaction potentials were based on amino acid solubility and partition coefficient data. The search procedure consisted of a series of local minimizations, performed in angular variables, with a pushing potential that prevented a system from returning to the already-found energy minimum, thereby allowing a larger-scale search of conformational space. The method was tried on bovine pancreatic trypsin inhibitor (BPTI) and found to produce protein-like structures when simulated folding was started from an extended chain; however, the native structure of the test protein was not reached. Although the Levitt and Warshel work was not developed further, it was the first attempt to construct a physics-based reduced model of polypeptide chains and laid solid foundations for the development of later physics-based models (Pincus and Scheraga 1977; Gerber 1992; Liwo et al. 1993b, 2008a; Wallqvist and Ullner 1994; Maupetit et al. 2007; Voth 2008; Chebaro et al. 2009), some of which (Liwo et al. 1993b, 2008a; Derreumaux 1997, 1999; Derreumaux and Mousseau 2007; Maupetit et al. 2007; Chebaro et al. 2009), were implemented successfully in energy-based protein structure prediction (Ołdziej et al. 2005) and ab initio protein-folding simulations (Derreumaux 1999; Liwo et al. 2005; Voth 2008). Potentials of this class are discussed in detail in Section 3.5.3. At nearly the same time, Tanaka and Scheraga (1976) introduced the first knowledge-based (statistical) protein energy function. These investigators determined a residue–residue interaction matrix from the database of protein structures known at that time by applying the Boltzmann principle to the frequency of contacts between pairs of residues of given types. Two residues were considered to be in contact when the distance between their Cα atoms was less than 7 Å (Tanaka and Scheraga 1976). Miyazawa and Jernigan (1985) developed a residue–residue contact potential by using a more refined method, in which the random-flight-chain component was removed from the effective contact energies and the quasi-chemical approximation (Fowler and Guggenheim 1949) was used instead of the simple Boltzmann inversion of contact frequency. Later they revised this potential (Miyazawa and Jernigan 1996) by using a larger database of protein structures and more refined approximations (Miyazawa and Jernigan 1999). Burgess and Scheraga (1975) introduced a residue–residue five-state model to examine the conformational states of bovine pancreatic trypsin inhibitor (BPTI). Wang and Wang (1999) proposed the reduction of the 20×20 contact-potential matrix to introduce only five distinct residue types. Structure-based contact potentials were implemented in simulations of protein packing (Gregoret and Cohen 1990), in onlattice-folding simulations (Covell 1992; Pincus et al. 2008) and in fold recognition (Maiorov and Crippen 1992).

38

C. Czaplewski et al.

Work on continuous knowledge-based potentials for off-lattice simulations was initiated by Kuntz et al. (1976), who constructed a minimal model with Cβ atoms as interaction sites. The pseudo-energy consisted of a series of quadratic penalty terms accounting for the violation of allowed distances between the Cβ atoms close in sequence, favoring close distances between hydrophobic or oppositely charged residues and penalizing close distances between polar or like-charged as well as between polar or charged and hydrophobic residues. Additionally, native disulfide bond topology and the topology of cysteine residues in Fe–S clusters were enforced, if applicable. The residue–residue interaction penalty terms were scaled by interaction-matrix elements, which were assigned in an arbitrary manner according to the number of carbon atoms and the presence of charged or polar groups. When applied to rubredoxin and BPTI, the approach produced low-resolution structures which resembled native-like topology. Crippen and coworkers (Obatake and Crippen 1981; Crippen and Viswanadhan 1984) implemented the Boltzmann principle to determine continuous residue– residue potentials from protein crystal structures and fitted them to a combination of a Lennard–Jones-like and a Gaussian (Obatake and Crippen 1981) or a Lennard– Jones-like functional form (Crippen and Viswanadhan 1984). However, these potentials could locate near-native structures as stable local minima obtained only when performing local energy minimization starting from the experimental structure of a protein. Yˇcas and coworkers (Yˇcas et al. 1978; Goel and Yˇcas 1979) and Wako and Scheraga (1982a,b) developed approaches in which distances between Cα atoms were restrained to average values determined from the Protein Data Bank (PDB) (Berman et al. 2000), depending on residue types and separation in sequence; therefore, the corresponding target functions could be considered knowledge-based potentials. These approaches had some success reproducing the native-like structure of BPTI, lysozyme, and staphylococcal nuclease. Another successful knowledge-based potential was developed by Sun (1993) who was able to locate the lowest-energy structures of melittin, apamin, and avian pancreatic polypeptide inhibitor by using a genetic algorithm with the experimental radius of gyration of the protein under study as a restraint. Following the above-mentioned works, knowledge-based potentials applicable to fold recognition were developed (Jones and Thornton 1993). By applying the Boltzmann principle and using protein crystal data, Sippl and coworkers (Hendlich et al. 1990; Sippl 1990a,b, 1993; Casari and Sippl 1992) developed continuous potentials of residue–residue interactions dependent on inter-residue distance, residue types, and residue separation in sequence. These potentials were able to recognize the folds of proteins and protein fragments. This approach was later continued by a number of investigators (Reva et al. 1997; Samudrala and Moult 1998) and also applied to protein complexes (Jiang et al. 2002) and protein–ligand complexes (Mitchell et al. 1999). More complex pseudo-energy functions for sequence threading, which include explicit local-interaction and solvation terms (Bryant and Lawrence 1993; Jones and Thornton 1993; Godzik et al. 1993; Miller et al. 1996), with parameters

3

Protein Coarse-Grained Models

39

optimized by using a set of training proteins (Meller and Elber 2001), were developed later. Buchete et al. (2003, 2004) developed distance-dependent and orientation-dependent statistical potentials for protein-fold recognition; they demonstrated that introducing orientation dependence greatly improves the capability of fold recognition. The orientation dependence was introduced with spherical harmonics. Statistical orientation-dependent side-chain–side-chain interaction potentials, in which each side chain is represented by an ellipsoid, were constructed by Liwo et al. (1997a) and Mukherjee et al. (2005). The statistical potentials based on the Boltzmann principle are discussed in Section 3.5.2. A different approach to derive knowledge-based potentials was developed by Wolynes and coworkers (Sasai and Wolynes 1990; Friedrichs et al. 2001; Goldstein et al. 1992a,b; Hardin et al. 2002; Eastwood et al. 2002, 2003). Instead of using the Boltzmann principle to determine the potentials, these investigators developed energy functions termed associative Hamiltonians, which linked the parameters of the potentials to the structures of a number of proteins from the database, the strength of coupling depending on the homology between the sequence under study and a sequence from the database. This approach can be considered sophisticated comparative modeling, in which sequence homology is incorporated into an energy function. Later versions of the method (Hardin et al. 2002; Eastwood et al. 2002, 2003) contain hydrogen-bonding and long-range side-chain contact energy. Initially a polypeptide chain was represented as a sequence of Cα atoms; in later versions, the Cα , backbone oxygen atoms, and side-chain centers were defined as interaction sites. The approach had some success in protein structure prediction (Hardin et al. 2002; Prentiss et al. 2006, 2008) and in Brownian dynamics simulations of protein folding (Wolynes 2005). Until the early 1990s, both physics-based and knowledge-based coarse-grained potentials were constructed as sums of individual terms. It has, however, become clear that the simplifications inherent in coarse graining make the resulting energy function inaccurate enough to prevent proteins from folding starting from an arbitrary conformation, without adjusting energy-term parameters. Work on this problem was initiated by Crippen and coworkers (Crippen and Viswandhan 1987; Crippen and Snow 1990; Seetharamulu and Crippen 1991) who defined potentialfunction optimization as a linear programing problem in which the difference between the energy of the native structure and the lowest-in-energy non-native structure of a training protein was maximized. They used the energy-embedding method as a global optimization algorithm to search low-energy structures. With this method and with the use of avian pancreatic polypeptide (APP) (Crippen and Snow 1990) and APP and crambin (Seetharamulu and Crippen 1991) as training proteins, the optimized potential energy function was able to locate the native-like structures of apamin and melittin as the lowest in energy. However, later work in which simulated annealing was implemented as a search method (Snow 1992) demonstrated that non-native structures lower in energy were found, showing the critical role of the quality of sampling in potential-function optimization. The physics-based justification of the methodology initiated by Crippen and coworkers was provided

40

C. Czaplewski et al.

by Wolynes and coworkers (Bryngelson and Wolynes 1987; Hardin et al. 2002; Goldstein et al. 1992a,b) and Shakhnovich and coworkers (Sali et al. 1994a,b). Force-field optimization is discussed in Section 3.5.5. The first successful application of coarse-grained potentials in ab initio protein folding was made by Koli´nski and Skolnick who developed a high-resolution lattice model of proteins and a statistical potential which included side chain–side chain, local, hydrogen bonding, and mutibody terms accounting for cooperative formation of backbone hydrogen bonds and side-chain contact patterns (Koli´nski and Skolnick 1992, 1994a; Koli´nski et al. 1993). A residue was represented by Cα carbon serving as a center of hydrogen-bonding interactions and a side-chain center. Using a Monte Carlo dynamics algorithm, these investigators performed successful folding simulations of protein A, repressor of primer (ROP) dimer, crambin (Koli´nski and Skolnick 1994b), leucine zipper (Vieth et al. 1994), as well as folding model α-helical (Rey and Skolnick 1993; Olszewski et al. 1996; Sikorski et al. 1998) and β-sheet proteins (Koli´nski et al. 1995) and other folding simulations (Koli´nski et al. 1996, 2003; Koli´nski and Skolnick 1997). Later versions of the Koli´nski–Skolnick model are the side chain-only (SICHO) model (Koli´nski and Skolnick 1998) in which only the side-chain centroids are interaction sites and the CABS (Cα , Cβ , side chain) model (Koli´nski 2004). These models served as a basis of the MONSSTER (Skolnick et al. 1997b) and TOUCHSTONE (Skolnick et al. 2003) approaches to protein structure prediction, which combine multiple sequence alignment, secondary-structure prediction, threading, and coarse-grained simulations. These approaches have been very successful in community-wide experiments on the Critical Assessment of Techniques for Protein Structure Prediction (CASP4– CASP8) (Skolnick et al. 2001, 2003; Zhou et al. 2007a) and have also been applied to structure determination from NMR data (Lee et al. 2006; Latek et al. 2007). The CABS model has recently been applied to study protein-folding pathways (Kmiecik et al. 2006; Kmiecik and Koli´nski 2007, 2008). Following the concept of averaging out the less important degrees of freedom, the United-Residue (UNRES) force field has been developed by Liwo, Scheraga, and coworkers (Liwo et al. 1993a,b; 1997a,b; 1998; 2001; 2004b; 2007; 2008a,b; Ołdziej et al. 2003; Czaplewski et al. 2004b; Kozłowska et al. 2007; Chinchio et al. 2007; Makowski et al. 2007a,b, and c; 2008; Rojas et al. 2007; Kozłowska et al. 2010a,b). In this model, the interaction sites are side-chain centers and peptide groups placed halfway between the consecutive α-carbon atoms, while the Cα atoms assist only in the definition of the geometry and peptide group orientation. The early version of UNRES (Liwo et al. 1993a,b) included only pairwise interactions between sites and virtual-bond-torsional terms. Backbone hydrogen-bonding interactions were described by mean-field-based analytical formulas obtained by Boltzmann-averaging of the energy of peptide-group dipoles with parameters derived based on averaging the all-atom ECEPP/2 (Momany et al. 1975) energy; they correctly reproduced the directionality of backbone-group hydrogen bonding while retaining only one interaction site per peptide group. The interactions between the side chains were described by Lennard–Jones-like potentials with radii taken

3

Protein Coarse-Grained Models

41

from Levitt and Chothia (1976) and well depths computed from the Miyazawa– Jernigan (1985) interaction energies. Later (Liwo et al. 1997a), the side-chain interaction potentials were revised to include anisotropy in the Gay–Berne model and reparameterized using a database of 197 high-resolution non-homologous protein structures (Liwo et al. 1997a); the local-interaction parameters were also determined from the PDB (Liwo et al. 1997b) and the weights of the energy terms were determined by using a Z-score optimization approach (Liwo et al. 1997b). Subsequently (Liwo et al. 1998, 2001), the force field was defined rigorously as a restricted free-energy (RFE) function of a united peptide chain in which the secondary degrees of freedom were averaged out; this definition can also be used to derive any coarse-grained force field. Using this definition, Liwo et al. (1998, 2001) derived multibody terms, which are necessary for regular secondary structure to form spontaneously in united-residue simulations (Koli´nski et al. 1993). The multibody terms, as well as the other terms of the force field, were gradually parameterized using ab initio energy surfaces of model peptide systems (Ołdziej et al. 2003; Liwo et al. 2004b; Kozłowska et al. 2007, 2010a,b) so that all knowledgebased local-interaction terms have now been replaced with physics-based terms and the knowledge-based side-chain interaction potentials are presently being replaced with potentials of mean force determined from all-atom simulations of models of pairs of side chains in water (Makowski et al. 2007a,b, and c, 2008). A hierarchical method was developed for force-field optimization (Liwo et al. 2002, 2007; Ołdziej et al. 2004) which extends the original concept of Wolynes and coworkers (Bryngelson and Wolynes 1987; Hardin et al. 2002; Goldstein et al. 1992a) and Shakhnovich and coworkers (Sali et al. 1994a,b) to energy-ranking partially native states. The UNRES force field was initially used for energy-based protein structure prediction formulated as a global minimum search and had considerable success in CASP experiments (Liwo et al. 1999; Pillardy et al. 2001a; Ołdziej et al. 2005). Recently (Liwo et al. 2005; Khalili et al. 2005a,b), a molecular dynamics (MD) algorithm was implemented in UNRES which extended the scope of the force field to study protein-folding pathways (Khalili et al. 2006), thermodynamic properties (Nanias et al. 2006; Liwo et al. 2007), and also to reformulate physics-based protein structure prediction as a search for the most probable conformational ensemble at temperatures below the folding temperature (Liwo et al. 2007). UNRES was also extended to simulate multichain proteins (Saunders and Scheraga 2003a,b; Rojas et al. 2007) and dynamic formation and breaking of disulfide bonds during protein folding and unfolding (Czaplewski et al. 2004b; Chinchio et al. 2007). A semi-coarse-grained model, with all-atom backbone and united-residue side chains, was developed by Derreumaux and coworkers (Derreumaux 1997, 1999; Wei and Derreumaux 2002). Side-chain interactions were represented by a 4–8 potential with well depths computed from the Miyazawa–Jernigan (1985) interaction energies and backbone long-range interactions were focused to reproduce hydrogen bonding. Later (Maupetit et al. 2007), the force field was enhanced in hydrogen-bonding correlation terms, which was motivated by the presence of multibody contributions to the free energy of desolvation of groups of hydrogen bonds.

42

C. Czaplewski et al.

The force field has been applied to Monte Carlo (Derreumaux 1997, 1999; Wei and Derreumaux 2002) and molecular dynamics (Derreumaux and Mousseau 2007; St-Pierre et al. 2008) folding simulations of peptides and small proteins; recent applications include prediction of protein structure (Maupetit et al. 2007) and a study of protein aggregation (Wei et al. 2007; Mousseau and Derreumaux 2008). The knowledge-based force field of Takada and coworkers (Takada 2001; Chikenji et al. 2001; Fujitsuka et al. 2004) is similar in spirit. Just recently (Ayton et al. 2007b; Zhou et al. 2007b; Noid et al. 2008; Thorpe et al. 2008; Wang et al. 2009), a new general approach to coarse-graining has been developed by Voth and coworkers. This approach is based on matching the forces computed in the coarse-grained model of a given system to the mean forces computed by all-atom MD simulations for the same system and is discussed in more detail in Section 3.5.4. Preliminary applications to a model α-helical and β-hairpin peptide resulted in energy landscapes funneled to the respective native structure. The general purpose coarse-graining scheme and force field of Monticelli et al. (2008) termed MARTINI is similar in spirit, though less rigorously derived but incorporating experimental thermodynamic and structural data in parameterization. When applied to proteins, it needs native secondary structure to work. At the time that the physics-based model of Levitt and Warshel (1975) appeared, G¯o and coworkers (Taketomi et al. 1975; Ueda et al. 1978, Cieplak and Sulkowska, Chapter 8 of this book) laid foundations of the now-popular structure-based models, usually termed G¯o-like models (Das et al. 2005; Schug et al. 2008; Hills and Brooks 2009). These models overemphasize the interactions in the native state (in rigorous G¯o-like models only native contacts result in attractive interactions, all other interactions being repulsive). These models exhibit minimal frustration of the energy landscape. Use of G¯o-like models is based on the assumption that the topology of the native state determines the major features of protein-folding pathways and kinetics. Other structure-based models contain potential energy terms biasing toward the native secondary structure (Brown et al. 2003; Brown and Head-Gordon 2004; Eskow et al. 2004). The elastic network coarse-grained models, developed relatively late (Bahar et al. 1997; Hinsen 1998; Atilgan et al. 2001; Tobi and Bahar 2005; Ahmed and Gohlke 2006; Chu and Voth 2007; Moritsugu and Smith 2008), can be considered another class of structure-based models. In these models, a polypeptide chain is treated as a network of Cα atoms connected by springs of equilibrium length corresponding to the distance in the experimental structure of the protein studied and force constants depending on distance in the experimental structure; in more advanced applications (Chu and Voth 2007), a double-well potential is imposed on each spring. In the simplest model (Bahar et al. 1997), the force constant is zero if the distance exceeds 7 Å and 1 otherwise (thus the spring network is described by the Kirchhoff adjacency matrix); in more sophisticated approaches, the force constants are Gaussians in distance (Hinsen 1998) or are calculated based on molecular dynamics simulations (Chu and Voth 2007; Moritsugu and Smith 2008). The elastic network models are used to study low-frequency motions of proteins, including thermal fluctuations (Bahar et al. 1997; Atilgan et al. 2001), domain motions (Hinsen 1998),

3

Protein Coarse-Grained Models

43

conformational changes upon folding and unfolding (Chu and Voth 2007; Moritsugu and Smith 2008), and protein–protein binding (Tobi and Bahar 2005). Finally, coarse-grained protein-like models that capture only general features of protein structures, such as the existence of a single native state, are at the other extreme. The oldest are the HP lattice models developed by Dill and coworkers (Chan and Dill 1989, 1990, 1991, 1994; Dill et al. 1995) in which only two types of beads, hydrophobic (H) and polar (P), are present with three contact interaction energy values and then the NPH lattice models developed by Hao and Scheraga (1994) with three types of beads: neutral (N), polar (P), and hydrophobic (H). These models proved invaluable in determining the origin of the general features of protein structures, such as compactness and formation of a hydrophobic core and a hydrophilic exterior, and cooperativity in folding. Protein-like lattice models with more complex interaction patterns were used by Shakhnovich and coworkers (Sali et al. 1994a; Shakhnovich 1997) and Thirumalai and coworkers (Camacho and Thirumalai 1996; Klimov and Thirumalai 1996a,b, 1998) to study foldability criteria.

3.3 Choice of Conformational Space Representation The geometry of united-residue polypeptide chains is represented in continuous or discretized space. For continuous space, the Cartesian coordinates of the interacting sites or the virtual-bond vectors are usually the variables of choice, although curvilinear (angular) coordinates were implemented in the early model of Levitt and Warshel (Levitt and Warshel 1975; Levitt 1976), in the UNRES model (Liwo et al. 1993b, 1997a,b, 1999), by Hoffman and Knapp (1996) (who considered collective coordinates composed of the ϕ and ψ torsional angles of several adjacent peptide groups), and by He and Scheraga (1998). Angular coordinates are used in energy minimization or Monte Carlo search rather than in molecular dynamics simulations; one of a few counter-examples is the work of He and Scheraga (1998) who did Brownian dynamics of model polypeptides using virtual-bond angles and virtual-bond dihedral angles as variables. Use of curvilinear coordinates enables us to reduce the number of degrees of freedom by treating the virtual-bond lengths and, sometimes, the virtual-bond angles as fixed, which reduces the cost of minimization and Monte Carlo algorithms. However, use of curvilinear coordinates in molecular dynamics is not so convenient because it introduces a non-diagonal inertia tensor which depends on conformation (resulting in the necessity of solving a linear-equation system in every MD step to compute accelerations from forces, which requires N3 operations, N being the number of variables) and also results in problems with singularity when mapping the curvilinear to Cartesian coordinates. Regardless of the representation (Cartesian or curvilinear), the obvious advantage of a continuous-space representation is the possibility of applying the algorithms of conformational-space exploration that require an energy gradient (local energy minimization with gradient minimizers and molecular dynamics and its variations).

44

C. Czaplewski et al.

Discrete representations of conformational space, in the first place, are identified with lattice models (Koli´nski and Skolnick 2004; Chapters 1 and 12), in which the interaction sites are always located on lattice nodes. Depending on the lattice resolution, the lattice models are divided into low-resolution lattice models, in which the sites connected by site–site virtual bonds are located on neighboring lattice nodes, the intermediate-resolution lattices (e.g., the chess-knight lattice) in which the sites connected by a virtual bond are located on second-neighboring lattice nodes, and high-resolution lattices in which the nodes are spaced by 1.45 Å (cubic lattice) or even by 0.61 Å. Low-resolution lattices can be used only to study protein-like polymers while, with high-coordination lattices, the accuracy of chain representation can be as high as 0.35 Å, which is comparable to the inaccuracy inherent in coarsegrained force fields. Various types of lattice models have been discussed extensively in the excellent review by Koli´nski and Skolnick (2004). The advantage of lattice models over continuous-space models is the possibility of pre-computing and storing the contributions to energy corresponding to chain fragments at certain conformations, which saves CPU time. However, while this advantage was substantial a decade ago when processor speed was small compared to the present, and memory access was relatively cheap, now memory access is the bottleneck [the so-called memory wall (Flynn 1999)]. For example, accessing elements of a 10,000,000 array 100,000,000 times at random takes 11 CPU seconds with an Intel Q9400 2.66 MHz processor, which is an issue to which one has to give some consideration. Moreover, conformational search algorithms that require an energy gradient (local minimization using gradient minimizers, molecular dynamics, and related techniques) are not possible with the lattice approach. Another discretization of the conformational space is accomplished by restricting it to decoys derived from structural databases or fragments derived from the sequences homologous to a target sequence. The first is the so-called threading or fold recognition approach (Bryant and Lawrence 1993; Godzik et al. 1993; Jones and Thornton 1993; Miller et al. 1996; Meller and Elber 2001; Buchete et al. 2003, 2004), while the second one is the fragment approach applied recently by Baker and coworkers (Simons et al. 1997; Rohl et al. 2004). These representations are applied mainly in protein structure prediction, although the fragment approach was also applied in protein folding by Monte Carlo dynamics (Fujitsuka and Takada 2004).

3.4 Interaction Schemes Due to the great diversity of coarse-grained models and force fields, it is very difficult, if not impossible, to provide a unique general formula that covers all of them. If we restrict the discussion to force fields that refer to physical interactions, a general formula for the effective energy, U, could be expressed by Eq. (3.1). U=

i

ulocal + i

i

j
uij +

ijkl...

uijkl...

(3.1)

3

Protein Coarse-Grained Models

45

where ulocal denotes a local-interaction term dependent on a single or a number i of adjacent sites, uij denotes the effective interaction energy between sites i and j, and uijkl. . . denotes a multibody interaction which extends over sites i, j, k, l. . . . The terms of the first two sums resemble those of all-atom force fields, although their physical origin is usually different. The local-interaction terms consist not only of those describing the energetics of virtual-bond stretching, virtual-bond angle bending, and virtual-bond torsional terms but also include more complex terms such as those describing the rotameric states of united side chains and terms that depend on a number of consecutive virtual-bond dihedral angles. Gerber (1992) introduced peptide-geometry restoring terms, which penalize the deviations of virtual-bond length and virtual-bond angles from boundary values resulting from the ideal valence geometry of polypeptide chains; these terms are also local-interaction terms. The uij terms corresponding to interactions between united side chains usually encode the totality of such interactions, including the effect of the surrounding solvent. The side-chain–side-chain interaction terms depend only on the distance (in most force fields) or also on side-chain orientation. Usually backbone hydrogen bonding is accounted for separately through interactions between backbone sites located on Cα atoms, between consecutive Cα atoms, or a number of sites, each representing a peptide group. Unless the peptide groups are represented at atomistic or nearly atomistic detail, the backbone hydrogen-bonding terms include dependence on peptide-group orientation. The presence of multibody terms is the main difference between the interaction scheme of all-atom force fields and those of coarse-grained force fields. Although the multibody terms are also present in some specific and very accurate all-atom force fields, their presence there is not required for a force field to work reasonably well. Conversely, coarse-grained force fields without multibody terms have very restricted application and cannot be used for ab initio folding unless specific biases toward elements of the native structure are introduced. The reason for this becomes clear in Section 3.5.3. One type of multibody terms, which is also present in an all-atom force field with implicit solvent representation, is the solvation term computed from the solvent-accessible surface area or related quantities; another one is the centrosymmetric potential implemented by Koli´nski, Skolnick, and coworkers (Koli´nski et al. 1993; Koli´nski and Skolnick 1994a).

3.5 Derivation of Coarse-Grained Force Fields In this section, three leading methods for developing coarse-grained potentials [statistical potentials (Boltzmann principle), factorization of potentials of mean force, and the force-matching method] are discussed. These three approaches result in potentials that can be used to simulate the structure and dynamics of real proteins and are, in theory, transferable. We omit from this discussion (i) the arbitrary potentials designed to simulate protein-like systems in order to study general properties of protein folding and dynamics (Chan and Dill 1989, 1990, 1991, 1994; Dill et al.

46

C. Czaplewski et al.

1995; Camacho and Thirumalai 1996; Cieplak et al. 2002; Chapter 8), (ii) the elastic network potentials (Bahar et al. 1997; Hinsen 1998; Atilgan et al. 2001; Tobi and Bahar 2005; Ahmed and Gohlke 2006; Chu and Voth 2007; Moritsugu and Smith 2008), and (iii) the structure-based potentials (Das et al. 2005; Schug et al. 2008; Hills and Brooks 2009; Brown et al. 2003; Brown and Head-Gordon 2004; Eskow et al. 2004; Chapter 8). The reader is referred to the literature for information regarding these three other categories.

3.5.1 Basic Formulations The coarse-grained energy function can generally be defined as an average of the energy of the corresponding all-atom system, the average being computed over the degrees of freedom that are not present in the coarse-grained representation (the secondary degrees of freedom). An illustration of the correspondence between a united-residue chain and the parent all-atom chain is presented in Fig. 3.1.

Fig. 3.1 Illustration of the correspondence between the all-atom polypeptide chain in water (a) and its coarse-grained (UNRES) representation (b). The side chains in part (b) are represented by ellipsoids of revolution and the peptide groups are represented by small spheres in the middle between consecutive α-carbon atoms. The solvent is implicit in the UNRES model. Reproduced with permission from figure 1 of Czaplewski et al. (2009)

The most physical definition corresponds to the restricted free energy (RFE) or potential of mean force (PMF) obtained by computing the part of the configurational integral of a system corresponding to integrating over the secondary degrees of freedom. If X = (x1 , x2 , . . . , xM ) and Y = (y1 , y2 , . . . , ym ) denote the coarse grained and secondary degrees of freedom (orthogonal to X), respectively, M and m being the dimensions of the space spanned by these coarse-grained and secondary variables, respectively, and E(X;Y) denotes the all-atom energy function, RFE can be expressed by Eq. (3.2) (Liwo et al. 1998, 2001; Izvekov and Voth 2005a,b; Ayton et al. 2007a).

3

Protein Coarse-Grained Models

F(X) = −RT ln

47

⎧ ⎪ ⎨ ⎪ ⎩

Y

⎫ ⎪ ⎬ E(X; Y) dVY + C exp − ⎪ RT ⎭

(3.2)

where Y is the space spanned by Y, R is the universal gas constant, T is the absolute temperature, and C is an additive constant. The choice of C varies from approach to approach; for example, Liwo et al. (1998, 2001) chose C = RT ln VY , which makes F(X) a restricted excess free energy, while Voth and coworkers (Izvekov and Voth 2005a,b; Ayton et al. 2007a,b) chose C = −RT ln ZN /zn , with ZN and zn denoting the configuration integrals computed over the coarse-grained and all-atom conformations, respectively. Equation (3.2) provides the best physical connection between the coarse-grained and the corresponding all-atom system, because exp[−F(X)/RT] is proportional to the probability of a coarse-grained conformation defined by X. Consequently, the ensemble averages computed over F(X) are theoretically equal to those computed over the parent all-atom energy function E(X;Y). Another point is that it is clear from Eq. (3.2) that the effective coarse-grained energy function depends on temperature. This point has recently been addressed by Liwo et al. (2007). Earlier coarse-grained models (Levitt and Warshel 1975; Levitt 1976; Pincus and Scheraga 1977; Liwo et al. 1993b) adopted the Boltzmann-averaged energy as the effective energy function. However, the average energy does not have a direct connection to the probability of a coarse-grained conformation, and it is not straightforward to compute ensemble averages using this quantity. Equation (3.2) defines a multidimensional potential of mean force, the computation or determination of which by direct integration, simulation, or from experimental data for the entire protein is unfeasible. In the next three sections, the determination of this integral by making necessary simplifications is addressed. It should be noted that Sections 3.5.2, 3.5.3, and 3.5.4 of this section do not imply a different physical origin but discuss different approaches to derive the coarse-grained potentials based on Eq. (3.2).

3.5.2 Statistical Potentials (Boltzmann Principle) Knowledge-based potentials, also known as statistical potentials, are commonly used to predict protein structures as well to simulate protein-folding pathways. The basic purpose of this approach (Tanaka and Scheraga 1976; Miyazawa and Jernigan 1985; Sippl 1990a,b, 1993; Covell 1992; Casari and Sippl 1992; Maiorov and Crippen 1992) is to construct an effective energy function [the prototype of which is given by Eq. (3.2)] based on the distributions of inter-residue distances, virtual-bond lengths, bond angles, dihedral angles, geometric parameters characteristic of short-sequence fragments, etc., derived from structures deposited in the Protein Data Bank (PDB) (Berman et al. 2000). The basic equation used in deriving statistical potentials is given by Eq. (3.3)

48

C. Czaplewski et al.

W (x; c; s) = −RT ln

N obs (x; c; s) N ref (x; c; s)

(3.3)

where W(x; c; s) is the estimated potential of mean force of a fragment with geometry expressed by the vector x, composition (the kinds of residues involved) expressed by the vector c, and sequence and/or secondary-structure context expressed by the vector s, R is the universal gas constant, T is the absolute temperature, N obs (x; c; s) is the number of counts of fragments of a given composition and sequence context and geometry close to x observed in the database, and N ref (x; c; s) is the reference number of counts (in the absence of any interactions except those imposed by chainconnectivity and excluded-volume constraints). In simple residue–residue contact potentials (Tanaka and Scheraga 1976; Miyazawa and Jernigan 1985; Covell 1992; Maiorov and Crippen 1992; Rooman et al. 1992; Zhou and Zhou 2004), c consists of the kinds of the first and the second residues involved, x is 1 if the distance between the two selected atoms (Cα , Cβ , mass centers) of the residues is less than rcut (usually equal to 7 Å) and 0 otherwise, and the sequence context is ignored. In more refined pair potentials, residues are split into more centers of interaction, and different potentials are developed for the local and non-local interactions or the order of the two residues in the chain; this implies sequence context (Sippl 1990a,b, 1993; Koli´nski and Skolnick 1992; Koli´nski et al. 1993; Godzik et al. 1993). It is clear from Eq. (3.3) that the statistical potentials depend on the database from which the N obs (x; c; s) values were derived. For example, it was shown that statistical potentials derived from the structures of all α-helical proteins are significantly different from those obtained from all-β proteins (Furuichi and Koehl 1998); a similar situation is observed when single-chain and multichain proteins are used for development of potentials (Moont et al. 1999; Lu et al. 2003). Following the work of Miyazawa and Jernigan (1985), non-homologous protein structures with different types of folds are selected for a database. The complete statistical energy function is a sum of terms determined using Eq. (3.3). It should be noted that each of the terms can be identified with the potential of mean force of a protein fragment, and each of them is evaluated independently. Consequently, Eq. (3.3) does not provide a rigorous connection to the PMF of the polypeptide chain under consideration [defined by Eq. (3.2)], because this PMF is not a sum of fragment PMFs, each determined in the context of the whole structures of database proteins. Moreover, there is a possibility that the same contributions will be included in different terms. The statistical potentials have been discussed in detail in the review by Shen and Sali (2006). They can be classified according to the characteristics a, b, c, and d: (a) Protein structure representation: The representations differ by the number, location (e.g., on Cα atoms, Cβ atoms, centers of masses of selected fragments, etc.), and types of sites (point masses, rods, or rigid bodies), and choice of the representation of the conformational space (continuous, lattice, decoys, and fragments).

3

Protein Coarse-Grained Models

49

(b) Interaction scheme: A number of knowledge-based force fields include only residue–residue interactions that depend only on the distance (Miyazawa and Jernigan 1985; Sippl 1990a,b, 1993; Casari and Sippl 1992; Skolnick et al. 1997a; Samudrala and Moult 1998; Rojnuckarin and Subramaniam 1999; Zhang et al. 2004; Chen and Shakhanovich 2005) or on distance and orientation (Buchete et al. 2003, 2004; Miyazawa and Jernigan 2005). Some of these simple potentials are enhanced by including a solvation term dependent on solventaccessible area (Sippl 1993; Melo and Feytymans 1998). Potentials based on residue–residue interactions are good only for threading but perform poorly in ab initio folding. Potentials containing local-only interactions in the form of virtual-bond dihedral angle terms (Rooman et al. 1991, 1992; Zhou and Zhou 2004; Betancourt 2008) or those containing only hydrogen-bonding interactions (Kortemme et al. 2003) are also known. However, the majority of working potentials contain side-chain–side-chain, hydrogen-bonding, local, and multibody interactions (Koli´nski and Skolnick 1992, 1994a,b; Koli´nski et al. 1993; Sun 1993; Koli´nski 2004). (c) Representation of interactions: The interactions can be represented either in a tabular form (i.e., a PMF value for each different geometry; for contact potentials this is just a single number) (Sippl 1990a,b, 1993; Casari and Sippl 1992; Koli´nski and Skolnick 1992, 1994a,b; Koli´nski et al. 1993; Koli´nski 2004) or by functional forms which are determined by fitting to the PMF values (Obatake and Crippen 1981; Crippen and Viswanadhan 1984, 1987; Sun 1993; Liwo et al. 1997a,b; Buchete et al. 2003, 2004). (d) Reference state definition: Generally the reference state is defined as one in which no specific interactions occur within the fragment under consideration, and the interactions result only from chain-connectivity and excluded-volume constraints. However, the level of sophistication of choosing the reference state varies significantly among the statistical potentials (Sippl 1990a,b, 1993; Casari and Sippl 1992; Jernigan and Bahar 1996; Moult 1997; Skolnick et al. 1997; Samudrala and Moult 1998; Lu and Skolnick 2001; Buchete et al. 2004; Zhou et al. 2006). As an example of the derivation of statistical potentials, the normalized distribution function N obs (r), the reference distribution functions, N ref (r) (r being the distance between the side-chain centers), and their ratio (the correlation function) for the Leu-Leu pair determined from the PDB by Liwo et al. (1997a) are shown in Fig. 3.2a, while the respective PMF together with the fit to a Lennard–Jones (6–12) functional form is shown in Fig. 3.2b. The statistical potentials were the subject of extensive criticism especially related to the fact that they are derived from equilibrium structures and their relationship to potentials of mean force is not clear (Thomas and Dill 1996; Ben Naim 1997). The correctness of reference state definition and the quality of statistical data derived from the databases were also challenged (Thomas and Dill 1996; Ben Naim 1997; Tobi et al. 2000; Summa et al. 2005). Implementators and developers of statistical potentials recognized most of the criticisms related to their methodology and

50

C. Czaplewski et al.

Fig. 3.2 Upper graph: sample pair-distribution and pair-correlation functions for the Leu-Leu pair averaged over consecutive 0.5-Å shells used to determine the statistical side-chain–side-chain interaction potentials for this residue pair by Liwo et al. (1997a). (a) Radial normalized pair-correlation function [N obs /N ref in Eq. (3.3)]; (b) the reference normalized pair number of counts [Nref in Eq. (3.3)]; (c) the total normalized number of pair counts [Nref in Eq. (3.3)] determined from the PDB. All curves were normalized to the maximum value of 1.0. Lower graph: the potential of mean force computed from the correlation function [W of Eq. (3.3); dashed line] and the Lennard–Jones fit (solid line). It should be noted that the Lennard–Jones fit does not reproduce the desolvation maximum. The upper graph has been reproduced with permission from figure 3 of Liwo et al. (1997a) and the lower graph was constructed based on the data from Liwo et al. (1997a)

3

Protein Coarse-Grained Models

51

raised further important points. One of the most important extensions of the statistical potentials was the observation that pairwise energies used there could be insufficient to describe internal protein energy correctly and, therefore, introduction of multibody expansion in the potential became necessary (Koli´nski and Skolnick 1992, 1994a,b; Koli´nski et al. 1993; Godzik et al. 1993; Vendruscolo and Domany 1998; Vendruscolo et al. 1999; Chapter 6). On the other hand, this extension led to greater problems with the quality of the statistical data derived from the PDB (Vendruscolo et al. 1999). Despite the conceptual and technical problems, statistical-based potentials enjoyed a significant degree of success in many applications related to the biophysics of proteins. The primary application of the statistical potentials was prediction of the three-dimensional structure of proteins from amino acid sequences (Koli´nski and Skolnick 1992, 1994a,b; Koli´nski et al. 1993; Panchenko et al. 2000; Skolnick et al. 2000; Tobi and Elber 2000; Tobi et al. 2000; Vendruscolo et al. 2000, Koli´nski 2004), protein thermodynamical stability (Gilis and Rooman 1996, 1997), and the structure and stability of protein–protein and protein–ligand complexes (Gohlke and Klebe 2001). However, the statistical potentials based on the Boltzmann principle [Eq. (3.3)] are currently no longer developed or used as extensively as 5–7 years ago (Lazaridis and Karplus 2000; Meller and Elber 2002; Russ and Ranganathan 2002; Buchete et al. 2003, 2004). The reason for this is that current protein structure prediction is based mostly on the template recognition methodology, rather than on a statistical energy function (Kryshtafovych and Fidelis 2009).

3.5.3 Factor Expansion of the PMF The exact RFE of Eq. (3.2) can be evaluated only numerically, and this task requires, at best, as much effort as all-atom simulations with explicit solvent. Therefore, a way around, to which many investigators resort, is to compose the total PMF [Eq. (3.3)] from contributions corresponding to fragments of a system obtained by all-atom simulations of, e.g., models of pairs of interacting side chains in water or as statistical potentials. However, the connection of such composite potentials to the parent PMF defined by Eq. (3.2) is unclear. Liwo et al. (1998, 2001) developed a formal expansion of the PMF, which is based on Kubo’s (1962) cluster-cumulant approach. First, the total all-atom energy of a system composed of n coarse-grained sites is partitioned into the component energies, ε1 , ε2 , . . . , εN , N = n(n + 1)/2, as given by Eq. (3.4). E(X; Y) =

n k=1 i∈Ik

Eik (X; yk ) +

n k−1 k=1 l=1 i∈Ik j∈Il

Eik;jl (X; yk ; yl ) =

N

εi (X; zi )

i=1

(3.4) where the sets {I1 , I2 , . . . , In } contain the indices of all atoms assigned to interaction site 1, 2,. . ., n, respectively, and zi = yi or zi = (yk , yl ) depending on whether

52

C. Czaplewski et al.

εi is an intrasite or intersite energy. Each component energy is either a sum of all interatomic interactions within a given extended site or between two extended sites. By inserting Eq. (3.2) and splitting the RFE into cluster-cumulant

Eq. (3.4) into functions, εi1 , εi2 , ..., εik f , containing increasing numbers of component energies, Eq. (3.2) becomes Eq. (3.5). F(X) =

εi f +

εi εj

i

i<j

f

εi εj εk

+

f

+... + ε1 ε2 ...εN f

(3.5)

i<j
where the cluster-cumulant functions are expressed by Eq. (3.6).

εi1 , εi2 , ..., εik

f

=

k

(−1)k−l εim1 , εim2 , ..., εiml

(3.6)

l=1 im1
and ⎧ ⎫ ⎪ ⎪ k ⎨ ⎬

1 1 exp −β εil (X; zil ) dVyl εi1 , εi2 , ...εik = − ln ⎪ β ⎪ ⎩ VyI ⎭ l=1

(3.7)

I

where β = 1/RT, and Vyl is

the volume of the subspace spanned by the variables yi1 , yi2 , ..., yik . The quantity εi1 , εi2 , ...εik is the RFE containing only a subset of component interactions. Each of the first-order factors εi f is the RFE corresponding to component interaction i. These factors are potentials of mean force of the interactions between isolated sites (which can be identified with, e.g., interactions between the side chains or “hydrogen-bonding” interactions between the peptide groups) and with the potentials of mean force of the local interactions within isolated sites. The PMF-based approaches that construct the effective energy function of coarse-grained systems by analogy to all-atom force fields include mostly the first-order factors which turned out to be an insufficient approximation. Nevertheless, the expansion has to be truncated at some points for tractability and transferability of the resulting force field. The fourth-order expansion appears to be sufficient (Ołdziej et al. 2004). Approximate analytical expressions for the factors can be derived by using Kubo’s (1962) generalized cumulant expansion (Liwo et al. 2001). With this expansion, temperature dependence of the respective terms of the force field can be introduced in a straightforward manner (Liwo

et al. 2007). The second-order factors εi εj f contain RFEs of pairs of component interactions minus the sums of the RFEs of the single-component interactions. They can be regarded as correlation terms pertaining to component interactions i and j, reflecting the coupling between the secondary degrees of freedom pertaining to these interactions. Examples of second-order and third-order correlation terms are virtual-bond torsional and double-torsional potentials, respectively [these terms correspond to the coupling between the local interactions of two or three consecutive amino acid

3

Protein Coarse-Grained Models

53

residues, respectively (Liwo et al. 2001; Ołdziej et al. 2003)] and the terms pertaining to the coupling between local and backbone-electrostatic interactions. The third-order factors corresponding to the coupling of the long-range electrostatic (hydrogen-bonding) interactions of a pair of peptide groups and the local interactions of residues adjacent to them, and the fourth-order factors accounting for the coupling of long-range electrostatic interactions of two pairs of consecutive peptide groups and the local interactions of the residues linking these groups, are indispensable for stabilization of regular secondary structures (α-helices and β-sheets), as was also observed earlier by Koli´nski, Skolnick, and coworkers (Koli´nski and Skolnick 1992, 1994a; Koli´nski et al. 1993) by experimenting with interaction schemes in the statistical potentials. The third-order factors mentioned above have a nice physical interpretation as the interaction of two fictitious dipoles located on the interacting peptide groups, with the magnitude and orientation of their dipole moments dependent on the orientation of the peptide-group axes and virtual-bond dihedral angles. An illustration is presented in Fig. 3.3. The correlation terms are also apparent in

(3)

Fig. 3.3 Illustration of the physical meaning of the Uloc-el correlation terms. The arrows represent the components of the dipole moments (which represent the peptide groups between the Cαi and Cαi+1 Cαj and Cαj+1 atoms, respectively; these peptide groups are represented by light-gray letters and lines) from the cumulant-based expression [Eq. (11) in Liwo et al. (2004b)] for the third-order correlation contribution to the RFE of two interacting peptide groups pi and pj considered together with the local interactions within the adjacent amino acid residues and the vector components of these dipole moments defined by Eq. (11) in Liwo et al. (2004b). The positions of atoms Cαi , Cαi+1 , and Cαi+2 Cαj , Cαj+1 , and Cαj+2 are fixed, and so are the vector components μ1i and μ1j of the dipole moments, whereas the virtual bonds Cαi−1 . . . Cαi Cαj−1 . . . Cαj and the components of μ2i (μ2j ) can rotate about the Cαi . . . Cαi+1 Cαj . . . Cαj+1 axis by the angle λi (λj ). (a) Parallel orientation of the planes defined by the Cαi , Cαi+1 , Cαi+2 and the Cαj , Cαj+1 , Cαj+2 atoms, as found in parallel β sheets. (b) Opposite orientation of these planes, which are not found in protein structures. The dotted rectangles mark the planes containing the Cαi . . . Cαi+1 Cαj . . . Cαj+1 virtual bonds and the vectors μi (μj ). Reproduced with permission from figure 12 of Liwo et al. (2004b)

54

C. Czaplewski et al.

effective interactions between side chains in water as shown by many simulation studies (Czaplewski et al. 2000, 2003; Shimizu and Chan 2001), although the inclusion of such correlation terms does not seem necessary for the coarse-grained force fields to work. The factors can be determined by direct Boltzmann summation over the energy surfaces of the respective systems (Levitt and Warshel 1975; Levitt 1976; Pincus and Scheraga 1977; Liwo et al. 1993b, 2001, 2004b; Ołdziej et al. 2003; Kozłowska et al. 2007, 2010a,b) or by simulations (Makowski et al. 2007b,c, 2008; Monticelli et al. 2008). Subsequently, approximate analytical expressions can be fitted to the factor surfaces defined by Eqs. (3.6) and (3.7). As an example, in Fig. 3.4, we present the PMFs of a pair of valine side chains from Makowski et al. (2007c), which are modeled by ellipsoids of revolution, with an interaction potential approximated by a sum of the Gay-Berne (1981) potential of non-bonded interactions and a Gaussian-overlap model of solvation (Makowski et al. 2007a). The physics-based potentials of side-chain interactions (Makowski et al. 2007a,b, and c, 2008) will replace the respective knowledge-based potentials (Liwo et al. 1997a) used in the present UNRES force field (Liwo et al. 2008a). It should be noted that Eq. (3.5) is not used as is but, instead, the individual factors are implemented as energy terms in the final force field, and each term is

1.5

PMF [kcal/mol]

1

0.5

0

–0.5

–1

–1.5 4

5

6

7 8 9 distance [Å]

10

11

12

Fig. 3.4 The PMF curve for a pair of isobutane molecules (to model a pair of valine side chains). The dashed, dotted, and dot-dashed lines correspond to PMFs determined for the side-to-side, edge-to-edge, and side-to-edge orientation, respectively, obtained by MD simulations. The solid lines correspond to the analytical approximation to the PMFs, with coefficients determined by least-squares fitting (Makowski et al. 2007c) of the analytical expression to the PMF determined by MD simulations. Reproduced with permission from figure 4c of Makowski et al. (2007c)

3

Protein Coarse-Grained Models

55

assigned a weight which is determined by force-field optimization (Liwo et al. 2002, 2007, 2008a,b; Ołdziej et al. 2004). Such a procedure enables us to compensate for the errors inherent in the truncation of the factor expansion, approximations to factors, and the accuracy of the method implemented to compute the free-energy surfaces corresponding to factors.

3.5.4 Force-Matching Method The multiscale-coarse-graining (MS-CG) method of Voth and coworkers (Izvekov and Voth 2005a,b; Ayton et al. 2007a,b; Noid et al. 2007, 2008; Zhou et al. 2007b; Wang et al. 2009) is also based on the RFE function (or many-body PMF) of a given system [Eq. (3.8)]. The CG sites are defined as the centers of masses of groups of atoms, MRI (r), I = 1, 2, ..., N, where r denotes the coordinates of atoms within the coarse-grained group I; this quantity is defined by Eq. (3.8).

mi ri

i∈{I}

MRI (r) =

mi

(3.8)

i∈{I}

where {I} is the set of the numbers of the atoms contained within the site I. The determination of the RFE function of a given system (e.g., a polypeptide molecule and the surrounding solvent) is based on all-atom MD simulations of that system and, subsequently, determination of the mean forces acting on the CG sites is obtained by solving the minimization problem given by Eq. (3.9) (Izvekov and Voth 2005a,b, Ayton et al. 2007b, Zhou et al. 2007b). Nf N CG 2 1 AA CG χ = FkI − FkI 3Nf NCG 2

(3.9)

k=1 I=1

where FAA kI is the average force acting on coarse-grained site I at configuration k computed from all-atom MD simulations, FCG kI is the force computed in the CG approximation, Nf is the number of coarse-grained configurations obtained in the MD simulation, NCG is the number of CG sites, and ||o|| denotes the Euclidean norm of a vector. It should be noted that the definition of coarse-grained degrees of freedom as the Cartesian coordinates of the centers of mass of the extended sites [Eq. (3.8)] is less general than separating the variable space into the coarse-grained degrees of freedom X and the secondary degrees of freedom Y, as defined in Section 3.5.1. The only restriction imposed on Y is that they must belong to a space orthogonal to that spanned by X; therefore, the coordinates of X might contain not only the Cartesian coordinates of the extended sites but also those of their orientation. Conversely,

56

C. Czaplewski et al.

Eq. (3.8) restricts the treatment to radial-only effective potentials of interaction, although extension to non-radial potentials is possible. Because of the large number of degrees of freedom, long-range forces are expressed in terms of pair contributions, and it is not feasible to determine FCG kI without additional assumptions. In the application to peptide calculations (Zhou et al. 2007b; Thorpe et al. 2008), the pairwise contributions were defined as cubic splines in site–site distances. The spline coefficients were determined by minimization of χ 2 of Eq. (3.9), which is the solution of a linear least-squares problem. Harmonic expressions are used for bonded and 1,3-non-bonded interactions and Fourier expressions for torsional interactions (Zhou et al. 2007b); the respective curves are determined as PMFs from MD simulations and the coefficients of the analytical expressions are determined by least-squares fitting. Because of the pairwise decomposition of the forces, the solvent molecules are treated explicitly. The water molecule is a single spherical site. As pointed out by Noid et al. (2007), the forces determined by minimizing χ 2 of Eq. (3.9), although pairwise, are not equivalent to pairwise mean forces determined by averaging the forces acting within each of the pair, which do not minimize χ 2 of Eq. (3.9). The MS-CG forces incorporate many-body terms implicitly, similar to the Yvon–Born–Green equations in liquid-state theory (Noid et al. 2007). However, explicit multibody terms are not present in the force fields obtained with the present force-matching procedure (Noid et al. 2007). The MS-CG methodology has been applied (Zhou et al. 2007b) to pentadecaalanine and to V5 PGV5 , which fold into an α-helix and a β-hairpin, respectively, with the chemistry at harvard molecular mechanics (CHARMM) force field. Two-bead and four-bead models were tried. The two-bead model consisted of one bead per side chain and one per backbone (−NH−CH−CO−) atoms placed at the sidechain center of mass and on the Cα atom, respectively; in the four-bead model, three backbone-atom centers represent the CO, NH, and CH groups, respectively, and one bead represents the side chain. The force field for each peptide and each coarse-grained representation was derived from all-atom MD simulations with the CHARMM force field and explicit water starting from the folded structure; the corresponding peptide never left the folded basin during the simulation. For pentadecaalanine, both the two-bead and the four-bead model were able to keep the folded (α-helical) structure stable, while only the four-bead model was able to maintain the folded structure of V5 PGV5 . The authors concluded that the failure of the two-bead model for V5 PGV5 was caused by asymmetry of interactions, as opposed to pentadecaalanine and that introduction of the dependence of the forces not only on bead type but also on their positions in the sequence could improve the two-bead model. The four-bead models of both peptides were also used to simulate the folding of these peptides starting from extended structures (Thorpe et al. 2008); both peptides folded to the native structures. This finding is remarkable in view of the fact that only the configurations from the neighborhood of the folded structures were used to derive the force field. The MS-CG approach, though fully general, lacks transferability; a force field derived in this way is, by definition, only good for coarse-grained representation of a

3

Protein Coarse-Grained Models

57

given system. The transferability has recently been addressed (Wang et al. 2009) by developing the effective force-coarse-graining (EF-CG) method, in which pairwise contributions to forces are parameterized explicitly, which reduces the accuracy of the representation but makes the force field transferable. However, both the MS-CG and EF-CG approaches produce force fields with accuracy no better than that of the parent all-atom force fields which, at least in their present form, are far from having the capacity of ab initio protein folding [this feature arises from the fact that the forces implemented in determining the potentials from Eq. (3.9) are computed from all-atom potentials, and the procedure is aimed at reproducing these forces as well as possible]. Force-field optimization for foldability (Section 3.5.5), which can compensate for the errors inherent in force-matching methods (arising both from approximating the site–site forces with a sum of pairwise central forces and the inaccuracy of the parent all-atom potentials used to parameterize them) by including structural and thermodynamic data of selected training proteins, will probably become necessary for application of the MS-CG and EF-CG methodologies in protein-folding simulations. The failure of the two-bead MS-CG model to keep the stable folded structure of V5 PGV5 indicates two other weaknesses of the MS-CG methodology in its present formulation, namely, the representation of forces as combinations of terms dependent on site–site distance, but not on site orientation, and the absence of explicit multibody terms. The first feature was shown by Buchete and coworkers to be important in side-chain interaction terms (Buchete et al. 2003, 2004) and by Liwo et al. (1993b) in peptide-group interaction terms, while explicit multibody terms are crucial to stabilize regular secondary structure in coarse-grained force fields (Koli´nski and Skolnick 1992; Koli´nski et al. 1993; Liwo et al. 1998, 2001). The absence of these features does not invalidate the MS-CG or the EF-CG approach; however, it precludes extensive coarse graining (e.g., it has forced the introduction of three sites per backbone unit instead of one to obtain good results for the V5 PGV5 peptide) and also forces the introduction of explicit solvent; moreover, the coarse-grained sites must be as spherical as possible (Wang et al. 2009). These deficiencies can probably be eliminated. Spherical harmonic expansion as applied by Buchete et al. (2003, 2004) can be used to introduce orientational dependence. Multibody terms could be introduced by factorization of the mean forces into two-body and many-body terms and use of cumulant-based expansion to derive approximate analytical formulas, as in the derivation of the UNRES force field.

3.5.5 Optimization of an Effective Energy Function Apart from the force-matching method discussed in Section 3.5.4 (which, however, does not lead to transferable potentials), the different terms in an effective coarsegrained energy function have different origin and accuracy (never greater than the accuracy of the parent all-atom energy calculations) and, consequently, are weakly related to each other. Moreover, the factor expansion given by Eqs. (3.5) and (3.7)

58

C. Czaplewski et al.

has to be truncated to render the force field usable. Consequently, an effective energy function obtained by direct summation of the different terms might not be good for folding simulations. As mentioned in Section 3.5.2, force-field optimization, originated by Crippen and coworkers (Crippen and Viswandhan 1987; Crippen and Snow 1990; Seetharamulu and Crippen 1991), is used to obtain foldable force fields. Wolynes and coworkers (Bryngelson and Wolynes 1987; Hardin et al. 2002; Goldstein et al. 1992a) formulated the energy-landscape theory, according to which the energy landscape of a foldable protein resembles a funnel with the bottom in the native state. This implied that the glass-transition temperature be significantly lower than the folding-transition temperature and, in turn, that the Z-score value, Z, which is defined by Eq. (3.10), be maximized. Enative − Enon-native 2 2 Enon-native − Enon-native

Z =

(3.10)

2 where, Enative , Enon-native , and Enon-native denote the canonical averages of the energy of native-like and non-native conformations and the canonical average of the square of the energy of non-native conformations. By simulation studies of model heteropolymer sequences discretized to a cubic lattice, Sali et al. (1994a,b) found that the native state is located as low in energy as possible relative to any alternative state. Although further work (Camacho and Thirumalai 1996; Klimov and Thirumalai 1996a,b, 1998; Tiana and Broglia 2001; Liwo et al. 2004a) has shown that the energy-gap criterion is not a sufficient one, and the structure of the energy spectrum of the non-native states is important, the energy-gap criterion remains one of the bases of potential-function optimization. The Z-score and energy-gap optimization approaches were implemented by many investigators to optimize potential energy functions (Hao and Scheraga 1996a,b; Liwo et al. 1997b; Lee et al. 2000, 2002; Pillardy et al. 2001b; Eastwood et al. 2003; Fujitsuka et al. 2004). Scheraga, Liwo, and coworkers (Liwo et al. 2002, 2004a; Ołdziej et al. 2004; Liwo et al. 2007, 2008b; He et al. 2009) extended the Z-score or energy-gap optimization concept further by developing a hierarchical optimization method. In this approach the conformational space of a protein is divided into levels. The lowest level (level 0) contains unfolded structures, the highest level contains folded structures, and the intermediate levels contain partially folded structures. Optimization is aimed at obtaining such ordering of the levels that the relative free energy of the folded level is the lowest below the folding temperature and the highest above the folding temperature, while the relative free energy of the lowest (non-native) level is the highest below the folding temperature and the lowest above the folding temperature, as illustrated in Fig. 3.5. The relative free energies of the levels and their composition are obtained from thermodynamic and structural data pertaining to the folding of complete proteins or their fragments. The ordering of the free energies as illustrated in Fig. 3.5 leads to the appearance of a heat-capacity peak at the folding-transition temperature (Liwo et al. 2007, 2008b); an example is shown in Fig. 3.6.

3

Protein Coarse-Grained Models

59

3

9

2.5

8

2

7

1.5

6

1

5 Tfexp = 325 K Tfcalc = 330 K

0.5

4

0 300

320 340 360 temperature [K]

RMSD [A]

Cv [kcal/(mol×°K)]

Fig. 3.5 Schematic illustration of ordering of the energy levels, which is the goal of the algorithm for optimizing the potential function, using the 1EM7 protein, the structure of which consists of two β-hairpins packed to a middle α-helix. Below the folding-transition temperature (Tf ), the nonnative level (level 0) has the highest free energy, the conformations with only the native C-terminal β-hairpin formed (level 1) have a lower free energy, next are the conformations in which the middle part of the N-terminal β -strand joins the β-hairpin and the middle α-helix starts to form and, finally, the native-like structures with all structural elements formed have the lowest free energy. Above the folding-transition temperature, the free-energy relations are reversed and, at the folding-transition temperature, the free energies should be approximately equal. Reproduced with permission from figure 3 of Liwo et al. (2008a)

380

3

Fig. 3.6 (a) Experimental (filled circles) and calculated with the optimized force field (solid line) free-energy difference between the folded and unfolded states of the 1ENH protein. (b) Calculated (solid line) and experimental (dashed line) heat-capacity curve of 1ENH. Data from Liwo et al. (2008b); reproduced with permission from figures 6a and b of that reference

60

C. Czaplewski et al.

3.5.6 “Knowledge-Based” and “Physics-Based” Potentials The coarse-grained potentials are sometimes collectively termed “knowledgebased” and as such counterposed to the “physics-based” all-atom potentials (Shell et al. 2009). However, the definition of coarse-grained potentials as potentials of mean force presented in Section 3.5.1 [Eq. (3.2)] is entirely physical and is strictly related to the thermodynamic probability of a coarse-grained structure to occur at a given temperature; as mentioned in Section 3.5.1, coarse-grained force fields should include dependence on temperature (Liwo et al. 2007). In Eq. (3.2), the secondary degrees of freedom are averaged out as in the case of all-atom potentials the electron degrees of freedom are averaged out to leave only the positions of atomic nuclei. The origin of each averaging is different (the Boltzmann law and the adiabatic approximation, respectively) but the principle is the same. Consequently, it is possible to construct both all-atom and coarse-grained force fields based on solid physical principles. If is often argued that coarse-grained potentials are knowledge-based, because they contain extensive input from experimental data and, moreover, are usually optimized to locate the native structures of training proteins as the lowest-energy or the lowest-free-energy structures. However, unless ab initio molecular quantum mechanics is implemented in protein-folding simulations, no computational approach is free from extensive use of experimental information in parameterization. This is true even for semi-empirical methods of molecular quantum mechanics, which were parameterized based on experimental heats of formation, dipole moments, and ionization potentials of model compounds (Stewart 1990). The all-atom force fields contain a heavy input from crystal data (equilibrium bond lengths and van der Waals radii), infrared (IR) spectroscopy (force constants), far IR, microwave, and NMR measurements (to refine torsional constants), dielectric measurements (to determine polarizabilities in order to parameterize the Lennard– Jones potential), and association enthalpy measurements (to adjust the charges). Ab initio quantum-mechanical calculations are used to derive charges and other parameters. This approach is identical in spirit to those described in Sections 3.5.3, 3.5.4 and, with some statistical approaches described in Section 3.5.1.2. In the factor-expansion approach, the parameters of the factors are derived based on quantum-mechanical calculations (Ołdziej et al. 2003; Liwo et al. 2004b) and on allatom simulations (Makowski et al. 2007b,c, 2008; Monticelli et al. 2008); some are derived from structural and other databases (Liwo et al. 1997a), just as the parameters of all-atom force fields. In the force-matching approach, the whole force field for a given system is derived from all-atom MD simulations (Izvekov and Voth 2005a,b; Ayton et al. 2007a,b; Noid et al. 2007, 2008; Zhou et al. 2007b; Thorpe et al. 2008). On the other hand, fully knowledge-based all-atom potentials, in which the “raw” atom–atom PMFs determined by Boltzmann inversion of the correlation functions [Eq. (3.3)] are used directly without fitting to physics-based functional forms, have also been developed (Melo and Feytmans 1997), The force-field optimization described in Section 3.5.5 is based on energylandscape theory which also has solid physical foundations (Bryngelson and

3

Protein Coarse-Grained Models

61

Wolynes, 1987; Hardin et al. 2002; Goldstein et al. 1992a; Sali et al. 1994a,b). Given the error inherent in individual energy terms, force-field optimization is a necessary step to take if a force field is meant to have any predictive power. This remark concerns both coarse-grained and all-atom force fields; it should be noted that a growing number of foldability-optimized all-atom force fields have appeared lately (Schug and Wenzel 2006; Yang et al. 2007; Arnautova and Scheraga 2008; Jagielska et al. 2008; Irbäck et al. 2009; Chapter 5). Consequently, if all coarse-grained force fields are knowledge-based because of using experimental information in parameterization, then all-atom force fields must also be considered as knowledge-based. In view of the above discussion, it seems reasonable to distinguish knowledgebased potentials from physics-based ones in the same manner as empirical equations of mechanical engineering for the relationship between stress and rod length or bend, velocity and shear friction, etc., are distinguished from formulas of theoretical mechanics and hydrodynamics. The empirical formulas mentioned above are set up to solve practical problems such as the construction of buildings, cars, bridges, and hydraulic systems and work only within the scope of their application. Consequently, a physics-based potential (coarse-grained or all-atom) must (i) be based on solid foundations that can be traced back to elementary laws of physics (such as the Schrodinger equation in the case of all-atom potentials and the Boltzmann law for coarse-grained potentials), (ii) apply consistent approximations to derive a force field or its terms from this principle, (iii) be meant to work without any ancillary information from structural databases, and (iv) reproduce structure, dynamics, and thermodynamics of folding. The statistical potentials derived fully from structural databases both in form and parameters might meet criterion (i) [although various investigators suggest the contrary (Thomas and Dill 1996; Ben Naim 1997)]; however, they certainly do not meet criterion (ii) and most of them work only with explicit database information. It should be noted, though that, e.g., the CABS statistical potential originally designed to predict protein structures also has the capacity to simulate folding pathways (Kmiecik and Koli´nski 2007,2008).

3.6 Applications in Protein Structure Prediction The physical foundation for prediction of protein structure is the Anfinsen thermodynamic hypothesis (Anfinsen 1973), according to which the native structure of a protein corresponds to the global free-energy minimum of this protein plus the surrounding solvent. Consequently, prediction of protein structure can be formulated as a global optimization problem (Scheraga et al. 2004). Optimization of the potential energy instead of the free energy was carried out until very recently (Liwo et al. 2007). Because of the large number of degrees of freedom, the problem is still intractable at the all-atom level except for relatively small proteins (Vila et al. 2003; Schug and Wenzel 2006, Irbäck et al. 2009; Chapter 5) and, consequently, coarsegrained models have been used extensively, including ROSETTA (Simons et al.

62

C. Czaplewski et al.

1997; Rohl et al. 2004), SICHO (Koli´nski and Skolnick 1998), CABS (Koli´nski 2004), C alpha and side chain (CAS) (Zhang et al. 2005), and UNRES (Liwo et al. 2008a). Despite some limitation related to the accuracy of the protein structure description by reduced models, many of them are among the most powerful and successful tools in protein structure prediction. Global optimization of the potential energy function was the method of choice in previous applications of UNRES to protein structure prediction (Lee et al. 1999; Liwo et al. 1999; Pillardy et al. 2001a; Ołdziej et al. 2005). The conformationalspace annealing (CSA) method (Lee et al. 1999; Czaplewski et al. 2004a), which is a combination of energy minimization and a genetic algorithm, was implemented in this early work. As an example, the prediction of CASP3 target T0061 obtained with this approach (the best prediction for this target over all groups participating in CASP3) is shown in Fig. 3.7. Fig. 3.7 Superpositions of the Cα traces of two fragments of the crystal (red) and predicted (yellow) with the UNRES force field and CSA global optimization method structure of HDEA (CASP3 target T0061). (a) Residues D25–I85 (Cα RMSD = 4.2 Å, over 61 residues). (b) Residues W16−K42 (Cα RMSD = 2.9 Å, over 27 residues). Helices 2, 3, 4, and 5 are indicated as H-2, H-3, H-4, and H-5, respectively. Reproduced with permission from figures 1 and 2 of Liwo et al. (1999)

Other search techniques, such as parallel tempering, were also implemented (Skolnick et al. 2001, 2003; Gront et al. 2005); in these approaches an all-atom energy (computed after converting the coarse-grained conformations to all-atom conformations), which is considered as more accurate than a united-residue energy function, was used as a scoring function. However, selecting the lowest-energy conformation or a number of energy-ranked conformations as candidates for the predicted structure ignores conformational entropy, which is a key component of the free energy. This problem was addressed by a number of investigators.

3

Protein Coarse-Grained Models

63

Rohl et al. (2004) developed the thermodynamic-clustering technique in which lowenergy conformations resulting from a conformational search are accumulated, and a cluster analysis of the resulting set is carried out. The largest cluster is selected as the best prediction; this cluster does not necessarily contain the conformation with the lowest potential energy. A similar approach to ensembles obtained from lattice Monte Carlo dynamics parallel-tempering runs was developed by Skolnick et al. (2001, 2003) and Gront et al. (2005). These approaches are at the other extreme because they assume that only conformational entropy (the number of conformations in a cluster) is important in making a decision about the ranking of candidate predictions, while the potential energy is ignored in scoring and is used only in conformational sampling. Such an approach was motivated by the fact that neither the united-residue nor the all-atom energy functions are accurate yet (Kryshtafovych et al. 2005). Liwo et al. (2007) developed a rigorous thermodynamic-clustering approach, in which the results of a replica-exchange molecular dynamics run are processed by the weighted-histogram analysis method (WHAM) (Kumar et al. 1992, 1995) to calculate the probability of each conformation, then the conformations are ranked according to decreasing probabilities and, finally, those whose probabilities sum up to a cut-off value (usually 0.99) are clustered, and the probability of each cluster is calculated from Eq. (3.11).

exp ωi −

i∈{I}

P{I} = {I} i∈{I}

Ui RT

exp ωi −

Ui RT

(3.11)

where {I} denotes the set of conformations belonging to the Ith cluster, P{I} is the probability of the Ith cluster, Ui is the energy of the ith conformation, ωi is the logarithm of the weight of this conformation determined by the WHAM method (Liwo et al. 2007), R is the universal gas constant, and T is the absolute temperature. The inaccuracy of energy functions as well as the inefficiency of conformational search techniques for larger proteins seriously limits the accuracy and efficiency of database-free predictions of protein structure (Kryshtafovych et al. 2005). To overcome these limitations, knowledge-based information is often introduced to energy-based prediction methods. One method is to impose geometrical restraints derived from knowledge-based information during a conformational search. Such restraints both reduce the conformational space to be sampled and partially bypass the inaccuracies of the energy function by biasing it toward conformations containing native-like elements. Almost all successful protein structure prediction methods use secondary-structure restraints from secondary-structure prediction. Current secondary-structure prediction algorithms are very fast and very accurate (Wood and Hirst 2005) and, therefore, protein structure prediction methods benefit from using this information (Rohl et al. 2004; Koli´nski and Bujnicki 2005; Ołdziej et al. 2005). Long-range restraints derived from tertiary-contact prediction, which is based on sequence homology, are also included (Skolnick et al. 1997b, 2000, 2001, 2003; Koli´nski and Bujnicki 2005). Geometrical restraints are also introduced in

64

C. Czaplewski et al.

scoring the candidate predictions, for example, by checking them against the distances derived from multiple sequence alignment (Paluszewski and Karplus 2009, and references cited therein). Reduction of the conformational space to be sampled is accomplished by sequence threading (fold recognition) and by a fragment approach. In the threading approach, the conformational space is reduced to Cα or Cβ traces derived from the database of known proteins (Bryant and Lawrence 1993; Jones and Thornton 1993; Godzik et al. 1993; Miller et al. 1996; Meller and Elber 2001; Buchete et al. 2003, 2004), which makes the search much easier but restricts predictions to folds already present in the database. Early studies of the fragment technique (Ponnuswamy et al. 1973; Simon et al. 1991) were adopted as a build-up procedure (Vásquez and Scheraga 1985) to compute protein structure. This technique, later used by Baker and coworkers (Simons et al. 1997; Rohl et al. 2004), has so far been the most successful one in protein structure prediction. In the fragment technique used by Ponnuswamy et al. (1973) and by Baker and coworkers (Simons et al. 1997; Rohl et al. 2004), a library of possible conformations of three-, five-, seven-, and nine-residue-long fragments of the target sequence is created first. To construct the library, a multiple sequence alignment of the target sequence to database sequences is carried out, and fragments are selected either by a minimum-energy criterion (Ponnuswamy et al. 1973; Simon et al. 1991), or from the portions of the chains of database protein conformations whose sequences are sufficiently homologous to the respective portions of the target sequences (Simons et al. 1997; Rohl et al. 2004). In the procedure of Baker and coworkers (Simons et al. 1997; Rohl et al. 2004), the conformational search is performed in the conformational space defined by the fragment library. The effective energy function is the ROSETTA coarse-grained statistical potential (Simons et al. 1997; Rohl et al. 2004); however, specific filters (e.g., a check for hydrogen-bonded β-strands) are used during the search. Since 1994, the techniques for protein structure prediction have been evaluated every other year in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments (Moult et al. 2007). Those methods that are based on coarse-grained models of polypeptide chains were assessed to be very successful. The most successful approaches include ROSETTA (Simons et al. 1997; Rohl et al. 2004), threading assembly refinement (TASSER) which used the reduced CAS model of a polypeptide chain (Zhang et al. 2005), and the CABS model (Koli´nski 2004). All methods mentioned above always appear among the best predictors in the new-fold category (with little or no similarity to amino acid sequences for which three-dimensional structure is known) (Jauch et al. 2007).

3.7 Applications to Study Protein Dynamics and Thermodynamics Proteins are dynamic objects. Computing the static native structure from a knowledge of the amino acid sequence is only the first approach to the protein-folding

3

Protein Coarse-Grained Models

65

problems. The second problem, even more demanding, is to compute structural pathways and rates from the unfolded to the folded form. Protein dynamics plays a fundamental role in biological processes such as enzymatic reactions, signal transduction, immunological processes, cell motility, and also in malignant processes such as cancer and amyloid formation (Dobson 2003). Both experimental and theoretical techniques are used to study protein dynamics and mechanism of protein folding. Experimental studies of protein dynamics are restricted to techniques which provide only indirect and fragmentary information, leaving a wide room for interpretation. Recent advancement of single-molecule studies (Cecconi et al. 2005) facilitates the detailed experimental investigation of the folding pathways of some proteins. However, the development of new methods of simulation and comparison of theoretical and experimental characteristics of protein folding is of crucial importance for advancing our understanding of biological systems. Because of the complexity of the systems, all-atom studies of protein dynamics are restricted mostly to simulations of folded globular proteins around their native states or to unfolding simulations, starting from the experimental structure, except for small proteins (Scheraga et al. 2007). Even for small proteins, only a few trajectories can be run, which does not make it readily possible to compare the results with those of experiments which provide ensemble-averaged properties. To carry out large-scale simulations one has to resort to coarse-grained models of proteins. The method of choice to study protein dynamics is molecular dynamics (MD) (Scheraga et al. 2007). In the MD method, Newton’s equations of motion are solved to obtain the coordinates and momenta of all particles: mi

d2 ri = −∇ri U (ri ) , i = 1, 2, ..., N dt2

(3.12)

where ri is the vector of the Cartesian coordinates and mi is the mass of ith particle, U is the potential energy of the system, −∇ri U (ri ) is the potential force acting on particle i, and N is the number of particles. It should be noted that Eq. (3.12) is valid only if the coordinates describing the system are the Cartesian coordinates of the centers of the sites; a non-diagonal tensor of inertia appears instead of point masses on the left-hand side of the system of the equations of motion if other coordinates such as virtual-bond dihedral angles (He and Scheraga 1998) or virtual-bond vectors (Khalili et al. 2005a) are used. In coarse-grained models, Langevin equations of motion are often used in simulations to mimic the non-conservative forces from the solvent: mi

d2 ri dri + Ri (t) , i = 1, 2, ..., N = −∇ri U (ri ) − mi γi 2 dt dt

(3.13)

where γ i is the friction coefficient of the ith particle, Ri (t) is the vector of random forces arising from the collision of particle i with the molecules of the solvent that are not considered explicitly. Ri (t) has zero mean, and its values taken at different times are δ-correlated:

66

C. Czaplewski et al.

Ri (t) • Ri t = 2mi γi kB Tδ t − t

(3.14)

where T is the absolute temperature of the system, kB is the Boltzmann constant, and δ(x) is the Dirac δ-function. In the overdamped limit, the friction terms prevail over the inertial terms [those on the left-hand side of Eq. (3.13)], and the latter can be neglected, resulting in a system of first-order differential equations often referred to as Brownian dynamics: mi γi

dri = −∇ri U (ri ) + Ri (t) dt

(3.15)

Brownian dynamics is frequently used for coarse-grained models of proteins, but it does not provide control over the kinetic energy because of the neglect of the inertial term, and, therefore, it cannot lead to correct values of thermodynamic properties. Monte Carlo (MC) methods are also used to study protein dynamics, especially in the case of lattice models, for which MD cannot be applied. Lattice representation is most often used in simplistic coarse-grained models, which are not meant to reproduce the detailed features of the protein energy surface, but rather to study general properties of folding. Simulations with simplistic general reduced models have contributed significantly to our understanding of the events that occur in the folding process and of foldability criteria (Veitshans et al. 1996; Shakhnovich 2006). Lattice MC dynamics simulations of the CABS model of polypeptide chains (Koli´nski 2004) were performed by Koli´nski and coworkers (Koli´nski et al. 2003; Kmiecik et al. 2006; Kmiecik and Koli´nski 2007, 2008) to study protein unfolding and folding dynamics. Many coarse-grained protein-folding simulations are carried out with the use of G¯o-like (Ueda et al. 1978; Cieplak et al. 2002; Chapter 8) or other structure-based potentials (Sorenson and Head-Gordon 2002; Brown et al. 2003; Brown and HeadGordon 2004), which are biased toward the native structure, or model potentials (Thirumalai and Klimov 1999) which can reproduce only general features of protein folding. Use of structure-based potentials is based on the assumption that the native-state topology largely determines folding, whereas non-native interactions are of secondary importance. Examples of such applications are studies of the kinetics and sequence of folding events (Hoang and Cieplak 2000; Cieplak et al. 2002; Chapter 8) and the thermal (Cieplak and Sulkowska 2005) and mechanical (Cieplak and Szymczak 2006) unfolding of proteins. Head-Gordon and coworkers have used a model with a bias toward native secondary structure in the backbone potential and interactions between side chains depending on their hydrophobicities to study the folding kinetics of ubiquitin-like sequences (Sorenson and Head-Gordon 2002), as well as protein L and G (Brown and Head-Gordon 2004) by Langevin dynamics. He and Scheraga (1998) used a simpler model with a secondary-structure bias to study the folding of model β-sheet sequences. Small-scale motions are also studied by using elastic network models (Atilgan et al. 2001; Ming and Bruschweiler 2006).

3

Protein Coarse-Grained Models

67

Many existing coarse-grained potentials perform well; in particular, they can predict the native structure of a protein, only when used in connection with information extracted from protein databases, which is not acceptable when studying protein dynamics. The reason for the deficiency of the coarse-grained potentials lies in their derivation either by analogy to all-atom force fields or from database statistics; neither method offers a first-principle systematic derivation of the force field in a clear way. Recently, we implemented the physics-based united-residue UNRES force field for coarse-grained molecular dynamics (MD) (Liwo et al. 2005; Khalili et al. 2005a,b, 2006). Comparison with all-atom MD revealed that UNRES offers a 4,000fold speed-up relative to all-atom simulations with explicit solvent and more than a 200-fold speed-up compared with simulations with implicit solvent (Khalili et al. 2005b). Consequently, real-time folding simulations with UNRES are readily possible. Initial results from UNRES MD simulations show that we are able to simulate folding events which take place in a microsecond or even a millisecond timescale. An example of a folding trajectory of the 1G6U dimeric protein is presented in Fig. 3.8.

C

C C

C C t=0.07 ns

C t= 0.1 ns

C

C

C

C t= 0.18 ns

t= 0.2 ns

t=0.49 ns

Fig. 3.8 Example of a fast folding trajectory of the two-chain 1G6U obtained with the UNRES force field and Berendsen dynamics. The C terminus of each chain is marked. Reproduced with permission from figure 9 of Rojas et al. (2007)

In our very recent work (Liwo et al. 2010), the capacity of UNRES MD was amplified by parallelizing energy and force calculations. A sample scalability graph obtained with IBM BlueGene is presented in Fig. 3.9. It can be seen that an 80-time speed-up was obtained for about an 800-residue protein with 128 processors, which enables millisecond-scale simulations of proteins of this size to be accomplished in days and, consequently, will enable us to simulate biologically important events such as domain motion during enzymatic catalysis, ligand binding and release, and G-protein binding and dissociation from G-protein coupled receptors. Canonical MD simulation can be used for estimating thermodynamical properties as well as for a global search, but in practice it easily becomes trapped, and thus

68

C. Czaplewski et al.

Fig. 3.9 Strong scaling data of the fine-grained UNRES/MD code obtained with IBM BlueGene at Argonnne National Laboratory (intrepid.alcf.anl.gov) for proteins of different size and canonical MD simulations

is not an effective method for searching rough free-energy landscapes of proteins. A single MD trajectory cannot sample the vast thermally accessible configurational space of biomolecules to capture large-scale rearrangements, including reaction pathways, mechanisms, and rates. More efficient conformational sampling algorithms are essential components of methods for studying protein dynamics. One of the most effective sampling methods, the replica-exchange method (RE, also known as exchange MC (Hukushima and Nemoto 1996), or parallel tempering (Hansmann 1997; Chapter 9), was initially developed to improve sampling in glassy systems in statistical physics. However, following Hansmann’s use of the method in simulations of a simple peptide, Met-enkephalin (Hansmann 1997), and Sugita and Okamoto’s formulation of an MD version of the algorithm (Sugita and Okamoto 1999), the RE method has been applied extensively in biomolecular simulations. The replica-exchange MD (REMD) method combines the idea of simulated annealing MD and MC methods and is one of the generalized-ensemble algorithms that perform a random walk in energy space due to a free random walk in temperature space. In the REMD method, n replica systems, each in the canonical ensemble, and each at a different temperature, are simulated. At given intervals, swaps, or exchanges, of the configurational variables between systems are accepted with the Metropolis criterion. This is equivalent to exchanges of temperatures, because the set of the n replica system can be treated as the set of n continuous MD trajectories of

3

Protein Coarse-Grained Models

69

varying temperatures or the set of n canonical ensembles at particular temperatures with structures from all trajectories sorted by temperature. The replica-exchange method was the subject of a recent review (Earl and Deem 2005) which discussed both the history of the method and its application to various physicochemical simulations. The great potential of the RE method was also recognized in a review of sampling methods for molecular simulation (Lei and Duan 2007). REMD is the method of choice for studies of the thermodynamics of protein folding. Various thermodynamical properties are available as a function of temperature through histogram reweighting techniques (WHAM) (Kumar et al. 1992, 1995). Low free-energy minima are accessible through accelerated relaxation. However, REMD does not provide direct information about kinetics, as opposed to multiple-trajectory canonical MD simulations. We compared three generalized-ensemble algorithms for molecular simulations, namely, a replica-exchange method (RE), a replica-exchange multicanonical method (REMUCA), and a replica-exchange multicanonical method with replica exchange (REMUCAREM) in both MC and MD versions, to determine the thermodynamic characteristics of the UNRES force field for efficient sampling at various temperatures using three-model systems: one peptide (polyalanine of 20 residues length) and two small proteins, an α-helical protein of 46 residues (the B domain of the staphylococal protein A; 1BDD) and an α + β protein of 48 residues (the Escherichia coli Mltd Lysm Domain; 1E0G) (Nanias et al. 2006). For polyalanine and α-helical protein A, all algorithms performed reasonably well, but for a more complicated α + β protein (1E0G), the REMD method, especially in its multiplexed version (MREMD) introduced by Rhee and Pande. (Rhee and Pande 2003), turned out to be the most efficient. Replica exchange was also used to test the coarse-grained OPEP force field on six peptides: decaalanine, (AAQAA)3 , the C-terminal β-hairpin from protein G, the Trp-cage, BBA zinc-finger motif, and a dimer of a coiled coil 7-residue peptide (Chebaro et al. 2009). REMD simulations with OPEP showed the α-helical character of alanine-based peptides and reproduced the experimental structure of the β-hairpin and Trp-cage within 1–2 Å. For all systems studied, the proper thermodynamics was recovered. The experimental melting temperature of the Trp-cage was reproduced within a small deviation of 25 K in contrast to previous all-atom simulations. For the peptide with the BBA fold, a cluster of conformations with the fold close to that determined by NMR (a β-hairpin packed to an α-helix) was obtained as one cluster of conformations; however, this cluster did not have the lowest-free energy. Recently, we compared canonical MD, REMD, and MREMD simulations of protein folding with the UNRES force field in detail, using two- model proteins: an α + β protein of 48 residues (the E. coli Mltd Lysm Domain; 1E0G) and an α-protein of 67 residues (the de novo designed protein; 1LQ7) (Czaplewski et al. 2009). To enhance the sampling in MREMD, the replicas are multiplexed with a number of independent molecular dynamics runs at each temperature (Rhee and Pande 2003). Exchanges of conformations between random replicas of neighboring temperatures are tried as in REMD, but there is a larger number of such pairs in MREMD than in REMD. In MREMD, there are several layers of replicas, each of which has all

70

C. Czaplewski et al.

different temperature levels and is equivalent to a single REMD simulation. Exchanges between replicas in different layers are tried as well as exchanges between replicas in the same layer. It should be noted that multiplexing provides better sampling compared to REMD with the same number of replicas (and, consequently, more temperatures). The reason for sampling enhancement in MREMD is that efficient sampling requires diffusion in temperature replica space. Therefore, adding more temperature replicas in REMD (which, at first thought, would seem to increase efficiency) means that the number of swaps grows quadratically and that either longer simulations are needed or exchanges must be attempted more frequently. However, the system needs time to equilibrate with a new temperature distribution after each exchange, which prevents too-frequent exchanges. The MREMD method, on the other hand, takes advantage of both the multiple temperature aspect of REMD, as well as the large number of independent simulations to enhance sampling considerably. Even with a large number of replicas, the simulation time should be long enough so that each trajectory can cover the entire conformational space as well as the entire temperature space (Czaplewski et al. 2009). Moreover, multiplexing effectively extends the intrinsic parallelism of the RE algorithm. Resolution exchange, a variant of the replica-exchange method, combines the efficiency of coarse-grained simulation and the accuracy of all-atom simulation by swapping conformations between replicas simulated at different resolutions (e.g., coarse-grained and all-atom) (Lyman et al. 2006). The crucial step in this method is the reconstruction of the higher resolution system from models at lower resolutions. Many reconstruction algorithms result in biased samples and make the calculation of unbiased thermodynamic averages very difficult. Recently a configurational-bias Monte Carlo procedure was adopted in resolution exchange as a general method to rebuild the missing degrees of freedom (Liu et al. 2008). This approach uses the all-atom Hamiltonian to guide the reconstruction process, and the bias introduced is removed through the acceptance criteria.

3.8 Conclusions and Outlook Returning to the statement in Section 3.1 that describes the two aspects of the protein-folding problem, it is clear that, while there has been considerable progress with coarse-grained models, there is a need for further improvement in force fields and in search procedures. Of the two, optimization of the search procedure seems more advanced at present, given the fine-grain parallel processing (Liwo et al. 2010) that has already led to computations for proteins containing up to 1,000 residues with the currently available computer resources. Molecular dynamics with a coarse-grained model, leading to canonical ensembles with MREMD procedures, has already pushed the simulation timescale into the experimentally accessible one. With anticipated enhancements of computer power in the near future, even longer-time simulations can be expected.

3

Protein Coarse-Grained Models

71

The situation is not as bright as far as improvement of force fields is concerned. While current coarse-grained force fields seem to be based on reasonable physics with a rigorous definition of the coarse-grained effective energy functions as potentials of mean force [Eq. (3.2)], and adequate procedures were developed to derive and parameterize individual effective energy terms and to determine their relative weights, it is clear that more effort must be expended to improve the physics. At present, the coarse-grained force fields work reliably in protein structure prediction only with a heavy input from knowledge-based information. The better treatment of solvent-mediated side-chain–side-chain interactions (including electrostatic interactions), local interactions, and correlation terms seem to be the primary targets for improvements. The importance of local interactions and, possibly, the associated correlation terms, has been shown by the tremendous success of the ROSETTA approach (Simons et al. 1997; Rohl et al. 2004), which is based on the construction of the backbone from a carefully constructed fragment library (cf. Section 3.6). Nevertheless, improvement of force fields would seem to be the barrier that must be surmounted if we can hope to simulate the dynamics, structure, and thermodynamics involved in folding large proteins in experimentally observed timescales. Finally, it will ultimately be necessary to re-convert our coarse-grained trajectories to all-atom ones in order to be able to specify the interactions on an atomic scale that control the dynamics during protein folding. A possible alternative would involve a hybrid procedure combining aspects of an all-atom model with a coarsegrained one. Acknowledgments We thank Ana Rojas for assistance in preparing Fig. 3.8. This work is supported by grants from the National Institutes of Health (GM-14312), the National Science Foundation (MCB00-03722), and grants N N204 049035 (grant contract number 0490/H03/2008/35) and N N204 152836 from the Polish Ministry of Science and Higher Education. Mariusz Makowski was also supported by a grant from the “Homing” program of the Foundation for Polish Science (FNP) and MF EOG resources. This research is conducted by using the resources of (a) our 800-processor Beowulf cluster at the Baker Laboratory of Chemistry and Chemical Biology, Cornell University, (b) the National Science Foundation Terascale Computing System at the Pittsburgh Supercomputer Center, (c) the John von Neumann Institute for Computing at the Central Institute for Applied Mathematics, Forschungszentrum Juelich, Germany, (d) the Beowulf cluster at the Department of Computer Science, Cornell University, (e) the resources of the Center for Computation and Technology at Louisiana State University, which is supported by funding from the Louisiana legislature, (f) our 45-processor Beowulf cluster at the Faculty of Chemistry, University of Gda´nsk, (g) the Informatics Center of the Metropolitan Academic Network (IC MAN) in Gda´nsk, and (h) the Interdisciplinary Center of Mathematical and Computer Modeling (ICM) at the University of Warsaw.

References Ahmed A, Gohlke H (2006) Multiscale modeling of macromolecular conformational changes combining concepts from rigidity and elastic network theory. Proteins: Struct Func Bioinf 63:1038–1051 Anfinsen C (1973) Principles that govern the folding of protein chain. Science 181:223–230 Arnautova YA, Scheraga HA (2008) Use of decoys to optimize an all-atom force field including hydration. Biophys J 95:2434–2449

72

C. Czaplewski et al.

Atilgan AR, Durell SR, Jernigan RL, Demirel MC, Keskin O, Bahar I (2001) Anisotropy of fluctuation dynamics of proteins with an elastic network model. Biophys J 80:505–515 Ayton GS, Noid WG, Voth GA (2007a) Multiscale modeling of biomolecular systems: in serial and in parallel. Curr Opin Struct Biol 17:192–198 Ayton GS, Noid WG, Voth GA (2007b) Systematic coarse graining of biomolecular and soft-matter systems. MRS Bull 32:929–934 Bahar I, Atilgan AR, Erman B (1997) Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential. Fold Des 2:173–181 Ben Naim A (1997) Statistical potentials extracted from protein structures: are these meaningful potentials? J Chem Phys 107:3698–3706 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucl Acids Res 28 235–242 Betancourt MR (2008) Knowledge-based potential for the polypeptide backbone. J Phys Chem B 112:5058–5069 Brown S, Fawzi NJ, Head-Gordon T (2003) Coarse-grained sequences for protein folding and design. Proc Natl Acad Sci USA 100:10712–10717 Brown S, Head-Gordon T (2004) Intermediates and the folding of proteins L and G. Prot Sci 13:958–970 Bryant SH, Lawrence CE (1993) An empirical energy function for threading protein sequence through the folding motif. Proteins: Struct Func Genet 16:92–112 Bryngelson JD, Wolynes PG (1987) Spin glasses and the statistical mechanics of protein folding. Proc Natl Acad Sci USA 84:7524–7528 Buchete NV, Straub JE, Thirumalai D (2003) Anisotropic coarse-grained statistical potentials improve the ability to identify native-like protein structures. J Chem Phys 118:7658–7671 Buchete NV, Straub JE, Thirumalai D (2004) Development of novel statistical potentials for protein fold recognition. Current Opinion Struct Biol 14:225–232 Burgess AW, Scheraga HA (1975) Assessment of some problems associated with prediction of the three-dimensional structure of a protein from its amino-acid sequence. Proc Natl Acad Sci USA 72:1221–1225 Camacho CJ, Thirumalai D (1996) A criterion that determines fast folding of proteins: A model study. Europhys Lett 35:627–632 Casari G, Sippl MJ (1992) Structure-derived hydrophobic potential. Hydrophobic potential derived from X-ray structures of globular proteins is able to identify native folds. J Mol Biol 224: 725–732 Cecconi C, Shank EA, Bustamante C, Marqusee S (2005) Direct observation of the three-state folding of a single protein molecule. Science 309:2057–2060 Chan HS, Dill KA (1989) Compact polymers. Macromolecules 22:4559–4573 Chan HS, Dill KA (1990) Origins of structure in globular proteins. Proc Natl Acad Sci USA 16:6388–6392 Chan HS, Dill KA (1991) Polymer principles in protein structure and stability. Annu Rev Biophys Biophys Chem 20:447–490 Chan HS, Dill KA (1994) Transition-states and folding dynamics of proteins and heteropolymers. J Chem Phys 12:9235–9257 Chebaro Y, Dong X, Laghaei R, Derreumaux P, Mousseau N (2009) Replica exchange molecular dynamics simulations of coarse-grained proteins in implicit solvent. J Phys Chem B 113: 267–274 Chen WW, Shakhnovich EI (2005) Lessons from the design of a novel atomic potential for protein folding. Protein Sci 14:1741–1752 Chikenji G, Fujitsuka Y, Takada S (2001) A reversible fragment assembly method for de novo protein structure prediction. J Chem Phys 119:6895–6903 Chinchio M, Czaplewski C, Liwo A, Ołdziej S, Scheraga HA (2007) Dynamic formation and breaking of disulfide bonds in molecular dynamics simulations with the UNRES force field. J Chem Theory Comput 3:1236–1248

3

Protein Coarse-Grained Models

73

Chu JW, Voth GA (2007) Coarse-grained free energy functions for studying protein conformational changes: a double-well network model. Biophys J 11:3860–3871 Cieplak M, Hoang TX, Robbins MO (2002) Thermal folding and mechanical unfolding pathways of protein secondary structures. Proteins: Struct Func Genet 49:104–113 Cieplak M, Sulkowska J (2005) Thermal unfolding of proteins. J Chem Phys 123:194908 Cieplak M, Szymczak P (2006) Protein folding in a force clamp. J Chem Phys 124:194901 Clementi C (2008) Coarse-grained models of protein folding: toy models or predictive tools? Curr Opinion Struct Biol 18:10–15 Colombo G, Micheletti C (2006) Protein folding simulations: combining coarse-grained models and all-atom molecular dynamics. Theor Chem Acc 116:75–86 Covell DG (1992) Folding protein α-carbon chains into compact forms by Monte Carlo methods. Proteins Struct Func Genet 14:409–420 Crippen GM, Viswanadhan VN (1984) A potential function for conformational analysis of proteins. Int J Peptide Protein Res 24:279–296 Crippen GM, Viswandhan VN (1987) Determination of an empirical energy function for protein conformational analysis by energy embedding. J Comput Chem 8:972–981 ´ resolution potential function for protein folding. Crippen GM, Snow ME (1990) A 1.8 Å Biopolymers 29:1479–1489 Czaplewski C, Rodziewicz-Motowidło S, Liwo A, Ripoll DR, Wawak RJ, Scheraga HA (2000) Molecular simulation study of cooperativity in hydrophobic association. Protein Sci 9: 1235–1245 Czaplewski C, Rodziewicz-Motowidło S, Dabal M, Liwo A, Ripoll DR, Scheraga HA (2003) Molecular simulation study of cooperativity in hydrophobic association: cluster of four hydrophobic particles. Biophys Chem 105:339–359 Czaplewski C, Liwo A, Pillardy J, Ołdziej S, Scheraga HA (2004a) Improved conformational space annealing method to treat β-structure with the UNRES force-field, to enhance scalability of parallel implementation. Polymer 45:677–686 Czaplewski C, Ołdziej S, Liwo A, Scheraga HA (2004b) Prediction of the structures of proteins with the UNRES force field including dynamic formation and breaking of disulfide bonds. Protein Eng Des Select 17:29–36 Czaplewski C, Kalinowski S, Liwo A, Scheraga HA (2009) Application of multiplexing replica exchange molecular dynamics method to the UNRES force field: tests with alpha and alpha+beta proteins. J Chem Theor Comput 5:627–640 Das P, Matysiak S, Clementi C (2005) Balancing energy and entropy: a minimalist model for the characterization of protein folding landscapes. Proc Natl Acad Sci USA 102: 10141–10146 Derreumaux P (1997) Folding a 20 amino acid alpha beta peptide with the diffusion processcontrolled Monte Carlo method. J Chem Phys 107:1941–1947 Derreumaux P (1999) From polypeptide sequences to structures using Monte Carlo simulations and an optimized potential. J Chem Phys 111:2301–2310 Derreumaux P, Mousseau N (2007) Coarse-grained protein molecular dynamics simulations. J Chem Phys 126:025101 Dill KA, Bromberg S, Yue KZ, Fiebig KM, Yee DP, Thomas PD, Chan HS (1995) Principles of protein folding – a perspective from simple exact models. Prot Sci 4:561–602 Dobson CM (2003) Protein folding and misfolding. Nature 426:884–890 Earl DJ, Deem MW (2005) Parallel tempering: theory applications and new perspectives. Phys Chem Chem Phys 7:3910–3916 Eastwood MP, Hardin C, Luthey-Schulten Z, Wolynes PG (2002) Statistical mechanical refinement of protein structure prediction schemes: cumulant expansion approach. J Chem Phys 117: 4602–4615 Eastwood MP, Hardin C, Luthey-Schulten Z, Wolynes PG (2003) Statistical mechanical refinement of protein structure prediction schemes. II. Mayer cluster expansion approach. J Chem Phys 118:8500–8512

74

C. Czaplewski et al.

Eskow E, Bader D, Byrd R, Crivelli S, Head-Gordon T, Lamberti V, Schnabel R (2004) An optimization approach to the problem of protein structure prediction. Math Program 101:497–514 Flynn MJ (1999) Basic issues in microprocessor architecture. J System Arch 45:939–948 Fowler RH, Guggenheim EA (1949) Statistical thermodynamics. Cambridge University Press, Cambridge Friedrichs MS, Goldstein RA, Wolynes PG (2001) Generalized protein tertiary structure recognition using associative memory Hamiltonians. J Mol Biol 222:1013–1034 Fujitsuka CG, Takada S (2004) Protein folding mechanisms and energy landscape of Src SH3 domain studied by a structure prediction toolbox. Chem Phys 307 157–162 Fujitsuka Y, Takada S, Luthey-Schulten ZA, Wolynes PG (2004) Optimizing physical energy functions for protein folding. Proteins: Struct Func Genet 54:88–103 Furuichi E, Koehl P (1998) Influence of protein structure databases on the predictive power of statistical pair potentials. Proteins: Struct Func Genet 31:139–149 Gay JG, Berne BJ (1981) Modification of the overlap potential to mimic a linear site-site potential. J Chem Phys 74:3316–3319 Gerber PR (1992) Peptide mechanics: a force field for peptides and proteins working with entire residues as smallest units. Biopolymers 32:1003–1017 Gilis D, Rooman M (1996) Stability changes upon mutation of solvent accessible residues in proteins evaluated by database-derived potentials. J Mol Biol 257:1112–1126 Gilis D, Rooman M (1997) Predicting protein stability changes upon mutation using databasederived potentials: solvent accessibility determines the importance of local versus non-local interactions along the sequence. J Mol Biol 272:276–290 Godzik A, Koli´nski A, Skolnick J (1993) De novo and inverse folding predictions of protein structure and dynamics. J Comput Aid Mol Des 7:397–438 Goel NS, Yˇcas M (1979) On the computation of the tertiary structure of globular proteins II. J Theor Biol 77:253–305 Gohlke H, Klebe G (2001) Statistical potentials and scoring functions applied to protein–ligand binding. Curr Opin Struct Biol 11:231–235 Goldstein RA, Luthey-Schulten ZA, Wolynes PG (1992a) Optimal protein-folding codes from spin-glass theory. Proc Natl Acad Sci USA 89:4918–4922 Goldstein RA, Luthey-Schulten ZA, Wolynes PG (1992b) Protein tertiary structure recognition using optimized Hamiltonians with local interactions. Proc Natl Acad Sci USA 89: 9029–9033 Gregoret LM, Cohen FE (1990) Novel method for rapid evaluation of packing in protein structures. J Mol Biol 211:959–974 Gront D, Koli´nski A, Hansmann UHE (2005) Exploring protein energy landscapes with hierarchical clustering. Int J Quant Chem 105 826–830 Gront D, Latek D, Kurci´nski M, Koli´nski A (2009) Template-free predictions of the threedimensional protein structures: from first principles to knowledge-based potentials In: Bujnicki J (ed) Prediction of protein structure functions and interactions. Wiley, San Francisco, CA Hansmann UHE (1997) Parallel tempering algorithm for conformational studies of biological molecules. Chem Phys Lett 281:140–150 Hao M-H, Scheraga HA (1994) Monte Carlo simulation of a first-order transition for protein folding. J Phys Chem 98:4940–4948 Hao M-H, Scheraga HA (1996a) How optimization of potential functions affects protein folding. Proc Natl Acad Sci USA 93:4984–4989 Hao M-H, Scheraga HA (1996b) Optimizing potential functions for protein folding. J Phys Chem 100:14540–14548 Hardin C, Eastwood MP, Prentiss M, Luthey-Schulten Z, Wolynes PG (2002) Folding funnels: the key to robust protein structure prediction. J Comput Chem 23:138–146 He S, Scheraga HA (1998) Brownian dynamics simulations of protein folding. J Chem Phys 108:287–300

3

Protein Coarse-Grained Models

75

He Y, Xiao Y, Liwo A, Scheraga HA (2009) Exploring the parameter space of the coarse-grained UNRES force field by random search: selecting a transferable medium-resolution force field. J Comput Chem 30:2127–2135 Hendlich M, Lackner P, Weitckus S, Floeckner H, Froschauer R, Gottsbacher K, Casari G, Sippl MJ (1990) Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force. J Mol Biol 216:167–180 Hills RD, Brooks CL (2009) Insights from coarse-grained G¯o models for protein folding and dynamics. Int J Mol Sci 10:889–905 Hinsen K (1998) Analysis of domain motions by approximate normal mode calculations. Proteins: Struct Func Genet 33:417–429 Hoang TX, Cieplak M (2000) Molecular dynamics of folding of secondary structures in G¯o-like models of proteins. J Chem Phys 112:6851–6862 Hoffman D, Knapp EW (1996) Protein dynamics with off-lattice Monte Carlo moves. Phys Rev E 53:4221–4224 Hukushima K, Nemoto K (1996) Exchange Monte Carlo method and application to spin glass simulations. J Phys Soc Jpn 65:1604–1608 Irbäck A, Mitternacht S, Mohanty S (2009) An effective all-atom potential for proteins. PMC Biophys 2:2 Izvekov S, Voth GA (2005a) A multiscale coarse-graining method for biomolecular systems. J Phys Chem B 109:2469–2473 Izvekov S, Voth GA (2005b) Multiscale coarse graining of liquid-state systems. J Chem Phys 123:134105 Jagielska A, Wroblewska L, Skolnick J (2008) Protein model refinement using an optimized physics-based all-atom force field. Proc Natl Acad Sci USA 24:8268–8273 Jauch R, Yeo HC, Kolatkar PR, Clarke DN (2007) Assessment of CASP7 structure predictions for template free targets. Proteins: Struct Funct Bioinf 69(S8):57–67 Jernigan RL, Bahar I (1996) Structure-derived potentials and protein simulations. Curr Opin Struct Biol 6:195–209 Jiang L, Gao Y, Mao F, Liu Z, Lai L (2002) Potential of mean force for protein–protein interaction studies. Proteins: Struc Func Gen 46:190–196 Jones D, Thornton J (1993) Protein fold recognition. J Comput-Aided Mol Des 7:439–456 Ka´zmierkiewicz R, Liwo A, Scheraga HA (2002) Energy-based reconstruction of a protein backbone from its alpha-carbon trace by a Monte Carlo method. J Comput Chem 23: 715–723 Ka´zmierkiewicz R, Liwo A, Scheraga HA (2003) Addition of side chains to a known backbone with defined side-chain centroids. Biophys Chem 100:261–280 (Erratum: Biophys Chem 106 91 (2003)) Khalili M, Liwo A, Rakowski F, Grochowski P, Scheraga HA (2005a) Molecular dynamics with the united-residue model of polypeptide chains. I Lagrange equations of motion and tests of numerical stability in the microcanonical model. J Phys Chem B 109:13785–13797 Khalili M, Liwo, Jagielska A, Scheraga H (2005b) Molecular dynamics with the united-residue model of polypeptide chains. II. Langevin and Berendsen-bath dynamics and tests on model alpha-helical systems. J Phys Chem B 109:13798–13810 Khalili M, Liwo A, Scheraga H (2006) Kinetic studies of folding of the B-domain of staphylococcal protein A with molecular dynamics and a united-residue (UNRES) model of polypeptide chains. J Mol Biol 355:536–547 Klimov DK, Thirumalai D (1996a) Criterion that determines the foldability of proteins. Phys Rev Lett 76:4070–4073 Klimov DK, Thirumalai D (1996b) Factors governing the foldability of proteins. Proteins: Struct Func Genet 26:411–441 Klimov DK, Thirumalai D (1998) Linking rates of folding in lattice models of proteins with underlying thermodynamic characteristics. J Chem Phys 109:4119–4125

76

C. Czaplewski et al.

Kmiecik S, Kurci´nski M, Rutkowska A, Gront D, Koli´nski A (2006) Denatured proteins and early folding intermediates simulated in a reduced conformational space. Acta Biochim Pol 53: 131–143 Kmiecik S, Koli´nski A (2007) Characterization of protein-folding pathways by reduced-space modeling. Proc Natl Acad Sci USA 104:12330–12335 Kmiecik S, Koli´nski A (2008) Folding pathway of the B1 domain of protein G explored by multiscale modeling. Biophys J 94:726–736 Koli´nski A, Skolnick J (1992) Discretized model of proteins. I. Monte Carlo study of cooperativity in homopolypeptides. J Chem Phys 97:9412–9426 Koli´nski A, Godzik A, Skolnick J (1993) A general method for the prediction of the threedimensional structure and folding pathway of globular proteins: Application to designed helical proteins. J Chem Phys 98:7420–7433 Koli´nski A, Skolnick J (1994a) Monte Carlo simulations of protein folding. I. Lattice model and interaction scheme. Proteins Struct Func Genet 18:338–352 Koli´nski A, Skolnick J (1994b) Monte Carlo simulations of protein folding. II. Application to protein A ROP and crambin. Proteins Struct Func Genet 18:353–366 Koli´nski A, Galazka W, Skolnick J (1995) Computer design of idealized beta-motifs. J Chem Phys 23:10286–10297 Koli´nski A, Galazka W, Skolnick J (1996) On the origin of the cooperativity of protein folding: implications from model simulations. Proteins: Struct Func Genet 26:271–297 Koli´nski A, Skolnick J (1997) Determination of secondary structure of polypeptide chains: interplay between short range and burial interactions. J Chem Phys 107:953–964 Koli´nski A, Skolnick J (1998) Assembly of protein structure from sparse experimental data: an efficient Monte Carlo model. Proteins: Struct Func Genet 32:475–494 Koli´nski A, Gront D, Pokarowski P, Skolnick J (2003) A simple lattice model that exhibits a protein-like cooperative all-or-none folding transition. Biopolymers 69:399–405 Koli´nski A (2004) Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 51:349–371 Koli´nski A, Skolnick J (2004) Reduced models of proteins and their applications. Polymer 45: 511–524 Koli´nski A, Bujnicki J (2005) Generalized protein structure prediction based on combination of fold-recognition with de novo folding, evaluation of models. Proteins: Struct Funct Bioinf 61:84–90 Kortemme T, Morozov AV, Baker D (2003) An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein–protein complexes. J Mol Biol 326:1239–1259 Kozłowska U, Liwo A, Scheraga HA (2007) Determination of virtual-bond-angle potentials of mean force for coarse-grained simulations of protein structure and folding from ab initio energy surfaces of terminally-blocked glycine, alanine, and proline. J Phys: Cond Matter 19:285203 Kozłowska U, Liwo A, Scheraga HA (2010a) Determination of side-chain-rotamer and side-chain and backbone virtual-bond-stretching potentials of mean force from AM1 energy surfaces of terminally-blocked amino-acid residues, for coarse-grained simulations of protein structure and folding. I. The method. J Comput Chem 31:1143–1153 Kozłowska U, Maisuradze GG, Liwo A, Scheraga HA (2010b) Determination of side-chainrotamer and side-chain and backbone virtual-bond-stretching potentials of mean force from AM1 energy surfaces of terminally-blocked amino-acid residues, for coarse-grained simulations of protein structure and folding. II. Results, comparison with statistical potentials and implementation in the UNRES force field. J Comput Chem. 31:1154–1167 Kryshtafovych A, Venclovas C, Fidelis K, Moult J (2005) Progress over the first decade of CASP experiments. Proteins: Struct Funct Bioinf 61(S7): 225–236 Kryshtafovych A, Fidelis K (2009) Protein structure prediction and model quality assessment. Drug Disc Today 14:386–393 Kubo R (1962) Generalized cumulant expansion method. J Phys Soc Jpn 17:1100–1120

3

Protein Coarse-Grained Models

77

Kumar S, Bouzida D, Swendsen RH, Kollman PA, Rosenberg JM (1992) The weighted histogram analysis method for free-energy calculations on biomolecules. 1. The method. J Comput Chem 13:1011–1021 Kumar S, Rosenberg JM, Bouzida D, Swendsen RH, Kollman PA (1995) Multidimensional free-energy calculations using the weighted histogram analysis method. J Comput Chem 16: 1339–1350 Kuntz ID, Crippen GM, Kollman PA, Kimelman D (1976) Calculation of protein tertiary structure. J Mol Biol 106:983–994 Latek D, Ekonomiuk D, Koli´nski A (2007) Protein structure prediction: Combining de novo modeling with sparse experimental data. J Comput Chem 28:1668–1676 Lazaridis T, Karplus M (2000) Effective energy function for protein structure prediction. Curr Opin Struct Biol 10:139–145 Lee J, Liwo A, Scheraga HA (1999) Energy-based de novo protein folding by conformational space annealing and an off-lattice united-residue force field: Application to the 10–55 fragment of staphylococcal protein A and to apo calbindin D9K. Proc Natl Acad Sci USA 96:2025–2030 Lee J, Liwo A, Ripoll DR, Pillardy J, Saunders JA, Gibson KD, Scheraga HA (2000) Hierarchical energy-based approach to protein-structure prediction; blind-test evaluation with CASP3 targets. Int J Quant Chem 77:90–117 Lee J, Park K, Lee J (2002) Full optimization of linear parameters of a united residue protein potential. J Phys Chem B 106:11647–11657 Lee SY, Zhang Y, Skolnick J (2006) TASSER-based refinement of NMR structures. Proteins: Struct Func Bioinf 63:451–456 Lei HX, Duan Y (2007) Improved sampling methods for molecular simulation. Curr Opinion Struct Biol 17:187–191 Levitt M, Warshel A (1975) Computer simulation of protein folding. Nature 253:694–698 Levitt M (1976) A simplified representation of protein conformations for rapid simulation of protein folding. J Mol Biol 104:59–107 Levitt M, Chothia C (1976) Structural patterns in globular proteins. Nature 261:552–558 Liu P, Shi Q, Lyman E, Voth GA (2008) Reconstructing atomistic detail for coarse-grained models with resolution exchange. J Chem Phys 129:114103 Liwo A, Pincus MR, Wawak RJ, Rackovsky S, Scheraga HA (1993a) Calculation of protein backbone geometry from alpha-carbon coordinates based on peptide-group dipole alignment. Protein Sci 2:1697–1714 Liwo A, Pincus MR, Wawak RJ, Rackovsky S, Scheraga HA (1993b) Prediction of protein conformation on the basis of a search for compact structures; test on avian pancreatic polypeptide. Protein Sci 2:1715–1731 Liwo A, Ołdziej S, Pincus MR, Wawak RJ, Rackovsky S, Scheraga HA (1997a) A united-residue force field for off-lattice protein-structure simulations. I. Functional forms and parameters of long-range side-chain interaction potentials from protein crystal data. J Comput Chem 18: 849–873 Liwo A, Pincus MR, Wawak RJ, Rackovsky S, Ołdziej S, Scheraga HA (1997b) A united-residue force field for off-lattice protein-structure simulations. II: Parameterization of local interactions and determination of the weights of energy terms by Z-score optimization. J Comput Chem 18:874–887 Liwo A, Ka´zmierkiewicz R, Czaplewski C, Groth M, Ołdziej S, Wawak RJ, Rackovsky S, Pincus MR, Scheraga HA (1998) United-residue force field for off-lattice protein-structure simulations. III: Origin of backbone hydrogen-bonding cooperativity in united-residue potentials. J Comput Chem 19:259–276 Liwo A, Lee J, Ripoll DR, Pillardy J, Scheraga HA (1999) Protein structure prediction by global optimization of a potential energy function. Proc Natl Acad Sci USA 96:5482–5485 Liwo A, Czaplewski C, Pillardy J, Scheraga HA (2001) Cumulant-based expressions for the multibody terms for the correlation between local and electrostatic interactions in the united-residue force field. J Chem Phys 115:2323–2347

78

C. Czaplewski et al.

Liwo A, Arłukowicz P, Czaplewski C, Ołdziej S, Pillardy J, Scheraga HA (2002) A method for optimizing potential-energy functions by a hierarchical design of the potential-energy landscape: application to the UNRES force field. Proc Natl Acad Sci USA 99:1937–1942 Liwo A, Arłukowicz P, Ołdziej S, Czaplewski C, Makowski M, Scheraga HA (2004a) Optimization of the UNRES force field by hierarchical design of the potential-energy landscape. I: Tests of the approach using simple lattice protein models. J Phys Chem B 108:16918–16933 Liwo A, Ołdziej S, Czaplewski C, Kozłowska U, Scheraga HA (2004b) Parameterization of backbone-electrostatic and multibody contributions to the UNRES force field for proteinstructure prediction from ab initio energy surfaces of model systems. J Phys Chem B 108:9421–9438 Liwo A, Khalili M, Scheraga HA (2005) Ab initio simulations of protein-folding pathways by molecular dynamics with the united-residue model of polypeptide chains. Proc Natl Acad Sci USA 102:2362–2367 Liwo A, Khalili M, Czaplewski C, Kalinowski S, Ołdziej S, Wachucik K, Scheraga HA (2007) Modification and optimization of the united-residue (UNRES) potential energy function for canonical simulations I. Temperature dependence of the effective energy function and tests of the optimization method with single training proteins. J Phys Chem B 111:260–285 Liwo A, Czaplewski C, Ołdziej S, Rojas AV, Ka´zmierkiewicz R, Makowski M, Murarka RK, Scheraga HA (2008a) Simulation of protein structure and dynamics with the coarse-grained UNRES force field. In: G Voth (ed) Coarse-graining of condensed phase and biomolecular systems. CRC Press Taylor & Francis, Farmington, CT Liwo A, Czaplewski C, Ołdziej S, Kozłowska U, Makowski M, Kalinowski S, Ka´zmierkiewicz R, Shen H, Maisuradze G, Scheraga HA (2008b) Optimization of the physics-based united-residue force field (UNRES) for protein folding simulations. In: Muenster G, Wolf D, Kremer M (eds) NIC Symposium 20–21 February (2008) Juelich Germany NIC Series 39, John von Neumann Institute for Computing (NIC), Germany, pp 63–70 Liwo A, Ołdziej S, Czaplewski C, Kleinerman DS, Blood P, Scheraga HA (2010) Implementation of molecular dynamics and its extensions with the coarse-grained UNRES force field on massively parallel systems: toward millisecond-scale simulations of protein structure, dynamics, and thermodynamics. J Chem Theory Comput 6:890–909 Lu H, Skolnick J (2001) A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins 44:223–232 Lu H, Lu L, Skolnick J (2003) Development of unified statistical potentials describing protein– protein interactions. Biophys J 84:1895–1901 Lyman E, Ytreberg F, Zuckerman D (2006) Resolution exchange simulation. Phys Rev Lett 96:028105 Maiorov VN, Crippen GM (1992) Contact potential that recognizes the correct folding of globular proteins. J Mol Biol 227:876–888 Makowski M, Liwo A, Scheraga HA (2007a) Simple physics-based analytical formulas for the potentials of mean force for the interaction of amino acid side chains in water. 1. Approximate expression for the free energy of hydrophobic association based on a Gaussian-overlap model. J Phys Chem B 111:2910–2916 Makowski M, Liwo A, Maksimiak K, Makowska J, Scheraga HA (2007b) Simple physics-based analytical formulas for the potentials of mean force for the interaction of amino acid side chains in water. 2. Tests with simple spherical systems. J Phys Chem B 111:2917–2924 Makowski M, Sobolewski E, Czaplewski C, Liwo A, Ołdziej S, No JH, Scheraga HA (2007c) Simple physics-based analytical formulas for the potentials of mean force for the interaction of amino acid side chains in water. 3. Calculation and parameterization of the potentials of mean force of pairs of identical hydrophobic side chains. J Phys Chem B 111:2925–2931 Makowski M, Sobolewski E, Czaplewski C, Ołdziej S, Liwo A, Scheraga HA (2008) Simple physics-based analytical formulas for the potentials of mean force for the interaction of amino acid side chains in water. 4. Pairs of different hydrophobic side chains. J Phys Chem B 112:11385–11395

3

Protein Coarse-Grained Models

79

Maupetit J, Tuffrey P, Derreumaux P (2007) A coarse-grained protein force field for folding and structure prediction. Proteins: Struct Func Genet 69:394–408 Meller J, Elber R (2001) Linear programming optimization and a double statistical filter for protein threading protocols. Proteins: Struct Func Genet 45:241–261 Meller J, Elber R (2002) Protein recognition by sequence-to-structure fitness: Bridging efficiency and capacity of threading models. Adv Chem Phys 120 77–130 Melo F, Feytmans E (1997) Novel knowledge-based mean force potential at atomic level. J Mol Biol 267:207–222 Melo F, Feytmans E (1998) Assessing protein structures using a non-local atomic interaction energy. J Mol Biol 277:1141–1152 Miller RT, Jones DT, Thornton JM (1996) Protein fold recognition by sequence threading: tools and assessment techniques. FASEB J 10:171–178 Ming DM, Bruschweiler R (2006) Reorientational contact-weighted elastic network model for the prediction of protein dynamics: comparison with NMR relaxation. Biophys J 90:3382–3388 Mitchell JBO, Laskowski RA, Alex A, Thornton JM (1999) BLEEP – Potential of mean force describing protein–ligand interactions: I. Generating potential. J Comput Chem 20:1165–1176 Miyazawa S, Jernigan RL (1985) Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18:534–552 Miyazawa S, Jernigan RL (1996) Residue–residue potentials with a favorable contact pair term and unfavorable high-packing density term for simulation and threading. J Mol Biol 256:623–644 Miyazawa S, Jernigan RL (1999) Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins: Struct Func Genet 34:49–68 Miyazawa S, Jernigan RL (2005) How effective for fold recognition is a potential of mean force that includes relative orientations between contacting residues in proteins? J Chem Phys 122:024901 Momany FA, McGuire RF, Burgess AW, Scheraga HA (1975) Energy parameters in polypeptides. VII. Geometric parameters partial atomic charges non-bonded interactions hydrogen bond interactions and intrinsic torsional potential for the naturally occurring amino-acids. J Phys Chem 79:2361–2381 Monticelli L, Kandasamy SK, Periole X, Larson RG, Tieleman DP, Marrink S-J (2008) The MARTINI coarse-grained force field: extension to proteins. J Chem Theor Comput 4: 819–834 Moont G, Gabb H, Sternberg M (1999) Use of pair potentials across protein interfaces in screening predicted docked complexes. Proteins 35:364–373 Moritsugu K, Smith JC (2008) REACH coarse-grained biomolecular simulation: Transferability between different protein structural classes. Biophys J 95:1639–1648 Moult J (1997) Comparison of database potentials and molecular mechanics force fields. Curr Opin Struct Biol 7:194–199 Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A (2007) Critical assessment of methods of protein structure prediction – Round VII. Proteins: Struct Func Bioinf 69(S8): 3–9 Mousseau N, Derreumaux P (2008) Exploring energy landscapes of protein folding and aggregation. Front Biosci 13:4495–4516 Mukherjee A, Bhimalapuram P, Bagchi B (2005) Orientation-dependent potential of mean force for protein folding. J Chem Phys 123:014901 Nanias M, Czaplewski C, Scheraga H A (2006) Replica exchange and multicanonical algorithms with the coarse-grained united-residue (UNRES) force field. J Chem Theor Comput 2:513–528 Noid WG, Chu J-W, Ayton GS, Voth GA (2007) Multiscale coarse-graining and structural correlations: connections to liquid state theory. J Phys Chem B 111:4116–4127 Noid WG, Chu J-W, Ayton GS, Krishna V, Izvekov S, Voth GA, Das A, Andersen HC (2008) The multiscale coarse-graining method. I. A rigorous bridge between atomistic and coarse-grained models. J Chem Phys 128:244144

80

C. Czaplewski et al.

Obatake M, Crippen GM 1981 Residue–residue potential function for conformational analysis of proteins. J Phys Chem 85:1187–1197 Ołdziej S, Kozłowska U, Liwo A, Scheraga HA (2003) Determination of the potentials of mean force for rotation about C-alpha–C-alpha virtual bonds in polypeptides from the ab initio energy surfaces of terminally-blocked glycine, alanine, and proline. J Phys Chem A 107:8035–8046 Ołdziej S, Łagiewka J, Liwo A, Czaplewski C, Chinchio M, Nanias M, Scheraga HA (2004) Optimization of the UNRES force field by hierarchical design of the potential-energy landscape. 3. Use of many proteins in optimization. J Phys Chem B 108:16950–16959 Ołdziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila J, Khalili M, Arnautova YA, Jagielska A, Makowski M, Schafroth HD, Ka´zmierkiewicz R, Ripoll DR, Pillardy J, Saunders J, Kang Y, Gibson K, Scheraga HA (2005) Physics-based protein-structure prediction using a hierarchical protocol based on the UNRES force field: Assessment in two blind tests. Proc Natl Acad Sci USA 102:7547–7552 Olszewski KA, Koli´nski A, Skolnick J (1996) Folding simulations and computer redesign of protein three-helix bundle motifs. Proteins: Struct Func Genet 25:286–299 Paluszewski M, Karplus K (2009) Model quality assessment using distance constraints from alignments. Proteins: Struct Func Bioinf 75:540–549 Panchenko A, Marchler-Bauer A, Bryant S (2000) Combination of threading potentials and sequence profiles improves fold recognition. J Mol Biol 296:1319–1331 Pillardy J, Czaplewski C, Liwo A, Lee J, Ripoll DR, Ka´zmierkiewicz R, Ołdziej S, Wedemeyer WJ, Gibson KD, Arnautova YA, Saunders J, Ye Y-J, Scheraga HA (2001a) Recent improvements in prediction of protein structure by global optimization of a potential energy function. Proc Natl Acad Sci USA 98:2329–2333 Pillardy J, Czaplewski C, Liwo A, Wedemeyer WJ, Lee J, Ripoll DR, Arłukowicz P, Ołdziej S, Arnautova YA, Scheraga HA (2001b) Development of physics-based energy functions that predict medium-resolution structure for proteins of the alpha, beta and alpha/beta structural classes. J Phys Chem B 105:7299–7311 Pincus MR, Scheraga HA (1977) An approximate treatment of long-range interactions in proteins. J Phys Chem 81:1579–1583 Pincus DL, Cho SS, Hyeon HC, Thirumalai D (2008) Minimal models for proteins and RNA: From folding to function. Prog Mol Biol Transl Sci 84:203–250 Ponnuswamy PK, Warme PK, Scheraga HA (1973) Role of medium-range interactions in proteins. Proc Natl Acad Sci USA 70:830–833 Prentiss MC, Hardin C, Eastwood MP, Zong CH, Wolynes PG (2006) Protein structure prediction: The next generation. J Chem Theor Comput 2:705–716 Prentiss MC, Wales DJ, Wolynes PG (2008) Protein structure prediction using basin-hopping. J Chem Phys 128:225106 Reva BA, Finkelstein AV, Sanner MF, Olson AJ (1997) Residue–residue mean-force potentials for protein structure recognition. Prot Eng 10:865–876 Rey A, Skolnick J (1993) Computer modeling and folding of four-helix bundles. Proteins: Struct Func Genet 16:8–28 Rhee YM, Pande VS (2003) Multiplexed-replica exchange molecular dynamics method for protein folding simulation. Biophys J 84:775–786 Rohl CA, Strauss CE, Misura KM, Baker D (2004) Protein structure prediction used ROSETTA. Methods Enzymol 383:66–93 Rojas AV, Liwo A, Scheraga HA (2007) Molecular dynamics with the united-residue force field: Ab initio folding simulations of multichain proteins. J Phys Chem B 111:293–309 Rojnuckarin A, Subramaniam S (1999) Knowledge-based interaction potentials for proteins. Proteins: Struct Func Genet 36:54–67 Rooman MJ, Kocher J-PA, Wodak SJ (1991) Prediction of protein backbone conformation based on seven structure assignments. Influence of local interactions. J Mol Biol 211:961–979 Rooman MJ, Kocher J-PA, Wodak SJ (1992) Extracting information on folding from the amino acid sequence: accurate predictions for protein regions with preferred conformation in the absence of tertiary interactions. Biochemistry 31:10226–10238

3

Protein Coarse-Grained Models

81

Russ WP, Ranganathan R (2002) Knowledge-based potential functions in protein design. Curr Opin Struct Biol 12:447–452 Sali A, Shakhnovich EI, Karplus M (1994a) How does a protein fold? Nature 369:248–251 Sali A, Shakhnovich EI, Karplus M (1994b) Kinetics of protein-folding – a lattice model study of the requirements for folding to the native-state. J Mol Biol 235:1614–1638 Samudrala R, Moult J (1998) An all-atom distance dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 275:895–916 Sasai M, Wolynes PG (1990) Molecular theory of associative memory Hamiltonian models of protein folding. Phys Rev Lett 65:2740–2743 Saunders JA, Scheraga HA (2003a) Ab initio structure prediction of two alpha-helical oligomers with a multiple-chain united-residue force field and global search. Biopolymers 68:300–317 Saunders JA, Scheraga HA (2003b) Challenges in structure prediction of oligomeric proteins at the united-residue level: searching the multiple-chain energy landscape with CSA and CFMC procedures. Biopolymers 68:318–332 Scheraga HA (1988) Approaches to the multiple-minima problem. In: Clementi E, Chin S (eds) Biological and artificial intelligence system. ESCOM Science, Leiden Scheraga HA (1996) Recent developments in the theory of protein folding: searching for the global energy-minimum. Biophys Chem 59:329–339 Scheraga HA, Liwo A, Ołdziej S, Czaplewski C, Pillardy J, Ripoll DR, Vila JA, Ka´zmierkiewicz R, Saunders JA, Arnautova YA, Jagielska A, Chinchio M, Nanias M (2004) The protein folding problem: global optimization of force fields. Front Biosci 9:3296–3323 Scheraga HA, Khalili M, Liwo A (2007) Protein-folding dynamics: Overview of molecular simulation techniques. Annu Rev Phys Chem 58:57–83 Schug A, Hyeon C, Onuchic JN (2008) Coarse-grained structure-based simulations of proteins and RNA. In: Voth G (ed) Coarse-graining of condensed phase and biomolecular systems. CRC Press Taylor & Francis, Farmington, CT Schug A, Wenzel W (2006) An evolutionary strategy for all-atom folding of the 60-amino-acid bacterial ribosomal protein L20. Biophys J 90:4273–4280 Seetharamulu P, Crippen GM (1991) A potential function for protein folding. J Math Chem 6: 91–110 Shakhnovich EI (1997) Theoretical studies of protein-folding thermodynamics and kinetics. Curr Opinion Struct Biol 7:29–40 Shakhnovich EI (2006) Protein folding thermodynamics and dynamics: where physics chemistry and biology meet. Chem Rev 106:1559–1588 Shell MS, Ozkan SB, Voelz V, Wu GA, Dill KA (2009) Blind test of physics-based prediction of protein structures. Biophys J 96:917–924 Shen M-Y, Sali A (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15:2507–2524 Shimizu S, Chan HS (2001) Anti-cooperativity in hydrophobic interactions: a simulation study of spatial dependence of three-body effects and beyond. J Chem Phys 115:1414–1421 Sikorski A, Koli´nski A, Skolnick J (1998) Computer simulations of de novo designed helical proteins. Biophys J 75:92–105 Simon I, Glasser L, Scheraga HA (1991) Calculation of protein conformation as an assembly of stable overlapping segments: Application to bovine pancreatic trypsin inhibitor. Proc Natl Acad Sci USA 88:3661–3665 Simons KT, Koopernberg C, Huang E, Baker D (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring function. J Mol Biol 268:209–225 Sippl MJ (1990a) Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J Mol Biol 213:859–883 Sippl MJ (1990b) Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force. J Mol Biol 216:167–180

82

C. Czaplewski et al.

Sippl MJ (1993) Boltzmann’s principle knowledge-based mean fields and protein folding. An approach to the computational determination of protein structures. J Comput-Aid Mol Des 7:473–501 Skolnick J, Jaroszewski L, Koli´nski A, Godzik A (1997a) Derivation and testing of pair potential for protein folding. When is the quasi-chemical approximation correct? Protein Sci 6:676–688 Skolnick J, Koli´nski A, Ortiz AR (1997b) MONSSTER: A method for folding globular proteins with a small number of distance constraints. J Mol Biol 265:217–241 Skolnick J, Koli´nski A, Ortiz A (2000) Derivation of protein-specific pair potentials based on weak sequence fragment similarity. Proteins 38:3–16 Skolnick J, Koli´nski A, Kihara D, Betancourt M, Rotkiewicz P, Boniecki M (2001) Ab initio protein structure prediction via a combination of threading lattice folding clustering and structure refinement. Proteins: Struct Func Genet 45(S 5):149–156 Skolnick J, Zhang Y, Arakaki AK, Koli´nski A, Boniecki M, Szilagyi A, Kihara D (2003) TOUCHSTONE: a unified approach to protein structure prediction. Proteins: Struct Func Genet 53 (S 6):469–479 Snow ME (1992) Powerful simulated-annealing algorithm locates global minimum of proteinfolding potentials from multiple starting points. J Comput Chem 13:579–584 Sorenson JM, Head-Gordon T (2002) Toward minimalist models of larger proteins: a ubiquitin-like protein. Proteins: Struct Func Genet 46:368–379 Stewart JJ (1990) MOPAC – a semiempirical molecular orbital program. J Comput-Aided Mol Des 4:1–105 St-Pierre JF, Mousseau N, Derreumaux P (2008) The complex folding pathways of protein A suggest a multiple-funnelled energy landscape. J Chem Phys 128:045101 Sugita Y, Okamoto Y (1999) Replica-exchange molecular dynamics method for protein folding. Chem Phys Lett 314:141–151 Summa CM, Levitt M, Degrado WF (2005) An atomic environment potential for use in protein structure prediction. J Mol Biol 352:986–1001 Sun S (1993) Reduced representation model of protein structure prediction: Statistical potential and genetic algorithms. Protein Sci 2:762–785 Takada S (2001) Protein folding simulation with solvent-induced force field: Folding pathway ensemble of three-helix-bundle proteins. Proteins: Struct Func Genet 42:85–98 Taketomi H, Ueda Y, G¯o N (1975) Studies on protein folding unfolding and fluctuations by computer simulation. 1. Effect of specific amino-acid sequence represented by specific inter-unit interactions. Int J Peptide Protein Res 7:445–459 Tanaka S, Scheraga HA (1976) Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structure of proteins. Macromolecules 9:945–950 Thirumalai D, Klimov DK (1999) Deciphering the timescales and mechanisms of protein folding using minimal off-lattice models. Curr Opin Struc Biol 9:197–207 Thomas PD, Dill KA (1996) Statistical potentials extracted from protein structures: How accurate are they? J Mol Biol 257:457–469 Thorpe IF, Zhou J, Voth GA (2008) Peptide folding using multiscale coarse-grained models. J Phys Chem B 112:13079–13090 Tiana G, Broglia RA (2001) Statistical analysis of native contact formation in the folding of designed model proteins. J Chem Phys 114:2503–2510 Tobi D, Elber R (2000) Distance-dependent pair potential for protein folding: Results from linear optimization. Proteins 41:40–46 Tobi D, Shafran G, Linial N, Elber R (2000) On the design and analysis of protein folding potentials. Proteins 40:71–85 Tobi D, Bahar I (2005) Structural changes involved in protein binding correlate with intrinsic motions of proteins in the unbound state. Proc Natl Acad Sci USA 102:18908–18913 Tozzini V (2005) Coarse-grained models for proteins. Curr Opinion Struct Biol 15:144–150 Ueda Y, Taketomi H, G¯o N (1978) Studies on protein folding unfolding and fluctuations by computer simulation. II. A three-dimensional lattice model of lysozyme. Biopolymers 17:1531–1548

3

Protein Coarse-Grained Models

83

Vásquez M, Scheraga HA (1985) Use of buildup and energy-minimization procedures to compute low-energy structures of the backbone of enkephalin. Biopolymers 24:1437–1447 Veitshans T, Klimov D, Thirumalai D (1996) Protein kinetics: timescales pathways and energy landscapes in terms of sequence-dependent properties. Fold Des 2:1–22 Vendruscolo M, Domany E (1998) Pairwise contact potentials are unsuitable for protein folding. J Chem Phys 109:11101–11108 Vendruscolo M, Najmanovich R, Domany E (1999) An optimal derivation of a potential for protein folding. Phys A 262:35–39 Vendruscolo M, Najmanovich R, Domany E (2000) Can a pairwise contact potential stabilize native protein folds against decoys obtained by threading? Proteins 38:134–148 Vieth M, Koli´nski A, Brooks CL, Skolnick J (1994) Prediction of the folding pathways and structure of the gcn4 leucine zipper. J Mol Biol 237:361–367 Vila JA, Ripoll DR, Scheraga HA (2003) Atomistically detailed folding simulations of the B-domain of staphylococcal protein A from random structures. Proc Natl Acad Sci USA 100:14812–14816 Voth G (ed) (2008) Coarse-graining of condensed phase and biomolecular systems. CRC Press Taylor & Francis, Farmington, CT Wako H, Scheraga HA (1982a) Distance-constraint approach to protein folding. I. Statistical analysis of protein conformations in terms of distance between residues. J Prot Chem 1:5–45 Wako H, Scheraga HA (1982b) Distance-constraint approach to protein folding. II. Prediction of three-dimensional structure of bovine pancreatic trypsin inhibitor. J Prot Chem 1:85–117 Wallqvist A, Ullner M (1994) A simplified amino acid potential for use in structure predictions of proteins. Proteins Struct Func Genet 18:267–280 Wang J, Wang W (1999) A computational approach to simplifying the protein folding alphabet. Nat Struct Biol 6:1033–1038 Wang Y, Noid WG, Liu P, Voth GA (2009) Effective force coarse-graining. Phys Chem Chem Phys 11:2002–2015 Wei G, Derreumaux P (2002) Exploring the energy landscape of proteins: a characterization of the activation–relaxation technique. J Chem Phys 117:11379–11387 Wei GH, Mousseau N, Derreumaux P (2007) Computational simulations of the early steps of protein aggregation. Prion 1:3–8 Wolynes PG (2005) Energy landscapes and solved protein-folding problems. Philos Transact A Math Phys Eng Sci 363:453–464 Wood J, Hirst JD (2005) Protein secondary structure prediction with dihedral angles. Proteins: Struct Funct Bioinf 53:476–481 Yang JS, Chen WW, Skolnick J, Shakhnovich EI (2007) All-atom ab initio folding of a diverse set of proteins. Structure 75:53–63 Yˇcas M, Goel NS, Jacobsen JW (1978) On the computation of the tertiary structure of globular proteins. J Theor Biol 72:443–457 Zhang C, Liu S, Zhou H, Zhou Y (2004) The dependence of all-atom statistical potentials on structural training database. Biophys J 86:3349–3358 Zhang Y, Arakaki A, Skolnick J (2005) TASSER: an automated method for the prediction of protein tertiary structures in CASP6. Prot Struct Funct Bioinf 69(S7): 91–98 Zhou H, Zhou Y (2004) Single-body knowledge based energy score combined with sequenceprofile and secondary structure information for fold recognition. Proteins 55:1005–1013 Zhou Y, Zhou H, Zhang C, Liu S (2006) What is a desirable statistical energy function for proteins and how can it be obtained. Cell Biochem Biophys 46:165–174 Zhou HY, Pandit SB, Lee SY, Borreguerro J, Chen HL, Wroblewska L, Skolnick J (2007a) Analysis of TASSER-based CASP7 protein structure prediction results. Proteins: Struct Func Bioinf. 60 (S 8):90–97 Zhou J, Thorpe I, Izvekov S, Voth GA (2007b) Coarse-grained peptide modeling using a systematic multiscale approach. Biophys J 92:4289–4303

Chapter 4

Conformational Sampling in Structure Prediction and Refinement with Atomistic and Coarse-Grained Models Michael Feig, Srinivasa M. Gopal, Kanagasabai Vadivel, and Andrew Stumpff-Kane

Abstract The performance of models at different resolutions is compared in the context of an iterative protein structure refinement protocol. The models considered here consist of an all-atom model described with the CHARMM22 force field in combination with a distance-dependent dielectric implicit solvent approximation, a united-atom model described by the CHARMM19 force field in combination with the effective energy function 1 (EEF1) solvent model, the new intermediate coarsegrained model PRotein Intermediate MOdel (PRIMO), both with Generalized Born and distance-dependent dielectric solvent models, and the lattice-based coarsegrained Side CHain Only (SICHO) model. It is found that the CHARMM19 and SICHO models lead to initially rapid refinement for a set consisting of 11 targets from past critical assessment of protein structure prediction (CASP) competitions, but they are eventually outperformed by the CHARMM22 and PRIMO models in consistently reaching near-native conformations over the course of 100 refinement cycles.

4.1 Introduction The most direct approach to protein structure prediction involves the simulation of folding from an initially unfolded structure to the native state at the global free energy minimum (Osguthorpe 2000; Bonneau and Baker 2001). Such an ab initio protocol is conceptually very appealing because it makes little or no a priori assumptions about the native state. However, the practical relevance of ab initio structure prediction is diminishing because template-based models can be obtained today for the large majority of structure prediction targets (Moult et al. 2009). This is a consequence of the increased number of experimental structures in the Protein Data M. Feig (B) Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA; Department of Chemistry, Michigan State University, East Lansing, MI, USA e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_4,

85

86

M. Feig et al.

Bank (PDB) (Berman et al. 2000) and the emergence of new methods that assemble protein structures from structural fragments from a variety of known structures (Rohl et al. 2004; Zhang et al. 2005). Fragment-based methods have been especially successful in predicting structures for “difficult” targets for which sequence homology-based templates that cover the entire sequence are not available (Simons et al. 1999; Zhang et al. 2005). In fact, results from recent rounds of CASP suggest that the structures of virtually any target could have been predicted at least approximately using templates from the PDB and that a lack of good predictions for the most difficult targets is largely a consequence of not being able to identify and assemble suitable templates (Zhang and Skolnick 2005). Protein structures are often used as starting structures for detailed mechanistic studies with computational techniques and as targets for rational drug design. Both applications often require structures at experimental or near-experimental accuracy. Structure predictions at near-experimental accuracy are typically only possible if structures from highly homologous proteins are available as templates. Otherwise, structure refinement is needed to improve initial predictions to near-experimental accuracy. Such refinement would not be able to take advantage of known structures anymore and instead rely on ab initio-like sampling protocols that can find the native state solely based on a given energy function. As a consequence, there has been renewed interest in ab initio prediction protocols. This time there is more of a focus on solving the refinement problem than on folding from extended structures. In both, ab initio folding and refinement, successful methods require a combination of effective conformational sampling methods with realistic underlying energy functions. The sampling method has to be able to reach the native state within available computational resources while the native state should be located at the global free energy minimum of the energy function. Furthermore, the energy should gradually decrease toward the native basin to reflect funnel-shaped energy landscapes of real proteins and guide the conformational sampling toward the native state. One approach to structure refinement is the application of molecular dynamics simulations with atomistic classical force fields to provide a realistic energetic description of solvated proteins in a theoretically rigorous framework. In practice, such simulations are rarely successful in significantly refining structures because extremely long and prohibitively expensive simulations are required to overcome kinetic barriers (Fan and Mark 2004). At the same time, current force fields still exhibit subtle conformational preferences that may in some cases bias the resulting structures toward non-native conformations even if simulations could be carried out for sufficiently long times (Feig et al. 2003; Hornak et al. 2006; Best et al. 2008; Freddolino et al. 2009). A more practical approach involves simplified protein representations in conjunction with empirical energy functions that are sampled via molecular dynamics or Monte Carlo methods (Skolnick et al. 1997; Chinchio et al. 2006; Gopal et al. 2009). Simplified protein models allow comprehensive conformational sampling with moderate computational resources of a wide variety of conformations, including native-like conformations. However, the energy functions used with such models are usually not sufficiently reliable to consistently converge to the native state at the free energy minimum.

4

Conformational Sampling

87

One strategy to overcome difficulties with ab initio methods is to separate conformational sampling and scoring. In this case, a computationally highly efficient method based on an approximate model is used for the sampling stage to quickly and broadly cover conformational space. Representative structures are then extracted from the sampling and evaluated with an accurate scoring function, often at a higher model resolution, to identify the most native-like conformations. Such a protocol is successful if the sampling method is able to generate at least some native-like conformations (not necessarily with the lowest free energy) and if the scoring function can discriminate native-like conformations from other non-native conformations within a given decoy set. A decoupling of sampling and scoring methods then also allows separate assessment and optimization toward the different objectives of the sampling and scoring stages. This chapter will primarily discuss conformational sampling methods and in particular the effect of different model resolutions on sampling efficiency and accuracy. Furthermore, practical relevance dictates that we focus primarily on structure refinement rather than on ab initio prediction from extended structures. In the following, we will first introduce a framework for assessing sampling efficiency within the context of structure refinement and then present concrete examples of different sampling strategies with models at different resolutions.

4.2 Iterative Structure Refinement Framework In order to establish a framework for analyzing conformational sampling for structure refinement application one may consider an iterative structure refinement scheme (Stumpff-Kane et al. 2008). In such a protocol, depicted in Fig. 4.1, an initial structure, typically obtained as a result of template-based modeling, is subjected to conformational sampling with the goal of generating a large number of diverse decoys. A scoring function is then applied to select one conformation from

Fig. 4.1 Iterative structure refinement scheme

88

M. Feig et al.

the set of decoys to be used as the starting structure for the next round of sampling. The sampling–scoring cycles are repeated until convergence is reached. Such a protocol would lead to successful refinement if more native-like decoys than the starting structure at a given cycle are generated and selected according to the scoring function at least for most of the cycles. In such a scheme, sampling and scoring become most decoupled when the sampling protocol involves only short simulations so that the underlying energy function cannot significantly bias the distribution of decoys. The degree to which sampling and scoring should be decoupled depends on the quality of the energy function used during the sampling phase. A very approximate energy function may favor the sampling of non-native states during longer simulations while a more accurate energy function may actually help direct sampling toward the native state. A typical protocol may involve short molecular dynamics or Monte Carlo simulations started with different random seeds from the same initial structure at a given cycle. The length of the simulations determines the breadth of the structural distribution of the resulting decoys and the number of refinement cycles that are minimally needed to reach the native state. The breadth of the decoy distribution in turn affects the ability of the scoring function to identify the most native-like conformations in a given decoy set. An ideal sampling method would therefore generate a broad distribution of structures where the fraction of structures that are more native-like than the starting structure at a given cycle is maximized. Sampling methods can be assessed independently from scoring by assuming an idealized refinement protocol where root mean square deviation (RMSD) from the native state (or another suitable metric of nativeness) is used as the scoring function, i.e., the decoy closest to the native state is always selected as the initial structure for the next cycle (see Fig. 4.1). The scheme can be varied by introducing different levels of noise in the RMSD-based scoring function in order to determine what levels of noise can be tolerated to accomplish structure refinement (Stumpff-Kane et al. 2008). An important question that can be answered with such a framework is whether a given sampling method is in principle able to reach the native state. A given sampling method may not be able to reach the native state if the native state does not correspond to favorable free energies or if there are significant kinetic barriers on the path toward the native state that cannot be overcome within the limited amount of sampling during the generation of decoys. Both, the presence of kinetic barriers and the relative energy of the native state depend on the model that is used for the conformational sampling.

4.2.1 Quantitative Measure of Sampling Efficiency In order to introduce a quantitative measure of sampling efficiency it is worthwhile to revisit an earlier analysis by Reva and coworkers of randomly generated protein-like conformations (Reva et al. 1998). It was found that such conformations approximately follow a normal distribution (see Fig. 4.2) so that the probability of

4

Conformational Sampling

89

finding a conformation with a coordinate RMSD from the native of R or less is given according to:

Pr≤R

1 = √ σ 2π

R e

−(r−)2 2σ 2

dr,

(4.1)

0

where the average RMSD and the associated standard deviation σ depend on the specific protein that is being considered. may be estimated from the number of residues as 3.333N1/3 (Maiorov and Crippen 1995; Reva et al. 1998), while a value of 2.0 Å is a reasonable guess for σ for small and medium-sized proteins (Reva et al. 1998). As an example, one may calculate the probability of randomly generating conformations for a 150-residue protein with an RMSD of 15 Å or less as the gray/blue area under the curve which corresponds to a value of about 0.09 (Fig. 4.2). Fig. 4.2 Probability of finding structures at a given RMSD for a protein with 150 residues

The preceding analysis can be extended to analyze the fraction of structures that are more native-like than a given starting structure for the case of random sampling with no specific bias or barriers that would favor or disfavor sampling toward the native state. If one assumes that an ensemble generated from an initial structure with an RMSD R extends maximally to the interval [R−R, R+R] the fraction of improved structures with an RMSD between R−R and R is given according to: R PR−R
A = = A+B

R−R R+R

e

−(r−)2 2σ 2

e

−(r−)2 2σ 2

dr (4.2) dr

R−R

where A and B are the areas under the curve indicated in Fig. 4.2 PR−R
90

M. Feig et al.

In practice, structures generated with a given sampling methods will not be distributed in a symmetric interval [R−R, R+R] around the starting structure even if the underlying energy landscape is completely flat because of a decreasing number of states as the native state, or any specific state for that matter, is approached. However, the symmetric case may be used as a theoretical reference to introduce a sampling efficiency ratio that compares the fraction of structures that are actually better than the starting structure to PR−R
4.3 Protein Models at Different Resolutions A large variety of models have been introduced to describe the interactions and structure of proteins. At the very minimum, such models need to maintain chain connectivity and proper space filling to generate protein-like structures as those used in early models of protein folding. Improved models introduce specific physicochemical characteristics of different amino acid side chains and represent secondary structure elements while the most detailed models used for representing protein structures include full atomistic detail. Atomistic models with physically motivated interaction potentials provide the most accurate description of protein structures that can be afforded today. An even more detailed model based on quantum mechanics remains beyond the realm of what is practical today. The application of classic atomistic models is limited, however, by still significant computational costs and a resulting energy landscape that tends to be very rough and thereby hinders efficient conformational sampling. Coarse-grained models on the other hand have generally much smoother energy landscapes but in those models energetic accuracy may be compromised to the point where the native state does not correspond to a free energy minimum anymore so that sampling of native-like structures is not favorable. In order to contrast the advantages and disadvantages of models at different resolutions in the context of conformational sampling, we will focus here on fully atomistic model based on the chemistry at harvard molecular mechanics (CHARMM) force field and two types of coarse-grained models (see Fig. 4.3). These models will be described briefly in the following.

4.3.1 All-Atom Models of Proteins All-atom models of biomolecules usually represent each atom, including all hydrogens, as spheres with point charges (see Fig. 4.3). The associated interaction potential generally assumes the following classical form (Mackerell 2004):

4

Conformational Sampling

91

Fig. 4.3 Protein models at different resolutions: all-atom resolution (left), PRIMO (center), SICHO (right)

V(r) =

N bonds

Nangles

kbond,i (di − di,0 ) + 2

i=1

+

kangle,i (θi − θi,0 )2

i=1

Ntorsions

Nimproper ktorsion,i 1 + cos(ni φi − φi,0 ) + kimproper,i (φi − φi,0 )2

i=1

+

Natoms atoms −1 N i=1

j=i+1

4εLJ,ij

σLJ,ij rij

12

−

i=1

σLJ,ij rij

6 +

1 qi qj 4π ε0 rij (4.3)

In Eq. (4.3), bonded terms maintain covalent molecular bonding geometries through harmonic terms for bonds, angles, and improper torsions as well as a Fourier series term for proper torsions. Bonded terms essentially reflect interactions that otherwise would have to be described quantum mechanically and they are either fit empirically to experimental high-resolution structural data and spectroscopic vibrational frequencies or to optimized geometries and normal mode frequencies from ab initio calculations. Non-bonded interactions are described through a combination of electrostatic interactions between point charges qi located at the atomic centers and the well-known Lennard–Jones potential which combines attractive van der Waals dispersion interactions with a highly repulsive term to avoid overlap of atomic spheres. The non-bonded terms are typically also fit to ab initio data (Mackerell 2004). The basic functional form given in Eq. (4.3) is used to describe protein interactions in a number of widely used force fields such as CHARMM (MacKerell et al. 1998), assisted model building with energy refinement (Amber), groningen molecular simulation (GROMOS) (Oostenbrink et al. 2004), and optimized potentials for liquid simulations (OPLS) (Jorgensen et al. 1996) with the main differences in the choice of force constants, torsion terms, partial charges, and Lennard–Jones parameters as well as the balance between non-bonded and bonded terms for 1–4 interactions. The CHARMM force field also introduces a 1–3 distance potential

92

M. Feig et al.

(MacKerell et al. 1998) and recently a spline-based torsion cross-correlation term was added to improve the description of protein backbone sampling (MacKerell Jr. et al. 2004a,b). An interaction potential according to Eq. (4.3) primarily relies on basic physical principles instead of empirical knowledge derived from known protein structures. As a result such models are applicable for describing the energetics of a wide variety of systems, including unusual, non-canonical structures, and disordered peptides. In the context of structure refinement all-atom models are furthermore expected to be able to correctly distinguish subtle energetic differences between sub-optimal and optimal packing arrangements. However, the all-atom level of detail also results in highly rugged energy landscapes since very small variations in atomic positions can lead to large changes in energy, e.g., when two contacting atomic spheres increase their degree of overlap or when the distance between two bonded atoms is stretched. The ruggedness of the energy landscape hinders efficient conformational sampling and in particular, the energetic fluctuations during sampling of similar conformations may be large compared to the difference in free energies between mis-folded and correctly folded states. 4.3.1.1 Sampling with All-Atom Force Fields Sampling with all-atom force fields commonly involves molecular dynamics simulations at a given temperature using forces based on Eq. (4.3) (from the gradient of the potential). The calculation of such forces for biological macromolecules is computationally very demanding because the long-range electrostatic interactions need to be calculated for large pairwise distances (Schreiber and Steinhauser 1992; Norberg and Nilsson 2000). Furthermore, simulations are typically required to use 1–2 fs integration time steps in order to adequately sample high-frequency bond vibrations. Conformational sampling can be improved simply by increasing the temperature or with a variety of more advanced enhanced sampling techniques (Cheng et al. 2004; Hamelberg et al. 2004; Micheletti et al. 2004; Okamoto 2004). An alternative strategy involves the use of normal modes to propagate structures along the lowest-frequency modes (Stumpff-Kane et al. 2008). Such modes focus on overall structural changes that involve large parts of a given structure and avoid the high-frequency fluctuations that give rise to the significant energetic fluctuations with atomistic force fields.

4.3.2 Coarse-Grained Models of Proteins Coarse-grained models of proteins reduce the model resolution from fully atomistic models. Such models may range from united-atom models where, e.g., aliphatic carbon–hydrogen pairs are combined into a single particle to models where an interaction site represents a significant part of a given protein structure and includes multiple residues, but the most popular models retain one up to a few interaction sites per amino acid residue. The reduced number of interaction sites greatly reduces the computational cost that is incurred during sampling but at the same compromises

4

Conformational Sampling

93

energetic accuracy. Furthermore, the degree of coarse-graining affects the ability to retain physically motivated interaction potentials. Instead, energy terms of coarsegrained models often have to rely on empirical and/or system-specific terms that limit transferability and applicability for structure refinement applications. While a more general overview over coarse-grained models is given in Chapter 11, we focus here only on contrasting two specific coarse-grained models that have very different characteristics. The first one, PRotein Intermediate MOdel (PRIMO), is a new model that represents proteins at an intermediate level of resolution but is able to retain a largely physics-based interactions potential. The second one, Side CHain Only (SICHO), is a lattice-based residue-level model that relies primarily on empirical and statistical terms. Both are described in more detail in the following sections. 4.3.2.1 PRIMO PRIMO (PRotein Intermediate MOdel) was recently developed as a coarse-grained model that maintains a one-to-one correspondence to fully atomistic models so that physically motivated force field-like interaction potentials can be used that are parameterized by direct comparison with a given all-atom force field (Gopal et al. 2010). Such a model is possible when coarse-grained interaction sites are selected so that the all-atom level of detail can be reconstructed analytically based on standard molecular bonding geometries. As an example, the position of the Cβ atom can always be reconstructed with good accuracy from backbone heavy atoms C, Cα , and N based on the assumption of standard tetrahedral bonding. A coarse-grained model without an explicit Cβ site is therefore still able to maintain close correspondence with a fully atomistic model. The PRIMO model is the result of a systematic reduction of interaction sites to a minimal set that maintains quasi-atomistic resolution (after all-atom reconstruction). The resulting model is shown in Fig. 4.4. It contains three sites per

Fig. 4.4 PRIMO interaction sites

94

M. Feig et al.

residue for the backbone and one to five interaction sites for non-glycine side chains. In some cases additional sites were introduced to achieve good space-filling. An example is the explicit Cβ site in alanine that is not needed for maintaining a oneto-one correspondence to the all-atom level but is included to improve packing interactions. All-atom reconstructions based on PRIMO deviate on average by less than 0.1 Å RMSD from structures that were reduced to the PRIMO level and then reconstructed to atomic detail (Gopal et al. 2010). In contrast, all-atom reconstructions from other coarse-grained models at lower resolutions, e.g., with a single site at the side-chain center, rarely achieve a reconstruction accuracy of less than 0.5 Å (Feig et al. 2000). More typical is an accuracy of 1–2 Å due to uncertainties in the common use of rotamer libraries for determining side-chain orientations (Bower et al. 1997). At the same time, the cost for the analytical reconstruction procedure from PRIMO to allatom representations is negligible compared to the cost of iterative reconstruction procedures that are used otherwise. This presents opportunities for the development of efficient multi-scale models. It also allows the use of virtual atomistic interaction sites to improve the description of bonded interactions without an additional cost for calculating non-bonded energy terms that would result if those sites were included explicitly as part of the model. The PRIMO interaction potential follows a force field-like form with additional terms for implicit solvent, an explicit hydrogen bonding interaction potential, and modified bonded terms to maintain correct bond geometries at the coarse-grained level: V(r) =

N bonds

2 kbond,i di − di,0 +

i=1 Nangles

+

Nbond-splines

2 kangle,i θi − θi,0 +

i=1 Nangle-splines

i=1 N i torsion N i=1

ktorsion,i, j 1 + cos(ni, j φi − φi, j,0 )

j=1

Ntorsion-splines

+

CMAP N 1−4 d + s1D s2D i i (φi , ψi ) i

i=1

+

+

Natoms −1

N

i=1

j=i+1

NHBOND

di1−3 s1D i

i=1 mult

+

1−2 d s1D i i

4εij

i=1

σij rij

12

N−CO cos α, d s2D i

i=1

+ GGB solvation +

N atom i=1

γi SASAi

−

σij rij

6 +

(4.4) N N i=1 j=i+1

qi qj 4π ε0 rij

4

Conformational Sampling

95

Bonded Terms in PRIMO Bonded interactions between PRIMO sites correspond to both real covalent bonds, e.g., between Cα and N backbone sites and virtual bonds, e.g., between N and the combined backbone carbonyl site, CO. For real covalent bonds, standard harmonic terms are used as in standard all-atom force fields to model 1–2 (bonds) and 1–3 (angles) interactions. Otherwise, distance-based spline-interpolated potentials are employed to reproduce non-harmonic functions and in particular the presence of multiple minima (see example in Fig. 4.5). The shape of the effective bond and angle potentials for virtual bond interactions was extracted from the actual sampling in long explicit solvent simulations over 150 ns with the CHARMM22/CMAP force field for each of the amino acid dipeptides. From a strict point of view, the spline-interpolated potentials represent effective potentials that are only applicable for explicit solvent environments. However, because bonded interactions are strong compared to non-bonded interactions with the environment it is assumed here that the dipeptide-derived effective bonded terms can be transferred to condensed protein environments and different non-aqueous solvent environments.

Fig. 4.5 N–CA(Cα )–SC1(Cβ +Cγ ) effective angle interaction in leucine from leucine dipeptide simulation (points) and spline interpolation used in PRIMO (line)

Bonded terms based on PRIMO interaction sites are augmented by bonded terms involving virtual atomic sites that are reconstructed on-the-fly from the PRIMO model. Such terms are necessary to maintain correct bonding geometries at the PRIMO level. For example, in the case of leucine, the N–CA(Cα )–SC1(Cβ +Cγ ) angle term along with other bonded terms involving SC1 is not sufficient to restrict SC1 to a circle around the Cα –Cβ bond as required for sp3 -hybridized molecular

96

M. Feig et al.

bonding to Cβ (see Fig. 4.5). This problem is alleviated by reconstructing the Cβ position from the backbone atoms and introducing a harmonic CA–CB∗ –SC1 angle potential where CB∗ is the reconstructed Cβ position. It may appear easier to simply re-introduce CB as an additional site in PRIMO. However, the cost of an additional particle that is propagated along with the other particles and that has to be included in the calculation of non-bonded interactions far outweighs the negligible cost for reconstructing CB on-the-fly as a virtual atom. In all-atom force fields 1–4 interactions are typically modeled as a combination of Fourier-series torsional terms and scaled electrostatic and Lennard–Jones interactions. In PRIMO, 1–4 interactions are modeled as a combination of Fourier series terms, distance-based spline-interpolated terms, and reduced Lennard–Jones interactions to avoid hard-sphere overlap. However, 1–4 electrostatic interactions are omitted because the reduced charged do not provide sufficiently accurate local electrostatic interactions. Instead, the spline-based 1–4 interaction potentials represent the effective interactions in explicit solvent that are extracted from the dipeptide simulations. This approach is therefore also only strictly valid for small peptides in aqueous solvent. In addition to the one-dimensional 1–4 terms, a spline-interpolated twodimensional cross-correlation term based on the CMAP methodology (MacKerell Jr. et al. 2004a) is used in PRIMO to couple the sampling of CO–N–CA–CO and N– CA–CO–N torsions and thereby control the sampling of φ/ψ backbone torsions. Using the CMAP methodology it is possible to nearly exactly reproduce any given target φ/ψ-map (MacKerell Jr. et al. 2004a). The CMAP term in PRIMO is initially parameterized to reproduce the effective φ/ψ distribution in short alanine-based peptides with the CHARMM22/CMAP force field, but can be easily reparameterized to reflect improved φ/ψ distributions that are more consistent with experimental observations such as nuclear magnetic resonance (NMR) J-coupling data (Graf et al. 2007; Best et al. 2008).

Non-bonded Terms in PRIMO Non-bonded terms in PRIMO consist of Lennard–Jones terms, electrostatic interactions, and an explicit hydrogen bonding term. The parameters in the Lennard–Jones term (σ i , εi ), including reduced values for 1–4 interactions, are optimized to reproduce CHARMM all-atom Lennard–Jones interaction energies for a series of peptide test sets. Electrostatic interactions rely on partial charges for each PRIMO site. Because of the coarse-grained nature of PRIMO, the partial charges are generally reduced over all-atom models although formal charges are maintained for acidic and basic amino acids. With the reduced charges (and in the absence of higher-order multipoles) all-atom electrostatic interactions cannot be accounted for completely so that the PRIMO charges cannot be optimized directly by simply fitting to all-atom electrostatic energies. The reduced partial charges are a particular problem for the description of hydrogen bonds which is accomplished implicitly through electrostatics in modern

4

Conformational Sampling

97

all-atom force fields. In PRIMO, an explicit hydrogen potential is therefore reintroduced but with a functional form that differs from earlier implementations. The PRIMO potential relies on a spline-interpolated two-dimensional potential of mean force (PMF) as a function of both hydrogen bonding distance and angle. The PMF was obtained from a statistical analysis of protein structures in the PDB. In order to apply such a PMF in PRIMO, it is necessary (and straightforward) to reconstruct hydrogen atoms involved in hydrogen bonding on-the-fly from the PRIMO sites. The hydrogen bonding potential is the only term in PRIMO that directly encodes information extracted from known protein structures. Partial charges and a scaling factor for the PMF-based hydrogen bonding potential were optimized by fitting total PRIMO internal energies (including bonded and non-bonded terms) to total CHARMM all-atom energies for a series of test peptides. Implicit Solvent Terms in PRIMO Interactions with the environment are represented in PRIMO via implicit solvent. Because PRIMO sites have partial charges the generalized Born (GB) formalism given in Eq. (4.5) can be used to describe the electrostatic contribution to the solvation free energy as an approximation to a continuum dielectric electrostatic model according to Poisson theory (Still et al. 1990; Feig and Brooks 2004):

GGB

1 1 = −166 − εp εw

i

j

rij2

qi qj + αi αj exp − rij2 /Fαi αj

(4.5)

where εp is the solute cavity dielectric constant, εw is the solvent dielectric constant, qi is the atomic charge of the ith PRIMO site, rij is the interatomic distance between the ith and the jth sites, α i is the so-called effective Born radius of the ith site. The key part of any GB implementation is the calculation of the Born radii (Onufriev et al. 2002). In the case of PRIMO, the generalized born with molecular volume (GBMV) formalism (Lee et al. 2003) is used because it results in electrostatic solvation free energies that are based on a molecular surface description and closely match Poisson theory (Feig et al. 2004b). The non-polar contribution to the solvation free energy is modeled simply according to a linear function of the solvent-accessible surface area (Sitkoff et al. 1994). However, instead of using a constant surface tension coefficient for the entire surface, different scaling factors are introduced for different atom types. Such a term is known as an atomic solvation potential (ASP) (Wesson and Eisenberg 1992). The ASP term is included because the electrostatic solvation free energies according Eq. (4.5) does not fully account for the electrostatic solvation energy because it is based on reduced partial charges. The ASP coefficients, which may be positive or negative, were fit by comparing the combined GB and ASP energies to all-atom implicit solvent free energies. It should be noted that the separation of internal energies and implicit solvent terms in PRIMO allows an extension of PRIMO to non-aqueous environments. For

98

M. Feig et al.

example, membrane environments can be modeled by replacing the standard GB formalism with an implicit membrane model based on a heterogeneous dielectric environment and ASP parameters that are scaled as a function of z, the distance from the membrane center (Tanizaki and Feig 2005).

Sampling with PRIMO PRIMO is designed as a continuous-space model with a force field-like energy function that can be explored with molecular dynamics sampling techniques although Monte Carlo sampling is also possible. Because of the coarse-grained nature and in particular the lack of explicit hydrogen atoms, a longer integration time step can be employed. In initial tests, PRIMO has been stable with time steps of up to 7 fs. Longer-time steps are likely possible by exchanging the GBMV implicit solvent model with a variant that models a smoother molecular surface based on experiences with all-atom implicit solvent simulations (Chocholousova and Feig 2006).

Transferability For the most part, PRIMO relies on a physically motivated interaction potential that is parameterized by direct comparison with an all-atom energy function. New empirical terms are introduced to model hydrogen bonding and atom-based solvation properties, but in contrast to other similar models no bias toward known secondary structures or other structural constraints is necessary to model a given protein system. In fact, PRIMO is designed to be used like an all-atom energy function with the ability to carry out stable native-state molecular dynamics simulations of any protein and allow peptide folding from extended chains to native-like structures. Initial validation suggests that PRIMO can fulfill that promise, but future more extensive tests of PRIMO will reveal where the coarse-grained nature of PRIMO places limits on its energetic accuracy compared to fully atomistic models.

4.3.2.2 SICHO The SICHO (Side CHain Only) model consists, as the name suggests, of a single interaction site per amino acid residue, located at the side-chain center (Kolinski and Skolnick 1994). Such a model emphasizes side-chain packing over the detailed conformational sampling of backbone conformations. Sampling with the SICHO model is computationally very efficient because of the small number of interaction sites, but it is further accelerated by projecting the SICHO particles onto a cubic lattice with a 1.45 Å grid spacing and allowing only moves between lattice points (see Fig. 4.3) (Kolinski and Skolnick 1994; Skolnick et al. 1997). The SICHO interaction potential consists of a number of statistical and empirical terms. It is summarized in Eq. (4.6) (Kolinski and Skolnick 1994):

4

Conformational Sampling

VSICHO = αstiffness

99

stiffness Vi,...,i+6 + αshort

i

+ αpair

short Vi,...,i+8

i

pair Vi, j

i, j

+ αburial

i

+ α3−body

3−body

Vi, j,k

i, j,k

Viburial + αcontact

+ αmultibody

multibody

Vi, j,k,...

i, j,k,...

Vicontact + αcentrosymmetric Vcentrosymmetric

i

(4.6) The stiffness potential enhances the persistence length along the SICHO chain to result in protein-like conformations with α-helices and β-strands. The statistical short-range potential further favors sampling of specific secondary structure elements. This potential depends on input about known or predicted secondary structure for a given protein. In the context of refinement, this means that the SICHO potential largely prevents the sampling of alternative secondary structures and is best suited for improving the packing of existing secondary structure elements. The statistical pairwise potential implements known preferences for side-chain interactions in soluble proteins. Because of the low-resolution of SICHO, directional dependence of side-chain–side-chain interactions cannot be implemented with just a pairwise term. This deficiency is addressed in part with 3-body and multibody terms. Pairwise side-chain interactions are further modeled with a side-chain contact term that only applies to side chains in direct contact and a side-chain burial term that relies based on the empirical Kyte–Doolittle hydrophobicity scale (Kyte and Doolittle 1982). In addition, excluded volume of SICHO particles is considered implicitly by not allowing SICHO particles to move onto lattice sites within the excluded volume radius of any other site. Finally, based on empirical knowledge about the typical radius of gyration of a folded protein (Crippen and Snow 1990), a centrosymmetric biasing term is introduced to maintain the radius of gyration near that average value. Solvent is not considered explicitly in SICHO since it is included indirectly in the statistical potentials that are derived from soluble proteins. A particular advantage of the lattice projection is that simple lookup tables can be used for all interaction potentials. Many terms of the SICHO potential overlap in representing the underlying physical interactions. As a result, full scaling of all terms would lead to overcounting of certain interactions. Therefore, scaling factors are applied for each potential term. These scaling factors are optimized based on given test sets and to some extent depend on the type of application. For example, ab initio protein-folding applications might benefit from a larger weight for the centrosymmetric potential in order to favor compact, globular structures while such a potential is less helpful in the detailed refinement of a structure that is already near the native state. Sampling with SICHO Sampling with the SICHO model is essentially limited to Monte Carlo sampling since molecular dynamics simulations cannot be used with lattice models (Skolnick

100

M. Feig et al.

et al. 1997). In the current implementation of SICHO within the modeling of new structures from secondary and tertiary restraints (MONSSTER) program, the move set consists of two-bond (involving one site) and three-bond (involving two sites) moves as well as infrequent long-range moves involving a larger structural fragment in order to mimic secondary structure rigid body motions. Compared to all-atom models or even the PRIMO model, sampling with SICHO is extremely fast due to the combination of low-model resolution and lattice projection. In the context of structure refinement, it is necessary to convert the low-resolution SICHO models to all-atom detail. The reconstruction of all-atom models from SICHO is possible with an accuracy of about 1 Å through the use of a rotamer library but the optimization step that is part of such reconstruction algorithms adds additional computational costs (Feig et al. 2000).

4.4 Iterative Refinement with Different Protein Models We will now proceed to describe the application of the different models described above within the iterative refinement protocol described in Section 4.2. Test cases consist of predictions from previous CASP competitions and statistics about the starting structures are given in Table 4.1. Table 4.1 Initial structures from template-based predictions for CASP targets used as test set for iterative structure refinement protocol

Target

Number of residues

PDB code of experimental structure

Residues in PDB

Starting Cα RMSD from native in Å

T0132 T0138 T0157 T0167 T0169 T0188 T0190 T0266 T0274 T0275 T0277

154 135 138 185 156 124 114 150 159 137 119

1YLI 1M2F 1NMN 1M3S 1MK4 1O13 2G2N 1WDV 1WGB 1WJG 1WTY

11–152 1–135 2–101; 119–138 1–135; 141–185 1–156 1–107 4–114 3–152 1–159 2–136 2–117

5.10 5.05 6.36 5.52 6.82 3.37 2.96 1.61 4.10 2.88 1.55

4.4.1 Sampling Protocol Starting from each structure an iterative refinement protocol was followed as described above. At each round 1,000 decoys were generated and the structure with the lowest RMSD was used as the starting structure for the next round. This protocol allows the comparison of different sampling methods in the best-case scenario where RMSD is used as the scoring function. A total of 100 refinement cycles were

4

Conformational Sampling

101

completed for each target with the simulation protocols described in more detail in the following sections. 4.4.1.1 All-Atom Molecular Dynamics Simulations At each round, decoys were generated from the final structures of very short molecular dynamics simulations that were started from the given starting structure with different initial velocities. Each simulation was 1 ps long (500 steps with a 2-fs time step). An elevated temperature of 400 K was used to accelerate the exploration of phase space. Because an entire refinement run consisting of 100 rounds × 1,000 decoys × 1 ps corresponds to a total simulation time of 100 ns only computationally inexpensive solvation models were used instead of explicit solvent so that an entire refinement run can be completed in about a day on moderate computational resources. More specifically, the following two force field/solvent model combinations were considered: (1) A full all-atom model with interactions described by the CHARMM22 force field was used in combination with distance-dependent dielectric screening of electrostatic interaction as a crude but very inexpensive approximation of solvation effects (Schaefer et al. 1999). (2) A united-atom model without aliphatic hydrogens as described by the CHARMM19 force field (Neria et al. 1996) was used in conjunction with the EEF1 solvation model (Lazaridis and Karplus 1999). Both sets of simulations were run with the CHARMM program (Brooks et al. 2009), version c36a1.

4.4.1.2 PRIMO Molecular Dynamics Simulations With the PRIMO model, decoys were also generated from short molecular dynamics simulations over 1 ps each (250 steps with a 4-fs time step) at an elevated temperature of 400 K. Two combinations of PRIMO were run: (1) The PRIMO force field was combined with the generalized Born implicit solvent and atomic solvation parameters as described above. (2) The PRIMO force field was also combined with distance-dependent dielectric screening without atomic solvation parameters. All of the PRIMO simulations were run with a locally modified version of CHARMM based on c36a1 that fully implements all of the PRIMO force field features.

4.4.1.3 SICHO Lattice Monte Carlo Sampling Simulations with the SICHO model were run with lattice Monte Carlo simulations using the program MONSSTER (Skolnick et al. 1997) through the MMTSB Tool

102

M. Feig et al.

Set (Feig et al. 2004a). These simulations were run at a constant temperature of 1.333 in reduced units of kT (at 300 K), corresponding to 400 K. In each simulation, 100 Monte Carlo moves were attempted. The default potential parameters were employed, except for the centrosymmetric potential which was not applied.

4.4.2 Refinement Toward the Native State The progress toward the native state for each of the 11 test targets with the 5 different sampling protocols is shown in Fig. 4.6. It can be seen that refinement is possible with all sampling methods but there are significant differences in how quickly structures are improved and how close the final structures are to the native structures. SICHO leads to rapid initial refinement in cases where the initial models are at 4 Å or more, but initial structures that are closer to the native are either not significantly refined or actually become worse. Final structures with SICHO are generally within 2.5 and 3.5 Å. In contrast, sampling with the other force fields

Fig. 4.6 Iterative refinement with different sampling methods from targets given in Table 4.1 using RMSD from the native as an idealized scoring function

4

Conformational Sampling

103

requires more cycles to converge but most of the final models are between 1 and 3 Å with CHARMM22/rdie and CHARMM19/EEF1, between 1.5 and 3.5 Å with PRIMO/GB and between 1.5 and 2.5 Å with PRIMO/rdie. The extent to which structures close to the native state (<2 Å) can be reached depends on the model resolution to be able to distinguish 1.5 Å structures from 2 Å structures, on a sufficiently accurate energy function to allow the sampling of near-native states and the absence of significant barriers on the path toward the native state. Overall, PRIMO/rdie achieves the best performance in reaching nearnative states, but the best structures are closest to the respective native states with CHARMM22/rdie and CHARMM19/EEF1. The best structures are slightly further from the native state with PRIMO and remain at a significant distance from the native state for all but one target when the SICHO model is used for sampling. In order to assess the sampling efficiency ratio defined above, we first examined the range of sampling at each cycle with the different sampling methods. The results are shown in Fig. 4.7. It can be seen that CHARMM22/rdie, CHARMM19/EEF1, and both PRIMO models have a relatively narrow range of sampling of 0.04–0.1 Å while the sampling range is much broader for SICHO (0.2–0.5 Å). Broader sampling

Fig. 4.7 Range of sampling (standard deviation) as a function of the average RMSD during refinement of the test targets given in Table 4.2

104

M. Feig et al.

generally facilitates the identification of the most native-like structures with a given real scoring function. The number of structures that are better than the initial structure at each cycle is the other determinant for how likely it is that a given real scoring function will be able to select more native-like structures. The average number of decoys that are better than the initial structure after 1, 5, 10, and 20 cycles is given in Table 4.2. It can be seen that the number of structures better than starting structure is initially high but drops quickly to about 1% at 10 cycles and to about 0.1–0.3% after 20 cycles which means that only 1–3 structures out of 1,000 are better than the starting structure. The fraction of structures that are better than the starting structure is highest with CHARMM19/EEF1 and CHARMM22/rdie and lowest with SICHO sampling. While this quantity is an important characteristic of the different sampling methods, it does not reflect how much better the structures are. For example, refinement during the initial cycles is more rapid with SICHO compared to PRIMO. On the other hand, the ability to refine to more native structures with PRIMO over the course of 100 refinement cycles is a result of a small number of structures that are consistently more native-like during later refinement cycles and allow slow convergence to the 1–2 Å range while such structures are not generated with SICHO. Table 4.2 Average fraction of structures better than the starting structure at different refinement cycles

CHARMM22/rdie PRIMO/GB PRIMO/rdie CHARMM19/EEF1 SICHO

1

5

10

20

0.242 0.138 0.166 0.321 0.046

0.047 0.026 0.032 0.059 0.002

0.008 0.013 0.011 0.008 0.002

0.003 0.003 0.001 0.008 0.000

The number of structures that are better than the initial structure can be related to the theoretical probability PR−R
4

Conformational Sampling

105

Fig. 4.8 Sampling efficiency ratio (fraction of structures better than initial model vs. PR−R
CHARMM22/rdie PRIMO/GB PRIMO/rdie CHARMM19/EEF1 SICHO

1

5

10

20

0.56 0.32 0.39 0.71 0.22

0.11 0.06 0.08 0.13 0.01

0.02 0.03 0.03 0.02 0.01

0.01 0.01 0.00 0.02 0.00

function that is used during the sampling to direct sampling toward the native state. Therefore, it is not surprising that the coarse-grained SICHO model performs worst.

4.5 Summary and Outlook Protein structure refinement remains a challenging problem. The idea promoted here is that successful refinement can be accomplished with an iterative framework through a favorable combination of sampling and scoring methods. The focus in

106

M. Feig et al.

this chapter is on an assessment of sampling methods with models at different resolutions ranging from all-atom models to the intermediate coarse-grained model PRIMO and the residue-level SICHO model. It is found that the choice of the model affects the ability to generate near-native conformations with CHARMM22/rdie and CHARMM19/EEF1 performing very well compared to the coarse-grained SICHO model. It may be possible to generate even more native-like conformations by combining the CHARMM22 all-atom force field with generalized Born implicit solvent or even explicit solvent. However, the associated costs make such an approach impractical within the iterative protocol tested here. However, interestingly, the PRIMO model also largely matches and to some extent exceeds the performance of the CHAMM22/rdie and CHARMM19/EEF1 models. It does so at a much reduced cost and therefore appears to be the best practical choice among the models tested here. The results shown here rely on an idealized refinement protocol where RMSD is used as the scoring function. In practical applications of structure refinement where the native structure is not known a real scoring function has to be employed instead. The next direction is therefore the exploration of different scoring functions within the protocol described here to examine whether scoring functions are capable of reliably selecting an improved structure at each cycle. This task becomes especially challenging after 10 cycles when less then 1% of the structures are better than the starting model. In reality, the problem might be easier, though, than it seems, because a given scoring function would be expected to guide the conformational sampling along a low-free energy path where RMSD may not always decrease continuously but sampling may be facilitated by avoiding the crossing of large barriers. We hope that the iterative protocol and sampling efficiency metrics introduced here will establish a framework for assessing sampling methods in the context of structure refinement that ultimately will lead to successful structure refinement protocols. Acknowledgments This work was financially supported from NIH grant GM 084953, NSF grant MCB 0447799, an Alfred P. Sloan fellowship (to MF). Access to computer resources at the High-Performance Computer Center at Michigan State University and use of TeraGrid computing facilities under grant number TG-MCB090003 are acknowledged.

References Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyal IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242 Best RB, Buchete NV, Hummer G (2008) Are current molecular dynamics force fields too helical? Biophys J 95:L7–L9 Bonneau R, Baker D (2001) Ab initio protein structure prediction: progress and prospects. Ann Rev Biophys Biomol Struct 30:173–189 Bower MJ, Cohen FE, Dunbrack RL (1997) Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: a new homology modeling tool. J Mol Biol 267: 1268–1282

4

Conformational Sampling

107

Brooks BR, Brooks III CL, Mackerell Jr AD, Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartels C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W, Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B, Venable RM, Woodcock HL, Wu X, Yang W, York DM and Karplus M (2009). CHARMM: the biomolecular simulation program. J Comput Chem 30:1545–1614 Cheng XL, Hornak V, Simmerling C (2004) Improved conformational sampling through an efficient combination of mean-field simulation approaches. J Phys Chem B 108:426–437 Chinchio M, Czaplewski C, Oldziej S, Scheraga HA, (2006) A hierarchical multiscale approach to protein structure prediction: production of low-resolution packing arrangements of helices and refinement of the best models with a united-residue force field. Multiscale Model Sim 5:1175–1195 Chocholousova J, Feig M (2006) Balancing an accurate representation of the molecular surface in generalized Born formalisms with integrator stability in molecular dynamics simulations. J Comput Chem 27:719–729 Crippen GM, Snow ME (1990) A 1.8 Å resolution potential function for protein folding. Biopolymers 29:1479–1489 Fan H, Mark AE (2004). Refinement of homology-based protein structures by molecular dynamics simulation techniques. Protein Sci 13:211–220 Feig M, Brooks CL III (2004) Recent advances in the development and application of implicit solvent models in biomolecular simulations. Curr Opin Struct Biol 14:217–224 Feig M, Karanicolas J, Brooks III CL (2004a) MMTSB tool set: enhanced sampling and multiscale modeling methods for applications in structural biology. J Mol Graph Model 22:377–395 Feig M, MacKerell AD, Brooks CL (2003) Force field influence on the observation of π-helical protein structures in molecular dynamics simulations. J Phys Chem B 107:2831–2836 Feig M, Onufriev A, Lee MS, Im W, Case DA, Brooks III CL (2004b) Performance comparison of generalized Born and Poisson methods in the calculation of electrostatic solvation energies for protein structures. J Comput Chem 25:265–284 Feig M, Rotkiewicz P, Kolinski A, Skolnick J, Brooks CLI (2000) Accurate reconstruction of allatom protein representations from side-chain-based low-resolution models. Proteins 41:86–97 Freddolino PL, Park S, Roux B, Schulten K (2009) Force field bias in protein folding simulations. Biophys J 96:3772–3780 Gopal SM, Klenin K, Wenzel W (2009) Template-free protein structure prediction and quality assessment with an all-atom free-energy model. Proteins-Struct Funct Bioinformatics 77: 330–341 Gopal SM, Mukherjee S, Cheng Y-M, Feig M (2010) Proteins: Structure, Function, and Bioinformatics 78:1266–1281 Graf J, Nguyen PH, Stock G, Schwalbe H (2007) Structure and dynamics of the homologous series of alanine peptides: a joint molecular dynamics/NMR study. J Am Chem Soc 129:1179–1189 Hamelberg D, Mongan J, McCammon JA (2004) Accelerated molecular dynamics: a promising and efficient simulation method for biomolecules. J Chem Phys 120:11919–11929 Hornak V, Abel R, Okur A, Strockbine B, Roitberg A, Simmerling C (2006) Comparison of multiple amber force fields and development of improved protein backbone parameters. Proteins-Struct Funct Bioinform 65:712–725 Jorgensen WL, Maxwell DS, Tirado-Rives J (1996) Development and testing of the OPLS allatom force field on conformational energetics and properties of organic liquids. J Am Chem Soc 118:11225–11236 Kolinski A, Skolnick J (1994) Monte Carlo simulations of protein folding. I. Lattice model and interaction scheme. Proteins 18:338–352 Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of protein. J Mol Biol 157:105–132 Lazaridis T, Karplus M (1999) Effective energy function for proteins in solution. Proteins 35: 133–152

108

M. Feig et al.

Lee MS, Feig M, Salsbury FR Jr, Brooks CL III (2003) New analytical approximation to the standard molecular volume definition and its application to generalized Born calculations. J Comput Chem 24:1348–1356 Mackerell AD (2004) Empirical force fields for biological macromolecules: overview and issues. J Comput Chem 25:1584–1604 MacKerell AD Jr, Bashford D, Bellott M, Dunbrack JD, Evanseck MJ, Field MJ, Fischer S, Gao J, Guo H, Ha S, Joseph-McCarthy D, Kuchnir L, Kuczera K, Lau FTK, Mattos C, Michnick S, Ngo T, Nguyen DT, Prodhom B, Reiher WE, Roux B, Schlenkrich M, Smith JC, Stote R, Straub J, Watanabe M, Wiorkiewicz-Kuczera J, Yin D, Karplus M (1998) All-atom empirical potential for molecular modeling and dynamics studies of proteins. J Phys Chem B 102:3586–3616 MacKerell AD Jr, Feig M, Brooks III CL (2004a) Extending the treatment of backbone energetics in protein force fields: limitations of gas-phase quantum mechanics in reproducing protein conformational distributions in molecular dynamics simulations. J Comput Chem 25:1400–1415 MacKerell AD Jr, Feig M, Brooks III CL (2004b) Improved treatment of the protein backbone in empirical force fields. JAm Chem Soc 126:698–699 Maiorov VN, Crippen GM (1995) Size-independent comparison of protein. 3-Dimensional Structures. Proteins-Struct Funct Genet 22:273–283 Micheletti C, Laio A, Parrinello M (2004) Reconstructing the density of states by historydependent metadynamics. Phys Rev Lett 92(170601) Moult J, Fidelis K, Kryshtafovych A, Rost B, Tramontano A (2009) Critical assessment of methods of protein structure prediction – Round VII. Proteins: Structure, Funct Bioinform 77:1–4 Neria E, Fischer S, Karplus M (1996) Simulation of activation free energies in molecular systems. J Chem Phys 105:1902–1921 Norberg J, Nilsson L (2000) On the truncation of long-range electrostatic interactions in DNA. Biophys J 79:1537–1553 Okamoto Y (2004) Generalized-ensemble algorithms: enhanced sampling techniques for Monte Carlo and molecular dynamics simulations. J Mol Graph Model 22:425–439 Onufriev A, Case DA, Bashford D (2002) Effective born radii in the generalized born approximation: the importance of being perfect. J Comput Chem 23:1297–1304 Oostenbrink C, Villa A, Mark AE, van Gunsteren WF (2004) A biomolecular force field based on the free enthalpy of hydration and solvation: the GROMOS force-field parameter sets 53A5 and 53A6. J Comput Chem 25:1656–1676 Osguthorpe DJ (2000) Ab initio protein folding. Curr Opin Struct Biol 10:146–152 Reva BA, Finkelstein AV, Skolnick J (1998) What is the probability of a chance prediction of a protein structure with an RMSD of 6 Ångstrom? Fold Des 3:141–147 Rohl CA, Strauss CE, Misura KM, Baker D (2004) Protein structure prediction using Rosetta. Methods Enzymol 383:66–93 Schaefer M, Bartels C, Karplus M (1999) Solution conformations of structured peptides: continuum electrostatics versus distance-dependent dielectric functions. Theor Chem Acc 101:194–204 Schreiber H, Steinhauser O (1992) Cutoff size does strongly influence molecular dynamics results on solvated polypeptides. Biochemistry 31:5856–5860 Simons KT, Bonneau R, Ruczinski I, Baker D (1999) Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins 37(Suppl 3):171–176 Sitkoff D, Sharp KA, Honig B (1994) Accurate calculation of hydration free-energies using macroscopic solvent models. J Phys Chem 98:1978–1988 Skolnick J, Kolinski A, Ortiz AR (1997) MONSSTER: a method for folding globular proteins with a small number of distance restraints. J Mol Biol 265:217–241 Still WC, Tempczyk A, Hawley RC, Hendrickson T (1990) Semianalytical treatment of solvation for molecular mechanics and dynamics. J Am Chem Soc 112:6127–6129 Stumpff-Kane AW, Maksimiak K, Lee MS, Feig M (2008) Sampling of near-native protein conformations during protein structure refinement using a coarse-grained model, normal modes, and molecular dynamics simulations. Proteins 70:1345–1356

4

Conformational Sampling

109

Tanizaki S, Feig M (2005) A generalized Born formalism for heterogeneous dielectric environments: application to the implicit modeling of biological membranes. J Chem Phys 122:124706 Wesson L, Eisenberg D (1992) Atomic solvation parameters applied to molecular dynamics of proteins in solution. Protein Sci 1:227–235 Zhang Y, Arakaki AK, Skolnick JR (2005) TASSER: an automated method for the prediction of protein tertiary structures in CASP6. Proteins-Struct Funct Bioinform 61:91–98 Zhang Y, Skolnick J (2005) The protein structure prediction problem could be solved using the current PDB library. Proc Nat Acad Sci USA 102:1029–1034

Chapter 5

Effective All-Atom Potentials for Proteins Anders Irbäck and Sandipan Mohanty

Abstract Experimental developments are steadily improving our knowledge of the dynamics and interactions of proteins, knowledge that is important for understanding living systems at the molecular level. Computer simulations provide a complementary tool in this endeavor and offer unique opportunities to elucidate salient features that remain experimentally inaccessible. Simulations are being used to study many different aspects of protein dynamics of varying computational complexity, which makes it necessary to choose models for the calculations with care, depending on the problem at hand. The models in use today range from reduced models with one or a few sites per amino acid to all-atom models with explicit solvent molecules. All-atom simulations have traditionally often been based on quite elaborate potentials. This level of detail may be needed in many applications, like structure refinement, but is not an obvious choice when studying processes like folding and aggregation. Some recent (implicit solvent) models use simpler and computationally more convenient potentials, while retaining an all-atom description of the protein chains. This article discusses some potentials of this type. The usefulness of the approach is illustrated by briefly describing studies, based on one of the potentials, of folding thermodynamics, mechanical unfolding and aggregation.

5.1 Introduction A molecular understanding of living systems requires knowledge of protein dynamics (Smock and Gierasch 2009). The relevant dynamics of a protein may consist of small fluctuations about its native structure or reorientations of its ordered parts relative to each other. In either case, a tiny fraction of the conformational space is explored. For flexible proteins, perhaps with large intrinsically disordered parts

A. Irbäck (B) Computational Biology & Biological Physics, Department of Theoretical Physics, Lund University, Lund, Sweden e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_5,

111

112

A. Irbäck and S. Mohanty

(Dunker et al. 2002; Uversky 2002; Dyson and Wright 2005), the situation is different. When studying such proteins or conformational conversion processes like folding or amyloid aggregation, the competition between different minima on the free-energy landscape inevitably comes into focus. Studying these systems by computer simulation is a challenge, because proper sampling of all relevant free-energy minima must be ensured. This goal is very hard to achieve if explicit solvent molecules are included in the simulations. Many studies are therefore carried out using implicit representations of the solvent. Additional speed can be gained if the degrees of freedom are further reduced, by turning to a coarse-grained protein representation (see Chapters 2, 3, 4, 6, 7, 8, and 12). This step entails a loss of detail in the protein description that may or may not be critical, depending on the question at hand. In this chapter, we will briefly introduce a few intermediate-resolution descriptions, which retain an all-atom representation of the protein chains while treating the solvent implicitly. Incorporating all protein atoms into the computational model has the advantage of simplifying the description of many interactions, like excludedvolume effects. The price to be paid is the increased computational complexity, as simulating protein folding at the atomic level is a notoriously difficult problem. However, the actual computational cost depends not only on the geometric representation, but also very strongly on the form of the force field, for two important reasons. The first involves the cost of a single energy evaluation or a single simulation step, while the second involves the required length, or number of steps, of the simulation. For a protein with Nat atoms, a potential with pairwise additive interactions consists of a sum of ∼Nat2 terms. The number of elementary computational operations required for calculating this sum depends critically on whether or not the interactions can be effectively treated as short-ranged. If so, the use of cell list techniques can reduce the required number of operations from ∼Nat2 to ∼Nat (Frenkel and Smit 1996) and thereby reduce the cost of each step in a simulation. The form of the force field also dictates the shape of the free-energy surface. This strongly affects the number of steps required for proper sampling of the conformational space. The use of a force field with unphysical local free-energy minima, in which the system might get trapped, risks making the simulations unnecessarily slow. The traditional approach to all-atom protein simulations is to use the molecular dynamics (MD) method and force fields like Amber, CHARMM, OPLS, or GROMOS; for a review, see Ponder and Case (2003). These potentials are based on explicit and detailed, although still approximate, representations of the physical forces involved. These details may be important to include, for instance, when modeling ligand binding, in order to accurately describe small perturbations of a folded protein about its native state. However, the number of parameters involved is large, and it is extremely difficult to ascertain whether the parameter values apply for every possible conformation of a protein. This consideration becomes important in the study of large conformational changes, as in protein folding. Consider a peptide with a helical native state. High precision of a force field near this helical state does

5

Effective All-Atom Potentials for Proteins

113

not guarantee a correct ranking of it relative to completely different states, such as a β-sheet state. In simulations of protein folding from a random unfolded state, the properties of the force field over the whole conformation space are relevant. In addition, the volume of the conformation space grows exponentially with the number of residues. This makes it desirable to develop alternative force fields in which the native states are more easily identified as free-energy minima, even at the expense of some details. In this chapter, we discuss some all-atom effective force fields with this goal. They are designed with a fast exploration of the global conformation space in mind, for which they rely on effective descriptions of the forces involved. These effective potentials are furthermore used together with simulation techniques which perform the conformational search faster than does conventional MD. In the following, we briefly describe four recently developed effective potentials for implicit solvent all-atom studies of proteins: the μ-potential of the Shakhnovich group (Hubner et al. 2005), the PFF potential of the Wenzel group (Verma and Wenzel 2009), the discrete molecular dynamics (DMD)-adapted potential of the Dokholyan group (Ding et al. 2008), and a potential developed by us (Irbäck et al. 2009) which we will refer to as the Lund potential. Our purpose is to introduce the reader to some of the prevailing ideas behind this class of potentials rather than to compile an exhaustive list of all potentials that fall in this category. However, it has been demonstrated that each of the above-mentioned set of potentials is capable of folding small proteins with different secondary structure. There is a wide range of possible applications of these methods, and they have indeed been used to study problems other than that of predicting folded structures. To illustrate this, we briefly discuss some studies of folding thermodynamics, mechanical unfolding, and aggregation that we performed using the Lund potential. The four discussed potentials are largely or entirely sequence-based, meaning that their parameters are determined by the amino acid sequence alone. We do not discuss native-centric G¯o-type potentials here (G¯o 1983), which require prior knowledge of the native structure. G¯o-type potentials have been frequently used in recent years, primarily to characterize the folding pathways of natively folded proteins. In a generalized form, the same approach has also been used to study dimerization processes involving domain swapping (Ding et al. 2002; Yang et al. 2004). Using entirely sequence-based potentials instead is much more challenging, but has obvious advantages, for example, when studying proteins with large disordered parts.

5.2 Effective Potentials In this section, we briefly describe the main features of the four above-mentioned implicit solvent all-atom effective force fields. Here, the term “all-atom” indicates that all heavy (non-hydrogen) protein atoms are explicitly represented; to what extent hydrogen atoms are represented individually varies among the four discussed potentials.

114

A. Irbäck and S. Mohanty

Although we categorize them here in the same class, both in algebraic form and in the procedures followed to derive their parameters, these four potentials are quite different from one another. Furthermore, they have been used together with different simulation algorithms, and for partly different purposes. The μ potential (Hubner et al. 2005) of the Shakhnovich group is composed of two terms, an explicit backbone hydrogen bond potential and a contact potential governing pairwise interactions between hard-sphere atoms. The strength of the contact interaction is determined by the types of the interacting atoms. For this purpose, the backbone N, Cα, C , and carbonyl O atoms are regarded as defining their own types irrespective of their residue. Except in cases of symmetry, each heavy side chain atom is assigned its own type. Hydrogen atoms are not represented individually. This gives a total of 84 atom types and 3570 interaction strength parameters. These parameters are determined through a systematic statistical procedure, using Protein Data Bank (PDB) structures. The backbone hydrogen bond potential also has a simple square well form depending on four pairwise atom–atom distances. The μ-potential has been used together with Monte Carlo (MC) methods. First, it was applied to small helical proteins (Hubner et al. 2005). Recently, a local backbone term was added to the μ-potential (Yang et al. 2007). This term is based on PDB-derived information about conformational preferences of triplets of amino acids and is amino acid specific. With a modulation of the hydrogen bond term based on PSIPRED (McGuffin et al. 2000), this extended potential was found to reproduce well the native structures of a diverse set of 16 polypeptides with 25–77 residues, including α and β as well as mixed α/β proteins (Yang et al. 2007). The PFF potential of the Wenzel group, in its original form (Herges and Wenzel 2004), contains Lennard–Jones interactions, charge–charge interactions, an explicit backbone hydrogen bond potential, and an implicit solvent model based on exposed surface area. Non-polar CHn groups are represented by united atoms. The parameters of the solvent model are optimized to stabilize the native structure of a single protein, the villin headpiece, against competing low-energy decoys. Other proteins are studied using the same parameters. This version of the PFF potential was applied to five small helical proteins (Herges and Wenzel 2004). Recently, a local backbone potential, consisting of a torsion angle potential and residue-specific electrostatic interactions (Avbelj and Moult 1995), was added to the PFF potential (Verma and Wenzel 2009). Using the new potential, native-like minimum energy conformations were observed for 20 polypeptides with 12–56 amino acids and varying secondary structure (Verma and Wenzel 2009). The calculations were performed using basin-hopping techniques for energy minimization (Verma et al. 2006). The potential of the Dokholyan group (Ding et al. 2008) is adapted to the DMD method, which is an accelerated form of MD (Rapaport 1997). The speed up is achieved by approximating the terms in the interaction potential with step functions. Under this approximation, the dynamics is driven by simple collisions rather than by gradients of the potential. The all-atom DMD potential of this group is composed of a local potential and three terms representing Lennard–Jones interactions, solvation effects, and hydrogen bonding. Only polar hydrogen atoms are represented individually. Solvation effects are modeled in terms of effective pairwise interactions, using

5

Effective All-Atom Potentials for Proteins

115

the potential of Lazaridis and Karplus (1999). DMD simulations with this approach were performed for 6 polypeptides with 20–60 residues and different secondary structure (Ding et al. 2008). Folded structures similar to the experimental structures were observed for all six sequences. For the three smallest ones, the folding thermodynamics were studied as well. The Lund potential, like the μ potential, has been developed and parameterized from scratch (Irbäck et al. 2003; Irbäck and Mohanty 2005; Irbäck et al. 2009). The calibration procedure is, however, very different from that followed for the μ potential; the calibration of the Lund potential is based on full-scale thermodynamic simulations of peptides, rather than on statistical analyses of PDB structures. The Lund potential, in which all hydrogens are represented individually, consists of four major terms: a local potential, an excluded-volume term, hydrogen bonds and a residue-specific interaction term between pairs of side chains. The main contribution to the fourth term approximates implicit solvent effects, in particular the hydrophobic effect. The side chain–side chain potential also contains charge– charge interactions, which are assumed to be screened by the (implicit) solvent and therefore effectively short ranged. A full description of the different terms of this potential, on which the calculations discussed below are based, can be found elsewhere (Irbäck et al. 2009). The simulations with the Lund potential have been carried out using PROFASI, which is an open-source MC package for folding and aggregation simulations, written in C++ (Irbäck and Mohanty 2006). The potential was shown to capture both structural and thermodynamic properties of a set of 17 peptides with 10–37 residues and diverse secondary structure. In addition, it was applied to three 49–67-residue systems (a heterodimeric coiled-coil system, a mixed α/β protein, and a three-helix-bundle protein), with very good results (Irbäck et al. 2009). These four potentials share the ability to fold proteins with different secondary structure, using a single parameter set, which is a fundamental and non-trivial requirement (Yoda et al. 2004; Shell et al. 2008). What makes these four potentials successful in this respect? A precise answer to this question is difficult to give, because the terms of a potential are interdependent and dependent on the choice of geometric representation. It is clear, however, that local backbone energetics influence secondary-structure preferences in a very direct manner and therefore play a key role. Indeed, the above-mentioned revisions of the μ- and PFF potentials, which were carried out to be able to describe β-sheet proteins, focused on local interactions. It is also worth noting that all the four potentials contain an explicit hydrogen bond term. Moreover, the hydrogen bond energy has a directional dependence in all four cases, although its precise form varies. By contrast, there are large differences among the potentials in how solvation effects are modeled. Although this might change for larger proteins, it thus seems that the exact form of these effects is not extremely important for small proteins. These computational models have been studied using DMD, MC methods and energy minimization techniques like basin-hopping. MC methods have been applied to systems with torsion angles as their degrees of freedom. A simple but useful MC move for such systems consists of rotating single, randomly selected torsion

116

A. Irbäck and S. Mohanty

angles individually. For a backbone angle, this move can lead to a highly non-local deformation of the polypeptide. For efficient sampling of compact states, it is useful to also include a local or semi-local backbone move. An example of such a move is Biased Gaussian Steps, which simultaneously updates up to eight consecutive backbone angles in a manner that keeps the ends of the segment approximately fixed (Favrin et al. 2001). The DMD and MC approaches can be much more efficient than conventional MD. To further speed up the exploration of the conformational space, generalizedensemble techniques are currently routinely used (see Chapter 9).

5.3 Applications The ability of the potentials discussed above to reproduce folded structures is an important benchmark. The potentials can also be used to study a range of other problems, where sometimes the experimental information available is very limited. In this section, we briefly describe some studies of folding thermodynamics, mechanical unfolding, and aggregation based on the Lund potential.

5.3.1 Folding Thermodynamics For many proteins, a proper structural characterization requires more than specifying a single conformation. This is often the case for small polypeptides, which tend to be unstable. There are further recent findings indicating that proteins with large unstructured regions are more common than previously thought (Dunker et al. 2002; Uversky 2002; Dyson and Wright 2005). When studying these fully or partially disordered systems, where fluctuations play a central role, thermodynamic equilibrium methods are more relevant than energy minimization techniques. However, to perform proper thermodynamic simulations at the atomic level has long been a challenge even for small peptides. With potentials like those above and with today’s computers, the situation is different. Statistically reliable results can now be easily obtained for peptides with, say, 20 residues, as illustrated by a recent study by us (Irbäck et al. 2009). Here we studied the folding thermodynamics for 17 peptides with 10–37 residues (see Fig. 5.1). For each peptide, a set of 10 independent runs was performed, starting from different random initial conformations. Each run was long enough to contain about a hundred independent folding/unfolding events. To achieve this required 1 day per run on a 2.0-GHz AMD Opteron processor. These calculations showed that our potential not only folds these sequences to structures similar to their experimental structures, but also captures many experimentally observed stability differences among the peptides. As an example, let us consider the β-hairpin GB1p, the second β-hairpin from the B1 domain of protein G (residues 41–56), and two designed variants of GB1p with enhanced stability, called GB1m2 and GB1m3 (Fesinmeyer et al. 2004). The isolated GB1p fragment

5

Effective All-Atom Potentials for Proteins

117

Fig. 5.1 Schematic illustration of native geometries studied using the Lund potential. (a) the Trp-cage (1L2Y), (b) an α-helix, (c) a β-hairpin, (d) a three-stranded β-sheet, (e) an α-helix dimer (1U2U), (f) a three-helix bundle (1LQ7), and (g) a mixed α/β protein (2GJH). Figure adapted from Irbäck et al. (2009)

has been shown to form a marginally stable β-hairpin (Blanco et al. 1994). By circular dichroism (CD) and nuclear magnetic resonance (NMR) experiments, the folded population of GB1p was found to be ∼0.30 at 298 K (Fesinmeyer et al. 2004). At the same temperature, GB1m2 and GB1m3 have estimated folded populations of 0.74 ± 0.05 and 0.86 ± 0.03, respectively (Fesinmeyer et al. 2004). Using tryptophan fluorescence instead of CD and NMR, a somewhat higher apparent folded population was obtained for GB1p (Muñoz et al. 1997), indicating deviations from a simple two-state behavior. We studied the melting properties of these peptides using two different observables, namely, the hydrophobicity energy Ehp and a hydrogen bond-based nativeness measure called qhb (qhb = 1 if at most two native backbone hydrogen bonds are missing and qhb = 0 otherwise). Figure 5.2 shows the calculated temperature dependence of these observables. Two things are worth noting. First, the apparent folded populations based on Ehp are higher than those based on qhb . This probe-dependence is consistent with the experimental findings for GB1p. Second, the calculated stability order among the three peptides is the same independent of which of the two observables we study. The order is GB1p < GB1m2 < GB1m3, which agrees with experimental data. We also applied our potential to three larger systems with 49–67 residues (Irbäck et al. 2009), which were more challenging to simulate. In fact, for a mixed α/β protein with 49 residues, we were unable to obtain statistically accurate thermodynamic results, although the global free-energy minimum could be identified with confidence. On the other hand, the 67-residue three-helix bundle protein GS-α3W (Johansson et al. 1998; Dai et al. 2002) turned out to be quite easy to characterize thermodynamically. Our simulations for these two proteins illustrate that the

118 0 –1 –2 –3 –4 –5 –6 –7 –8 –9

(b)

〈qhb 〉

〈Ehp 〉 (kcal/mol)

(a)

A. Irbäck and S. Mohanty

GB1p GB1m2 GB1m3 280

300

320

T (K)

340

360

380

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

GB1p GB1m2 GB1m3

280

300

320

340

360

380

T (K)

Fig. 5.2 Calculated melting behavior of the β-hairpins GB1p, GB1m2, and GB1m3. The lines are fits of a simple two-state model to data. (a) Hydrophobicity energy Ehp against temperature. (b) The hydrogen bond-based nativeness parameter qhb against temperature. Figure reprinted from Irbäck et al. (2009)

computational complexity depends not only on protein size, but also very strongly on the shape of the folding free-energy landscape. Our GS-α3W simulations took about 10 days of computing time on 64 AMD Opteron processors running at 2.0 GHz and resulted in about 800 independent folding events to the native state. On average, we thus obtained more than one folding event per day of single processor time, which is very good for a protein of this size. A study of GS-α3W using the same geometric representation but the ECEPP/3 force field was recently reported (Meinke and Hansmann 2009). Compared to this study, our simulations found about a hundred times the number of independent folding events, while consuming much smaller computing resources. Figure 5.3a shows how the probabilities for structures with different backbone root mean square deviation (bRMSD) vary with temperature in our GS-α3W

Fig. 5.3 Simulation results for GS-α3W. (a) Variation of histogram of backbone RMSD (bRMSD) with temperature. At high temperatures, there is a broad distribution of bRMSD with values > 10 Å. At lower temperatures there are three clearly separated clusters. Representative structures from these clusters are also shown aligned with the native structure. (b) Temperature dependence of specific heat, Cv , and the ratio hr of the observed helix content and the helix content of the native structure. Figure adapted from Irbäck et al. (2009)

5

Effective All-Atom Potentials for Proteins

119

simulations. Clearly, the protein makes a transition from a rather continuous distribution of bRMSD at high temperatures to a distribution dominated by three well-separated clusters. Analysis of the structures at the lower temperatures shows that all three free-energy minima consist almost exclusively of structures with all three helices of GS-α3W formed. The plot of the ratio of the observed helix content and the helix content of the native state, shown in Fig. 5.3b, further supports this idea. The average value of this ratio approaches 1 as the temperature decreases below 300 K. From CD measurements, the stability of GS-α3W was estimated to be 4.6 kcal/mol at 298 K (Johansson et al. 1998). The specific heat curve, also shown in Fig. 5.3b, indicates that the formation of these structures correlates with the steepest change in energy. The cluster with a center at bRMSD ∼3 Å dominates at the lowest temperatures. The structures contributing to the cluster with ∼8–9 Å bRMSD superficially look like well-folded three-helix bundles. However, the arrangement of the helices is topologically distinct from the native arrangement. The cluster seen at larger bRMSD values is broader and consists of a host of structures in which two of the helices make a helical hairpin, but the third helix is not bound to it. The unbound helix could be at either side of the chain. According to our model therefore, the population at the lowest temperatures consists of ∼80% genuinely native structures, ∼10% three-helix bundles with wrong topology, and ∼10% other structures with as much helix content as the native state. In order to experimentally determine the true folded population of the protein, the experimental probe must be able to distinguish the native fold from the other helix-rich structures described here.

5.3.2 Mechanical Unfolding Single-molecule manipulation by atomic force spectroscopy (AFM) or optical tweezers has become an important and widely used method for examining mechanical properties of proteins (Forman and Clarke 2007). While the monitoring of individual molecules removes ambiguities of bulk experiments, there are still important issues to be addressed in order to extract as much information as possible from these experiments. How does one, for example, extract properties of the equilibrium free-energy landscape at zero force from observed unfolding times at non-zero force? Further, the structural information provided by the experiments is limited, because the chain extension is typically the only conformational variable monitored. We have used the Lund potential to investigate the mechanical unfolding of two proteins, ubiquitin and the fibronectin domain FnIII-10. AFM experiments had shown that both ubiquitin (Schlierf et al. 2004) and FnIII-10 (Li et al. 2005) sometimes unfold via intermediate states when stretched by an external force. The main purpose of our studies was to characterize the mechanical unfolding pathways and intermediates of these proteins. Following the experiments (Schlierf et al. 2004), we studied the unfolding behavior of the 76-residue protein ubiquitin under constant pulling forces in the range of 100–200 pN (Irbäck et al. 2005). As in the experiments, both one-step unfolding and

120

A. Irbäck and S. Mohanty

unfolding through intermediate states were observed. Moreover, the intermediates seen in our simulations had extensions consistent with experimental data. Having verified this, we characterized the unfolding pathways by examining the order of breaking of secondary-structure elements. We found that there was a statistically preferred unfolding order, which was the same at the different forces studied. Furthermore, the same unfolding order was observed in events both with and without intermediate states. The mechanical unfolding order we found does not agree with thermal unfolding data from experiments at zero force (Cordier and Grzesiek 2002; Chung et al. 2005). Therefore, we also performed a study of the thermal unfolding of ubiquitin (Irbäck and Mitternacht 2006). It turned out that our computational model actually predicts a different thermal unfolding order, consistent with the thermal unfolding experiments. It is also worth noting that several subsequent computational studies of the mechanical unfolding of ubiquitin, based on completely different potentials, have obtained results similar to ours (Li et al. 2007; Kleiner and Shakhnovich 2007; Imparato and Pelizzola 2008). Fibronectin is a giant multimodular cell adhesion protein. The most common type of module is called FnIII and has ∼90 residues. Of particular interest is the tenth FnIII module, FnIII-10, which contains an Arg-Gly-Asp motif that is critical for the cell–fibronectin interaction. The mechanical properties of FnIII-10 have been extensively studied by both experimental and computational methods. The occurrence of unfolding intermediates was predicted by an early simulation study (Paci and Karplus 1999) and verified by AFM experiments (Li et al. 2005). However, computer simulations have also been performed by many other groups (Klimov and Thirumalai 2000; Craig et al. 2004; Sułkowska and Cieplak 2007; Li 2007), and as yet there is no consensus on the precise nature of the unfolding pathways and intermediates. Recently, we therefore performed an extensive study of how the unfolding behavior of FnIII-10 depends on the pulling conditions (Mitternacht et al. 2009). Six constant pulling forces and four constant pulling velocities were studied, and 100 unfolding trajectories were generated in each of these ten cases. We observed both apparent two-state unfolding and several unfolding pathways involving one of three major, mutually exclusive intermediate states. Unlike for ubiquitin, we thus found more than one major unfolding pathway. The relative weights of the different unfolding pathways were further found to depend strongly on the pulling conditions. For weak pulling (low force or velocity), all three major intermediates occurred with a significant frequency. For strong pulling (high force or velocity), one of them dominated over the other two. All our three major intermediates have extensions consistent with existing experimental data. With higher experimental resolution, it should become possible to discriminate between these three intermediates and thereby test the picture emerging from our study. For properties that can be studied experimentally, like the extension of intermediates and the unfolding force, we obtained results in quite good agreement with experimental data for both ubiquitin and FnIII-10. This agreement is encouraging, especially since both studies could be carried out without having to adjust any parameter of the potential.

5

Effective All-Atom Potentials for Proteins

121

5.3.3 Aggregation The ability to aggregate into β-sheet-rich so-called amyloid fibrils is a property shared by many polypeptides, including unstructured polypeptides as well as natively folded proteins from varying structural classes (Selkoe 2003; Chiti and Dobson 2009). The mechanisms of amyloid aggregation are currently being intensely studied, primarily because protein aggregates of this type have been implicated in several diseases, such as Alzheimer’s and Parkinson’s diseases. Of particular interest are the initial stages of aggregation process, because a growing body of evidence indicates that the cytotoxic component in these diseases is small early formed oligomers rather than the full fibrils. Unfortunately, these small aggregates are notoriously hard to characterize. In the quest to do so, computer simulations can play an important role by providing complementary information not accessible experimentally. Simulating protein aggregation is a challenge that has been approached in several different ways (Ma and Nussinov 2006). One approach has been to study the stability of preformed structures or the addition of polypeptides to such a structure. Another simplification that has been made is to use coarse-grained protein descriptions. All-atom studies of spontaneous aggregation, starting from random initial conformations, have typically been limited to systems consisting of a small number of short polypeptides. The Lund potential has been used to study spontaneous aggregation for fibril-forming fragments of the Aβ peptide and protein tau, both associated with Alzheimer’s disease. Aβ and protein tau are the main components of amyloid plaques and neurofibrillary tangles, respectively. One of the best-studied fibril-forming peptides is the 7-residue Aβ(16–22) fragment, which contains the 17–21-residue segment that is commonly referred to as the central hydrophobic core of Aβ. In six-chain simulations for Aβ(16–22), we observed the spontaneous formation of a β-barrel (Irbäck and Mitternacht 2008). In line with previous Aβ(16–22) simulations (Hwang et al. 2004; Klimov et al. 2004; Santini et al. 2004; Röhrig et al. 2006; Gnanakaran et al. 2006; Meinke and Hansmann 2007; Krone et al. 2008), a wide spectrum of meta-stable aggregates was observed in our study. The β-barrel stood out as the by far most long-lived aggregate, indicating that it corresponds to a relatively deep free-energy minimum. The aggregation properties of the hydrophobic Aβ(16–22) fragment were also compared to those of the less hydrophobic Aβ(25–35) fragment, through simulations with up to 20 chains based on the same potential (Cheon et al. 2007). Aβ(16–22) was found to rapidly form disordered oligomers, which were subsequently converted into ordered oligomers through a reorganization process. By contrast, Aβ(25–35) formed ordered oligomers directly, in a one-step process. In particular, this study suggests that the character of the aggregation process critically depends on the balance between hydrophobicity and hydrogen bonding forces. Recently, we studied spontaneous aggregation for relatively large systems of PHF6 peptides (Li et al. 2008), with up to 36 chains. The PHF6 peptide (VQIVYK) is a six-residue fragment of protein tau and is known from experiments to make amyloid fibrils with a parallel β-strand organization (Goux et al. 2004). Based on

122

A. Irbäck and S. Mohanty

X-ray experiments on microcrystals, it has further been proposed that the β-sheets in PHF6 fibrils, as well as in fibrils of many other peptides, are arranged in tightly packed pairs with a “dry steric zipper” organization at the sheet–sheet interface (Sawaya et al. 2007). In our PHF6 simulations, we saw, as for Aβ(16–22), a multitude of small aggregates. A significant increase in stability was observed for aggregates above a certain critical size, indicating that the formation of stable oligomers occurs through a nucleation process. The oligomers formed in the nucleation step were β-sheet-rich, but they were not necessarily growth competent. Instead further growth required conformational reorganization. Interestingly, growth competence was found to correlate with the alignment of the strands in the β-sheets. Figure 5.4 shows how the fractions of parallel and antiparallel β-sheet structure varied with aggregate size in our simulations. While, as mentioned, PFH6 fibrils have β-sheets with a parallel strand organization, small aggregates from the simulations show a clear statistical preference for the antiparallel organization. However, the fraction of parallel structure increased steadily with aggregate size, and in the simulated aggregates with >18 chains the parallel organization was more common than the antiparallel one. In marked contrast to the plethora of different small aggregates, all larger simulated aggregates with 20 chains shared a common form. Invariably, these large aggregates were composed of two stacked β-sheets, as in Fig. 5.5. This form bears a suggestive resemblance to the dry steric zipper structure proposed for PHF6 fibrils (Sawaya et al. 2007). It would be very interesting to extend these calculations to even larger systems, which we hope to be able to do soon. These studies dealt with short fragments of Aβ and protein tau. We are currently using the same methods to examine monomer and dimer properties of full-length 42-residue Aβ, for the wild-type sequence, and three variants with different aggregation properties.

Fig. 5.4 The fractions of parallel and antiparallel β-sheet structure against aggregate size, as obtained from simulations for 24 PHF6 peptides. Figure reprinted from Li et al. (2008)

5

Effective All-Atom Potentials for Proteins

123

Fig. 5.5 Snapshots of two large aggregates from simulations with 24 PHF6 peptides. V1, I3, and K5 side chains are shown in gray and V4 side chains in dark, whereas Q2 and K6 side chains, for clarity, are omitted. Figure adapted from Li et al. (2008)

5.4 Summary The computer time required to explore the conformational space of a protein depends not only on the geometric representation of the system and the simulation algorithm, but also on the form of the potential. Here, we have discussed some novel effective potentials for implicit solvent all-atom simulations. The use of effective potentials together with efficient algorithms can give a dramatic speed up of the conformational search and thereby make it possible to address questions that otherwise would be computationally out of reach. Also, it is likely that the discussed potentials, which are relatively new, can be refined to further increase their applicability.

References Avbelj F, Moult J (1995) Role of electrostatic screening in determining protein main chain conformational preferences. Biochemistry 34:755–764 Blanco F, Rivas G, Serrano L (1994) A short linear peptide that folds into a native stable β-hairpin in aqueous solution. Nat Struct Biol 1:584–590

124

A. Irbäck and S. Mohanty

Cheon M, Chang I, Mohanty S, Luheshi LM, Dobson CM, Vendruscolo M, Favrin G (2007) Structural reorganisation and potential toxicity of oligomeric species formed during the assembly of amyloid fibrils. PLoS Comput Biol 3:e173 Chiti F, Dobson CM (2009) Amyloid formation by globular proteins under native conditions. Nat Chem Biol 5:15–22 Chung HS, Khalil M, Smith AW, Ganim Z, Tokmakoff A (2005) Conformational changes during the nanosecond-to-millisecond unfolding of ubiquitin. Proc Natl Acad Sci USA 102: 612–617 Cordier F, Grzesiek S (2002) Temperature-dependence of protein hydrogen bond properties as studied by high-resolution NMR. J Mol Biol 317:739–752 Craig D, Gao M, Schulten K, Vogel V (2004) Tuning the mechanical stability of fibronectin type III modules through sequence variations. Structure 12:21–30 Dai QH, Thomas C, Fuentes EJ, Blomberg MRA, Dutton PL, Wand AJ (2002) Structure of a de novo designed protein model of radical enzymes. J Am Chem Soc 124:10952–10953 Ding F, Dokholyan NV, Buldyrev SV, Stanley HE, Shakhnovich EI (2002) Molecular dynamics simulation of the SH3 domain aggregation suggests a generic amyloidogenesis mechanism. J Mol Biol 324:851–857 Ding F, Tsao D, Nie H, Dokholyan NV (2008) Ab initio folding of proteins with all-atom discrete molecular dynamics. Structure 16:1010–1018 Dunker A, Brown C, Lawson J, Iakoucheva L (2002) Intrinsic disorder and protein function. Biochemistry 41:6573–6582 Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6:197–208 Favrin G, Irbäck A, Sjunnesson F (2001) Monte Carlo update for chain molecules: biased Gaussian steps in torsional space. J Chem Phys 114:8154–8158 Fesinmeyer RM, Hudson FM, Andersen NH (2004) Enhanced hairpin stability through loop design: the case of the protein G B1 domain hairpin. J Am Chem Soc 126:7238–7243 Forman JR, Clarke J (2007) Mechanical unfolding of proteins: insights into biology, structure and folding. Curr Opin Struct Biol 17:58–66 Frenkel D, Smit B (1996) Understanding molecular simulations: from algorithms to applications. Academic San Diego, CA Gnanakaran S, Nussinov R, Garcia, AE (2006) Atomic-level description of amyloid β-dimer formation. J Am Chem Soc 128:2158–2159 G¯o N (1983) Theoretical studies of protein folding. Annu Rev Biophys Bioeng 12:183–210 Goux WJ, Kopplin L, Nguyen AD, Leak K, Rutkofsky M, Shanmuganandam VD, Sharma D, Inouye H, Kirschner DA (2004) The formation of straight and twisted filaments from short tau peptides. J Biol Chem 279:26868–26875 Herges T, Wenzel W (2004) An all-atom force field for tertiary structure prediction of helical proteins. Biophys J 87:3100–3109 Hubner IA, Deeds EJ, Shakhnovich EI (2005) High-resolution protein folding with a transferable potential. Proc Natl Acad Sci USA 102:18914–18919 Hwang W, Zhang S, Kamm RD, Karplus M (2004) Kinetic control of dimer structure formation in amyloid fibrillogenesis. Proc Natl Acad Sci USA 101:12916–12921 Imparato A, Pelizzola A (2008) Mechanical unfolding and refolding pathways of ubiquitin. Phys Rev Lett 100:158104 Irbäck A, Mitternacht S (2006) Thermal versus mechanical unfolding of ubiquitin. Proteins 65:759–766 Irbäck A, Mitternacht S (2008) Spontaneous β-barrel formation: an all-atom Monte Carlo study of Aβ16–22 oligomerization. Proteins 71:207–214 Irbäck A, Mitternacht S, Mohanty S (2005) Dissecting the mechanical unfolding of ubiquitin. Proc Natl Acad Sci USA 102:13427–13432 Irbäck A, Mitternacht S, Mohanty S (2009) An effective all-atom potential for proteins. PMC Biophys 2:2

5

Effective All-Atom Potentials for Proteins

125

Irbäck A, Mohanty S (2005) Folding thermodynamics of peptides. Biophys J 88:1560– 1569 Irbäck A, Mohanty S (2006) PROFASI: a Monte Carlo simulation package for protein folding and aggregation. J Comput Chem 27:1548–1555 Irbäck A, Samuelsson B, Sjunnesson F, Wallin S (2003) Thermodynamics of α- and β-structure formation in proteins. Biophys J 85:1466–1473 Johansson JS, Gibney BR, Skalicky JJ, Wand AJ, Dutton PL (1998) A nativelike three-α-helix bundle protein from structure-based redesign: a novel maquette scaffold. J Am Chem Soc 120:3881–3886 Kleiner A, Shakhnovich E (2007) The mechanical unfolding of ubiquitin through all-atom Monte Carlo simulation with a G¯o-type potential. Biophys J 92:2054–2061 Klimov DK, Straub JE, Thirumalai D (2004) Aqueous urea solution destabilizes Aβ16–22 oligomers. Proc Natl Acad Sci USA 101:14760–14765 Klimov DK, Thirumalai D (2000) Native topology determines force-induced unfolding pathways in globular proteins. Proc Natl Acad Sci USA 97:7254–7259 Krone M, Hua L, Soto P, Zhou R, Berne B, Shea JE (2008) Role of water in mediating the assembly of Alzheimer amyloid-β Aβ16–22 protofilaments. J Am Chem Soc 130:11066–11072 Lazaridis T, Karplus M (1999) Effective energy function for proteins in solution. Proteins 35: 133–152 Li D, Mohanty S, Irbäck A, Huo S (2008) Formation and growth of oligomers: a Monte Carlo study of an amyloid tau fragment. PLoS Comput Biol 4:e1000238 Li L, Huang HHL, Badilla CL, Fernandez JM (2005) Mechanical unfolding intermediates observed by single-molecule force spectroscopy in a fibronectin type III module. J Mol Biol 345:817–826 Li MS (2007) Secondary structure, mechanical stability, and location of transition state of proteins. Biophys J 93:2644–2654 Li MS, Kouza M, Hu CK (2007) Refolding upon force quench and pathways of mechanical and thermal unfolding of ubiquitin. Biophys J 92:547–561 Ma B, Nussinov R (2006) Simulations as analytical tools to understand protein aggregation and predict amyloid conformation. Curr Opin Chem Biol 10:445–452 McGuffin LJ, Bryson K, Jones DT (2000) The PSIPRED protein structure prediction server. Bioinformatics 16:404–405 Meinke J, Hansmann UHE (2009) Free-energy driven folding and thermodynamics of the 67residue protein GSαW –a large-scale Monte Carlo study. J Comput Chem 30:1642–1648 Meinke JH, Hansmann UHE (2007) Aggregation of β-amyloid fragments. J Chem Phys 126:014706 Mitternacht S, Luccioli S, Torcini A, Imparato A, Irbäck A (2009) Changing the mechanical unfolding pathway of FnIII-10 by tuning the pulling strength. Biophys J 96:429–441 Munoz V, Thompson PA, Hofrichter J, Eaton WA (1997) Folding dynamics and mechanism of β-hairpin formation. Nature 390:196–199 Paci E, Karplus M (1999) Forced unfolding of fibronectin type 3 modules: an analysis by biased molecular dynamics simulations. J Mol Biol 288:441–459 Ponder JW, Case DA (2003) Force fields for protein simulations. Adv Protein Chem 66:27–85 Rapaport DC (1997) The art of molecular dynamics simulations. Cambridge University Press, Cambridge Röhrig UF, Laio A, Tantalo N, Parrinello M, Petronzio R (2006) Stability and structure of oligomers of the Alzheimer peptide Aβ16–22: from the dimer to the 32-mer. Biophys J 91:3217–3229 Santini S, Mousseau N, Derreumaux P (2004) In silico assembly of Alzheimer’s Aβ16–22 peptide into β-sheets. J Am Chem Soc 126:11509–11516 Sawaya MR, Sambashivan S, Nelson R, Ivanova MI, Sievers SA, Apostol MI, Thompson MJ, Balbirnie M, Wiltzius JJW, McFarlane HT, Madsen AØ, Riekel C, Eisenberg D (2007) Atomic structures of amyloid cross-β spines reveal varied steric zippers. Nature 447:453–457 Schlierf M, Li H, Fernandez JM (2004) The unfolding kinetics of ubiquitin captured with singlemolecule force-clamp techniques. Proc Natl Acad Sci USA 101:7299–7304

126

A. Irbäck and S. Mohanty

Selkoe DJ (2003) Folding proteins in fatal ways. Nature 426:900–904 Shell MS, Ritterson R, Dill KA (2008) A test on peptide stability of amber force field with implicit solvation. J Phys Chem B 112:6878–6886 Smock RG, Gierasch LM (2009) Sending signals dynamically. Science 324:198–203 Sułkowska J, Cieplak M (2007) Mechanical stretching of proteins – a theoretical survey of the protein data bank. J Phys Condens Matter 19:283201 Uversky VN (2002) Natively unfolded proteins: a point where biology waits for physics. Protein Sci 11:739–756 Verma A, Schug A, Lee KH, Wenzel W (2006) Basin hopping simulations for all-atom protein folding. J Chem Phys 124:044515 Verma A, Wenzel W (2009) A free-energy approach for all-atom protein simulation. Biophys J 96:3483–3494 Yang JS, Chen WW, Skolnick J, Shakhnovich EI (2007) All-atom ab initio folding of a diverse set of proteins. Structure 15:53–63 Yang S, Cho SS, Levy Y, Cheung MS, Levine H, Wolynes PG, Onuchic JN (2004) Domain swapping is a consequence of minimal frustration. Proc Natl Acad Sci USA 101:13786–13791 Yoda T, Sugita Y, Okamoto Y (2004) Secondary-structure preferences of force fields for proteins evaluated by generalized-ensemble simulations. Chem Phys 307:269–283

Chapter 6

Statistical Contact Potentials in Protein Coarse-Grained Modeling: From Pair to Multi-body Potentials Sumudu P. Leelananda, Yaping Feng, Pawel Gniewek, Andrzej Kloczkowski, and Robert L. Jernigan

Abstract The basic concepts of coarse-graining protein structures led to the introduction of empirical statistical potentials in protein computations. We review the history of the development of statistical contact potentials in computational biology and discuss the common features and differences between various pair contact potentials. Potentials derived from the statistics of non-bonded contacts in protein structures from the Protein Data Bank (PDB) are compared with potentials developed for threading purposes based on the optimization of the selection of the native structures among decoys. The energy of transfer of amino acids from water to a protein environment is discussed in detail. We suggest that a next generation of statistical contact potentials should include the effects of residue packing in proteins to improve predictions of protein native three-dimensional structures. We review existing multi-body potentials that have been proposed in the literature, including our own recent four-body potentials. We show how these are related to amino acid substitution matrices.

6.1 Introduction Statistical contact potential functions are knowledge-based potentials since they are based on information extracted from sets of protein structures. These knowledgebased potentials are useful to describe in approximate ways the interactions among residues or atoms in proteins. They are useful mainly because it is extremely difficult, if not impossible, to formulate these interactions directly using laws of physics or to determine them experimentally. Statistical potential functions have

R.L. Jernigan (B) Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_6,

127

128

S.P. Leelananda et al.

been derived by analyzing known protein structures in the Protein Data Bank (PDB) and have been used extensively in computational studies of proteins. Usually these are derived for coarse-grained cases, averages for individual amino acids or groups of atoms. In this chapter we discuss the theory and methods used to derive knowledgebased potentials and review the history of the development of statistical contact potentials. Based on the approach used to develop knowledge-based potentials, they can be generally classified into two groups: those derived from statistical analysis of a protein structure database (Tanaka and Scheraga 1976; Miyazawa and Jernigan 1985; Sippl 1990) and potentials based on optimization of recognition of the native structure among a set of misfolded conformations – so-called decoys (Goldstein et al. 1992; Maiorov and Crippen 1992; Thomas and Dill 1996a). We discuss these two classes of knowledge-based potentials, giving emphasis to statistical potentials because these have proven themselves to be important in protein structure prediction, in studies of protein–protein interactions, and in designing protein sequences and are more prominent. There are three types of statistical potentials: distance-independent statistical potentials, distance-dependent ones, and geometric potentials. We will discuss in some detail distance-independent potential functions, with special reference to the Miyazawa–Jernigan contact statistical potentials, which are widely used. We will also review distance-dependent statistical potentials and geometric statistical potentials. Several multi-body potentials that were proposed in the literature (Munson and Singh 1997; Krishnamoorthy and Tropsha 2003), and recently by us (Feng et al. 2007), will be described. We compare and discuss common features and differences among the various pair contact potentials. Potentials derived from the statistics of non-bonded contacts in protein structures in the Protein Data Bank are compared with potentials developed for threading purposes based on the optimization of the selection of the native structures among decoys. The potential server developed recently by Feng et al. (2010) for energy estimation of coarse-grained models of proteins with knowledge-based potentials is also introduced. Further, we briefly discuss the following applications of knowledge-based potentials: (i) discrimination of the native structure from decoys (Park and Levitt 1996; Zhou and Zhou 2002; Gilis 2004), (ii) protein docking (Li and Liang 2005a), (iii) design of new proteins (Hu et al. 2004), and (iv) prediction of protein stability and binding affinity (Gilis and Rooman 1996, 1997; Guerois et al. 2002; Bordner and Abagyan 2004; Hoppe and Schomburg 2005; Zhou and Zhou 2002). And, finally ways to further improve these potentials are discussed. To develop a knowledge-based potential function three main steps are necessary: (i) the description of the sequence and structure of the native protein structure (protein descriptor) in a form suitable for computations; (ii) determination of the functional form of the potential function; (iii) the extraction method to derive potential function parameters.

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

129

6.2 History of Development of Knowledge-Based Potentials Experimental observations led to the remarkable discovery that all information required to specify the three-dimensional structure of a protein is contained in its amino acid sequence (Anfinsen 1973; Anfinsen et al. 1961). These experiments led to the formulation of the thermodynamic hypothesis of proteins, which states that the native state of a protein corresponds to its global free energy minimum under physiological conditions and provides the basis for development of potential functions that are used to calculate effective energy of a protein. Although this is substantially true, we now know that there are exceptions where the protein structure changes in response to local forces or local conditions. Initially atomic potential functions were obtained using principles of physics. Here, thermodynamic properties of a system are measured experimentally and also calculated using quantum mechanics applied to small molecules. By utilizing the appropriate experimental and theoretical values, the parameters in the atomic potential are obtained. These potentials are often referred to as physics-based, physical or semi-empirical effective potential functions, or force fields (Wolynes et al. 1995; Karplus and Petsko 1990; Yang et al. 2006). Although these atomic physics-based potentials have been found promising for protein folding studies, the approach is computationally costly when applied to large proteins because of the large number of atoms in large proteins. As an alternative, knowledge-based potentials have been developed. Here, known protein structures that have been deposited in Protein Data Bank (PDB) are used to generate the parameters in our potential functions (Sippl 1993). These potential functions implicitly take into account many different types of physical interactions such as electrostatic, cation–π, and van der Waals interactions. The number of experimentally determined three-dimensional protein structures in the PDB database is increasing rapidly, and it now becomes possible to derive a variety of improved, more specific potentials. As a result knowledge-based potential functions are becoming extremely popular nowadays in computational studies such as protein structure prediction, analyzing protein–protein interactions, and designing new proteins. The knowledge-based potential functions can usually be divided into two types: atomic-level potentials (Samudrala and Moult 1998; Gatchell et al. 2000; Lu and Skolnick 2001; Zhou and Zhou 2002; McConkey et al. 2003; Hubner et al. 2005; Qiu and Elber 2005; Shen and Sali 2006) and coarse-grained potentials (Tanaka and Scheraga 1976; Miyazawa and Jernigan 1985; Hendlich et al. 1990; Sippl 1990; Hinds and Levitt 1992; Miyazawa and Jernigan 1996; Bahar and Jernigan 1997; Eisenberg et al. 1997; Simons et al. 1999; Tobi and Elber 2000; Zhang et al. 2006; Dehouck et al. 2006; Dong et al. 2006). The latter have been demonstrated to be highly effective for reducing the computational cost in modeling native protein structures, although they are sometimes considered not to be sufficiently rigorous to reflect the entire landscape of a potential energy surface (Thomas and Dill 1996b; Skolnick 2006). The performance and applicability of coarse-grained potential functions are largely modulated by the choice of the coarse-graining scheme. In

130

S.P. Leelananda et al.

many applications, the ability to accurately calculate the potential energy based solely on Cα positions would certainly be useful. Typical examples are seen in recent studies on modeling protein chain topologies based on low-resolution density maps (Wu et al. 2005a) and on coarse-grained folding simulations based on Cα models (Wu et al. 2005b). Knowledge-based potentials are categorized into two main groups according to how they are extracted from known protein structures. Most important of the two are the so-called statistical or knowledge-based potentials. These potential functions are derived by a statistical analysis of protein structure databases (Tanaka and Scheraga 1976; Miyazawa and Jernigan 1985; Sippl 1990). The other group of knowledgebased potential functions are known as optimized potential functions. This type of potential function is even more empirical. To obtain these potential functions a selected criterion, for example, the maximization of the energy gap between the known native structure and a set of alternative or decoy conformations is used (Dobbs et al. 2002; Hu et al. 2004; Dombkowski and Crippen 2000). Some threebody (Munson and Singh 1997) and four-body contact potentials (Krishnamoorthy and Tropsha 2003; Feng et al. 2007) were introduced that showed improved results.

6.2.1 Inverse Boltzmann Relationship In an ensemble of protein molecules at thermodynamic equilibrium at temperature T, the relative probability P(s) of a microstate s with energy E(s) is given by the Boltzmann distribution P(s) =

exp [−E(s)/kT] Z(q)

(6.1)

where k is the Boltzmann constant and Z(q) is the partition function for the protein sequence q given by, Z(q) ≡

exp[−E(s)/kT].

(6.2)

s

and the sum is over all microstates of the system. By inverting Eq. (6.1), E(s), which is also sometimes termed a potential of mean force, can be obtained directly in the following form E(s) = −kT ln P(s) − kT ln Z(q)

(6.3)

For a given temperature, Z(q) depends only on the protein sequence. Although Z(q) cannot be obtained experimentally, when the sequence q and temperature T are fixed, Z(q) is also fixed. The partition function Z(q) also depends on how the protein is described once its structure and sequence are set, e.g., the level of coarse graining.

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

131

The descriptor s does not affect the probability of occupancy of different energy states. However, obtaining a real energy value from a protein descriptor s depends on the form of the potential function E(s). The general goal in statistical mechanics is to calculate the partition function and probabilities from a given energy function. Then the macroscopic thermodynamic quantities of the system can be derived. This approach in statistical mechanics requires (i) the design of an energy function which models the system reasonably well and (ii) the calculation of the partition function by analytical or numerical techniques. Potential functions can take many forms; but by far the most widely used form for coarse-grained proteins is the weighted linear sum over pairwise contacts (Tanaka and Scheraga 1976; Miyazawa and Jernigan 1985; Tobi et al. 2000; Vendruscolo and Domanyi 1998). In this linear approximation E(s) is given by the dot product of vectors w and s: E(s) =

wi si = w · s

(6.4)

i

where si is the number of occurrences of the descriptor type i and wi is the corresponding weight. The potential function is fully defined when the weight vector w is specified. For optimized knowledge-based linear potential functions, optimization procedures are used to obtain the weight vector w. For statistical knowledge-based potential functions, w is derived using a database of diverse experimentally determined protein structures and computing the frequency distribution of the structural descriptors. If the probability distribution P(s) is measured accurately, the knowledge-based potential function E(s) can be determined by Eq. (6.3). We must specify reference state interaction energies E (s), in order to obtain an effective potential energy function E(s). By subtracting the reference energy from the energy of the mean force we remove all energies corresponding to a starting state, whether it be exposure to water or burial inside a protein. By subtracting E (s), the energy of the reference state (Sippl, 1990), the effective potential energy function can be written as: E(s) = E(s) − E (s) = −kT ln

Z(q) P(s) − kT ln P (s) Z (q)

(6.5)

where, P (s) and Z (q) are, respectively, the reference state probability of a sequence q corresponding to the descriptor vector s and the partition function of that state. The P’s can be counts also, and an example of how these counts are made is shown in Fig. 6.1. The second term of the above relationship is a constant for a particular temperature because the partition function is constant for a given sequence and does not depend on the descriptor vector s. Following Sippl (1990) we can assume that Z(q) and Z (q) are approximately equal, thus the second term of Eq. (6.5) becomes zero

132

S.P. Leelananda et al.

X

2 2 2

Fig. 6.1 Counting scheme for pairwise contact potential extraction from structures. The circles represent amino acids and the color the types of amino acids, with the dashed yellow circle representing solvent. Each amino acid is taken as the center of a sphere and the number of pairs with it is counted. In the example a blue circle is the center of the green dashed large sphere, and within this sphere there are contacts between two of each class of interaction as shown on the right side

and the effective or the net potential energy is given by the first term in Eq. (6.5) only as:

P(s) E(s) = −kT ln P (s)

(6.6)

In order to determine P(s)/P (s), we make further assumptions. By assuming that the probability distribution of each descriptor s is independent, we can write ! P(si ) P(s) = P (s) P (si )

(6.7)

i

If the probability of occurrence is independent of the descriptor we have ! P(si ) i

P (si )

=

! Pi Si i

(6.8)

P i

where Pi is the probability of occurrence of the ith type structural feature in the native protein state and P i is that in the reference state, and si is the number of microstates having the structural feature of the type i. This leads to the relationship, P(s) Pi ln = si ln P (s) P i

(6.9)

i

Eq. (6.6) can now be written as, E(s) = −kT

i

Pi si ln P i

(6.10)

For a linear potential function, the effective potential energy E(s) of the system is written as a sum of various energetic terms,

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

E(s) =

E(si ) =

i

si wi .

133

(6.11)

i

In the native protein structures, if si distributions are all linearly independent, then, by comparing Eqs. (6.10) and (6.11) we get wi = −kT ln

Pi P i

(6.12)

This shows that in native protein structures the probability of each structural feature obeys independently the Boltzmann distribution. In nearly all statistical potential functions this Boltzmann distribution is assumed. Finkelstein et al. (1995) noticed that many structural features observed in proteins, such as distribution of residues between the surface and interior of globular proteins, distribution of dihedral angles, and ion pairs correlate well with the Boltzmann formula. Probability Pi is estimated by finding the frequency of occurrence of the ith structural feature in a database of protein structures. From a database of solved protein structures Pi can be easily determined.

6.2.2 Quasi-chemical Approximation Reference states are critical for the application of empirical potentials, just as in thermodynamics. In the quasi-chemical approximation the random mixing approximation is utilized as reference state: the number of contacts between a particular pair of species is taken to be directly proportional to their relative frequencies. Three reference states are most common. For the first, the preference of an i-type residue for a j-type residue is compared to that of their self-interactions expressed as

i•j+j•j→2i•j

(6.13)

The corresponding effective contact energy eAB , also referred to as self-contact energy, is found from 2 eij (RC ) = 2 Eij (RC ) − [Eii (RC ) + Ejj (RC )]

(6.14)

Another reference state involves the desolvation of residues i and jB prior to their association, as i•0+j•0→i•j+0•0

(6.15)

where “0” indicates solvent molecules. The corresponding solvent-mediated (indicated by superscript 0) effective contact energy is given by

134

S.P. Leelananda et al.

eij (RC ) = Eij (RC ) + E00 (RC ) − Ei0 (RC ) − Ej0 (RC )

(6.16)

The solvent-residue potentials, Ei0 (RC ) and Ej0 (RC ), are determined from the number of effective solvent “0” molecules coordinating residue types i and j, and the solvent–solvent interaction potential E00 (RC ) contributes a constant amount to each of the contact energies eij 0 (RC ).

6.3 Distant-Independent Potential Functions Miyazawa–Jernigan (Miyazawa and Jernigan 1985, 1996, 1999a,b) contact potentials are historically significant and still widely used today. These potentials take implicitly into account the solvent effects. In the Miyazawa–Jernigan approach, the ith residue is represented by a sphere located at the center of its side chain Ri , except for glycine (Gly), which does not have a side chain, so the position of Gly Cα atom is taken as Ri . A residue pair (i, j) is considered to be in contact if their separation distance is less than a cutoff distance rc = 6.5 Å. The closest residues i and j along the amino acid sequence (|i − j| = 1) are excluded in computations of the statistics of contacts. A contact matrix for all pairs of residues (i, j) is defined as " Cij =

1 if |Ri − Rj | ≤ rc , and|i − j| > 1 0 otherwise

(6.17)

Effective inter-residue contact energies for all pairs of amino acids are evaluated, under the basic assumption that the average characteristics of residue–residue contacts observed in a large number of crystal structures of globular proteins represent the actual intrinsic inter-residue interactions. This empirical energy function includes solvent effects and provides an estimate of the long-range component of conformational energies without atomic details. To obtain this energy function, the quasi-chemical approximation with an approximate treatment of the effects of chain connectivity was employed with the number of residue–residue close contacts computed from a dataset of known protein crystal structures (Miyazawa and Jernigan 1985). The Bethe approximation is a well-known second-order approximation to the mean-field approximation used to describe a system consisting of a mixture of multiple molecular species interacting with each other (Hill 1960). In this approximation, the effects of interactions are taken into account to estimate the average numbers of contacts. Thus, the Bethe approximation is the lowest-order approximation that can provide an estimate of a set of contact energies between species from a given set of the average numbers of contacts between them. Two-body contact energies are estimated using the Bethe approximation along with the basic assumption for residue–residue contacts in protein structures that they are the same as those in mixtures of unconnected amino acids and solvent molecules. Both these approximations are usually used to calculate a partition function for such a system from a given set of interaction energies between molecules. In the mean field approximation, contacts between species are approximated to be random, and a

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

135

partition function of the system is computed. Therefore, if residue–residue contacts in protein structures can be reliably represented to be the same as those in mixtures of unconnected amino acids and solvent molecules, the Bethe approximation will give us a reasonable estimate of actual interaction energies between amino acids. There are many homologous proteins in the Protein Data Bank that may bias the computed statistics. In order to correctly use all structural data, an unbiased sampling of protein structures from the PDB is required in the calculation of contact frequencies. A sampling weight for each protein is devised based upon a sequence homology matrix giving the extent of sequence identity of all pairs of sequences.

6.3.1 Sample Weighing Miyazawa and Jernigan used 1,168 protein structures, whose PDB structures have been determined with a resolution better than 2.5 Å. Then each of the 1,661 sequences in those structures is sampled with a weight determined on the basis of the sequence identity matrix. Average contact energy for the pth residue is given by, < Epc >=

1 p ei nc 2 p

(6.18) p

where, eip is the average contact energy of the ip type of amino acid and nc is the number of contacts with the pth residue. The ordinate corresponds to the sum of the average attractive contact energy and the repulsive packing energy erp . Repulsive packing energies reflect only packing density and do not depend strongly on the type of amino acid at the center. Also, the coordination number of the amino acid does not depend on the type of amino acid. The long-range energy is defined by, Elong =

Epc + Epr

(6.19)

p

where Epc is the short-range attractive term that becomes effective only when two residues are in close proximity, and Epr is the repulsive term that results from the overlap of residues at high packing densities. Contact energies for the pth residue is Epc (eij ) =

1 eip j ncip j 2

(6.20)

j

where, eip j is the contact energy of the pth residue whose amino acid is i in contact with a type j residue. Here ncip j is the number of i, j contacts with pth residue. Total contact energy for the pth residue is the sum of all the residue pairs contact energies according to the above equation.

136

S.P. Leelananda et al.

Miyazawa and Jernigan (1985) introduced also two types of energies: eij and e ij , where e ij, is the energy difference detailing the residue interaction specificity accompanying the formation of a contact pair i−j from contact pairs i−i and j−j (that is in buried protein environment, without taking into account the solvent effects), while eij is the energy difference accompanying the formation of contacts between i and j types of amino acids from those amino acids exposed to solvent. A methodology for estimating contact energies eij and e ij was explained in more detail in Miyazawa and Jernigan (1985). Repulsive energies consist of two parts and are estimated for the residue according to,

r Epr = ehc p + ep

(6.21)

The first term on the right-hand side of the above equation, the hard core repulsion term, is considered to be zero for proteins in the PDB and is not included, because known protein structures are usually refined to remove such close contacts. For off-lattice protein models, a repulsive potential is necessary for evaluation of the conformational packing energy of a protein to remove possible steric clashes. Total long-range energy is estimated by summing over contact energies and repulsive energies of all residues in a protein. Assessment of protein stability based on the total number of contacts in a protein can lead to an incorrect result. Thus, in order to measure the stability of any protein structure for a given sequence, it might be better to use the energy that does not include the homogeneous energy for protein collapse but consists only of the remaining energy for aligning residues with the contacts assumed to be present within the protein structure. The alignment energy of residues within a protein structure is therefore defined as

Epc (eij − err ) = Epc (eij ) − Epc (err )

(6.22)

where Epc (err ) is the average collapse energy. It is, however, still inappropriate for determining the stabilities of a given fold for different protein sequences. Miyazawa and Jernigan have addressed this issue using an empirical energy potential with a reference state that not only can be used for protein fold recognition but also can be used for the recognition of sequences (Miyazawa and Jernigan 1999c). Miyazawa and Jernigan (1996) chose the total energy expected for a typical protein with a given amino acid composition and chain length to be the reference energy of the native structure for the given protein sequence. The following difference in energy is considered:

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

137

Elong ≡Elong − (Elong of a typical native structure for a given sequence) fip ∼ Elong −

(6.23)

p

· (the average number of contacts per residue in a typical native structure) where the second term is the sum of the average contact energies per residue of residue type ip over all positions in the protein. Here the average number of contacts per residue in a typical native structure fi is a function of eij and is estimated by averaging over all proteins. Lastly, a linear dependence on chain length is removed by comparison on a “per residue” basis, which is appropriate for assessing the stability of one protein structure among other folds. The threading of sequences into other folds (Hendlich et al. 1990; Jones et al. 1992) or inverse folding (Bowie et al. 1991) is a good method to evaluate how well a given energy scale can discriminate the native structures as the lowest energy conformations among other folds. A basic assumption underlying the Miyazawa and Jernigan (1996) estimation of contact energies is that, for a large enough sample, the effects of specific amino acid sequences will average out and then the numbers of residue–residue contacts observed in a large number of protein crystals will reflect directly the actual intrinsic inter-residue interaction strengths. The hydrophobic effect is a dominant force in stabilizing the native structure of a globular protein. However, there is a lack of agreement as to its precise magnitude. The hydrophobicity of a small molecule can be measured by transfer energies, for example, of amino acids from a non-polar solvent to water. However, it is not immediately evident that the free energy change accompanying a process in which residues are buried in the folding process of protein is the same as that accompanying the transfer process of amino acids from water to a non-aqueous solvent, because the two processes are obviously different (Lee 1993) and denatured proteins do not fully expose all residues to solvent.

6.4 Distance-Dependent Potential Functions In the Miyazawa–Jernigan potential function a cutoff value of rc = 6.5 Å is used for the distance between a pair of residues to define an occurrence of a contact. This is done by assuming the interactions between amino acids are short range. On the other hand, in distance-dependent potentials the interactions depend on the distance between the amino acids or atoms. Here the distance of interactions is divided

138

S.P. Leelananda et al.

into a number of small distance intervals (bins) and potential functions are derived by applying the weight equation for each bin. Many distance-dependent potential functions have been developed (Hendlich et al. 1990; Sippl 1990; Jones et al. 1992; Samudrala and Moult 1998; Lu and Skolnick 2001). One of the problems is that these potentials need to approach zero beyond a certain distance. This problem has been discussed and one solution to the problem presented by Bahar and Jernigan (1997). For distance-dependence the effective potential energy for pairwise interactions between two residue types i and j at a distance d, E(i, j; d) can be written in several different forms. the knowledge-based potential (KBP) function developed by Lu and Skolnick (2001) and residue-specific all-atom probability discriminatory function (RAPDF) developed by Samudrala and Moult (1998) are two examples. All of these approaches lead to the same basic formula which can be written as, E(i, j; d) = − ln

N(i, j;d) N(i, j;d)

(6.24)

where, N(i, j;d) and N (i, j;d) are, respectively, the observed and expected total number of pairwise interactions for (i, j) pairs within a distance d. The method of obtaining N(i, j;d) is the same for many distance-dependent energy functions. However, N (i, j;d) depends on the way of counting it. Therefore the model of reference state used to compute N (i, j;d) is critical for distance-dependent energy functions. For example, in the model proposed by Sippl (1990) uniform density reference state was assumed to derive the residue-based, distance-dependent potential. Here, for a reference state total number of pairs in a given distance shell is equal to that for folded proteins. This means the distance dependence of the pair probability distribution is the averaged distribution over all atomic or residue pairs. Lu and Skolnick (2001) used this reference state to calculate the expected number of (i, j) interactions at distance d. Samudrala and Moult (1998) used another type of reference state where they took the probability of the distance d between a pair of residues (i, j) to be independent of the contact types (i, j). The problem with the uniform density model is that very different volume distributions of (i, j) pairs in specific regions of the protein may produce the same density if the particular residue pair is along a line. To overcome this problem Zhou and Zhou (2002) developed a new reference state called DFIRE (Distancescaled, Finite Ideal-gas REference state) It is an extension of the ideal gas reference state and residues are modeled as non-interacting points like molecules in an ideal gas. The interacting pair distribution is uniform not only along any line, but also in the whole volume of the protein. Potentials generated using this reference state are termed DFIRE-based potentials. These potentials were tested by using decoys and mutation databases. Their results showed that the DFIRE-based all-atom potential consistently performed better than previous all-atom knowledge-based potentials.

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

139

6.5 Geometric Potential Functions The third class of knowledge-based statistical potentials is based on the shape of the protein molecule. Here various geometric constructs that accurately reflect the shape of the molecules are computed. There are several of these geometric constructs and potentials derived from these constructs have shown significant success. The potential function developed by McConkey et al. (2003) that was based on the Voronoi diagram of the atomic structures of proteins is one of the best-performing atomic-level potential functions used in decoy discrimination. McConkey et al. used the Voronoi tessellation procedure to divide the protein into cells and the cell volume was restricted to a sphere, which defined the atom-solvent accessible area. Atom–atom contacts were calculated from the projection of the cell faces to the surface of the sphere. Their contact-based scoring function was able to distinguish native proteins from corresponding decoy structures with a high degree of accuracy. The Delaunay triangulation (Singh et al. 1996; Zheng et al. 1997; Carter et al. 2001; Krishnamoorthy and Tropsha 2003) is another type of geometric construct. Krishnamoorthy and Tropsha represented proteins by the side-chain centroids of amino acids. Delaunay tessellation of this representation defines all sets of nearest neighbor quadruplets of amino acids. A four-body contact scoring function (log likelihoods of residue quadruplet composition) was derived by analysis of a diverse set of proteins with known structures. A test protein was characterized by the total score calculated as the sum of the individual log likelihoods of amino acid quadruplets. Their results showed that the scoring function distinguishes native from partially unfolded or deliberately misfolded structures. The alpha shape of the protein molecules has also been used to define geometric constructs (Li et al. 2003; Li and Liang 2005a,b).

6.6 Multi-body Potentials All non-geometric statistical potentials discussed so far were two-body potentials and they showed varying levels of success in discriminating native structures from decoys. However, two-body potentials are not capable of recognizing all native folds against large datasets of decoy structures (Vendruscolo et al. 2000). They also cannot properly represent three-dimensional interactions and are unable to represent threedimensional cooperativity properly and can only represent lower-order packing arrangements (Carter et al. 2001). Czaplewski et al. (2000) learned that two-body potentials for methane in water were insufficient for capturing the cooperativity and that it required a three-body term. The Scheraga group has subsequently explored six-body potentials and learned that a four-body term is sufficient to capture the cooperativity present in proteins. The lack of any significant “excess” contributions to the pairwise potentials, which cannot be approximated by one-body components, strongly suggests that more effective structure-specific knowledge-based potentials are required

140

S.P. Leelananda et al.

(Pokarowski et al. 2005). Betancourt and Thirumalai examined the similarities and differences between the two widely used pairwise potentials proposed by Miyazawa and Jernigan (Miyazawa and Jernigan 1996) and Skolnick et al. (Skolnick et al. 1997) and suggested that pairwise potentials are not sufficient for reliable prediction of protein structures (Betancourt and Thirumalai 1999). Munson et al. (1997) showed small gains in threading by using three-body potentials instead of two-body potentials. Krishnamoorthy and Tropsha (2003) showed that the four-body potentials obtained by using Delaunay tessellation algorithms, which are popular in the study of protein structure, can discern correct sequences or structures and generate better z-scores compared to two-body statistical potentials. However, the four-body contact potentials derived by Delaunay tessellation and most of two-body potentials (Miyazawa and Jernigan 1985, 1996; Vajda et al. 1997; Maiorov and Crippen 1992; Skolnick at el. 1997) neglect the information on the sequence location of interacting residues. Feng et al. (2007) have developed a new scheme for the derivation of four-body potentials, considering in more detail the interactions between backbones and side chains and including some of the sequential information of the protein. They tested this new scheme of four-body potentials by threading against same decoy databases used with Delaunay tessellation four-body potentials (Krishnamoorthy and Tropsha 2003) and concluded that the overall rankings of their potentials are significantly better than the Delaunay tessellation-based potentials.

6.6.1 Four-Body Contact Potentials The four-body contact potentials developed by Feng et al. (2007) give a more cooperative representation of protein interaction energies. They found that these four-body contact potentials can discriminate well between native structures and partially unfolded or deliberately misfolded structures. Four-body contact potentials derived using Delaunay tessellation by Krishnamoorthy and Tropsha (2003) and most of two-body potentials neglect the sequence information of proteins. However, in the four-body contact potentials derived by Feng et al. (2007), the interactions between the backbones and side chains are considered in more detail and some of the sequential information for the protein is included. 6.6.1.1 Construction of Four-Body Contacts Residues are all represented by the geometric centers of the side-chain heavy atoms, except for glycine, where the alpha carbon atom is used (Fig. 6.2). Yellow points are the side-chain geometric centers of four sequential residues i, i + 1, i + 2, and i + 3. The red point is the geometric center of the four yellow points chosen as the center of the interacting group. The six cyan planes, defined by all combinations of pairs of yellow points and the central red point, fully subdivide the space surrounding the red point into four tetrahedra. Blue points represent other residues in close proximity to the red point, the interaction range is defined as being within 8.0 Å of the red point.

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

141

Fig. 6.2 Identification of residue points for use in the four-body contacts Table 6.1 Combination of residue types chosen to reduce the sequential amino acids to eight classes A = {GLU, ASP} (acidic) B = {ARG, LYS, HIS} (basic) C = {CYS} (cysteine) H = {TRP, TYR, PHE, MET, LEU, ILE, VAL} (hydrophobic) N = {GLN, ASN} (amide) O = {SER, THR} (hydroxyl) P = {PRO} (proline) S = {ALA, GLY} (small)

An example of the four contacting bodies for potential is shown by the four residues in boxes. Among these, the three yellow residues form a sequence triplet, whose residue types are reduced to eight classes (Table 6.1) in order to ensure sufficient data for potential extraction. The single blue point in a box within the quadruplet is not close in sequence and is taken to be one of the 20 amino acids. Here the four bodies always have three sequential points and one non-sequential in the quartet of interacting residues. In accumulating the information to construct the potential Feng et al. (2007) have ignored the specific sequence order of the three residues within each backbone triplet, so instead of 83 = 512, they have only 120 different triplets since their specific sequence order is disregarded. Data is collected by including all specific types of residues (20 types) for the fourth point, within a distance of 8 Å from the coordinate center (the red point in Fig. 6.2) and assign them to one of the corresponding four tetrahedra defined by the vectors originating from the red point to the yellow points. This residue is then counted in the specific tetrahedron, and the procedure is repeated for the entire set of proteins and for all quartets defining closely interacting residues. They have thus derived four-body conformational sets composed of the three sequential residue triplets and a single non-sequential nearby residue. Each of the residues can be exposed when it is on the surface of the protein or can be buried when located inside the protein core. These different situations were considered separately. The triplets were separated into three groups by their relative solvent accessible surface area (RSA) in order to remove differences in the chain connectivity effect and in residue packing geometry between surface area and buried

142

S.P. Leelananda et al.

region. These three groups correspond to buried (with all three residues in the triplet having RSA < 20%, denoted as Bu), exposed (with all three having RSA < 20%, denoted as E), and intermediate (some of the residues in the triplet being exposed, and some being buried, denoted as I). Better results were obtained in discriminating native structures from a large number of decoys by using these four-body potentials categorized by RSA.

6.6.2 Four-Body Contact Potential Energy Function A four-body contact potential energy is calculated according to the inverse Boltzmann principle. First the probabilities P4|X , P3|X , and PA , which are, respectively, the observed frequencies of quadruplets and triplets in each of the sets specified by X = Bu, E, or I and amino acid type singlets (A) in the protein datasets, are calculated using the following equations: P4|X =

number of the specific quadruplets given Bu, E, or I in the data set total number of all types quadruplets given Bu, E, or I in the data set

P3|X =

number of the specific triplets given Bu, E, or I in the data set total number of all triplets given Bu, E, or I in the data set

PA=

(6.25)

number of the specific type of amino acids in the data set total number of all amino acids in the data set

Then, the four-body contact potential energy is obtained using Boltzmann formula, E4|x = −RT ln

P4|x P3|x PA

(6.26)

The total free energy for a protein is obtained by summing the four-body contact potential energies over all quadruplets nq . Etotal =

E4|x

(6.27)

nq

This equation is used to estimate the free energy of native structures and their decoys. The results are shown in Fig. 6.3 where the relative values of these four-body contact potentials are shown in color, as the heat map. The figure contains three parts: the left vertical part shows potentials for the buried triplets, the middle section is for exposed, and the third for intermediate burial. The y-axis represents the indices of the 120 types of the possible triplets.

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

143

Fig. 6.3 Relative values of four-body contact potentials showed in color for three types of triplets separately (buried, exposed, and intermediate). Blue is the most favorable energy or most frequently occurring

The abscissa shows the singlet within the sequence-based tetrahedra in contact with the specific triplets indexed on the ordinate. The first 20 characters on the x-axis represent the 20 types of amino acids for triplets in the buried state, the next 20 characters the triplets in the exposed state, and the last 20 characters the triplets in the intermediate state. The values of the potential are colored spectrally from blue to red: negative values representing favorable contacts and positive values unfavorable contacts. Energy values are in units of RT.

6.7 Optimization Method Assumption of Boltzmann statistics and the non-inclusion of chain connectivity in the reference state are among the drawbacks associated with statistical knowledgebased potentials. Knowledge-based potentials derived from optimization have been able to take into account these issues. In the optimization method, the same functional form is used as for statistical potentials. However, the weight coefficients are derived by optimization. Here, the weight coefficients are the parameters of the potential functions for individual types

144

S.P. Leelananda et al.

of interactions. These parameters are optimized for discriminating the native structure from a large set of decoys. This is in contrast to the case of statistical potentials where the weight coefficients are derived from protein database statistical analysis. In this type of potential function certain inequalities are optimized. For example, from Anfinsen’s thermodynamic hypothesis (Anfinsen et al. 1961; Anfinsen 1973), the native amino acid sequence threaded on the native structure has the best fitness score compared to a set of sequence decoys, which are taken from proteins that are in different folds. Secondly, the native sequence has the highest probability of fitting into the specified native structure (Shakhnovich and Gutin 1993; Deutsch and Kurosky 1996; Li et al. 1996). Thirdly, native protein sequences, when mounted onto their native structure, have the lowest energy compared to a set of possible decoys. As in the case of statistical potentials, choosing a functional form of the optimized potential and the generation of a large set of decoys are starting points. Then a set of parameters are optimized in discriminating the native structure among the decoys. Usually the gap between native energy and the average energy of decoys or the energy gap between the native and decoys with the lowest score, or the z-score of the native protein is maximized (Goldstein et al. 1992; Maiorov and Crippen 1992; Thomas and Dill 1996a; Koretke et al. 1996, 1998; Hao and Scheraga 1996; Mirny and Shakhnovich 1996; Vendruscolo and Domanyi 1998).

6.8 Comparative Analysis of Statistical Protein Contact Potentials to Infer Ideal Amino Acid Interaction Forms Pokarowski et al. (2005) have analyzed 29 different published matrices of protein pairwise contact potentials between amino acids derived from different sets of proteins. Structures are taken either from the Protein Data Bank (PDB) or generated computationally (decoys). Each of the contact potentials is similar to one of the two matrices derived in the work of Miyazawa and Jernigan (1999b). All the known pairwise matrices of contact potentials can be well approximated by simple functions of individual residue properties, such as hydrophobicity and electrostatic properties (Georgescu et al. 2002; Mehler et al. 2002; Sandberg and Edholm 1999; Tollinger et al. 2003; Laurents et al. 2003) for each pair of amino acids. Such an approximation of the contact potential matrices is termed a one-body approximation. The electrostatic properties of an amino acid are represented by its isoelectric point and measured in pH units. The contact potential matrices of hydrophobicity can be approximated by the formula, eij = hi + hj , 1 ≤ i, j ≤ 20

(6.28)

Residue-type-dependent factor h is highly correlated with the frequency of occurrence of a given amino acid-type inside proteins. Electrostatic interactions for the potentials of this class are almost negligible. In the potentials belonging to this class,

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

145

the major contribution to the potentials is the one-body transfer energy of the amino acid from water to the protein environment. Potentials belonging to the second class can be approximated by the formula, eij = c0 − hi hj + qi qj

(6.29)

where c0 is a constant and h is highly correlated with the Kyte–Doolittle hydrophobicity scale. The less dominant, residue-type-dependent factor q correlates with amino acid isoelectric points pI. The electrostatic interactions are quite important for the second class of potentials but are completely negligible for the first class of potentials. Inclusion of electrostatic interactions significantly improves the approximation for this class of potentials. High correlation between potentials of the first class and the hydrophobic transfer energies is well known. It was found that this approximation works well for the second class of potentials as well. The potentials of this class are interpreted as representing energies of contact of amino acid pairs within an average protein environment. Hydrophobicity represents the dominant factor in protein potentials. Energy of demixing of amino acids in a protein environment and electrostatic interactions are less dominant factors. Pokarowski et al. (2005) have shown that the accuracy of the one-body approximation works significantly better for potentials derived from the quasi-chemical principle than for the potentials obtained from the optimization of the prediction of the native structures among decoys. The frequencies of contacts between different amino acids can also be successfully approximated with this method. Therefore

Fig. 6.4 Graphical illustration of correlations among different protein potentials (Reproduced from Pokarowski et al. 2005 with the permission from the “publisher”). The extent of correlation is colored according to the key. The identifications of the individual potentials are given in the Pokarowski et al. (2005)

146

S.P. Leelananda et al.

hydrophobicity, demixing, and electrostatics are fundamental properties defining potentials from the simple statistics of inter-residue pair contacts for proteins. The one-body approximation can be used to separate contact potentials into two classes. Potentials belonging to the first class are dominated by the one-body energies of transfer of amino acids from water to a protein environment. Potentials belonging to the second class represent energies of contacts of amino acids in a protein environment. Figure 6.4 gives a graphical illustration of correlations among different protein potentials. The clustering of the contact potentials into two groups is readily apparent. The first cluster is centered on MJ3h and SJKG and the second one around MJ3 as shown in Fig. 6.4, corresponding to the two major types of reference states. This scheme makes it possible to analyze and compare various statistical potentials for proteins.

6.9 Statistical Force Fields for Coarse-Grained Protein Models In traditional statistical potentials usually only distances between residues are considered. To improve prediction of protein tertiary structure from amino acid sequence more sophisticated statistical force fields have been recently proposed, where in addition to the distances between residues, several other features of amino acids in protein structures have been incorporated, such as their solvent exposure, orientation of side chains, or their localization in specific secondary structure elements (Zhang and Kim 2000). Examples of such statistical force fields are OPUS-Ca developed by Ma (Wu et al. 2007), CABS force field developed by Kolinski and coworkers (Kolinski 2004), and the United Residue (UNRES) model developed by Scheraga and coworkers (Liwo et al. 1997a,b, 1998, 2001; Pillardy et al. 2001). The OPUS-Ca (Wu et al. 2007) potentials require only alpha carbon positions as input. The contributions from other atomic positions are established from pseudopositions artificially built from a Cα trace for auxiliary purposes. The potential function is formed based on seven major representative molecular interactions in proteins: a distance-dependent pairwise energy with orientational preference, a hydrogen bonding energy, a short-range energy, a packing energy, a tri-peptide packing energy, a three-body energy, and a solvation energy. From the testing of decoy recognition on a number of commonly used decoy sets, it was shown that the new potential function outperforms all known Cα -based potentials and most other coarse-grained ones that require more information than Cα positions. The CABS model is based on lattice representation of proteins’ Cα positions with 800 possible orientations of the virtual Cα −Cα bonds. Only positions of Cα and Cβ atoms and center of the side chain were considered. Knowledge-based potentials of the force field for the CABS model include generic protein-like conformational biases, statistical potentials for the short-range conformational propensities, a model of the main-chain hydrogen bonds and context-dependent statistical potentials describing the side-group interactions. The model is more accurate than the

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

147

previously designed lattice models and in many applications it is complementary and competitive in respect to the all-atom techniques. The UNRES force field of Scheraga and collaborators (Liwo et al. 1997a,b, 1998, 2001; Pillardy et al. 2001) represents proteins as a sequence of Cα atoms linked by virtual bonds attached to united side chains and united peptide groups. Only these united peptide groups and united side chains interact, alpha-carbon atoms serve only for linking purposes. The total energy in UNRES model is a sum of terms representing hydrophobic/hydrophilic and electrostatic interactions, excluded volume, virtual dihedral angle torsions, double torsions, virtual-bond angle bending, energies of side-chain rotamers, and multi-body correlations from a cumulant expansion. Because of this the UNRES force field appears similar to a classical atomic force field used in molecular dynamics simulations.

6.10 Applications of Knowledge-Based Potential Functions Knowledge-based potential functions are used in the study of protein structure prediction, protein design, protein docking, and in protein folding. In protein structure prediction, the conformational space is sampled and the near-native structure is recognized from that ensemble of decoy conformations. The potential function is used to discriminate the near-native structure from the set of decoys. Several decoy sets have been developed to test whether a particular knowledgebased potential function is successful in predicting the native structure from decoys. One example of such decoy sets is, the 4-state-reduced decoy set developed by Park and Levitt (1996). This set contains native and the near-native conformations of 7 sequences along with about 650 decoys for each sequence. Decoys “R”Us provides a number of sets of decoys and acts as a resource for test sets for evaluating the success of potential functions and to obtain improved scoring functions (Samudrala and Levitt 2000). CASP (Critical Assessment of Techniques for Protein Structure Prediction) decoys are also often used in testing knowledge-based potential functions. Studies have been carried out in which performance of different knowledgebased potentials have been compared (Park and Levitt 1996; Zhou and Zhou 2002; Gilis 2004). The way the evaluation is done is by finding the success in ranking the native structure as the lowest energy and also in obtaining a large z-score for the native structure. Our (Feng et al. 2007) four-body sequence-based contact potentials (4CP-seq) have been successful in recognition of the native structure among misfolded decoys datasets from Decoys‘R’Us. The sequential information built into this potential enables better gapless threading results than Delaunay tessellation. However, these potentials fail in recognizing native structures of some proteins like ferredoxin (1fca) (Feng et al. 2007). To improve the performance of four-body potentials we have recently derived a second generation of these potentials. These newer four-body potentials include the spatial orientation information by using four

148

S.P. Leelananda et al.

non-sequential neighbors as its basis and are named 4CP-non-seq. We further improved our four-body contact potentials by combining 4CP-seq and 4CP-non-seq. Because both 4CP-seq and 4CP-non-seq are long-range potentials, we also add in our short-range potentials (Bahar et al. 1997). The best threading results are obtained by combining these three sets of potentials (see Table 6.2, where the bold column is best). Table 6.2 Threading results – significantly improved performance of our second generation of four-body potentials that combines 4CP-seq (Seq), 4CP-non-seq (nonSeq), and short-range (SR) interactions. Results shown are for (a) the four-state reduced decoy and (b) lattice_ssfit sets from Decoys‘R’Us. Best cases shown in Bold Protein

Rank w/Seq

Ranks w/nonSeq Rank w/SR

Rank w/Seq + nonSeq

Rank w/Seq + 0.5 nonSeq + 0.1 SR

z score

1ctf-a 1r69-a 1sn3-a 2cro-a 3icb-a 4pti-a 4rxn-a 1beo-b 1ctf-b 1dkt-A-b 1fca-b 1nkl-b 1pgb-b 1trl-A-b 4icb-b

6 1 1 1 1 7 7 1 2 13 249 1 19 1 1

1 1 46 26 9 13 9 1 1 47 1 1 3 7 1

2 1 7 1 3 2 3 1 1 14 12 1 1 1 1

2 1 2 1 1 1 1 1 1 1 1 1 1 1 1

−2.5 −3.3 −2.6 −2.6 −2.2 −2.9 −2.5 −5.1 −4.5 −2.6 −2.7 −5.4 −3.3 −3.8 −5.0

25 65 11 56 59 6 8 1 87 1 8 4 3 3 1

Our new combined potentials are able to identify all the native structures in the lattice_ssfit and 5 out of 7 in the 4state_reduced decoys. The other two targets in the 4state_reduced decoy set rank them as nearly best. The successful threading tests have been performed on widely used decoys from Decoys‘R’Us, but we are now generating our own sets of decoys. In some cases such as 4icb from the Decoys‘R’Us lattice_ssfit dataset the decoys are tightly clustered in RMSD from 5–12 Å. Our new decoys are generated to be more challenging by including decoys having root mean square deviations (RMSDs) in the range of 2–5 Å. Our potentials can still identify the native structure even for these more difficult new decoys. Knowledge-based potential functions are not only used to identify the native conformation from a set of decoys, but also they can be used when generating conformations to sample the conformation space (Jernigan and Bahar 1996; Hao and Scheraga 1999). Knowledge-based potentials are also used in protein–protein docking predictions. Li and Liang (2005a) used knowledge-based potentials in predicting protein–protein binding surfaces of several antibody or antibody-related proteins.

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

149

Protein–protein complexes used by them were taken from the 21 CAPRI (Critical Assessment of PRedicted Interactions, Méndez et al. 2005) target proteins. Protein design is another active research area where knowledge-based potentials are used extensively. Protein design means identifying the sequences that are compatible with a given fold but not compatible with other alternative folds (Koehl and Levitt 1999a,b). This is important in designing novel proteins that may not exist in nature but have enhanced or novel biological functions and is especially important in the pharmaceutical industry where development of new drugs is essential. Designing proteins is extremely complex because even for a protein with a small number of residues, there are a huge number of possible sequences. It is difficult to design proteins by searching these large sequence spaces using computational and experimental techniques. Therefore protein design techniques rely heavily on potential functions to execute the search. Hu et al. (2004) used optimized non-linear design potential functions in protein design to distinguish the native sequence that folds to a given protein structure, from a set of 440 decoy sequences. Machine learning algorithms were used in protein design to train the potential function. To test the design potential, and check if the native sequence can be distinguished from the decoy sequences, predicted sequences which gave the lowest scores were compared with the native sequences. All 440 native sequences were successfully distinguished from the decoy sequences. Nonlinear design potential functions are capable of successfully discriminating native sequences from decoy sequences; however, linear potential functions are not. Knowledge-based potential functions are also used in predicting protein stability. It is reasonable to use statistical potentials to estimate protein stability upon mutation because mutations will not change the partition functions of the protein sequences appreciably. In studies done using physics-based empirical potentials and statistical potentials, predicted and experimentally measured stabilities show a good correlation (Miyazawa and Jernigan 1994; Gilis and Rooman 1996, 1997; Guerois et al. 2002; Bordner and Abagyan 2004; Hoppe and Schomburg 2005; Zhou and Zhou 2002). Furthermore statistical potentials have been used to predict quantitative binding free energy of protein interactions, such as protein–protein interactions or protein– ligand interactions (DeWitte and Shakhnovich 1996; Mitchell et al. 1999; Muegge and Martin 1999; Liu et al. 2004; Zhang et al. 2005). Changes of binding free energy upon mutations can also be predicted using knowledge-based potentials (Kortemme and Baker 2002; Kortemme et al. 2004).

6.11 Future Developments Although statistical coarse-grained potentials have been quite successful in the prediction of protein tertiary structures from the amino acid sequence, protein design, mutation stability, or docking studies there is still significant room for further improvements. One aspect in our opinion that is especially important is relating

150

S.P. Leelananda et al.

inter-residue potentials to the dense packing within protein cores. These dense states are not easy to represent. Polyhedra and lattices represent two of the simplest ways to achieve high packing densities. The face-centered cubic (fcc) lattice is how spheres pack with the highest density. Recently, there has been renewed interest in packing problems in computer science and mathematics, such as a recent paper considering packing of tetrahedra (Haji-Akbari et al. 2009). There are also interesting alternative polyhedra to utilize for representing the clusters of densely packed residues within a protein. Our preliminary data show significant improvements through the use of an icosahedron for fitting the orientational packing of residues in comparison with the fcc lattice (Feng et al. 2008). The fcc lattice and icosahedron are comparable, since both have 12 directions between the central point and its nearest nodes. We have found, however, by analyzing coordination clusters in our dataset that some clusters have even 14 nearest neighbors. Because of this another polyhedron – the tetrakis hexahedron – that has 14 vertices and 24 faces might be also a good model. The tetrakis hexahedron is slightly less regular than an icosahedron. The icosahedron belongs to the category of the most regular Platonic solids, while the tetrakis hexahedron is a non-regular polyhedron belonging to the category of Catalan solids (duals of the Archimedean solids). Because our four-body potentials provide significantly more cooperative representations than pairwise potentials, and also because they have shown important gains in threading calculations, we expect that potentials derived by using icosahedral or tetrakis hexahedron packings might further improve coarse-grained representation of proteins. To compare all different coarse-grained potentials we have recently developed a Potentials‘R’Us database and server for testing and comparison on different statistical potentials and for energy estimations of protein coarse-grained models using a variety of knowledge-based potentials (Feng et al. 2010). Our Potentials‘R’Us database contains two types of four-body potentials, shortrange potentials, and 23 different two-body potentials (Feng et al. 2010). The server is an easily accessible, freely available tool with a web interface that collects all existing and future protein coarse-grained potentials and computes energies of multiple structural models. It allows evaluation of energies of different protein folds and significantly improves the access to a wide variety of knowledge-based potentials. The server accepts multiple structural files in the PDB format (including hundreds or even thousands of decoys) and the results are sent back to users promptly to the supplied e-mail address. The address of the server is http://gor.bb.iastate.edu/potential; 21 two-body potential functions from Potentials‘R’Us webserver (Feng et al. 2010) were tested using as decoys structural models from CASP8 competition to examine how well they recognize the target structures from each corresponding set of decoys (Gniewek et al. 2010, unpublished data). The CASP8 targets were divided into two subsets according to the method used to predict structural models (decoys) for each target. One set comprised targets for which the structural models were predicted using homology modeling (easy targets − 153 cases) and the other set comprised the targets for which the decoy sets were obtained using non-homology (templatefree modeling)-based structure prediction methods (hard targets − 12 cases). All incomplete decoys were removed from the sets.

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

151

z-scores were calculated for decoys. The RMSD between the native structure and the best-fit decoy and the RMSD between the native and the average of the five bestfit decoys for each decoy set were also calculated. The Spearman, Pearson, and the Kendall correlation coefficients were calculated for all the target–decoy pairs using potential energies and the RMSD values from the native structure. The Pearson’s correlation coefficient is a covariance of two variables normalized by their standard deviations: n

− x)(yi − y) n 2 2 i=1 (xi − x) i=1 (yi − y)

ρP = n

i=1 (xi

(6.30)

The Spearman’s rank correlation coefficient is a measure of statistical dependence between two ranked variables. In the case of existence of tied ranks ρS is computed from the same formula as ρP . In the case of no tied ranks the Spearman correlation coefficient can be computed from a simpler formula: 6 ni=1 di2 ρS = 1 − n(n2 − 1)

(6.31)

with di = xi − yi being the difference between the ranks on the two variables. The Kendal τ coefficient is a measure of rank correlation, i.e., the similarity of ordering of the data when ranked by different quantities, and defined as: τ=

nc − nd 1 2 n(n − 1)

(6.32)

where nc is a number of concordant pairs, nd is the number of discordant pairs, and the denominator is the total number of pairs. The values obtained for each target were next averaged over all the decoy sets in each of the two categories to obtain averaged values for each potential function. The results are shown in Table 6.3a, b. Tested potentials are all knowledge-based coarse-grained potentials, and they usually capture statistics based on coordinates of Cα (sometimes Cβ ) atoms, so they do not take into account atomic details of proteins. We observed that for templatebased targets, BT potential (Betancourt and Thirumalai 1999) perform best (in terms of correlation coefficients, average z-score, and average RMSD). The best RMSD values are in the range of 4–5 Å. There are few other potentials which show similar performances. The targets in the homology set have been modeled with high accuracy; in many cases with an RMSD of about 1–5 Å and the free modeled targets in most cases have higher RMSD values. For free modeled targets, performance (both in terms of correlation coefficients or averaged values of z-score and cRMSD), is worse than that for the homology-based case. Potentials that perform well for homology-based targets also perform well for free modeled targets but without giving as good results as the latter. This is due to the fact that the submitted models usually are more

152

S.P. Leelananda et al.

Table 6.3 Threading results for two-body potentials for decoys obtained by (a) homology modeling and (b) free modeling Potential

Spearman ρ

(a) Homology modeling Qm 0.3874 Qp 0.4062 HLPL 0.3943 SKOb 0.4307 SKOa 0.4154 SKJG 0.4374 MJPL 0.2998 MJ3h 0.4590 MJ2h 0.3247 TS 0.2751 BT 0.4619 BFKV 0.4546 TD 0.4404 TEl 0.4306 TEs 0.4224 RO 0.2397 MS 0.3837 MJ3 0.3968 GKS 0.3043 VD 0.4081 MSBM 0.0688

Kendall τ

z-score

RMSD (top1)

RMSD (top5)

0.3696 0.3942 0.3795 0.4412 0.4042 0.4309 0.2609 0.4830 0.2980 0.2350 0.4881 0.4799 0.4503 0.4620 0.4509 0.2627 0.3978 0.4013 0.3120 0.4338 0.0453

0.2693 0.2933 0.2846 0.3054 0.2909 0.3068 0.2172 0.3310 0.2349 0.1990 0.3319 0.3269 0.3180 0.3068 0.2988 0.1642 0.2663 0.2756 0.2102 0.2890 0.0458

1.251 1.217 1.178 1.479 1.420 1.410 0.752 1.395 0.812 0.656 1.496 1.453 1.268 1.412 1.388 0.455 1.249 1.291 1.162 1.400 0.019

5.039 6.453 6.678 4.808 5.158 4.614 9.308 4.939 8.049 9.445 4.087 4.919 5.389 4.733 6.109 5.848 5.210 4.612 6.259 4.576 8.578

5.648 5.898 6.013 5.217 5.616 5.091 7.847 5.096 7.571 8.364 4.891 5.264 5.427 4.860 5.267 5.678 5.043 5.168 5.592 5.017 8.286

0.1427 0.0407 0.0346 0.1348 0.1541 0.1550 −0.0184 0.1154 −0.0026 −0.0246 0.1583 0.1260 0.1026 0.1160 0.1093 0.1255 0.1805 0.1786 0.1160 0.1376 0.0555

0.1251 0.1310 0.1342 0.1348 0.1397 0.1327 0.1244 0.1131 0.1302 0.1203 0.1366 0.1332 0.1287 0.1075 0.1025 0.0794 0.1437 0.1468 0.1105 0.1207 0.0654

1.701 1.430 1.317 1.944 2.010 1.880 0.880 2.020 0.996 0.809 2.137 1.977 1.779 1.704 1.585 0.457 1.555 1.661 1.333 1.579 1.052

9.334 9.722 10.309 11.35 10.439 10.783 11.35 8.364 11.797 11.572 7.733 9.249 9.876 10.625 10.881 12.264 10.251 9.621 11.102 10.921 10.754

11.173 10.859 10.676 10.814 10.291 10.486 11.975 9.944 11.863 12.324 10.202 10.454 10.291 10.691 10.775 9.865 10.290 9.984 11.292 11.680 10.693

Pearson ρ

(b) Template-free modeling Qm Qp HLPL SKOb SKOa SKJG MJPL MJ3h MJ2h TS BT BFKV TD TEl TEs RO MS MJ3 GKS VD MSBM

0.1887 0.1586 0.1640 0.1892 0.2127 0.2034 0.1520 0.1463 0.1594 0.1467 0.1861 0.1694 0.1613 0.1501 0.1488 0.1203 0.2173 0.2242 0.1638 0.1627 0.0999

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

153

deviant from the native protein structures for template free modeled cases than for homology-based models. Potentials are not able to recognize proteins’ native structures with 100% accuracy regardless of what is the modeling used to generate the structure. This could be due to the fact that knowledge-based potentials are used along with sophisticated optimization methods to generate models. Therefore in order to obtain better quality assessment it is reasonable to produce decoys using one potential and asses their quality using other scoring functions. Example can be found in McGuffin (2007). Acknowledgments We acknowledge the financial support provided by NIH grants 1R01GM 073095-3, 1R01GM072014-5, and 1R01GM081680-2.

References Anfinsen C, Haber E, Sela M, White F (1961) The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc Natl Acad Sci USA 47:1309–1314 Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230 Bahar I, Kaplan M, Jernigan RL (1997) Short-range conformational energies, secondary structure propensities, and recognition of correct sequence-structure matches. Proteins: Struct Funct Genet 29:292–308 Bahar I, Jernigan RL (1997) Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation. J Mol Biol 266:195–214 Betancourt MR, Thirumalai D (1999) Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes. Protein Sci 8:361–369 Bordner AJ, Abagyan RA (2004) Large-scale prediction of protein geometry and stability changes for arbitrary single point mutations. Proteins 57:400–413 Bowie JU, Luthy R, Eisenberg D (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science 253:164–170 Carter C Jr, LeFebvre B, Cammer S, Tropsha A, Edgell M (2001) Fourbody potentials reveal protein-specific correlations to stability changes caused by hydrophobic core mutations. J Mol Biol 311:625–638 Czaplewski C, Rodziewicz-Motowidlo S, Liwo A, Ripoll DR, Wawak RJ, Scheraga HA (2000) Molecular simulation study of cooperativity in hydrophobic association. Protein Sci 9:1235–45 Dehouck Y, Gilis D, Rooman M (2006) A new generation of statistical potentials for proteins. Biophys J 90:4010–4017 Deutsch JM, Kurosky T (1996) New algorithm for protein design. Phys Rev Lett 76:323–326 DeWitte RS, Shakhnovich EI (1996) SMoG: de novo design method based on simple, fast and accurate free energy estimates. 1. Methodology and supporting evidence. J Am Chem Soc 118:11733–11744 Dobbs H, Orlandini E, Bonaccini R, Seno F (2002) Optimal potentials for predicting inter-helical packing in transmembrane proteins. Proteins 49:342–349 Dombkowski AA, Crippen GM (2000) Disulfide recognition in an optimized threading potential. Protein Eng 13:679–689 Dong Q, Wang X, Lin L (2006) Novel knowledge-based mean force potential at the profile level. BMC Bioinformatics 7:324 Eisenberg D, Luthy R, Bowie JU (1997) VERIFY3D: Assessment of protein models with threedimensional profiles. Methods Enzymol 277:396–404 Feng Y, Kloczkowski A, Jernigan RL (2007) Four-body contact potentials derived from two protein datasets to discriminate native structures from decoys. Proteins 68:57–66

154

S.P. Leelananda et al.

Feng Y, Jernigan RL, Kloczkowski A (2008) Orientational distributions of contact clusters in proteins closely resemble those of an icosahedron. Proteins Struct Funct Bioinf 73:730–741 Feng Y, Kloczkowski A, Jernigan RL (2010) Potentials ‘R’Us web-server for protein energy estimations with coarse-grained knowledge-based potentials. BMC Bioinformatics 11:92 Finkelstein AV, Badretdinov AY, Gutin AM (1995) Why do protein architectures have Boltzmannlike statistics? Proteins 23:142–150 Goldstein R, Luthey-Schulten ZA, Wolynes PG (1992) Protein tertiary structure recognition using optimized Hamiltonians with local interactions. Proc Natl Acad Sci USA 89:9029–9033 Gatchell DW, Dennis S, Vajda S (2000) Discrimination of near-native protein structures from misfolded models by empirical free energy functions. Proteins 41:518–534 Georgescu RE, Alexov EG, Gunner MR (2002) Combining conformational flexibility and continuum electrostatics for calculating pK(a)s in proteins. Biophys J 83:1731–1748 Gilis D, Rooman M (1996) Stability changes upon mutation of solvent accessible residues in proteins evaluated by database-derived potentials. J Mol Biol 257:1112–1126 Gilis D, Rooman M (1997) Predicting protein stability changes upon mutation using databasederived potentials: Solvent accessibility determines the importance of local versus non-local interactions along the sequence. J Mol Biol 272:276–290 Gilis D (2004) Protein decoy sets for evaluating energy functions. J Biomol Struct Dyn 21:725–736 Guerois R, Nielsen JE, Serrano L (2002) Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol 320:369–387 Haji-Akbari A, Engel M, Keys AS, Zheng X, Petschek RG, Palffy-Muhoray P, Glotzer SC (2009) Disordered, quasicrystalline and crystalline phases of densely packed tetrahedra. Nature 462:773–7 Hao MH, Scheraga HA (1996) How optimization of potential functions affects protein folding. Proc Natl Acad Sci USA 93:4984–4989 Hao MH, Scheraga HA (1999) Designing potential energy functions for protein folding. Curr Opin Struct Biol 9:184–188 Hendlich M, Lackner P, Weitckus S, Floechner H, Froschauer R, Gottsbachner K, Casari G, Sippl MJ (1990) Identification of native protein folds amongst a large number of incorrect models: the calculation of low energy conformations from potentials of mean force. J Mol Biol 216: 167–180 Hill TL (1960) Statistical mechanics. Addison-Wesley, Reading, MA Hinds DA, Levitt M (1992) A lattice model for protein structure prediction at low resolution. Proc Natl Acad Sci 89:2536–2540 Hoppe C, Schomburg D (2005) Prediction of protein thermostability with a direction- and distancedependent knowledge-based potential. Protein Sci 14:2682–2692 Hu C, Li X, Liang J (2004) Developing optimal non-linear scoring function for protein design. Bioinformatics 20:3080–3098 Hubner IA, Deeds EJ, Shakhnovich EI (2005) High-resolution protein folding with a transferable potential. Proc Natl Acad Sci 102:18914–18919 Jernigan RL, Bahar I (1996) Structure-derived potentials and protein simulations. Curr Opin Struct Biol 6:195–209 Jones DT, Taylor WR, Thornton JM (1992) A new approach to protein fold recognition. Nature 358:86–89 Karplus M, Petsko GA (1990) Molecular dynamics simulations in biology. Nature 347:631–639 Koehl P, Levitt M (1999a) De novo protein design. I. In search of stability and specificity. J Mol Biol 293:1161–1181 Koehl P, Levitt M (1999b) De novo protein design. II. Plasticity of protein sequence. J Mol Biol 293:1183–1193 Kolinski A (2004) Protein modeling and structure prediction with a reduced representation. Acta Biochimica Polonica 51:349–371 Koretke KK, Luthey-Schulten Z, Wolynes PG (1996) Self-consistently optimized statistical mechanical energy functions for sequence structure alignment. Protein Sci 5:1043–1059

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

155

Koretke KK, Luthey-Schulten Z, Wolynes PG (1998) Self-consistently optimized energy functions for protein structure prediction by molecular dynamics. Proc Natl Acad Sci USA 95:2932–2937 Kortemme T, Baker D (2002) A simple physical model for binding energy hot spots in protein– protein complexes. Proc Natl Acad Sci USA 99:14116–14121 Kortemme T, Kim DE, Baker D (2004) Computational alanine scanning of protein–protein interfaces. Sci STKE 2004:pl2 Krishnamoorthy B, Tropsha A (2003) Development of a four-body statistical pseudo-potential to discriminate native from nonnative protein conformations. Bioinformatics 19:1540–1548 Laurents DV, Huyghes-Despointes BMP, Bruix M, Thurlkill RL, Schell D, Newsom S, Grimsley GR, Shaw KL, Trevi S, Rico M, Briggs JM, Antosiewicz JM, Scholtz JM, Pace CN (2003) Charge–charge interactions are key determinants of the pK values of ionizable groups in ribonuclease Sa (pI = 3.5) and a basic variant (pI = 10.2). J Mol Biol 325:1077–1092 Lee B (1993) Estimation of the maximum change in stability of globular proteins upon mutation of a hydrophobic residue to another of smaller size. Protein Sci 2:733–738 Li H, Helling R, Tang C, Wingreen N (1996) Emergence of preferred structures in a simple model of protein folding. Science 273:666–669 Li X, Hu C, Liang J (2003) Simplicial edge representation of protein structures and alpha contact potential with confidence measure. Proteins 53:792–805 Li X, Liang J (2005a) Computational design of combinatorial peptide library for modulating protein–protein interactions. Pacific Symposium of Biocomputing 10:28–39 Li X, Liang J (2005b) Geometric cooperativity and anti-cooperativity of three-body interactions in native proteins. Proteins 60:46–65 Li X, Liang J (2007) Knowledge-based energy functions for computational studies of proteins. In: Xu Y, Xu D, Liang J (eds) Computational methods for protein structure prediction and modeling, 1st edn. Springer, New York, NY, pp 71–123 Liu S, Zhang C, Zhou H, Zhou Y (2004) A physical reference state unifies the structure-derived potential of mean force for protein folding and binding. Proteins 56:93–101 Liwo A, Czaplewski C, Pillardy J, Scheraga HA (2001) Cumulant-based expressions for the multibody terms for the correlation between local and electrostatic interactions in the united-residue force field. J Chem Phys 115:2323–2347 Liwo A, Kazmierkiewicz R, Czaplewski C, Groth M, Oldziej S, Wawak RJ, Rackovsky S, Pincus MR, Scheraga HA (1998) United-residue force field for off-lattice protein-structure simulations: III. Origin of backbone hydrogen-bonding cooperativity in united-residue potentials. J Com Chem 19:259–276 Liwo A, Oldziej S, Pincus MR, Wawak RJ, Rackovsky S, Scheraga HA (1997a) A united-residue force field for off-lattice protein-structure simulations. 1. Functional forms and parameters of long-range side-chain interaction potentials from protein crystal data. J Com Chem 18:849–873 Liwo A, Pincus MR, Wawak RJ, Rackovsky S, Oldziej S, Scheraga HA (1997b) A united-residue force field for off-lattice protein-structure simulations. 2. Parameterization of short-range interactions and determination of weights of energy terms by Z-score optimization. J Com Chem 18:874–887 Lu H, Skolnick J (2001) A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins 44:223–232 Maiorov VN, Crippen GM (1992) Contact potential that recognizes the correct folding of globular proteins. J Mol Biol 227:876–888 McConkey BJ, Sobolev V, Edelman M (2003) Discrimination of native protein structures using atom–atom contact scoring. Proc Natl Acad Sci USA 100:3215–3220 McGuffin LJ (2007) Benchmarking consensus model quality assessment for protein fold recognition. BMC Bioinformatics 8:345 Mehler EL, Fuxreiter M, Simon I, Garcia-Moreno EB (2002) The role of hydrophobic microenvironments in modulating pKa shifts in proteins. Proteins 48:283–292 Méndez R, Leplae R, Lensink MF, Wodak SJ (2005) Assessment of Capri predictions in rounds 3–5 shows progress in docking procedures. Proteins 60:150–169

156

S.P. Leelananda et al.

Mirny LA, Shakhnovich EI (1996) How to derive a protein folding potential? A new approach to an old problem. J Mol Biol 264:1164–1179 Mitchell BO, Laskowski RA, Alex A, Thornton JM (1999) BLEEP: Potential of mean force describing protein–ligand interactions: II. Calculation of binding energies and comparison with experimental data. J Comp Chem 20:1177–1185 Miyazawa S, Jernigan RL (1985) Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules 18:534–552 Miyazawa S, Jernigan RL (1994) Protein stability changes for single substitution mutants and the extent of local compactness in the denatured state. Prot Eng 7:1209–1220 Miyazawa S, Jernigan RL (1996) Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Bio 256:623–644 Miyazawa S, Jernigan RL (1999a) Evaluation of short-range interactions as secondary structure energies for protein fold and sequence recognition. Proteins 36:347–356 Miyazawa S, Jernigan RL (1999b) Self-consistent estimation of interresidue protein contact energies based on an equilibrium mixture approximation of residues. Proteins 34:49–68 Miyazawa S, Jernigan RL (1999c) An empirical energy potential with a reference state for protein fold and sequence recognition. Proteins Struct Funct Genet 36:357–369 Muegge I, Martin YC (1999) A general and fast scoring function for protein–ligand interactions: A simplified potential approach. J Med Chem 42:791–804 Munson PJ, Singh RK (1997) Statistical significance of hierarchical multi-body potentials based on Delaunay tessellation and their application in sequence structure alignment. Protein Science 6:1467–1481 Park BH, Levitt M (1996) Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J Mol Biol 258:367–392 Pillardy J, Czaplewski C, Liwo A, Lee J, Ripoll DR, Kazmierkiewicz R, Oldziej S, Wedemeyer WJ, Gibson KD, Arnautova YA, Saunders J, Ye YJ, Scheraga HA (2001) Recent improvements in prediction of protein structure by global optimization of a potential energy function. Proc Nat Acad Sci USA 98:2329–2333 Pokarowski P, Kloczkowski A, Jernigan RL, Kothari NS, Pokarowska M, Kolinski A (2005) Inferring ideal amino acid interaction forms from statistical protein contact potentials. Proteins: Struct Func Bioinf 59:49–57 Qiu, J, Elber R (2005) Atomically detailed potentials to recognize native and approximate protein structures. Proteins 61:44–55 Samudrala R, Levitt M (2000) Decoys ‘R’ Us: A database of incorrect conformations to improve protein structure prediction. Protein Sci 9:1399–1401 Samudrala R, Moult J (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 275:895–916 Sandberg L, Edholm O (1999) A fast and simple method to calculate protonation states in proteins. Proteins 36:474–483 Shakhnovich EI, Gutin AM (1993) Engineering of stable and fast-folding sequences of model proteins. Proc Natl Acad Sci USA 90:7195–7199 Shen MY, Sali A (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15:2507–2524 Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D (1999) Improved recognition of native-like protein structures using a combination of sequence-dependent and sequenceindependent features of proteins. Proteins 34:82–95 Singh RK, Tropsha A, Vaisman II (1996) Delaunay tessellation of proteins: four body nearestneighbor propensities of amino acid residues. J Comp Biol 3:213–221 Sippl MJ (1990) Calculation of conformational ensembles from potentials of the main force. J Mol Biol 213:167–180 Sippl MJ (1993) Boltzmann’s principle, knowledge-based mean fields and protein folding. An approach to the computational determination of protein structures. J Comp Aided Mol Des 7:473–501

6

Statistical Contact Potentials in Protein Coarse-Grained Modeling

157

Skolnick J (2006) In quest of an empirical potential for protein structure prediction. Curr Opin Struct Biol 16:166–171 Skolnick J, Jaroszewski L, Kolinski A, Godzik A (1997) Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? Protein Sci 6:676–688 Tanaka S, Scheraga HA (1976) Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules 9:945–950 Thomas PD, Dill KA (1996a) An iterative method for extracting energy-like quantities from protein structures. Proc Natl Acad Sci USA 93:11628–11633 Thomas PD, Dill KA (1996b) Statistical potentials extracted from protein structures: How accurate are they? J Mol Biol 257:457–469 Tobi D, Elber R (2000) Distance-dependent, pair potential for protein folding: Results from linear optimization. Proteins 41:40–46 Tobi D, Shafran G, Linial N, Elber R (2000) On the design and analysis of protein folding potentials. Proteins 40:71–85 Tollinger M, Crowhurst KA, Kay LE, Forman-Kay JD (2003) Site specific contributions to the pH dependence of protein stability. Proc Natl Acad Sci USA 100:4545–4550 Vajda S, Sippl M, Novotny J (1997) Empirical potentials and functions for protein folding and binding. Curr Opin Struc Biol 7:222–228 Vendruscolo M, Domanyi E (1998) Pairwise contact potentials are unsuitable for protein folding. J Chem Phys 109:11101–11108 Vendruscolo M, Najmanovich R, Domany E (2000) Can a pairwise contact potential stabilize native protein folds against decoys obtained by threading? Proteins-Struc Funct Genet 38:134–148 Wolynes PG, Onuchic JN, Thirumalai D (1995) Navigating the folding routes. Science 267: 1619–1620 Wu Y, Chen M, Lu M, Wang Q, Ma J (2005a) Determining protein topology from skeletons of secondary structures. J Mol Biol 350:571–586 Wu YH, Lu MY, Chen MZ, Li JL, Ma JP (2007) OPUS-Ca: a knowledge-based potential function requiring only C alpha positions. Protein Sci 16:1449–1463 Wu Y, Tian X, Lu M, Chen M, Wang Q, Ma J (2005b) Folding of small helical proteins assisted by small-angle X-ray scattering profiles. Structure 13:1587–1597 Yang L, Tan CH, Hsieh MJ, Wang J, Duan Y, Cieplak P, Caldwell J, Kollman PA, Luo R (2006) New-generation amber united-atom force field. J Phys Chem B 110:13166–76 Zhang C, Kim SH (2000) Environment-dependent residue contact energies for proteins. Proc Nat Acad Sci 97:2550–2555 Zhang C, Liu S, Zhu Q, Zhou Y (2005) A knowledge-based energy function for protein–ligand, protein–protein, and protein–DNA complexes. J Med Chem 48:2325–2335 Zhang J, Chen R, Liang J (2006) Empirical potential function for simplified protein models: Combining contact and local sequence-structure descriptors. Proteins 63:949–960 Zheng W, Cho SJ, Vaisman II, Tropsha A (1997) A new approach to protein fold recognition based on Delaunay tessellation of protein structure. Pac Symp Biocomp 1997:486–497 Zhou H, Zhou Y (2002) Distance-scaled, finite ideal-gas reference state improves structurederived potentials of mean force for structure selection and stability prediction. Protein Sci 11: 2714–2726

Chapter 7

Bridging the Atomic and Coarse-Grained Descriptions of Collective Motions in Proteins Vincenzo Carnevale, Cristian Micheletti, Francesco Pontiggia, and Raffaello Potestio

Abstract In proteins and enzymes the necessity that the native state is thermodynamically stable must be appropriately balanced by the capability of the structure to sustain conformational changes and efficiently interconvert among different functionally relevant conformers. This subtle equilibrium reverberates in the complexity of the free-energy landscape which is endowed by a variety of local minima of varying depth and breadth corresponding to the salient structural states of the molecules. In this chapter we will present some concepts and computational algorithms that can be used to characterize the internal dynamics of proteins and relate it to their “functional mechanics.” We will apply these concepts to the analysis of a molecular dynamics simulation of adenylate kinase, a protein for which the structural rearrangement is known to be crucial for the accomplishment of its biological function. We will show that, despite the structural heterogeneity of the explored conformational ensemble, the generalized directions accounting for conformational fluctuations within and across the visited conformational substates are robust and can be described by a limited set of collective coordinates. Finally, as a term of comparison, we will show that in the case of HIV-1 Trans-Activator of Transcription (TAT), a naturally unstructured protein, the lack of any hierarchical organization of the free-energy minima results in a poor consistency of the essential dynamical spaces sampled during the dynamical evolution of the system.

7.1 Introduction Proteins have arguably evolved under the pressure of several concurrent selection mechanisms. Among the key selection factors, the balance between thermodynamic stability and conformational elasticity deserves a particular mention. In fact, on C. Micheletti (B) Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy; Democritos CNR-IOM and Italian Institute of Technology (SISSA Unit), Trieste, Italy e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_7,

159

160

V. Carnevale et al.

one hand the thermodynamic stability of a particular conformational state – the native state – is required to attain a well-defined three-dimensional structure under physiological conditions (temperature, pH, etc.). At the same time, the functionality of proteins and enzymes often relies on their capability to sustain appreciable conformational changes, and this aspect favors conformational heterogeneity. The above-mentioned interplay is best discussed following the development of the key concepts that have progressively shaped the current understanding of enzymatic functionality. The fact that proteins are characterized by a well-defined native state was earlier suggested by H.E. Fischer who, at the end of the Nineteenth century, proposed the lock-and-key paradigm for enzymatic activity. According to this view, both the enzyme and the substrate to be processed are described as rigid molecules that can bind and interact only when having complementary matching shapes. Interestingly, the notion of proteins as static rigid objects was challenged by the very first crystallographic determinations of protein structures (see references in Pontiggia 2008). In particular, the inspection of the X-ray-resolved structures of myoglobin and hemoglobin indicated that (i) the apo-form and holo-form of these heme proteins presented noticeable structural differences and that (ii) the apo conformers were too compact to possibly allow oxygen molecules to diffuse from the solvent toward the heme pocket inside the proteins. Both observations prompted the conclusion that proteins can exist in structurally different substates, corresponding to various minima of the free-energy landscape separated by free-energy barriers of various heights (Frauenfelder et al. 1991). According to the induced-fit theory, the interconversion between the different biologically relevant substates is triggered by the binding/release of the correct ligand(s) which modifies the free-energy landscape favoring the bound/unbound state. The conformational changes involved by such interconversions can range from localized differences, such as dihedral angle isomers, to much larger deformations. The latter may result from the concerted displacement of several secondary elements, possibly also involving changes in the overall tertiary organization. The induced-fit mechanism is extremely valuable for rationalizing many functionally oriented protein structural changes including the selective recognition, binding, and processing of other molecules, the catalysis of biochemical reactions, the mechanical transduction of regulatory signals, and the activation or inactivation of metabolic processes that constitute the life cycle of the cell. Though the scope of applicability of the induced-fit mechanism is fairly large, the advent of a new generation of single-molecule experiments has clarified that the substate interconversion dynamics of several proteins defies this paradigm. For a growing number of enzymes, it is now known that, in thermal equilibrium and in the absence of any ligand, the free biomolecule can appreciably populate not only the inactive (rest) form but also the active one. Thus, the emerging view of catalysis involves the notion of conformational selection: rather than triggering a structural transition, ligand binding occurs only for those molecules already “resembling” the bound state. Hence, ligand binding does not necessarily cause conformational changes, but can shift the equilibrium

7

Bridging the Atomic and Coarse-Grained Internal Dynamics

161

between the populations of the various species already present in solution, favoring the dominant substate. Consequently, a great deal of the proteins’ proficiency relies on their intrinsic flexibility. The unexpected capability of proteins to spontaneously interconvert between different biologically relevant states points at the fact that the internal dynamics of the free enzymes is “innately” predisposed to assist these interconversions (Henzler-Wildman and Kern 2007). Starting from these observations it is appealing to suggest that proteins have evolved not only under the pressure to fold rapidly and reversibly into a welldefined overall shape but also for being capable of harnessing thermal energy toward realizing specific structural fluctuations that are functionally oriented. From the brief account given above it is clear that the phenomenology of functionally oriented structural fluctuations in proteins and enzymes is too rich and complex to be accounted for within a single theoretical paradigm. Indeed, the richness of the observed behavior parallels the complexity of the free-energy landscape which is endowed by a variety of local minima of varying depth and breadth corresponding to the salient structural states of the molecules. Depending on the specific protein of interest, local minima associated with the biologically relevant conformational substates may pre-exist the binding of ligands or can be created by the latter. In either case, it can be argued that a fruitful strategy to study the occurrence of structural fluctuations in proteins is through the characterization of how the potential energy landscape is “navigated” under the effect of thermal fluctuations. An extensive investigation of the near-native free-energy landscape is here presented for adenylate kinase. This enzyme is of high biological interest because it regulates the energy charge of the cell. In recent years it has been the object of an increasing number of studies; the overview presented here is based on the specific theoretical and computational investigations carried out within our group over the past few years. Our analysis will be largely based on atomistic molecular dynamics simulations of the free form of adenylate kinase carried out at constant temperature. A number of methods are used to monitor and characterize the internal dynamics of this enzyme, which is known to be capable of converting spontaneously between two markedly different states. We shall discuss in detail the connection between the small-scale structural changes experienced by the molecule over the subnanosecond time scale, and the much larger collective changes observed over tens of ns (which represents the current limit of computational approaches). The analysis indicates the existence of preferential, “innate,” generalized directions along which both small-scale and large-scale structural rearrangements occur. The robustness of these preferential directions is highlighted by the high degree of consistency found between the essential dynamical spaces obtained by molecular dynamics and those predicted on the basis of coarse-grained elastic network models. The collective character of the innate movements of adenylate kinase is finally established by using a variational scheme to decompose the enzyme in dynamical domains. It is found that virtually all the visited conformational spaces can be described in terms of the relative motion of a limited number of quasi-rigid protein regions. Besides offering a transparent insight into the high level of consistency of the molecule internal dynamics observed over very different time scales, the

162

V. Carnevale et al.

decomposition into quasi-rigid domains provides a clear indication of the feasibility to introduce coarse-grained approaches to probe the breadth of conformational space explored by enzymes having an “innate,” functionally oriented, internal dynamics. As a term of comparison, the same analysis is carried out for extensive molecular dynamics simulations of Trans-Activator of Transcription (TAT), an intrinsically disordered protein. The conformational space visited during the molecular dynamics evolution does not lend easily to a subdivision into a small number of clearly identifiable substates. The lack of a well-defined overall structure is accompanied by a poor consistency of the salient aspects of TAT internal dynamics at various stages of the simulation. The two systems considered here, adenylate kinase and TAT, aptly illustrate the extent to which the free-energy landscape organization can differ across proteins. For TAT, the landscape appears to be sufficiently flat to allow a substantial diffusion in conformational space. The lack of pronounced free-energy minima is not only reflected in the absence of substates with a definite structural organization but also in an inconsistent character of the essential dynamical spaces. By converse, the conformational space of adenylate kinase is well partitioned in structural substates which share a common internal dynamics with a manifest functionaloriented character (as it can bridge the rest and catalytically potent forms of the enzyme).

7.2 Protein Internal Dynamics Observed over Different Timescales: Methods The free-energy landscape of proteins is typically regarded as being highly corrugated and organized in a hierarchy of minima which result from the complex interactions of amino acids in the protein and with the surrounding solvent. At the smallest scale, nearby minima reflect minute structural changes such as distinct rotameric states of sidechains. The protein motion within these minima is well described by a normal mode analysis which can be straightforwardly accomplished by calculating the mass-weighted Hessian in correspondence of the local energy minimum. The barriers separating these energy minima are small enough that, over a typical time scale of ∼10 fs, the system is capable of diffusing to other minima. The effective dimensionality of the available phase space is large enough that the probability of return to a previously visited energy minimum is negligible (Brooks et al. 1995; Janezic and Brooks 1995; Janezic et al. 1995). Yet, over timescales of the order of ∼1 ns, proteins appear to be trapped in specific basins of the landscape, as they mostly fluctuate around a well-defined average conformation. Over this timescale, protein motion is conveniently described as a stochastic diffusion process in an approximately harmonic effective potential (representing the envelope of the above-mentioned local energy minima). Already for time spans of about 1 ns it is possible to detect the presence of largescale concerted displacements of groups of several amino acids. In HIV-1 protease,

7

Bridging the Atomic and Coarse-Grained Internal Dynamics

163

for example, various regions consisting of tens of amino acids can be displaced by several angstroms over 10–20 ns (Janezic et al. 1995).

7.2.1 Low-Energy Collective Excitations Without any assumption on the harmonicity of the potential underlying the dynamical evolution of the system, the modes corresponding to the independent collective protein fluctuations can be obtained through a principal component analysis of the dynamical trajectories (Garcia 1992). Specifically, the modes are aptly identified as the eigenvectors associated with the largest eigenvalues of the covariance matrix, C, whose generic entry is defined as

Cij,μν = ri,μ − ri,μ · rj,ν − rj,ν

(7.1)

where ri,μ indicates the μth Cartesian coordinate of the ith atom and < · > represents the time average over the configurations visited during the simulation. Prior to the calculation of the average structure, the overall roto-translational motion of the protein is removed. Typically, this is achieved by optimally superimposing each “frame” of the trajectory onto a reference one. Initially, the first frame is taken as the reference, and the procedure is next iterated taking the average structure of the previously aligned frames as the new reference, and so on. The principal component analysis can be performed both on a detailed atomistic level, keeping into account all degrees of freedom of the system, and within a coarse-grained description, which is justified in view of the collective character of the structural fluctuations (modes) associated with the principal eigenvectors of the C matrix. This point of view will be adopted hereafter where, as customary, only the Cα atom will be used to represent each amino acid (Amadei et al. 1993; Kitao and Go 1999).

7.2.2 Structural Substates Over time spans of the order of 10 ns, proteins are capable of overcoming the larger barriers (estimated to be of the order of 1 kcal/mol) separating the above-mentioned structurally homogeneous substates. Identifying the substates by analyzing the freeenergy landscape may present difficulties because it is not always straightforward to identify a priori the appropriate order parameters for the system of interest. In these cases, a natural approach is to rely on the intuitive notion that substates should be structurally homogeneous. Accordingly, a suitable measure of structural similarity/dissimilarity, such as the root mean square distance (RMSD), can be used to detect indirectly the presence of local free-energy minima where the system dwells for appreciable time spans. The main substates can, in fact, be detected by a structural clustering of the conformers visited during the trajectory. Notice that as the system should dwell in a substate

164

V. Carnevale et al.

over an uninterrupted time interval (except when substate transitions occur). The time continuity of structures assigned to the same substate should thus be enforced by the clustering procedure. This goal is conveniently formulated following, e.g., the method introduced in Pontiggia et al. (2008), which is based on the K-medoids-partitioning scheme (Kaufman and Rousseeuw 2005). In the following we shall indicate with dij the RMSD after optimal superposition (Kabsch 1978) of two structures visited at times i and j. The sought optimal clustering into K substates is found by identifying the substate-transition times, t1 , t2 , . . ., tK−1 , and the representatives (again labeled by their time index, r1 , r2 , . . . , rK ) which minimize the following dissimilarity cost function:

d=

tk+1 K−1

di,rk ,

k=0 i=tk

where t0 and tK are, respectively, the initial and final time steps of the simulation. The minimization of the dissimilarity score is accomplished using an iterative scheme (Pontiggia et al. 2008). An initial subdivision in substates is performed by randomly choosing the K – 1 transition times. Each substate representative is identified as the structure with the smallest average distance from the other substate members. A new clustering is next proposed by randomly re-assigning one or more of the K – 1 time subdivisions and ensuring that no two subdivisions coincide (because empty-clusters would result) and the representatives are re-identified. The new subdivision is retained if it leads to a smaller dissimilarity score; otherwise the original subdivision is kept. This procedure is iterated until convergence. Notice that the desired number of clusters, K, is left as input to the user and the choice of the optimal K value requires the inspection of various clustering properties, such as the dissimilarity score as a function of K.

7.2.3 Inter-substate and Intra-substate Fluctuations The conformational rearrangements implied by the inter-substate transitions (hereafter simply termed “jumps”) can be significantly large. On such large scale, both the criteria to be used to remove the roto-translational degrees of freedom and the related concept of average structure, on which the definition of the relaxation modes depend, might leave room for ambiguities. Notwithstanding these ambiguities, the principal component analysis of the trajectory can still be a valuable tool for identifying the salient collective dynamical degrees of freedom. The latter should capture not only the fluctuations within the substates, but also the jumps that bridge them. Following Kitao et al. (1998) the covariance matrix may be decomposed in intraminima and inter-minima contributions

7

Bridging the Atomic and Coarse-Grained Internal Dynamics

Cij,μν =

l

l ωl Cij,μν +

rj,ν l − rj,ν ωl ri,μ l − ri,μ

165

(7.2)

l

where ·l represents an average restricted to the sole configurations in the lth substate and ωl is the fraction of simulation time covered by substate l. The relative importance of intra-substate and inter-substate fluctuations is expected to depend on the number of visited substates. This prompts the interest in examining how the weight of the two terms in Eq. (7.2) depends on the duration of the simulated dynamical trajectory. The connection between these two tiers of conformational fluctuations can also be investigated at a deeper, and more interesting level than motion amplitudes. In fact, a key point is to understand what relationship, if any, exists between the set of lowest-energy modes of an extensive trajectory and those calculated separately for each visited substate. This question is aimed at understanding how the free-energy landscape is actually organized. A high consistency of the intra-substate modes and inter-substate jumps would indicate that the effective dimensionality of free-energy landscape is limited, as a limited number of generalized coordinates would suffice to describe the system fluctuations over a wide temporal range. By converse, the lack of a significant overlap between the modes/jumps would indicate the necessity to use a large number of degrees of freedom to capture the salient features of the free-energy landscape.

7.2.4 Comparison of Structural Fluctuations in Different Substates The degree of overlap of two sets of modes can be established using various quantitative criteria (Amadei et al. 1993; Hess 2000; Pontiggia et al. 2008). Here we shall adopt an intuitive one, based on the calculation of the root mean square inner product (RMSIP): # $ 2 $1 → pi · → qj RMSIP = % n

(7.3)

ij

where {p} ≡ {→ pl , . . . → pn } and {q} ≡ {→ ql , . . . → qn } are the two (ortho-normal) sets of modes. The RMSIP value ranges from 0, for complete orthogonality of the {p} and {q} spaces, to 1 in case of their perfect coincidence. Various schemes can be used to establish the statistical significance of an RMSIP value in dependence on the size of the protein and on the number of considered modes, n. For adenylate kinase, which consists of 214 amino acids, the customary choice of n = 10 will be used and RMSIP values equal to 0.7 or more will be considered significant (Amadei et al. 1993).

166

V. Carnevale et al.

7.2.5 Coarse-Grained Description and Modeling of Protein Internal Dynamics As anticipated, a distinctive feature of the structural excitations entailed by the lowenergy modes is their collective character. This observation has several implications. In fact, on one hand, it opens the possibility that the low-energy modes can be captured using simplified coarse-grained approaches which, though inaccurate for an atomistic description of the system, may still appropriately account for the non-local low-energy excitations. The second consideration is that the collective character of the low-energy structural fluctuations suggests the possibility that some groups of amino acids of the protein may move so coherently to be regarded as approximately rigid blocks. Both approaches will be considered here to gain insight into the functional dynamics of adenylate kinase. 7.2.5.1 Elastic Network Models In a seminal study, Tirion (1996) showed that the low-energy excitations of globular proteins calculated with an atomistic force field could be accurately reproduced by a much simpler potential energy function, where all pairs of spatially close heavy atoms were connected by springs penalizing their displacement from the reference positions. It was regarded as highly remarkable the fact that the low-energy system excitations could be well captured by a model with a single parameter (the spring constant, which was equal for all pairs of interacting heavy atoms). The success of the model was explained a posteriori in terms of the large-scale, collective character of the low-energy fluctuations. The salient features of these excitations, which entail structural modulations at scales much larger than a few atoms, can therefore be captured with minimalistic interatomic potentials. In the past decade, this observation provided the motivation to model the internal dynamics of proteins using not only harmonic interatomic potentials (as done by Tirion) but adopting a very coarse-grained structural description where each amino acid is replaced by one or two interaction centers. These models are generally referred to as “elastic network models” because of the quadratic character of the model potential energy which is defined as: F=

1 α αβ β δri Mij δrj 2

(7.4)

ij,αβ

where δriα indicates the αth Cartesian component of the displacement of the ith interaction center (usually taken coinciding with the ith Cα atom), and M is a symmetric interaction matrix which is straightforwardly computed from the reference (native) protein structure (Hinsen 1998; Atilgan et al. 2001; Delarue and Sanejouand 2002; Micheletti et al. 2004). The low-energy modes of the model system are simply given by the eigenvectors of M associated with the smallest non-zero eigenvalues. Despite the quadratic character of the potential energy of Eq. (7.4), the model low-energy

7

Bridging the Atomic and Coarse-Grained Internal Dynamics

167

modes have been shown to be in remarkable accord with the principal directions of motion calculated from extensive molecular dynamics trajectories (Atilgan et al. 2001; Micheletti et al. 2004). We shall accordingly complement the comparison of the intra-substates’ essential dynamical spaces with the low-energy modes predicted using the elastic network model of Micheletti et al. (2004). 7.2.5.2 Identifying Protein Dynamical Domains We finally consider the possibility to detect the existence of approximately rigid domains in proteins based on the analysis of the collective excitations. Rigid subunits, or dynamical domains, correspond to groups of amino acids whose pairwise distances are not significantly changed (in modulus) in the course of thermal fluctuations of the protein. For a given tentative grouping of amino acids into Q domains it is possible to calculate a scoring function which measures the extent to which the protein internal dynamics is distorted after the suppression of the internal structural fluctuations of each domain. The starting ' analysis is the projection of & point of→this the original set of lowest-energy modes, → pl , . . . , pn , on a space which allows only for rigid-body movements (rotations and translations) of the putative rigid subunits. This procedure, which entails standard linear-algebra operations, (see Hinsen and Kneller leads ' to the identification of the rigid-body fits of the original modes &→rb 2000) p rb p l ,...,→ n . The fraction of the molecule’s low-energy fluctuation which is captured by the given decomposition is conveyed via the following quantity, f: f =

(→rb (2 (p ( l l λl

l λl

(7.5)

with λl indicating the covariance matrix eigenvalue corresponding to the lth mode. The best decomposition of the protein is the one that maximizes f over the possible distinct partitions of amino acids in Q groups. As the space of possible amino acid groupings is too large to allow for a thorough exploration, the optimization of f can be accomplished with approximate stochastic methods (the one discussed here follows the variational scheme of Potestio et al. (2009)).

7.3 Protein Internal Dynamics Observed Over Different Timescales: The Case of Adenylate Kinase Adenylate kinase (AKE) is a small (around 23 kDa) single-chain enzyme whose three-dimensional structure can be subdivided into three main subdomains: a central extremely stable Core is flanked by two mobile domains – the Lid-binding and Ampbinding domains. The primary biological function of this protein is to catalyze the reversible reaction Mg ATP + AMP ↔ Mg ADP + ADP. The binding of the two nucleotides is accompanied by the closure of the domains toward the Core which

168

V. Carnevale et al.

assists and controls the phosphotransfer reaction (Mueller and Schulz 1992; Mueller et al. 1996). The correct movement of the subdomains with respect to the core is, therefore, crucial for function. Because of the large, and rapidly growing, number of experimental and theoretical studies on adenylate kinase (Müller and Schulz 1992; Müller et al. 1996; Sinev et al. 1996; Shapiro et al. 2000, 2002; Han et al. 2002; Wolf-Watz et al. 2004; Kern et al. 2005; Shapiro and Meirovitch 2006; Henzler-Wildman et al. 2007b; Hanson et al. 2007), it appears natural to choose this system to investigate the organization of the near-native free-energy landscape and understand how it reverberates on the protein dynamics observed over various time scales. The conformational rearrangement that takes place upon binding of the substrate can be appreciated in Fig. 7.1 where cartoon representations of the free and bound forms can be compared.

Fig. 7.1 Crystallographic structures of the open free form (a) and of the closed bound form (b) of Escherichia coli adenylate kinase (Mueller and Schulz 1992; Mueller et al. 1996). The flexible Lid (amino acids 114–164) and AMP-binding (amino acids 31–60) domains are colored in black and gray, respectively. The rigid Core domain is colored in white. Reproduced with permission from Pontiggia et al. 2008. Copyright Elsevier Inc

In the closed form the two binding domains are displaced by more than 7 Å with respect to the open structure. The conformational rearrangement of the two domains has been the subject of a large number of experimental and computational studies. Interestingly, nuclear magnetic resonance (NMR) relaxation experiments have identified the structural rearrangement as the limiting step in the reaction turnover (Kern et al. 2005). Moreover, nuclear magnetic resonance (NMR) and fluorescence resonance energy transfer (FRET) experiments have also shown that the closed conformation is appreciably populated even in absence of the ligands (HenzlerWildman et al. 2007b; Hanson et al. 2007). In particular, the binding of the ligands appears to act by shifting the equilibrium between the closed and open populations, stabilizing the closed structure. The correct and efficient functioning of the enzyme relies on the fact that thermal fluctuations are sufficient to excite “intrinsic” functionally oriented movements of the enzymes, where the two mobile subdomains fluctuate between open and closed arrangements over a timescale of about 1 ms. This duration is beyond the reach of present-day atomistic molecular dynamics simulations. Yet, by analyzing the enzyme internal dynamics over time spans of ∼10 ns, it is possible to establish

7

Bridging the Atomic and Coarse-Grained Internal Dynamics

169

several relevant facts about the organization of the free energy of adenylate kinase and their influence on the open/closed interconversion dynamics. The findings discussed hereafter will pertain to a 50-ns long simulation carried out in explicit solvent starting from the open crystal structure. The system, solvated by 18,000 simple point charge (SPC) water molecules in a cubic box, was parameterized with OPLSS-(AA)/L force field (Jorgensen et al. 1996; Kaminski et al. 2001), energy minimized, and gradually heated up to 300 K. The temperature was next adjusted, along with the system density, in a 500-ps long MD simulation at constant temperature (300 K) (Nose’ 1984; Hoover 1985) and pressure (1 bar) (Berendsen et al. 1984). The dynamical evolution of the system in the NVT ensemble, within a cubic simulation box of side l = 8.35 nm, was simulated with the GROMACS software (version 3.3.1) (Van Der Spoel et al. 2005) with an integration time step of 1 fs. Lincs algorithm (Hess et al. 1997) was used to constrain all bond lengths while water internal degrees of freedom were controlled with SETTLE (Mityamoto and Kollman 1992). Long-range electrostatic interaction was treated with the particle mesh Ewald (PME) method (Darden et al. 1993; Essmann et al. 1995). The internal dynamics analysis was first undertaken by identifying the system lowest-energy modes from trajectory excerpts of longer and longer duration, from 500 ps to 50 ns. The plots in Fig. 7.2 present the distribution of the amplitude of the protein motion projected on the first low-energy mode of the entire trajectory. For “converged trajectories” occurring in a quadratic free-energy landscape, the abovementioned distributions are expected to have a Gaussian character. As visible in Fig. 7.2, this property is maintained up to durations of ∼1 ns, though it should be noted that the width of the distribution increases with duration (as pointed out for other systems in Pontiggia et al. 2007). Fig. 7.2 Normalized histograms of Cα displacements along trajectories of increasing duration projected onto the principal direction of the Cα fluctuations

On time intervals of the order of tens of nanoseconds, the system jumps through states that are separated by appreciable barriers of a few kcal/mol and a multimodal character of the projected amplitudes becomes very apparent. The existence of pronounced multiple local minima in the free-energy landscape naturally prompts the comparative investigation of their low-energy modes and their possible influence on how the system can navigate efficiently the pleated landscape. To this purpose, the ensemble of conformations visited during the 50-ns long dynamical evolution was first processed to identify the salient substates (intuitively

170

V. Carnevale et al.

Fig. 7.3 (a) Density matrix of the pairwise root mean square distances of the Cα atoms of configurations sampled in 50-ns MD trajectory of AKE (Pontiggia et al. 2008). The block character of the matrix suggests the presence of substates that can be quantitatively identified with the K-medoids algorithm. The representatives of the eight identified substates are represented in panel b. Reproduced with permission from Pontiggia et al. 2008. Copyright Elsevier Inc

associated with the free-energy minima). The existence of well-populated ensembles of distinct conformers is suggested by the block character of the matrix of root mean square distances (RMSDs) of all pairs of visited structures, see Fig. 7.3. By using the clustering algorithm described in the previous section it was found that the space of visited conformers could be viably partitioned in eight substates, each associated with a dwelling time of about 5–10 ns. The typical RMSD of structures belonging to the same cluster was 2 Å, which is indicative of a good structural homogeneity, while the RMSD between any two representatives of different clusters is about 4.5 Å on average.

7

Bridging the Atomic and Coarse-Grained Internal Dynamics

171

By using Eq. (7.2), the total fluctuation experienced by the protein can be expressed as the sum of two contributions: one pertaining to the intra-substates’ fluctuation and the other from the inter-substates’ jumps. It is of interest to examine the interplay of these two contributions for increasing duration of the trajectory. As shown in Fig. 7.4, the intra-substate contribution is approximately constant in the course of the simulation. This indicates that the various free-energy minima (substates) have approximately the same breadth. On the other hand, the progressive enlargement of the visited phase space is revealed by the rapid increase of the contribution from inter-substate jumps. Besides the comparison of the breadth of phase space occupied by the various substates, it is important to examine what relationship, if any, exists between the essential dynamical spaces which capture the most prominent structural fluctuations of each substate. Because these can be assimilated to the lowest-energy excitation of the corresponding free-energy minima, a further relevant issue is to investigate to what extent they are consistent with the structural deformation bridging the open and closed states of the molecule. For each pair of the eight visited substates the consistency of the top ten essential dynamical substates (obtained from the intra-substate structural covariance analysis) was measured. Such pairwise comparisons resulted in typical RMSIP values of 0.83, which is indicative of very significant correspondences of the low-energy

Fig. 7.4 (a) Time evolution of the total mean square fluctuation of Cα atoms and of two separated contributions due to the intra-substate fluctuations and due to the inter-substate transitions. (b) Evolution of the absolute value of the scalar product between the first intra-substate and inter-substate essential modes and the principal mode of the entire trajectory. Reproduced with permission from Pontiggia et al. 2008. Copyright Elsevier Inc

172

V. Carnevale et al.

mode space. Equally remarkable is the fact that also the difference vectors bridging any pair of substate representatives (not just those visited consecutively in time) are well described by the same low-dimensional space of the above-mentioned lowenergy modes. This is ascertained by calculating what fraction of the difference vector between each pair of substate representatives is captured by the subspace of the ten lowest energy modes of each substate. The distribution of these projections is narrowly peaked around the value 0.92 with a spread of only 0.04. The consistency of the linear spaces describing the low-energy modes within substates and of the difference vectors that connect their representatives is suggestive of the fact that the essential dynamical spaces of adenylate kinase ought to be very consistent across all timescales covered by the atomistic simulation. This fact was confirmed by direct observation, as shown in Fig. 7.4 panel (b). The data reflect the degree of accord of the principal directions of motion obtained by the covariance analysis of trajectory excerpts of increasing duration, from few hundreds of picoseconds to several tens of nanoseconds. The data in the figure indicate a remarkable degree of consistency. This has important implications regarding the scope of molecular dynamics simulations, because the analysis of the essential dynamical spaces over a limited time duration may provide very clear indication of the structural deformations that the system may undergo over much longer timescale. An illustration of this concept is shown by the fact that the top ten modes within the most populated substate encountered during the MD trajectory (lifetime of 9.5 ns) is sufficient to account for as much as 85% of the system fluctuations observed over a 50-ns long time span. In turn, the functional-oriented character of the robust “innate” internal rearrangements of adenylate kinase is apparent a posteriori from the fact that the 96% of the norm of the difference vector bridging the open and closed states is captured by the top ten modes of the complete trajectory. As recently suggested by Henzler-Wildman et al. (2007a) the robustness of the essential dynamical spaces of adenylate kinase appears to be encoded in the overall structural organization of the biomolecule. This point can be aptly illustrated by resorting to coarse-grained description of the biomolecule structure and internal dynamics. In particular, it is instructive to consider the application of an elastic network model to determine the lowest-energy excitation of one particular substate by using, as input, only the Cα trace of the representative. We recall here that elastic network models adopt a minimalistic modeling of the free-energy landscape in the neighborhood of the given reference structure. In particular no sequence-dependent information is directly introduced (the same strength is used for interactions between any pair of amino acids that are in contact in the reference structure). The essential dynamical spaces of the elastic network models, therefore, reflect properties that are readily connected to the neat secondary and tertiary organizations of the biomolecule (Micheletti et al. 2002). By applying the β-Gaussian network model (Micheletti et al. 2004) to the structure of the representative conformer of the most populated substate it was found that the top ten modes have an RMSIP of 0.82 against the essential dynamical space of the entire 50-ns long MD trajectory.

7

Bridging the Atomic and Coarse-Grained Internal Dynamics

173

This striking degree of accord confirms the notion that the salient properties of the lowest-energy modes of adenylate kinase are encoded in the overall conformational organization of the biomolecule. This observation finds a natural interpretative framework in the context of dynamical domains. As will be shown in the subsequent analysis, the consistency of the lowest-energy modes over several timescales appears to reflect the presence of a limited number of amino acids which act as hinge regions for the internal dynamics. In particular, there appear to exist a limited number of protein subdomains which are concertedly displaced as a nearly rigid body, in the course of fluctuations. The quasi-rigid character of these fairly large regions is arguably one of the reasons for the observed consistency of the essential dynamical spaces. We shall hereafter describe the results of the decomposition of AKE in quasirigid units following the application of the optimal subdivision scheme described in the previous section. Subdivisions into Q = 2. . .10 domains were considered. As shown in Fig. 7.5, a very limited number of domains is enough for accounting for most of the protein essential mobility. The subdivision in three quasi-rigid blocks is able to capture 77% of the considered essential dynamics. The three dynamic domains consist of five sequence intervals: one for each of the two mobile domains and three for the nearly fixed core. The correspondence of the three dynamical blocks with the customary tripartite subdivision of AKE into Lid, Core, and AMP-binding subdomains is manifest. It is evident from the analysis presented so far that the dynamical properties of a protein, functionally oriented ones in particular, can be inherent features of the native structure. The interconversion between the open and closed forms of adenylate kinase is not necessarily triggered by the binding to the substrates, but can be induced by the mere thermal agitation. The multiminima character of

Fig. 7.5 Subdivisions of AKE in Q = 2, Q = 3, and Q = 4 rigid subunits. The rigid subdomains, identified by different colors, are shown in panels a, b, and c, respectively. The decomposition was performed taking into account the ten lowest-energy modes. The fraction of essential dynamical motion captured by the subdivision into Q = 2. . .10 rigid domains is shown in panel d. Reproduced with permission from Potestio et al. 2009. Copyright Elsevier Inc

174

V. Carnevale et al.

the free-energy landscape prevents a description of the latter in terms of simple quadratic potentials if a dynamics over the nanoseconds is considered; nevertheless, the basins visited by the system share a striking similarity of the essential directions of motion that are moreover oriented in order to help the biological function. The importance of large-scale conformational changes of the protein structure is made more evident by the dynamical domain decomposition, which shows a clear match between structural and functional modularity. The mobility of the protein is by far concentrated in the hinge-bending motion of the two substrate-binding domains, while a negligible amount of fluctuation is retained by the intra-domain displacements.

7.3.1 Conformational Fluctuations in the Presence of a Nearly Flat Free-Energy Landscape: The Case of TAT The results obtained for adenylate kinase demonstrate how dynamical interconversion among distinct conformational states can be an integral part of protein biological functionality. The observed conformational variability of AKE does not compromise the overall organization of the molecule fold. In fact, the large conformational space of the molecule can be described in terms of the relative movements of a limited number of quasi-rigid domains. Nevertheless, it should be noted that several types of proteins exist for which the intrinsic structural plasticity is so high that it is impossible to introduce a definite notion of overall fold (Dunker et al. 2000, 2002; Tompa 2005). These intrinsically disordered proteins are interesting in that their behavior cannot be possibly rationalized in terms of the lock-and-key paradigm, and they are relevant for the present discussion because they provide a concrete example of biomolecules lacking nearly rigid domains. Defining the essential space of these proteins, i.e. a small set of collective atomic displacements accounting for most of the fluctuations, appears challenging given the absence of well-defined conformational substates and the nearly flat nature of the free-energy surface. Therefore disordered proteins are expected to constitute a limiting case for most of the concepts discussed above and their analysis provides the opportunity to discuss the range of applicability of the methodologies introduced in this chapter. A relevant example of intrinsically disordered protein is the HIV-1 TransActivator of Transcription (TAT) (Shojania and O’Neil 2006). TAT is a viral protein of about 100 amino acids essential for an efficient transcription of the viral genome (Athanassiou et al. 2007). From the analysis of the 15-ns long trajectory of the MAL isolate (Alizon et al. 1986) of TAT is evident a significant conformational heterogeneity (Fig. 7.6, right panel). However, rather than resulting from a dynamic interconversion among distinct well-defined substates, such a broad distribution appears to result from free diffusion across a nearly flat free-energy landscape. Indeed the set of pairwise structural distances (RMSD) does not show any clear structure and no recognizable basins are detected.

7

Bridging the Atomic and Coarse-Grained Internal Dynamics

175

Fig. 7.6 Left panel: The MD trajectories of TAT-MAL isolate and of AKE were subdivided in 1-ns long intervals. The top ten eigenvectors of covariance matrix of each interval were calculated. The histogram shows the distribution of the root mean square inner product (RMSIP) of the essential spaces for any pair of distinct intervals. The RMSIP distributions of TAT-MAL isolate and AKE are shown with dashed and continuous lines, respectively. Right panel: Distribution of the root mean square distances of any pair of configurations sampled by the MD trajectory of the TAT-MAL isolate (dashed line) and of AKE (solid line)

The lack of any hierarchical organization of the free-energy minima is expected to result in a poor consistency of the essential spaces calculated along the trajectory. Indeed the mean square inner product between the top ten eigenvectors of the covariance matrix calculated for pairs of 1-ns long intervals of the trajectory are significantly lower than for adenylate kinase (Fig. 7.6). This observation therefore suggests that conformational fluctuations cannot be easily rationalized as a superposition of simpler motions of groups of residues undergoing rigid-body rotations about hinges and/or axes.

7.4 Concluding Remarks For a large number of enzymes it has been shown that the biological function emerges from a plurality of distinct conformational substates (Henzler-Wildman and Kern 2007). The interconversions among the substates are the ultimate determinants of the effective reaction kinetics of the enzymatic processes. Based on these observations it can be stated that key functional-oriented features are encoded in the free-energy landscape, both in terms of local minima (substates) and in terms of the accessible pathways that connect them. In this chapter we have discussed a number of theoretical concepts and computational algorithms that can be used to characterize the internal dynamics of proteins and enzymes and relate it to their “functional mechanics.” The presented techniques, that are general and transferable, have been applied to the case study of adenylate kinase (AKE). As a term of comparison, the case of a protein showing a very different organization of the free-energy landscape, the HIV-1 Trans-Activator of Transcription (TAT), was also briefly discussed. For adenylate kinase it was shown that the dynamical evolution (followed with constant temperature, explicit solvent molecular dynamics) visits several different conformational substates. Each substate appears structurally homogeneous, though

176

V. Carnevale et al.

the inter-substate conformational variability is substantial. The multi-scale analysis of the system intra-substate and inter-substate fluctuation dynamics reveals that the heterogeneity of the conformational ensemble explored over tens of nanoseconds is accompanied by a striking robustness of the generalized degrees of freedom identifying the lowest-energy excitations of these biomolecules. The latter mostly correspond to the relative motion of a limited number of “dynamical domains” that behave as nearly rigid protein subunits and whose displacement can bridge the inactive (open) and active (closed) forms of the enzyme. The substate-independent location of the hinge regions between these units hints at the fact that the salient, functionally oriented aspects of the internal dynamics of adenylate kinase are encoded in the fundamental structural aspects of the enzyme (such as its secondary and tertiary organizations).

References Alizon M, Wain-Hobson S, Montagnier L, Sonigo P (1986) Genetic variability of the AIDS virus: Nucleotide sequence analysis of two isolates from African patients. Cell 46:63–74 Amadei A, Linssen ABM, Berendsen HJC (1993) Essential dynamics of proteins. Proteins 17: 412–425 Athanassiou Z, Patora K, Dias RL, Moehle K, Robinson JA, Varani G (2007) Structure-guided peptidomimetic design leads to nanomolar beta-hairpin inhibitors of the TAT-TAR interaction of bovine immunodeficiency virus. Biochemistry 46:741–751 Atilgan AR, Durell SR, Jernigan RL, Demirel MC, Keskin O, Bahar I (2001) Anisotropy of fiuctuation dynamics of proteins with an elastic network model. Biophys J 80: 505–515 Berendsen HJC, Postma JPM, van Gunsteren WF, DiNola A, Haak JR (1984) Molecular dynamics with coupling to an external bath. J Chem Phys 81:3684–3690 Brooks BR, Janezic D, Karplus M (1995) Harmonic analysis of large systems. I. Methodology. J Comp Chem 16:1522–1542 Darden T, York D, Pedersen L (1993) Particle mesh Ewald: an Nlog(N) method for Ewald sums in large systems. J Chem Phys 98:10089–10092 Delarue M, Sanejouand YH (2002) Simplified normal mode analysis of conformational transitions in DNA-dependent polymerases: the elastic network model. J Mol Biol 320: 1011–1024 Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z (2002) Intrinsic disorder and protein function. Biochemistry 41:6573–6582 Dunker AK, Obradovic Z, Romero P, Garner EC, Brown CJ (2000) Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform 11:161–171 Essmann U, Perera L, Berkowitz ML, Darden T, Lee H, Pedersen LG (1995) A smooth particle mesh Ewald method. J Chem Phys 103:8577–8593 Frauenfelder H, Sligar H, Wolynes P (1991) The energy landscape and motions of proteins. Science 254:1598 Garcia A (1992) Large-amplitude nonlinear motions in proteins. Phys Rev Lett 68:2696–2699 Han Y, Li X, Pan X (2002) Native states of adenylate kinase are two active sub-ensembles. FEBS Lett 528:161–165 Hanson JA, Duderstadt K, Watkins LP, Bhattacharyya S, Brokaw J, Chu JW, Yang H (2007) Illuminating the mechanistic roles of enzyme conformational dynamics. Proc Natl Acad Sci USA 104:18055–18060 Henzler-Wildman KA, Kern D (2007) Dynamic personalities of proteins. Nature 450:964–972

7

Bridging the Atomic and Coarse-Grained Internal Dynamics

177

Henzler-Wildman KA, Lei M, Thai V, Kerns SJ, Karplus M, Kern D (2007a) A hierarchy of timescales in protein dynamics is linked to enzyme catalysis. Nature 450:913–916 Henzler-Wildman KA, Thai V, Lei M, Ott M, Wolf-Watz M, Fenn T, Pozharski E, Wilson MA, Petsko GA, Karplus M, Hubner CG, Kern D (2007b) Intrinsic motions along an enzymatic reaction trajectory. Nature 450:838–844 Hess B (2000) Similarities between principal components of protein dynamics and random diffusion. Phys Rev E 62:8438–8448 Hess B, Bekker H, Berendsen HJC, Fraaije JGEM (1997) A linear constraint solver for molecular simulations. J Comp Chem 18:1463–1472 Hinsen K (1998) Analysis of domain motions by approximate normal mode calculations. Proteins 33:417–429 Hinsen K, Kneller GR (2000) Projection methods for the analysis of complex motions in macromolecules. Mol Simulat 23:275–292 Hoover WG (1985) Canonical dynamics: equilibrium phase-space distributions. Phys Rev A 31:1695–1697 Janezic D, Brooks BR (1995) Harmonic analysis of large systems. II. Comparison of different protein models. J Comp Chem 16:1543–1553 Janezic D, Venable R, Brooks BR (1995) Harmonic analysis of large systems. III. Comparison with molecular dynamics. J Comp Chem 16:1544–1556 Jorgensen W, Maxwell D, Tirado-Rives J (1996) Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J Am Chem Soc 118:11225–11236 Kabsch W (1978) A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 34:827–828 Kaminski G, Friesner R, Tirado-Rives J, Jorgensen W (2001) Evaluation and reparametrization of the OPLS-AA force field for proteins via comparison with accurate quantum chemical calculations on peptides. J Phys Chem B 105:6474–6487 Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley. Wiley’s Series in Probability and Statistics, New York, NY. Kern D, Eisenmesser EZ, Wolf-Watz M (2005) Enzyme dynamics during catalysis measured by NMR spectroscopy. Methods Enzymol 394:507–524 Kitao A, Go N (1999) Investigating protein dynamics in collective coordinate space. Curr Opin Struct Biol 9:164–169 Kitao A, Hayward S, Go N (1998) Energy landscape of a native protein: jumping-among-minima model. Proteins 33:496–517 Micheletti C, Carloni P, Maritan A (2004) Accurate and efficient description of protein vibrational dynamics: comparing molecular dynamics and Gaussian models. Proteins 55:635–645 Micheletti C, Lattanzi G, Maritan A (2002) Elastic properties of proteins: insight on the folding process and evolutionary selection of native structures. J Mol Biol 321:909–921 Mityamoto S, Kollman PA (1992) Settle: an analytical version of the shake and rattle algorithm for rigid water models. J Comp Chem 13:952–962 Müller CW, Schlauderer GJ, Reinstein J, Schulz GE (1996) Adenylate kinase motions during catalysis: an energetic counterweight balancing substrate binding. Structure 4: 147–156 Müller CW, Schulz GE (1992) Structure of the complex between adenylate kinase from Escherichia coli and the inhibitor AP5A refined at 1.9 Å resolution. A model for a catalytic transition state. J Mol Biol 224:159–177 Nose’ S (1984) A molecular dynamics method for simulations in the canonical ensemble. Mol Phys 52:255–268 Pontiggia F (2008). Protein structure and functionally-oriented dynamics: from atomistic to coarse-grained models. PhD in Biophysics Scuola Internazionale Superiore di Studi Avanzati (SISSA/ISAS) via Beirut 2–4, 34151 Trieste, Italy Pontiggia F, Colombo G, Micheletti C, Orland H (2007) Anharmonicity and self-similarity of the free energy landscape of protein G. Phys Rev Lett 98:048102–048102

178

V. Carnevale et al.

Pontiggia F, Zen A, Micheletti C (2008) Small- and large-scale conformational changes of adenylate kinase: a molecular dynamics study of the subdomain motion and mechanics. Biophys J 95:5901–5912 Potestio R, Pontiggia F, Micheletti C (2009) Coarse-grained description of proteins’ internal dynamics: an optimal strategy for decomposing proteins in rigid subunits. Biophys J 96:4993–5002 Shapiro YE, Kahana E, Tugarinov V, Liang Z, Freed JH, Meirovitch E (2002) Domain flexibility in ligand-free and inhibitor-bound Escherichia coli adenylate kinase based on a mode-coupling analysis of 15 N spin relaxation. Biochemistry 41:6271–6281 Shapiro YE, Meirovitch E (2006) Activation energy of catalysis-related domain motion in E. coli adenylate kinase. J Phys Chem B 110:11519–11524 Shapiro YE, Sinev MA, Sineva EV, Tugarinov V, Meirovitch E (2000) Backbone dynamics of Escherichia coli adenylate kinase at the extreme stages of the catalytic cycle studied by 15 N NMR relaxation. Biochemistry 39:6634–6644 Shojania S, O’Neil JD (2006) HIV-1 TAT is a natively unfolded protein: the solution conformation and dynamics of reduced HIV-1 TAT-(1–72) by NMR spectroscopy. J Biol Chem 281: 8347–8356 Sinev MA, Sineva EV, Ittah V, Haas E (1996) Towards a mechanism of amp-substrate inhibition in adenylate kinase from Escherichia coli. FEBS Lett 397:273–276 Tirion MM (1996) Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Phys Rev Lett 77:1905–1908 Tompa P (2005) The interplay between structure and function in intrinsically unstructured proteins. FEBS Lett 579:3346–3354 Van Der Spoel D, Lindahl E, Hess B, Groenhof G, Mark AE, Berendsen HJ (2005) GROMACS: fast, flexible, and free. J Comput Chem 26:1701–1718 Wolf-Watz M, Thai V, Henzler-Wildman K, Hadjipavlou G, Eisenmesser EZ, Kern D (2004) Linkage between dynamics and catalysis in a thermophilic–mesophilic enzyme pair. Nat Struct Mol Biol 11:945–949

Chapter 8

Structure-Based Models of Biomolecules: Stretching of Proteins, Dynamics of Knots, Hydrodynamic Effects, and Indentation of Virus Capsids Marek Cieplak and Joanna I. Sułkowska

Abstract Coarse-grained models of biomolecules are developed to provide ways of simulating situations which involve system sizes and time scales that are hard to study by all-atom approaches. These situations usually are associated with occurrence of large conformational transformations. This review discusses a subclass of the coarse-grained descriptions: the structure-based models. These models are defined in a phenomenological way and make use of the knowledge of the native structure that is determined experimentally. We discuss the cases of the DNA molecule and dendrimers but the focus of the review is on proteins. For proteins, the reduction of the number of the degrees of freedom is achieved by representing amino acids by single beads located at the Cα atoms. The beads are tethered together into chains and effective attractive contact interactions are introduced so that the lowest energy state corresponds to the native conformation. There are many variants of such models and each variant comes with its own set of properties. Optimal variants can be selected by making comparisons to experimental data on single-molecule stretching. We discuss the best-performing variants of such models. Among them, there is a model with the Lennard–Jones potential in the native contacts and with the uniform (sequence independent) energy parameter. We apply this model to several problems: theoretical survey of mechanical resistance to stretching of 17,134 proteins comprising not more than 250 amino acids, dynamics of proteins with knots, pulling proteins out of membranes, the role of hydrodynamic interactions, and nanoindentation of virus capsids.

8.1 Introduction Equilibrium dynamical behavior of biomolecules can usually be characterized through studies of fluctuations around an optimal conformation. All-atom simulations are well suited to this task, particularly when the system comprises not more

M. Cieplak (B) Institute of Physics, Polish Academy of Sciences, Warsaw, Poland e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_8,

179

180

M. Cieplak and J.I. Sułkowska

than a thousand or so of amino acids or nucleotides. Many interesting biomolecular phenomena, however, involve large conformational transformations that usually last orders of magnitude longer than a typical 10 ns duration of current advanced all-atom calculations. The list of such phenomena includes protein folding, chemical, thermal, and pressure-induced denaturation, stretching and twisting by various nanotechnological devices such as the atomic force microscope, stretching by fluid flows, refolding in a mechanically constraining force clamp, etc. An even larger level of complexity in the conformational changes arises when considering dynamics of biomolecular assemblies that are functional at the cellular level. Examples of such processes include replication, transcription, translation, protein degradation, organelle transport, cell adhesion, muscle elasticity, cell crawling, membrane fusion, functioning of virus capsids, and many others. A still another category of the dynamic large-scale phenomena involves the effects of molecular crowding – how the behavior of one biomolecule is influenced by the presence of many other biomolecules that are nearby. Even though the all-atom simulation methods have occasionally reached the microsecond time scale (Duan and Kollman 1998; Snow et al. 2002), such studies count more toward breaking records than as established tools allowing for multiparameter and multi-trajectory statistical studies. They are also restricted to small system sizes. Undoubtedly, the growing availability of the peta-flop computing will shift the computational feasibility barriers. Even then, however, it will be still worthwhile to first assess overall properties of a system within a simpler model, as a numerical equivalent of a “back-of-the-envelope calculation,” and only then focus on selected aspects in a more realistic and more detailed way. Another application of the simplified models is making comparison of many biomolecules to select those that are desired for a particular purpose such as that come with large resistance to stretching. Still another is studying assemblies or aggregates of many molecules such as virus capsids. Finally, the simplified models can often be used as toy systems for which some basic questions may get qualitative answers. One example of an enquiry of this kind is what are the relations, if any, between various processes that involve conformational changes. A natural way to develop simplified models starts by coarse graining, i.e., by thinning the number of degrees of freedom. A crucial ingredient of coarse graining is provided by introduction of an overdamped “implicit solvent” that accounts for the immersion in water. The implicit solvent brings in thermostating through random forces acting on the biomolecule. It also introduces damping forces that are proportional to the velocity of the structural units of the biomolecule. Both kinds of these forces are added to the relevant equations of motions for the system. The equations of motion are either of the second or first order. In the former case one implements molecular dynamics of the Newtonian type (Allen and Tildesley 1987) and in the latter of the Brownian dynamics type (Ermak and McCammon 1978). In this chapter, we do not discuss the vast kingdom of lattice models in which the dynamics are intrinsically discretized and defined through a Monte Carlo process. The coarse-graining schemes depend on the biomolecule studied. For a protein, the simplest approach is to represent it by a chain of harmonically tethered beads

8

Structure-Based Models of Biomolecules

181

located at the sites of the Cα atoms. One can also consider models with the Cα and Cβ atoms as the building units, as studied, e.g., in Sulkowska and Cieplak (2007). For the DNA, each nucleotide in a strand can be represented by two to five beads, depending on the specific model (Knotts et al. 2007; Niewieczerzal and Cieplak 2009) if the description is to be resolved at the single nucleotide level. Any coarse-graining procedure comes with replacement of microscopic couplings (such as Coulombic interactions between partial charges) by new effective couplings between the decimated degrees of freedom. There is no unique prescription for determining such effective couplings. Here we outline a phenomenological approach that is often found successful, in which the couplings are determined based on the knowledge of the native structure. This philosophy has been proposed by Go and his collaborators (Abe and Go 1981) in the context of folding. Its first molecular dynamics implementation appears to be contained in the paper by Veitshans et al. (1997). When applying structure-based models one makes the assumption that the process studied is largely determined by the native-state topology and non-native interactions are of secondary importance. This assumption should be valid in most stretching processes since unfolding commences in the native state. However, it may or may not be valid in the folding process, depending on the nature of the free energy landscape. Small globular proteins are expected to show little structural frustration (Bryngelson et al. 1995; Goldstein et al. 1992; Clementi et al. 2000) and are believed to be compatible with the assumption. It is possible to include non-native contacts in the model whenever they are deemed essential, as for example in Cieplak (2004) and Wallin et al. (2007). There are many possible molecular dynamics implementations of the structurebased approach, since a definition through structure is not sufficiently unique. In fact, various structure-based models of one protein may lead to a wide range of possible behaviors (Sulkowska and Cieplak 2008). We have enumerated more than 500 Go-like off-lattice models and proposed (Sulkowska and Cieplak 2008) that one may select the optimal classes by considering single-protein stretching (Rief et al. 1997; Fowler et al. 2002; Yang et al. 2000; Carrion-Vazquez et al. 2000, 2009; Galera-Prat et al. 2010) since this particular process starts in the native state, in the vicinity of which the structure-based models are expected to be the most reliable. A comparison of the theoretical results to experimental data on stretching allows us to select several models which, in addition, yield reasonable folding properties. Among the winning models, there is one in which, besides other relevant attributes to be explained later, the effective attractive Cα –Cα interaction in the, so-called, native contact is described by the Lennard–Jones potential V

6−12

= 4Eij

σij rij

12

−

σij rij

6 .

(8.1)

Here rij is the distance between the Cα s in amino acids i and j and σ ij is determined pair-by-pair so that the minimum is located at the experimentally established native

182

M. Cieplak and J.I. Sułkowska

distance rij n . The simplest choice of the energy parameter Eij is to take it to be uniform, Eij = ε, where ε which should be of order 1.5 kcal/mol. It should be noted that, even in the case of the uniform Eij , the structure-based model still carries in itself an information about the sequence, albeit in an implicit way. It comes in two ways. The first is through the structure which affects the parameter σ ij and the structure is formed (in water) for a particular sequence. The second is through the contact map which specifies which interactions correspond to the native contacts. For a given distance between the Cα s in the structure, the native contact may or may not form, depending on the specificity. Note that the potential defined in Eq. (8.1) is very weak for rij exceeding several σ ij which allows for “breaking” the contact for sufficiently large distances as a result of stretching or thermal denaturization. This feature is qualitatively distinct from the situation encountered with the purely confining harmonic potential which is used in the elastic network models (Bahar et al. 1997, 1999; Micheletti et al. 2002). Such elastic models form a special class of the Go-like Hamiltonians. Their application to stretching requires a prior determination of a distorted non-native structure (Sulkowska et al. 2008a). The construction of most structure-based models is based on a distinction between native and non-native contacts in a biomolecule. This leads to a natural characterization of folding or unfolding processes in terms of whether particular contacts are functionally established or broken. For instance, for the proteinic model with the Lennard–Jones native potentials, a contact can be declared broken if the rij distance exceeds 1.5 σ ij . It has turned out (Cieplak and Hoang 2003; Cieplak 2004) that when studying folding, it is interesting to determine when a given contact is established for the first time. If all such contacts are established for the first time simultaneously then one declares folding to be accomplished. On the other hand, when studying unraveling, it is interesting to record the last times at which a native contact still holds (Cieplak et al. 2002, 2004). A convenient way to represent processes with large conformational changes visually is to provide scenario diagrams in which the characteristic times for a particular native contact are displayed against a contact label, such as the sequential distance |j-i|. The advantage of the scenario diagrams over representing processes by the snapshots of conformations is that the scenario diagrams are derived for a statistical ensemble of trajectories whereas the snapshot’s role is more that of providing an illustration. Representing results of all-atom simulations through the scenario diagrams may also prove useful. Even though the all-atom models do not make use of the division into native and nonnative contacts to define the dynamics, monitoring the fate of the native interactions as a function of time may provide important insights. In this review, we outline the phenomenological methods of construction of the structure-based models of proteins, the DNA, and dendrimers. The review is focused on the proteins, however, and the examples of applications discussed here deal with the proteins. These are mechanical stretching of proteins, dynamics of proteins with knots, pulling of proteins out of membranes, the role of hydrodynamic interactions, and nanoindentation of virus capsids.

8

Structure-Based Models of Biomolecules

183

8.2 The Structure-Based Models of Proteins In the case of the proteins, the simplest coarse-graining scheme represents amino acids by single beads located at the Cα atoms. The beads are tethered together into chains by harmonic interactions. The minima of these interactions correspond to distances between the consecutive Cα atoms as read off the Protein Data Bank (PDB; Berman et al. 2000) structure file. Usually, these distances cluster around 3.8 Å with a spread of 0.15 Å. However, in cases involving prolines, the distances may vary between 2.8 and 3.85 Å as prolines can accommodate the cis-conformation. The beads generate a soft excluded volume. In our implementation (Cieplak et al. 2002), this volume corresponds to a radius of order 4 Å. This excluded volume prevents interpenetration of the chain in non-native conformations. Further steps in the construction of the model (Cieplak and Hoang 2003; Cieplak et al. 2004) involve introduction of a local backbone stiffness, determination of the contact map, i.e., the list of bead pairs endowed with attractive interactions, and then selection of an appropriate potential for the attraction. The structure-based approach (Abe and Go 1981; Takada 1999) postulates that the potentials used should be minimized for the native structure that is determined experimentally. There are many ways to find such potentials. In Sulkowska and Cieplak (2008) we have analyzed in detail 62 variants of the resulting models. The corresponding Hamiltonians are characterized by attributes described in a short-hand notation by model = {V NAT , S, M, E}

(8.2)

where the first term denotes selection of the potential, the second identifies the way to account for the local backbone stiffness, the third defines the contact map, and the fourth determines the choice of the energy parameters. In addition, one may also take the Cβ atoms into account, but we find that this step usually does not amount to an improvement. Physical properties are found to vary significantly across the variants of the model and one can select the best variants by making comparisons to the experimental data on stretching. The specific parameter used in the comparisons is the maximum value, Fmax , of the force of resistance to stretching at constant speed. Fmax is determined before reaching a fully extended conformation (where the covalent bonds get strained but not ruptured) and it characterizes the elastic structure of the protein. Large values of Fmax are associated with simultaneous shearing of many attractive contacts together. In order to select the optimal variants of the models, we have performed statistical tests of correlations to the experimental values of Fmax (Sulkowska and Cieplak 2008). Out of the 62 variants considered, 15 were found to perform well and their list is provided in table 5 of Sulkowska and Cieplak (2008). When no adjustments for the variations in the experimental pulling speeds are made, the model LJ3 = {6 − 12, C, M3, E0 }

184

M. Cieplak and J.I. Sułkowska

is singled out as the best performer. Here, 6–12 denotes the Lennard–Jones potential defined in Eq. (8.1) with the uniform, i.e., amino acid-independent, energy parameter. Symbol C signifies that the backbone stiffness is described by a potential term, VC , which favors the native sense of the local chirality (Cieplak et al. 2002; Kwiecinska and Cieplak 2005). The chirality-based stiffness is approximately equivalent to favoring of the native values of the dihedral angles (Sulkowska and Cieplak 2008). An alternative is to have an angular stiffness, denoted as A, with a potential VA , in which also the bond angles tend to adopt the native values as in Veitshans et al. (1997) and Clementi et al. (2000). VA is more constraining than VC , but is less efficient numerically. Our choice of VC is given by Vc =

1 2 κ Ci − Cin , 2

Ci =

(wi−1 × wi ) · wi+1 do3

where Ci n is the chirality of residue i in the native conformation and d0 = |wi | is the distance between subsequent Cα atoms. Here, wi = ri+1 − ri . We take κ equal to 1 as discussed in Kwiecinska and Cieplak (2005). We now discuss the most crucial third descriptor in Eq. (8.2): the contact map. One way to define the contact map is to introduce a cutoff length, e.g., equal to 7.5 Å, and declare all pairs ij of amino acids as making a native contact if rij is less than the cutoff. This way, however, does not take the variable sizes of the amino acids into account and thus introduces spurious contacts while eliminating important longer-ranged contacts (even corresponding to distances of order 12 Å (Cieplak and Hoang 2003)). A more physical way to determine the contact map is through the following procedure. First, amino acids are represented by clusters of spheres associated with the heavy atoms located at their native positions. The heavy atoms are assigned van der Waals radii, as in Tsai et al. (1999) and Settanni et al. (2002), which are then multiplied by 1.24 to account for attraction. If the resulting spheres assigned to one amino acid overlap with the spheres assigned to another in the native state then the corresponding pair of amino acids is declared as generating a native contact. Otherwise, the interactions are declared to be non-native and then they are represented by the soft cores. This procedure is illustrated in Fig. 8.1. The atomic-overlap-based criterion generally agrees with a chemical-based assessment (Sobolev et al. 1999) in which one pays attention to the geometry of a bond. By employing the related CSU software (Sobolev et al. 1999) we find, however, that the overlap criterion often errs when it comes to the sequentially close pairs of the type i,i+2. Such contacts are detected as valid whereas they should be discarded as being usually of the dispersive origin (van der Waals interactions) and thus much weaker than hydrogen bonds or ionic bridges. In the M3 map, such contacts are excluded – as opposed to M2 when they are kept. The choice of the model for the local backbone stiffness has a bearing on the choice of the contact map. Even though the function of VA is similar to that of VC , VA is more encompassing and more sensitive to the local shape of a conformation. Thus when using VA , one does not need to worry about the short-range contacts i,i+3 and deal only with the

8

Structure-Based Models of Biomolecules

185

Fig. 8.1 An illustration of a construction of a structure-based model of a protein for the Trp-cage miniprotein, with the PDB code 1l2y, which consists of 20 amino acids. The backbone is shown in its native conformation. The conformation is akin to a β-hairpin in which one of the “arms” is shaped into an α-helix. Heavy atoms are shown for two amino acids. The two large spheres are located on the Cδ atom in leucine-2 and the Cγ atom in proline-19. Their radii are equal to the van der Waals radii enhanced by 1.24. These two atoms yield the biggest overlap between residues 2 and 19. All contacts connected to residue 2 are marked by black lines

M4 map in which one keeps i,i+4 and longer-ranged contacts as done in Clementi et al. (2000). A special kind of contacts is formed by the disulfide bridges. Since these are covalent in nature, they cannot be broken by means of the current nanomanipulation techniques (though they can be dissolved chemically). One can represent them by the Lennard–Jones potential in which the energy parameter is enhanced by an order of magnitude as done in Sulkowska and Cieplak (2007). A more reliable way is to represent them by stiff harmonic bonds as in Sikora et al. (2009). All the potential terms discussed above give rise to forces Fi acting on particular amino acids. Newton’s equations of motion for the ith amino acid involve not only Fi but they also include Langevin terms mimicking the implicit solvent: ••

•

i

i

m r = −γ r +Fi + i where γ denotes the damping constant and the last term corresponds to ran√ dom forces. The dispersion of the random forces is equal to 2γ kB T, where T denotes temperature and kB the Boltzmann constant. In standard molecular dynamics schemes that pertain to atoms, as opposed to effective atoms in an implicit solvent of our models, the Langevin terms are sometimes used (Grest and Kremer

186

M. Cieplak and J.I. Sułkowska

1986; Smith et al. 1996; Cieplak et al. 2001) as a minor perturbation to provide thermostating. In such cases the random forces are applied just in one Cartesian direction that is assumed to be irrelevant for the problem under consideration. In the proteinic case, however, the Langevin terms are applied in all directions and affect the dynamics in a major way in order to account for the viscous features of the water solvent. We select γ = 2m/τ as we find that this choice provides sufficient overdamping. Higher values of γ would imitate water better (Veitshans et al. 1997; Klimov and Thirumalai 1997). However, enhancing γ beyond our selected value no longer results in noticeable changes in the force–displacement curves in stretching studies (Cieplak et al. 2004) and makes the folding times start depending on γ in a linear fashion (see figure 15 in Hoang and Cieplak (2000)) that allows for simple scaling estimates. One may consider models in which there is specificity in the parameters associated with individual amino acids. Variations in values of the particle masses do not affect folding times (Cieplak et al. 2004) which are consistent with the overdamped character of the model. Variations in the values of the damping constant, reflecting placement on various rings in the hydrophobicity scale, have been found to affect folding and stretching primarily through the mean of these values (Szymczak and Cieplak 2007a). A similar situation is expected to happen when considering differentiations in the hydrodynamic radii when the model includes hydrodynamic interactions that are discussed in Section 4.4. In standard molecular dynamics simulations (Allen and Tildesley 1987), the characteristic time scale τ is of order 1 ps since the motion of the model atoms is mostly ballistic and thermalization takes place primarily through collisions with other atoms. In the coarse-grained model, however, we deal with effective atoms that represent whole amino acids and their motion is dominated by diffusion. A characteristic time needed to cover typical molecular distance like 5 Å through diffusion is of order 1 ns as argued in Veitshans et al. (1997) and Szymczak and Cieplak (2006). The equations can be solved by various algorithms. Many of our results have been obtained by using a fifth-order Gear predictor–corrector scheme (Gear, 1971) with the time step of 0.005 τ . We have found this algorithm to be very stable and allowing for a relatively large integration step (ten times larger than in more popular schemes). This algorithm yields results similar to those obtained within the Brownian dynamics. The model presented here can also be used for studies of protein stretching by fluid flows (Szymczak and Cieplak 2006) by adding drag forces (due to a hydrodynamic flow) to the damping term. One may consider this as a side advantage of the implicit solvent approach. The models described above can be employed, for instance, to study folding. In such studies, one starts from fully extended conformations and asks when all of the native contacts are established for the first time. The median folding time, determined by studying many trajectories, generally has a U-shaped behavior as a function of T (see, e.g., Cieplak and Hoang 2003). For the LJ3 model the U-shape has a (usually broad) minimum centered around kB T/ε = 0.3. In practice, this optimal temperature for folding plays the role of the room temperature in the model. When studying thermal denaturation (Cieplak and Sulkowska 2005), one starts in the native conformation and then asks when

8

Structure-Based Models of Biomolecules

187

certain classes of contacts, for instance, all non-local contacts (|j − i| > 4), get ruptured simultaneously. The simplification of the structure-based models has allowed us for demonstrating that, on average, the order of contact breaking by thermal fluctuations is opposite to the order in which contacts are established during folding – which is implicitly assumed in many all-atom simulations performed in the context of the transition state studies (see, e.g., Li and Daggett, 1994). Pressure-induced denaturation is found (Wojciechowski and Cieplak 2007) to lead to unraveling in a still different way. When studying stretching, we attach two elastic springs to two selected amino acids, typically the two terminal amino acids. One end of one of the springs is anchored to an immovable substrate. One end of the other spring is attached to a point that moves along the direction connecting the selected amino acids. In our constant speed simulations, the speed of pulling, vp , is often chosen to be equal to 0.005 Å/τ as in Sulkowska and Cieplak (2007) and Sikora et al. (2009). If τ is of order 1 ns then this vp corresponds to about 500,000 nm/s, which is three orders of magnitude larger than the typical experimental speeds of 600–700 nm/s. Simulations at near experimental speeds take time but can be accomplished in the structure-based models. On the other hand, the “steered” all-atom molecular dynamics computations (Lu and Schulten 1999; Paci and Karplus 2000) usually incorporate pulling speeds which are seven orders of magnitude bigger. This leads to large values of Fmax that well exceed velocity-dependent logarithmic extrapolations of the experimental data. These values can be reduced by performing stretching in a quasistatic way (Pabon and Amzel 2006). When adjustments accounting for the variations in the experimental speeds are made by a corresponding linear proportioning of the theoretical speeds, the ranking of the variants of the models is changed somewhat (Sulkowska and Cieplak 2008; Cieplak and Sulkowska 2009). In particular, the LJ3 model gets dethroned to the second best, whereas &

10−12, A, M4, EHB,MJ

'

is found to be the true winner. It involves the contact potential of the10–12 form, V 10−12

n 12 n 10 rij rij = Eij 5 −6 . rij rij

and EHB,MJ assigns the amplitudes in this potential in a non-uniform way. Specifically an energy of ε is assigned to hydrogen-bonded amino acids (the bond between the N and O atoms belonging to the backbone and the bond between the N atom from the backbone the Cβ atom from a side chain) whereas an energy εijMJ proportional to the Miyazawa–Jernigan couplings (Miyazawa and Jernigan, 1996), γijMJ , is assigned to contacts arising through non-hydrogen-bond side chain–side chain interactions. All other contacts get the energy ε. The energy εijMJ is given by εγijMJ / < γijMJ > where the couplings in the denominator are averaged over 210

188

M. Cieplak and J.I. Sułkowska

different possible pairs of amino acids. This very best model does not appear to have been used in the proteinic literature. However, it is very closely related to the third best-performing model {6−10−12, A, M4, EHB,MJ } that has been introduced by Karanicolas and Brooks (2002). This model uses the potential V

6−10−12

= 4 Eij

σij 13 rij

12

σij − 18 rij

10

σij +4 rij

6 .

The statistical tests have been run at a fixed temperature (close to the optimal folding) and the differences in performance between the three models are found to be slight (Sulkowska and Cieplak 2008). Therefore, it seems sensible to use the simplest of them all: the LJ3 model with the uniform couplings. This is despite the fact that using Miyazawa–Jernigan-like couplings would enhance the specificity content of a model. Our latest comparisons of the LJ3-derived values of Fmax to the corresponding experimental findings have involved determining Fmax at several values of vp , fitting them to a logarithmic vp -dependence and then extrapolating to the experimental speeds (Sikora et al. 2009). The best slope in the cross-correlation plot yields the effective value of the parameter ε. We find it to be (110 ± 30) pNÅ whereas without the extrapolation we were getting about 71 pNÅ (Sulkowska and Cieplak 2008). The latter value corresponds to 1.6 kcal/mol. There are several other implementations of the Go-like coarse-graining that we did not test. Among them, there is the model of Tozzini et al. (2007) which uses the Morse potentials and couplings determined in the spirit of Miyazawa and Jernigan, i.e., through the frequencies of occurrence of various amino acidic pairs. However, for other sets of couplings that we did test (Sulkowska and Cieplak 2008) the Morse potentials were not found to perform in any distinguished way. It should be noted though that the model of Tozzini et al. contains additional terms in the Hamiltonian.

8.3 The Structure-Based Models of the DNA and Dendrimers Construction of coarse-grained models of the double-stranded DNA (dsDNA) is similar in spirit to that of the proteins. In the absence of stress or torsion, the native structure corresponds to the double helix as exemplified by structure 119D deposited in the Protein Data Bank for the sequence 5 D(CGTAGATCTACGTAGATCTACG)-3 . A pair of nucleotides from two strands can bind either by two or three hydrogen bonds, depending on whether this is the case of the A−T pair or the G−C pair, respectively. Recently, we have constructed several coarse-grained models of the dsDNA (Niewieczerzal and Cieplak 2009) that are resolved at a single nucleotide level. The least coarse grained of them (model I) will be described here in more detail. In this model, the A and T nucleotides are

8

Structure-Based Models of Biomolecules

189

Fig. 8.2 A schematic representation of the bead model I of the dsDNA after (Niewieczerzal and Cieplak 2009). An A–T pair of the nucleotides is shown; each of the nucleotides in the pair is represented by four beads. The G and C nucleotides are represented by five beads

represented by four beads (Fig. 8.2) and the G and C nucleotides by five beads. One of the beads represents the phosphate group (bead p), another the sugar group (bead b), and the remaining beads (bead h) participate in formation of the appropriate number of the hydrogen bonds. As in proteins, the hydrogen bonds between the coarse-grained entities are represented by effective Lennard–Jones potentials of strength ε. A further reduced version of this model, denoted as model II, is shown at the top of Fig. 8.3. It comprises no h-beads. Instead, the effective Lennard–Jones couplings of strength of either 2 or 3 ε couple the b-beads directly. In between, there is a similarly defined model III (the bottom panel of Fig. 8.3) which involves three beads. Model III has been proposed and studied by Knotts et al. (2007). In models I and II the effective backbone is meant to connect (harmonically) the consecutive p-beads in each strand. In model III, on the other hand, the backbone zigzags

Fig. 8.3 A schematic representation of models II and III of the dsDNA

190

M. Cieplak and J.I. Sułkowska

between the s-beads (for sugar) and p-beads to render the true chemical connectivity better. Nevertheless, we find (Niewieczerzal and Cieplak 2009) that the three models perform quite similar at least in the context of mechanical manipulations such as unzipping, shearing, and twisting. They also show sensitivity to the sequence. For instance, an all-CG sequence has a bigger mechanical stability than an all-AT sequence of the same length. We illustrate the coarse-graining process for model I. Each strand is first represented by a chain of p-beads which are 5.8 Å apart on average. The p-bead is placed at the position of the C4∗ atom in the molecule of ribose. The h-beads represent the “head” atoms that may form the hydrogen bonds. In the C nucleotides, the h-beads are located on the O2, N3, and N4 atoms of the bases. In the G nucleotides, on the O6, N1, and N2 atoms. Finally, in the A and T nucleotides on the N6, N1, and N3, O4 atoms, respectively. The h-beads are linked to their “bases,” i.e., to the supporting b-beads. In the native state, the b-beads are located half-way between the p-bead and the center of mass of the h-beads at each nucleotide. The excluded volume diameters of the h-beads, b-beads, and p-beads are taken to be equal to 2.0, 3.4, and 6.0 Å, respectively. The local stiffness of the phosphate chain is represented by harmonic potentials that favor native values of the bond and dihedral angles. All of the interbead interactions within one nucleotide are described by stiff harmonic springs. The stacking interactions between consecutive b-beads in each chain are described by the Lennard–Jones interactions with the minima placed at 4.43 Å apart (on average). As the average value of energy for hydrogen bond interaction in dsDNA in model I we chose 0.6 kcal/mol. In a related single-bead model of the RNA Hyeon and Thirumalai (2005) take ε of about 0.5–0.7 kcal/mol. On average, 2.5 hydrogen bonds are created between the bases in dsDNA. Hence the total average energy of interaction between paired bases in the DNA helix is about 1.5 kcal/mol in our models. This choice is consistent with kB T/ε = 0.4 corresponding to T=300 K. The corresponding unit of the force ε/Å should be then of order 100 pN. The structure-based models should find more extensive applications in the context of many experiments on the DNA that involve manipulation (Cluzel et al. 1996; Bustamante et al. 2003; Koster et al. 2005; Bockelmann et al. 1997, 2002; Bryant et al. 2003; Oroszi et al. 2006; Allemand et al. 2007). As a “warm-up,” we have studied various ways to distort short chains (20 nucleotides in one strand) of the dsDNA within the three models proposed (Niewieczerzal and Cieplak 2009). In the unzipping process, the hydrogen bonds are broken one at a time, by starting from the end at which the unzipping forces are applied. The resulting force, F, ondulates at a single nucleotide resolution as the system unravels. This pattern depends on the nucleotide sequence. Experimentally, the average force, , is of order 14 pN and the amplitude of ondulations is of order 1 pN (Bockelmann et al. 1997, 2002). The force pattern has also been found to depend on the pulling speed weakly (Bockelmann et al. 1997). Our simulations point to a logarithmic dependence on the speed (Niewieczerzal and Cieplak 2009). The force is also a monotonic function of the temperature, since the higher the temperature, the bigger the role of thermal

8

Structure-Based Models of Biomolecules

191

fluctuations in unraveling the structure. When one pulls the two strands in opposite ways, then there is a simultaneous shearing of all hydrogen bonds. The resulting force pattern consists of one large force peak, the strength of which depends on the length of the strand (Niewieczerzal and Cieplak 2009). Finally, when one stretches the two strands in the same direction and couples this action with application of a torque one may transform the usual right-handed B form of the dsDNA either to a left-handed form, or to a stretched form, or to a Pauling form. Our coarse-grained simulations of these transitions have been found to be qualitatively consistent with all-atom simulation (Wereszczynski and Andricioaei 2006). The structure-based modeling of molecules is not restricted to biomolecules. Recently, we have developed a coarse-grained model to study interaction of the poly(propylene imine) (PPI) and poly(amidino amine) (PAMAM) dendrimer ink molecules with the self-assembled monolayers of β-cyclodextrins (Thompson 2007; Cieplak and Thompson 2008). Such monolayers are new kinds of molecular printboards used in soft lithography (Schonher et al. 2000; Huskens 2006). An about threefold reduction in the number of the degrees of freedom (as measured by the number of the effective atoms involved) allows for studies of larger areas over which the ink molecules may diffuse and of studies of a set of many molecules. The details of the modeling and further discussion can be found in Cieplak and Thompson (2008). Among the results, we find that anchoring of the ink molecules to the monolayer is of a multi-valent nature and the average valency is found to be controlled by the temperature.

8.4 Examples of Applications of the Structure-Based Models of Proteins 8.4.1 Mechanical Strength of 17,134 Proteins We now illustrate one advantage of the structure-based models: a possibility of performing a rapid comparatory analysis of dynamical properties for thousands of proteins. Specifically, we consider resistance to stretching at constant speed. Only less than a hundred of proteins have been stretched experimentally (see, e.g., Carrion-Vazquez et al. 2009; Galera-Prat et al. 2010) which is just a tiny fraction of structures deposited in the Protein Data Bank. It thus seems worthwhile to explore possible elastic behaviors of all available structures to get an overview and to detect features that may be interesting. In particular, it is useful to predict which of the proteins would be expected to yield large values of Fmax and to explain why they are arising. We have addressed these tasks in several publications. In particular, we have used variant {6 −12, C, M2, E0 } in Sulkowska and Cieplak (2007) to make survey of elastic properties of 7,510 proteins of not more than 150 amino acids (and 239 proteins with larger sizes – up to 851 amino acids). The latest survey (Sikora et al. 2009) involved 17,134 proteins comprising no more than 250 amino acids. It used

192

M. Cieplak and J.I. Sułkowska

model LJ3 which we find to be more reliable. Structures with gaps in the structure assignment and complexes with the nucleic acids have not been taken into account. The primary objective of the surveys has been to determine theoretical values of Fmax for the terminus-to-terminus stretching. The full rank ordered list of proteins is available at www.ifpan.edu.pl/BSDB/. Another objective has been to correlate the values of Fmax with structural classification schemes such as class, architecture, topology, homology (CATH) (Orengo et al. 1997; Pearl et al. 2005) and structure classification of proteins (SCOP) (Murzin et al. 1995; Andreeva et al. 2008). Even though both schemes are hierarchical in nature they are governed by different principles. CATH uses algorithmic methods and it divides proteins into four classes and then into architectures, topologies, and homologies. SCOP is based on visual inspection and it groups protein species into seven classes (and three quasiclasses) and then into folds, superfamilies, families, and proteins. Both schemes involve sequential information at the lowest stages. We find that large values of Fmax should not arise in α proteins (provided they are single-domained). The architectures that are likely to have member proteins that are particularly resistant to stretching include ribbons, β-barrels, β-sandwiches (like titin), β-rolls (like ubiquitin), and three-layer (aba) sandwiches. The SCOP-based folds which are likely to yield large forces are listed in Sikora et al. (2009). It should be noted that mechanical unfolding is usually unrelated to thermal unfolding (and other kinds of unraveling) and mechanical strength need not go together with thermal stability (Cieplak and Sulkowska 2005). The region of a protein that acts as the most potent source of resistance to pulling is known as a mechanical clamp. It can be identified by inspecting the scenario diagrams that provide information about rupturing events that lead to large force peaks and then by removing sets of corresponding native contacts to see the impact on the value of Fmax (Sulkowska and Cieplak 2007; Sikora et al. 2009). The most common mechanical clamps that are strong involve shearing of many native contacts simultaneously, especially in long parallel β-strands that get additional stabilization from surrounding strands, as illustrated in Fig. 8.4. The top two strongest proteins endowed with this kind of a mechanical clamp have been determined to belong to the PDB codes 1c4p and 1qqr. The corresponding theoretical values of Fmax are 5.1 and 5.0 ε/Å, i.e., they should be within the range of about 550 pN. Among the very strong proteins identified by us there is also scaffoldin c7A (the PDB code 1aoh) with Fmax of 4.3 ε/Å. The experimental value of Fmax for this protein has been recently measured to be 480 pN (Valbuena et al. 2009) which testifies to the predictive value of our approach. In proteins discussed so far, the disulfide bonds, if present, do not play any important dynamical roles. However, we have identified tens of proteins in which such bonds are vital and may even yield Fmax within 1,000 pN range. In fact, the top strongest 13 proteins (Sikora et al. 2009), such as 1bmp (bone morphogenic protein) and 1vpf (vascular endothelial growth factor), all contain a motif corresponding to a cysteine slipknot (Fig. 8.5). One of them forms a knot-loop. This loop is fairly stiff as it is closed by two disulfide bonds. Another loop, the slip-loop, pierces through the knot-loop in the native state. On stretching, the slip-loop may be retracted out

8

Structure-Based Models of Biomolecules

193

slip−loop

57

102 knot−loop

57 N C

knot−loop

26 N

beta strands (1c4p)

slip−loop 102

68

61

104 C

cys slipknot (1vpf)

Fig. 8.4 Two kinds of proteinic mechanical clamps that yield large resistance to stretching. The bottom panels indicate the relevant mechanisms in a schematic way. The top panels show the corresponding cartoon representations of proteins in which these mechanisms are operational. The two left panels illustrate the shearing mechanism of two parallel β-strands (shown by two central arrows in the bottom left panel) as exemplified by streptokinase 1c4p. The two right panels illustrate the slipknot mechanism as exemplified by the vascular endothelial growth factor 1vpf

Fig. 8.5 The slipknot conformation in the thymidine kinase 1p6x. The line representation defines the characteristic locations k1 , k2 , and k3

k3 k1 k2

of the knot-loop. We find that this process may often require large forces, of order 10 ε/Å, which are still a factor of 2–4 smaller than those required to rupture covalent bonds (Grandbois et al. 1999). We hope that our theoretical investigation of mechanics of the cysteine slipknot proteins will motivate experimental studies.

194

M. Cieplak and J.I. Sułkowska

8.4.2 Dynamics of Knots Knot-related topologies, such as cysteine slipknots and cysteine knots, in the native conformations of proteins may be related to the presence of disulfide bonds (Craik et al. 1999, 2001) as discussed in the previous section, but usually they are not. One may refer to such non-cysteinic cases as genuine knots or slipknots. The knots are of the so-called open kind as the mathematically proper definition of a knot requires considering closed lines (loops). In practice, however, there are no difficulties with considering knots formed on open lines when the knots are “deep,” that is, when the entangled region is not close to any of the termini. In fact, this is a situation that is frequently encountered for knotted proteins. Such complex topologies cover only several hundred examples of proteinic structures, which is less than 1% of the deposits contained in the PDB. These structures represent nine distinct folds and come with different biological functions (Mansfield 1994; Taylor, 2000; Taylor and Lin 2003; Virnau et al. 2006; Bolinger et al. 2010). Knots are characterized mathematically by various types of invariants, such as Alexander or Jones polynomials (Livingston 1993). The simplest albeit important example of such an invariant is the crossing number, defined as the minimal number of crossings obtained by smooth transformations of the main chain that do not involve cutting. Most knotted proteins contain the simple trefoil knot for which the crossing number is equal to 3. The most complicated protein knot found so far has the crossing number of 6. Another example of geometrically involved structures is genuine slipknots as discussed by King et al. (2007) and Taylor (2007). A slipknot contains a loop that is partly threaded through another loop. It therefore resembles a knot before it is untied so that upon pulling by both termini the system would eventually acquire a conformation of a straight line. Slipknots are then topologically trivial. Elucidation of the role of knots in the kinetics, thermodynamics, and function of proteins is currently being investigated. Some pioneering experiments include works of Mallam and Jackson (2007), Mallam et al. (2008), Andersson et al. (2009), and Wagner et al. (2005). Nevertheless, the complexity of the structure makes the experimental results difficult to interpret since there are no direct techniques allowing for distinguishing between knotted and unknoted structures. The difficulties compound when trying to asses properties of non-native conformations. These circumstances offer a window of opportunity for theoretical simulations, especially for coarse-grained models since proteins with knots and slipknots often come with large sizes. In order to detect the presence of a knot in a given protein, we apply the KMT algorithm (Koniaris and Muthukumar 1991; Taylor 2000). It involves removing the Cα atoms, one at a time, for as long as the backbone does not intersect a triangle set by the atom under consideration and its two immediate sequential neighbors. If this procedure results in reducing the chain to two nodes, in a series of steps, then the structure is not knotted. Otherwise the structure contains a knot. This algorithm also allows for determination of two end points, k1 and k2 , of the knot along the sequence (Sulkowska et al. 2008b). They are determined by cutting away, one by one, consecutive amino acids from the N terminus and the C terminus, respectively. The

8

Structure-Based Models of Biomolecules

195

Fig. 8.6 Knot dynamics in protein 1j85 during stretching. The data are based on the study by Sulkowska et al. (2008). The bottom panel shows the multi-peak dependence of the resistive force on the displacement, d, of the pulling element. The middle panel shows locations of the ends of the trefoil knot. The upper panels illustrate the backbone conformation, or its fragments, at three stages of the stretching process. The stages are indicated by the arrows

sequential position

knot end points correspond to the sequential positions of the amino acids whose removal leads to lack of detection of the knot for the first time. The sequential extension, k2 −k1 , of a knot is a useful characteristic to determine in studies of dynamics of knotted proteins. Describing localization of a slipknot requires considering three sequential positions. One of them, denoted by k3 , is determined by eliminating amino acids consecutively from one terminus until the knotted configuration is reached. The slipknot conformations are not symmetric and k3 can arise either from the N or from the C terminus side. Location k3 is where the slip-loop begins. It ends at k2 . The end of the knot-loop is, simultaneously, a beginning of the slip-loop that pierces through the plane set by the knot-loop. The knot-loop ends at k1 . Locations k1 and k2 are determined by applying the KMT algorithm anew with k3 acting as an effective new terminus in the procedure. Cutting off an amino acid located at k3 reduces a slipknot to a knot. Thus these three points (k1 , k2 , and k3 ) divide the slipknot into two loops: the knot-loop and the slip-loop (Sulkowska et al. 2009b). The KMT algorithm can be applied not only in the native state but also in any other conformation that may arise as a result of mechanical manipulation or during folding. This allows for monitoring the evolution of location of the ends of a knot or of a slipknot during dynamical processes. A natural way of studying the physical properties of the knotted protein is offered by mechanical stretching. Stretching by the protein termini results in knot tightening as studied in Sulkowska et al. (2008b) within our LJ3 model. These studies have been accomplished for 18 proteins including the monomeric methylotransferaze (PDB code 1j85) which is used, as an illustration, in Fig. 8.6. We have found that the

k2 k1

196

M. Cieplak and J.I. Sułkowska

tightening process results in sudden jumps of the knot ends along the sequence (the middle panel of Fig. 8.6). The jumps take place to selected characteristic sites before arriving at the fully tightened conformation. These sites are typically endowed with a large curvature. These results are in contrast to the well-studied case of knots in homopolymers which tend to diffuse smoothly along the chain (Metzler et al. 2006). In a recent experiment on phytochrome B (Bornschlogl et al. 2009) the tightening process has resulted in reducing final end-to-end distance by 12 amino acids compared to the similar distance that is measured along the backbone. We also note that pulling by some other amino acids may result in untying the knot, as in the experiment described by Alam et al. (2002). Another information about the behavior of knotted proteins can be learned from a simulation in which the protein is first stretched and then released. We found that knots typically do not return to their native locations in such a process (Sulkowska et al. 2008b). Such simulations most likely probe different routes on the free energy landscape compared to chemical or thermal refolding. The force–displacement plot of Fig. 8.7 in itself does not shed light on the microscopic interpretation of the rupture process. An information on what contact is broken at a given displacement is provided by the scenario diagram. The scenario diagram for 1j85, shown in Fig. 8.7, indicates that each force peak comes with an accumulation of rupturing events. The numerical labels used in Fig. 8.7 correspond

knot tightening

Fig. 8.7 The scenario diagram for stretching of 1j85 at constant velocity of 0.005 Å/τ at kB T/ε = 0.3. The y-axis indicates sequential separations of the native contacts. The x-axis indicates pulling distances at which particular contacts get ruptured (rij exceeds 1.5 σ ij for the last time). The data symbols denoted by faint asterisks correspond to contacts which do not involve any secondary structures. The remaining symbols are diversified and they correspond to contacts involving the secondary structures as identified by the numerical labels

8

Structure-Based Models of Biomolecules

197

to the secondary structures as counted from the N terminus to the C terminus. Thus 1, 3, and 5 are the β-strands which form a β-sheet. A second β-sheet is formed of strands 7, 8, and 10. This β-sheet forms the core of the knot. The symbols 2, 4, 6, 9, and 11 denote helices. An inspection of the scenario diagram suggests that the unfolding process starts by breaking contacts between the two terminal helices 2 and 11. Afterward, stretching affects contacts in the N-terminal section which involves no knot: contacts 1–8 rupture first, they are followed by contacts 3–5 (without generating a force peak), and then 1–3. These rupturing events are accompanied by breaking down of contacts which do not involve any secondary structures (denoted by the asterisks). The largest force peak is seen to be associated with shearing of the 7–8 and 7–10 contacts which also involves a simultaneous rupture of the hydrogen bonds inside the core of the knot. This rupture makes the knot’s ends jump without affecting the Fmax in a significant manner. We have also studied the influence of knots on other physical properties of proteins (Sulkowska et al. 2009a). Specifically, we considered two proteins with similar structures but different topologies. One protein, 1yh1, contains a knot confined to a location far away from the termini. Another protein, 1c9y, is not knotted. Additionally two other synthetic constructs were made. The first is based on 1yh1 where the knot was removed by changing the way one crossing is made. The second is based on the 1c9y where the cysteine knot was created. This set of structures allows us to show that the presence of the knot affects some kinetic properties. In particular, the knotted structure is found to unravel through a different pathway than the corresponding unknotted structure during a stretching simulation. We have also found that the presence of a knot in our model system enhances their mechanical and thermal stabilities. Furthermore, the unfolding rate for knotted protein has been found to be much lower. In addition, thermal fluctuations may untie the knotted protein. One of the possible ways of unknotting the protein proceeds through the slipknot conformation (Sulkowska et al. 2009a). As generally expected, and demonstrated by Cieplak and Sulkowska (2005) explicitly (see, however, a word of caution by Finkelstein (1997)), the process of thermal unfolding at a high temperature is statistically reverse to folding proceeding under optimal folding conditions. These results motivate us to use the structurebased model to study the folding mechanism of knotted protein. When we consider knots of the simplest type, in principle we can distinguish three possible mechanisms leading to the creation of such a knot. The most straightforward one would require just two steps: creating a loop and threading one end of a protein through it. The second mechanism is more complicated as it involves an intermediate step with a slipknot. The third possibility would involve the creation of an ensemble of loose random knots in the first stage, which may turn into deeper knots as a result of a longer lasting process. In (Sulkowska et al. 2009a) we consider methylotransferaze (PDB code 1od6) and show that knotted protein can fold without involving any attractive non-native contacts that have been considered by Wallin et al. (2007). A similar result has been obtained within a simple and short lattice model with a shallow trefoil knot (Faisca et al. 2010).

198

M. Cieplak and J.I. Sułkowska

We have found (Sulkowska et al. 2009a) that the topological bottlenecks may be overcome by forming slipknots as necessary intermediate conformations. It is worth noticing that a substantial number of loose randomly knotted structures are observed in our folding studies. Such behavior agrees with the results of simulations by Virnau et al. (2005) and well-known experimental results that flexible polymers or strings (Raymer and Smith 2007) can easily become knotted in a spontaneous way. In most cases, however, random knots observed in folding simulations of proteins do not lead to deep and tighter knots. We should remark that knots in a protein chain at the room temperature do not necessarily behave in the same way as in a polymer chain. For instance, in our stretching simulations (Sulkowska et al. 2008c), the knots wound on homopolymers have been found to diffuse-off the chain usually. Although our findings suggest that the presence of the non-native attractive contacts (Wallin et al. 2007) is not necessary for formation of a knot, these contacts could reduce topological traps such as the wrong sense of making a twist or forming a left-handed conformation instead of the corresponding mirror image. The folding mechanisms of knotted proteins require further experimental and theoretical studies. We note here that these folding processes appear not to involve any external intervention such as provided by, e.g., chaperons (Mallam and Jackson 2007; Andersson et al. 2009).

8.4.3 Proteins in Membranes One needs structure as an input to structure-based modeling. The structure can not only be derived experimentally, but it can also be determined through all-atom simulations. As an example, we consider the case of pulling bacteriorhodopsin out of a membrane (Janovjak et al. 2003). Bacteriorhodopsin is a protein whose crystal structure is known and thus its native contacts can be determined. However, it is also interacting with its phospholipidic environment which holds it in place. The molecular structure of the membrane is not readily available. In order to find effective protein–membrane contacts, we have simulated (Cieplak et al. 2006) the whole systems near its equilibrium to determine the average atomic positions. The system was then coarse grained. In the case of the membrane, the structure was represented by its carbon atoms. We then applied the criterion based on the van der Waals volumes to determine which of these carbon atoms were making contacts with the protein. At this stage, the system is represented by the Cα atoms of the protein and by a cage of frozen C atoms that are within the interaction range from the protein. When one pulls one end of the protein out of the membrane the “walls” would be expected to cave in. Yet this appears to be a dynamically minor effect because we find (Cieplak et al. 2006) the resulting multi-peak force pattern to be remarkably similar to the shape obtained experimentally (though there are problems with the scale of the force). We also can demonstrate the dependence of the pattern on the selection of the terminal amino acids that is being pulled.

8

Structure-Based Models of Biomolecules

199

8.4.4 Hydrodynamic Interactions Water solvent is a source of the hydrophobic effect that not only keeps proteins globular but it also mediates a drag force on one amino acid due to the flow induced by motion of another amino acid. The latter effect goes under the name of hydrodynamic interactions (HI) and it enhances cooperative features in the dynamics of proteins, DNA, and simple polymers (see, e.g., Hinczewski et al. 2009). The HI may show in a variety of ways, primarily through emergence of dragging and screening phenomena. The HI are implicit in all-atom simulations involving water molecules, but their role is minuscule in near-equilibrium simulations. The structure-based models come in handy when assessing the role of HI because of the larger time scales available and because implicit solvent models can capture the HI approximately by incorporating the diffusion tensor, Dij , in the description of the system. In the Rotne–Prager form (Rotne and Prager 1969; Yamakawa 1970) of this tensor, the beads of finite hydrodynamic radii are coupled in a way that depends on their separation, r, (as 1/r to the leading order; strictly speaking, the approach is valid at large distances). When dealing with the HI, one can use equations of the Newtonian dynamics (Miller et al. 2008), but it is more straightforward to work within the Brownian dynamics (Ermak and McCammon 1978). The equations of the Brownian dynamics read ri − r0i =

1 0 0 ∇j · D0ij t + Dij · Fj t + Bi , kB T

where the left-hand side represents displacement of bead i in time step t, the index 0 denotes the values of respective quantities at the beginning of the time step, and Fj is the force exerted on particle j by other particles. B is a random displacement given by distribution with an average value of zero and covariance obeying

a Gaussian Bi Bj = 2D0ij t. We take the diffusion tensor as given by Dii =

kB T I γ

and ⎧ ⎪ 2a2 2a2 ⎪ ⎪ ⎪ 1 + 2 I + 1 − 2 rˆ ij rˆ ij , rij ≥ 2a 3rij rij kB T 3a ⎨ Dij = ⎪ γ 4rij ⎪ r 8 3rij ij ⎪ ⎪ − I + rij 4aˆrij rˆ ij , rij < 2a ⎩ 2a 3 4a where rij = rj − ri , and a is the hydrodynamic radius. By Stokes’ law, γ = 6π aη, where η is the viscosity of the fluid. The non-diagonal parts of Dij are responsible for the HI effects. The selection of the value of the hydrodynamic radius is not obvious. When one thinks of amino acids as represented by spherical beads

200

M. Cieplak and J.I. Sułkowska

then one would be inclined to consider a that does not exceed a half of the distance between consecutive Cα s. On the other hand, the residues extend laterally which may give rise to larger drag forces. It has been argued (Garcia de la Torre and Bloomfield 1981; Antosiewicz and Porschke 1989) that a characteristic a of an amino acid could be even as big as 4.2 Å whereas a characteristic van der Waals radius is about 3.0 Å (Zamyatin 1972). The van der Waals volume does not include hydration layers whereas the value of 4.2 Å effectively does so. Such big values would mean existence of an overlap between spheres in a chain, as considered in Banavar et al. (2009), but its usage takes into account the non-spherical properties of the amino acids in an approximate way. However, considering a of 1.5 Å has a numerical advantage because the time unit is governed by the damping constant and is thus proportional to a and hence in most of our HI-related calculations, when we are after qualitative effects, this is the value that we use. It should be noted that characteristic times, like those of folding, scale primarily linearly with a for systems both with and without HI. Therefore comparisons can be made for whatever value of a. We have used this formalism to study several proteins, such as ubiquitin, to establish the following results. The HI lead to (a) reduction in peak unfolding forces when stretching at high constant velocities (Szymczak and Cieplak 2007b; the effect of dragging), (b) reduction in unfolding times when stretching at constant force (Szymczak and Cieplak 2007b; dragging), (c) hindering of unraveling imposed through uniform (Szymczak and Cieplak 2006) and shear (Szymczak and Cieplak 2007c) fluid flows (screening), and (d) acceleration of folding times by a proteindependent factor (Cieplak and Niewieczerzal 2009; dragging) and to faster flap closing dynamics in HIV-1 protease (Cieplak and Niewieczerzal 2009; dragging). Item (d) above is qualitatively consistent with the results obtained for self-attracting homopolymers (Kikuchi et al. 2005) and several proteins (Ryder 2005) within the stochastic rotation dynamics approach (Malevanets and Kapral 1999) that mimics the solvent in the spirit of cellular automata. The acceleration of folding is observed to be confined to the collapse stage of the process and we find it to be by a factor of order 2 and which depends on the protein studied (Cieplak and Niewieczerzal 2009). The value of the factor depends on the nature of the initial unfolded state – in our case it is the full extension – whereas Ryder (Ryder 2005) estimates it to be smaller. Our results on folding times are quite similar to those obtained recently by Frembgen-Kesner and Elcock (2009) on 11 proteins studied within structure-based model developed by Clementi et al. (2000).

8.4.5 Nanoindentation of Virus Capsids Capsids are conglomerates of proteins that protect viral genomes. The RNAenveloping capsids are quite resilient mechanically (though not as much as those which envelope the DNA material) as probed recently through nanoindentation measurements (Michel et al. 2006). One especially well-studied RNA-related capsid belongs to the cowpea chlorotic mottle virus (CCMV). The capsids can be made

8

Structure-Based Models of Biomolecules

201

devoid of their genetic content which may remove some of the resiliency. In particular, it has been found (Michel et al. 2006) that an emptied CCMV capsid displays an initial Hookean reversible regime corresponding to the spring constant of 0.14 N/m that persists for up to 20% reduction in the diameter where the maximum force of resistance reaches about 500 pN (Michel et al. 2006; Klug et al. 2006). The elastic regime ends in a collapse that involves buckling. The same CCMV capsid with the RNA in it would reach about 1,000 pN before collapsing. These forces are noticeably larger than those needed for a mechanical unraveling of proteins. As an example, unraveling titin requires a force of about 204 pN (Rief et al. 1997; Carrion-Vazquez et al. 1999). The sturdiness of the capsids comes from the mechanical properties of the individual proteins and from reinforcements provided by the geometry of their assembly in which any single protein is stabilized by its neighbors. CCMV capsids consist of 180 sequentially identical proteins comprising 190 amino acids each in a β-barrel fold. They have the fullerene-like truncated icosahedral symmetry with 12 pentagonal and 20 hexagonal faces. This regular nature of the geometry has suggested to model the capsid as a homogeneous shell and use the finite element method to determine its mechanical properties by studying the resulting curved elastic network (Vliegenthart and Gompper 2006; Buenemann and Lenz 2007; Gibbons and Klug 2008). The plots of the force of resistance versus the degree of nanoindentation obtained in this way agree qualitatively with the experimental findings. It should be noted, however, that the thickness of a capsid is non-uniform since the “faces” themselves actually dissect three-dimensional molecular complexes which are far from being homogeneous. In the CCMV case the Cα atoms extend between ∼97 and 143 Å from the capsid center. Most of the theoretical studies of molecular level aspects of the capsid mechanics have been restricted to the properties of the native state. Such studies include an all-atom-based analysis of the low frequency vibrations in the satellite tobacco necrosis virus (Dykeman and Sankey 2008), elastic network model study of eigenmodes in the HK97 virus (Kim et al. 2003), and a similar model for CCMV (Tama and Brooks 2002, 2004). However, there are few molecular level studies of the nanoindentation processes. One exception is the recent all-atom study by Zink and Grubmueller (Zink and Grubmueller 2009) on the southern bean mosaic virus which, however, used extreme nanoindentation rates and emulated “nanoindentation” by pressing one amino acid into the center of the virus capsid. We have recently performed the structure-based molecular dynamics study of CCMV capsid (Cieplak and Robbins 2010) and demonstrated existence of deviations from the elastic homogeneous shell model. This indicates relevance of the molecular level effects for the mechanostability of the capsid. We have also showed that past the elastic regime, the nanoindentation process is strongly history dependent and hysteretic. The locations of some amino acids have not been determined through the X-ray crystallography (Speir et al. 1995; Carrillo-Tripp et al. 2009) and our model makes use only of the known parts of the structure. Thus the coarse-grained model of the whole capsid involves 28,620 Cα atoms and 62,426 native contacts. (It also involves of order 5 × 108 repulsive non-native contacts which makes it necessary to deal with

202

M. Cieplak and J.I. Sułkowska

their effects by determining a Verlet list (Allen and Tildesley 1987) periodically). Nanoindentation is implemented by placing the capsid between two parallel plates that generate the h-10 repulsive potentials, where h is the distance away from the plate. In the initial state, the plates do not touch the capsid and then they both move toward each other with a relative speed of (in most cases) 0.005 Å/τ . Figure 8.8 shows example of a nanoindentation trajectory for CCMV capsid. As the separation, s, decreases, the dependence of F on s is basically linear for the first ∼35% of the process as measured by the values of s relative to the separation corresponding to the onset of the force of resistance. In this linear regime the nanoindentation process is nearly reversible: an inversion of the direction of motion of the plates results in a force which is dropping down linearly in a way that is tracking its previous growth. At the end of this Hookean regime the force acquires a value Fm ∼ 5.5 ε/Å, which should be of order 600 pN. The resulting spring constant is close to 0.05 ε/Å2 which should correspond to about 5.5 pN/Å which is softer than the experimental value due to the usage of the Lennard–Jones potential which is fairly broad. At this stage, a collapse follows resulting in the force dropping to

Fig. 8.8 Nanoindentation of CCMV capsid at kB T/ε = 0.3 after (Cieplak and Robbins 2010). The solid line in the main figure shows the force of resistance to indentation against the separation between the compressing planes. The force is averaged over time intervals and between the planes. The dotted line shows the force when the planes are moving back once s reaches a value sc (here sc is 100 Å as an illustration) which is past the Hookean regime. This dotted line is seen not to retrace the compression curve. However, there is a reversibility if the motion back is confined to the linear elastic region (s > 180 Å). The figures at the top show the schematic representation of the shapes of the capsid at the three values of s as indicated. The shapes of the n-gons and the locations of the vertices are obtained through studies of the Cα mass distribution for proteins belonging to the particular n-gons. At s = 284 Å, the capsid is in its native state

8

Structure-Based Models of Biomolecules

203

about 3 ε/Å. The collapse results primarily from the breakage of the interproteinic contacts. Finally, F starts to rise indefinitely due to compactification of the hard cores representing the amino acids. After departing the linear regime no process is reversible. Further discussion together with the detailed comparison to the elastic shell model is presented in Cieplak and Robbins (2010). We just note here that the biggest deviations are observed in the strain field, i.e., in the shape of the deformed capsid. In summary, the structure-based models appear to perform well in many relevant applications, especially when one seeks qualitative and comparatory goals instead of chemically precise values. The models have been developed to tackle problems of large conformational changes, large time scales, and large sizes. In this last case, they can also be of interest when studying equilibrium fluctuations. For instance, our results on the single-bead and two-bead fluctuation characteristics of the large human topoisomerase I protein (Szklarczyk et al. 2009) are consistent with all-atom simulations (Chillemi et al. 2007) but have been obtained with a much smaller computational effort. Acknowledgments This work summarizes a decade of research performed with a group of collaborators. In particular with J.R. Banavar, M. Carrion-Vazquez, S. Filipek, T.X. Hoang, H. Janovjak, A. Maritan, P. Marszałek, S. Niewieczerzał, A. Pastore, M.O. Robbins, M. Sikora, K. Staro´n, P. Sułkowski, O. Szklarczyk, P. Szymczak, D. Thompson, and M. Wojciechowski. We appreciate S. Niewieczerzał’s help in preparation of the first three figures displayed here. This work has been supported by the grant N N202 0852 33 from the Ministry of Science and Higher Education in Poland, by the EC FUNMOL project under FP7-NMP-2007-SMALL-1, and by the European Union within European Regional Development Fund, through grant Innovative Economy (POIG.01.01.02-00-008/08).

References Abe H, Go N (1981) Noninteracting local-structure model of folding and unfolding transition in globular proteins. II. Application to two-dimensional lattice proteins. Biopolymers 20: 1013–1031 Alam MT, Yamada T, Carlsson U, Ikai A (2002) Importance of being knotted: effects of C-terminal knot structure on enzymatic and mechanical properties of bovine carbonic anhydrase II. FEBS Lett 519:35–42 Allemand JF, Bensimon D, Lavery R, Croquette V (2007) Stretched and overwound DNA forms a Pauling-like structure with exposed bases. Proc Natl Acad Sci USA 95:14152–14157 Allen MP, Tildesley DJ (1987) Computer simulation of liquids. Oxford University Press, New York, NY. Andersson FI, Pina DG, Mallam AL, Blaser G, Jackson SE (2009) Untangling the folding mechanism of the 52 -knotted protein UCH-L3. FEBS J 276:2625–2635 Andreeva A, Howorth D, Chandonia J-M, Brenner SE, Hubbard TJP, Chothia C, Murzin AG (2008) Data growth and its impact on the SCOP database: new developments. Nucl Acid Res 36: D419–D425 Antosiewicz J, Porschke D (1989) Volume correction for bead model simulations of rotational friction coefficients of macromolecules. J Phys Chem 93:5301–5305 Bahar I, Atilgan AR, Erman B (1997) Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential. Fold Des 2:173–181

204

M. Cieplak and J.I. Sułkowska

Bahar I, Erman B, Jernigan RL, Atilgan AR, Covell DG (1999) Collective motions in HIV-1 reverse transcriptase: examination of flexibility and enzyme function. J Mol Biol 285:1023–1037 Banavar JR, Cieplak M, Hoang TX, Maritan A (2009) First-principles design of nanomachines. Proc Natl Acad Sci USA 106:6900–6903 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN et al (2000) The Protein Data Bank. Nucl Acids Res 28:235–242 Bockelmann U, Essevaz-Roulet B, Heslot F (1997) Molecular stick-slip motion revealed by opening DNA with piconewton forces. Phys Rev Lett 79:4489–4492 Bockelmann U, Thomen Ph, Essevaz-Roulet B, Viasnoff V, Heslot F (2002) Unzipping DNA with optical tweezers: High sequence sensitivity and force flips. Biophys J 82:1537–1553 Bolinger D, Sulkowska JI, Hsu H-P, Mirny LA, Kardar M, Onuchic JN, Virnau P (2010) A Stevedore’s protein knot. PLoS Comput Biol 6:e1000731 Bornschlogl T, Anstrom DM, Mey E, Dzubiella J, Rief M, Forest KT (2009) Tightening the knot in phytochrome by single-molecule atomic force microscopy. Biophys J 96:1508–1514 Bryant Z, Stone MD, Gore J, Smith SB, Cozzarelli NR, Bustamante C (2003) Structural transitions and elasticity from torque measurements on DNA. Nature 424:338–341 Bryngelson JD, Onuchic JN, Socci ND, Wolynes PG (1995) Funnels, pathways and the energy landscape of protein folding: a synthesis. Proteins Struct Funct Genet 21:167–195 Bustamante C, Bryant Z, Smith SB (2003) Ten years of tension: single-molecule DNA mechanics. Nature 421:423–426 Buenemann M, Lenz P (2007) Mechanical limits of viral capsids. Proc Natl Acad Sci USA 104:9925–9930 Carrillo-Tripp M, Shepherd C, Borelli IA, Venkataraman S, Lander G, Natarajan P, Johnson JE, Brooks III CL, Reddy V (2009) VIPERdb2: an enhanced and web API enabled relational database for structural virology. Nucl Acids Res 37:D436–D442. http://viperdb.scripps.edu/. Accessed date 2008 Carrion-Vazquez M, Oberhauser AF, Fowler SB, Marszalek PE, Broedel PE et al (1999) Mechanical and chemical unfolding of a single protein: a comparison. Proc Natl Acad Sci USA 96:3694–3699 Carrion-Vazquez M, Oberhauser AF, Fisher TE, Marszalek PE, Li H, Fernandez JM (2000) Mechanical design of proteins studied by single-molecule force spectroscopy and protein engineering, Prog Biophys Mol Biol 74:63–91 Carrion-Vazquez M, Cieplak M, Oberhauser AF (2009) Protein mechanics at the single-molecule level. In: Meyers RA (eds) Encyclopedia of complexity and systems science. Springer, New York, NY, pp 577–603. ISBN:978-0-387-75888-6 Chillemi G, Bruselles A, Fiorani P, Bueno S, Desideri A (2007) The open state of human topoisomerase I as probed by molecular dynamics simulation. Nucl Acids Res 35:3032–3038 Cieplak M, Koplik J, Banavar JR (2001) Boundary conditions at a fluid–solid interface. Phys Rev Lett 86:803–806 Cieplak M, Hoang TX, Robbins MO (2002) Folding and stretching in a Go-like model of titin Proteins. Struct Funct Bioinform 49:114–124 Cieplak M, Hoang TX (2003) Universality classes in folding times of proteins. Biophys J 84: 475–488 Cieplak M (2004) Cooperativity and contact order in protein folding. Phys Rev E 69:031907 Cieplak M, Hoang TX, Robbins MO (2004) Thermal effects in stretching of Go-like models of titin and secondary structures. Proteins Struct Funct Bioinform 56:285–297 Cieplak M, Sulkowska JI (2005) Thermal unfolding of proteins. J Chem Phys 123:194908 Cieplak M, Filipek S, Janovjak H, Krzysko KA (2006) Pulling single bacteriorhodopsin out of a membrane: comparison of simulation and experiment. BBA – Biomembranes 1758:537–544 Cieplak M, Thompson D (2008) Coarse-grained molecular dynamics simulations of nanopatterning with multivalent inks. J Chem Phys 128:234906 Cieplak M, Robbins MO (2010) Nanoindentation of virus capsids in a molecular model. J Chem Phys 132:015101

8

Structure-Based Models of Biomolecules

205

Cieplak M, Niewieczerzal S (2009) Hydrodynamic interactions in protein folding. J Chem Phys 130:124905 Cieplak M, Sulkowska JI (2009) Tests of the structure-based models of proteins. Acta Phys Polonica 115:441–445 Clementi C, Nymeyer H, Onuchic JN (2000) Topological and energetic factors: What determines the structural details of the transition state ensemble and “en-route” intermediates for protein folding? An investigation for small globular proteins. J Mol Biol 298:937–953 Cluzel P, Lebrun A, Heller C, Lavery R, Viovy J-L, Chatenay D, Caron F (1996) DNA: an extensible molecule. Science 271:792–794 Craik DJ, Daly NL, Bond TJ, Waine C (1999) Plant cyclotides: a unique family of cyclic and knotted proteins that defines the cyclic cystine knot structural motif. J Mol Biol 294:1327–1336 Craik DJ, Dally NL, Waine C (2001) The cysteine knot motif in toxins and implications for drug design. Toxicon 39:43–60 Duan Y, Kollman PA (1998) Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. Science 282:740–744 Dykeman EC, Sankey OF (2008) Low frequency mechanical models of viral capsids: an atomistic approach. Phys Rev Lett 100:028101 Ermak DL, McCammon JA (1978.) Brownian dynamics with hydrodynamic interactions. J Chem Phys 69:1352–1360 Faisca PFN, Travasso RDM, Charters T, Nunes A, Cieplak M (2010) The folding of knotted proteins: insights from lattice simulations. Phys Biol 7:016009 Finkelstein AV (1997) Can protein unfolding simulate protein folding. Prot Eng 10:843–845 Fowler SB, Best RB, Toca Herrera JL, Rutherford TJ, Steward A, Paci E, Karplus M, Clarke J (2002) Mechanical Unfolding of a Titin Ig Domain: structure of unfolding Intermediate revealed by combining AFM, molecular dynamics simulations, NMR and protein engineering. J Mol Biol 322:841–849 Frembgen-Kesner T, Elcock AH (2009) Striking effects of hydrodynamic interactions on the simulated diffusion and folding of proteins. J Chem Theory Comp 5:242–256 Galera-Prat A, Gomez-Sicilia A, Oberhauser AF, Cieplak M, Carrion-Vazquez M (2010) Understanding biology by stretching proteins: recent progress. Curr Op Struct Biol 20:63–69 Garcia de la Torre J, Bloomfield VA (1981) Hydrodynamic properties of complex, rigid, biological macromolecules: theory and applications. Quarter Rev Biophys 14:81–139 Gear WC (1971) Numerical initial value problems in ordinary differential equations. Prentice-Hall, New York, NY Gibbons MM, Klug WS (2008) Influence of nonuniform geometry on nanoindentation of viral capsids. Biophys J 95:3640–3649 Goldstein RA, Luthey-Schulten ZA, Wolynes PG (1992) Optimal protein-folding codes from spinglass theory. Proc Natl Acad Sci USA 89:4918–4922 Grandbois M, Beyer M, Rief M, Clausen-Schaumann H, Gaub H (1999) How Strong Is a Covalent Bond? Science 283:1727–1730 Grest GS, Kremer K (1986) Molecular-dynamics simulation for polymers in the presence of a heat bath. Phys Rev A 33:3628–3631 Hinczewski M, Schlagberger X, Rubinstein M, Krichevsky O, Netz RR (2009) End-monomer dynamics in semiflexible polymers. Macromolecules 42:860–875 Hoang TX, Cieplak M (2000) Molecular dynamics of folding of secondary structures in Go-type models of proteins. J Chem Phys 112:6851–6862 Huskens J (2006) Multivalent interactions at surfaces. Curr Opin Chem Biol 10:537–543 Hyeon C, Thirumalai D (2005) Mechanical unfolding of RNA hairpins. Proc Natl Acad Sci USA 102:6789–6794 Janovjak H, Kessler M, Oesterhelt D, Gaub HE, Mueller DJ (2003) Unfolding pathways of native bacteriorhodopsin depend on temperature. EMBO J 22:5220–5229 Karanicolas J, Brooks III CL (2002) The origins of asymmetry in the folding transition states of protein L and protein G. Protein Sci 11:2351–2361

206

M. Cieplak and J.I. Sułkowska

King NP, Yeates EO, Yeates TO (2007) Identification of rare slipknots in proteins and their implications for stability and folding. J Mol Biol 373:153–166 Kikuchi N, Ryder JF, Pooley CM, Yeomans JM (2005) Kinetics of the polymer collapse transition: the role of hydrodynamics. Phys Rev E 71:061804 Kim MK, Jernigan RL, GS Chirikjian GS (2003) An elastic network model of HK97 capsid maturation. J Struct Biol 143:107–117 Klimov DK, Thirumalai D (1997) Viscosity dependence of the folding rates of proteins. Phys Rev Lett 79:317–320 Knotts IV TA, Rathore N, Schwartz DC, de Pablo JJ (2007) A coarse grain model for DNA. J Chem Phys 126:84901 Koniaris K, Muthukumar M (1991) Knottedness in ring polymers. Phys Rev Lett 66:2211–2214 Koster DA, Croquette V, Shuman S, Dekker NH (2005) Friction and torque govern the relaxation of DNA supercoils by eukaryotic topoisomerase IB. Nature 34:671–674 Kwiecinska JI, Cieplak M (2005) Chirality and protein folding. J Phys Cond Matter 17: S1565–S1580 Klug WS, Bruinsma RF, Michel J-P, Knobler CM, Ivanovska IL, Schmidt CF, Wuite GJL (2006) Failure of viral shells. Phys Rev Lett 97:228101 Li A, Daggett V (1994) Characterization of the transition state of protein unfolding by use of molecular dynamics: chymotrypsin inhibitor 2. Proc Natl Acad Sci USA 91: 10430–10434 Livingston C (1993) Knot theory. Mathematical Association of America, Washington, DC. Lu H, Schulten K (1999) Steered molecular dynamics simulation of conformational changes of immunoglobulin domain I27 interprete atomic force microscopy observations. Chem Phys 247:141–153 Mallam AL, Jackson SE (2007) Comparison of the folding of two knotted proteins: YbeA and Yibk. J Mol Biol 366:650–665 Mallam AL, Onuoha SC, Grossmann JG, Jackson SE (2008) Knotted fusion proteins reveal unexpected possibilities in protein folding. Mol Cell 30:642–648 Malevanets A, Kapral R (1999) Mesoscopic model for solvent dynamics. J Chem Phys 110: 8605–8613 Mansfield ML (1994) Are there knots in proteins? Nat Struct Biol 1:213–214 Metzler R, Reisner W, Riehn R, Austin R, Tegenfeldt JO, Sokolov IM (2006) Diffusion mechanisms of localised knots along a polymer. Europhys Lett 76:696–702 Micheletti C, Latanzi G, Maritan A (2002) Elastic properties of proteins: insight on the folding process and evolutionary selection of native structures. J Mol Biol 321:909–921 Michel JP, Ivanovska IL, Gibbons MM, Klug WS, Knobler CM, Wuite GJL, Schmidt CF (2006) Nanoindentation studies of full and empty viral capsids and the effects of capsid protein mutations on elasticity and strength. Proc Natl Acad Sci USA 103:6184–6189 Miller BT, Zheng W, Venable RM, Pastor RW, Brooks BR (2008) Langevin network model of myosin. J Phys Chem B 112:6274–6281 Miyazawa S, Jernigan RL (1996) Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 256: 623–644 Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–40 Niewieczerzal S, Cieplak M (2009) Stretching and twisting of the DNA duplexes in coarse-grained dynamical models. J Phys Cond Matter 21:474221 Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB et al (1997) CATH – A hierarchical classification of protein domain structures. Structure 5:1093–108 Oroszi L, Gajda P, Kirei H, Bottka S, Ormos P (2006) Direct measurement of torque in an optical trap and its applications to double-strand DNA. Phys Rev Lett 97:058301 Paci E, Karplus M (2000) Unfolding proteins by external forces and temperature: the importance of topology and energetics. Proc Natl Acad Sci USA 97:6521–6526

8

Structure-Based Models of Biomolecules

207

Pabon G, Amzel LM (2006) Mechanism of titin unfolding by force: Insight from quasi-equilibrium molecular dynamics calculations. Biophys J 91:467–472 Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O et al (2005) The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucl Acid Res 33:D247–51 Raymer DM, Smith DE (2007) Spontaneous knotting of an agitated string. Proc Natl Acad Sci USA 104:16432–16437 Rief M, Gautel M, Oesterhelt F, Fernandez JM, Gaub HE (1997) Reversible unfolding of individual titin immunoglobulin domains by AFM. Science 276:1109–1112 Rotne J, Prager S (1969) Variational treatment of hydrodynamic interaction on polymers. J Chem Phys 50:4831–4837 Ryder JF (2005) Mesoscopic simulations of complex fluids, Ph.D. thesis, the University of Oxford. Schonher H, Beulen MWJ, Bugler J, Huskens J, van Veggel FCJM, Reinhoudt DN, Vancso GJ (2000) Individual supramolecular host–guest interactions studied by dynamic single molecule spectroscopy. J Am Chem Soc 122:4963–4967 Settanni G, Hoang TX, Micheletti C and Maritan A (2002) Folding pathways of prion and doppel. Biophys J 83:3533–3541 Sikora M, Sulkowska JI, Cieplak M (2009) Mechanical strength of 17 134 model proteins and cysteine knots. PLoS Comput Biol 5:e1000547 Smith ED, Robbins MO, Cieplak M (1996) Friction on adsorbed monolayers. Phys. Rev B 54:8252–8260 Snow CD, Nguyen H, Pande V, Gruebele M (2002) Absolute comparison of simulated and experimental protein folding dynamics. Nature 420:102–106 Sobolev V, Sorokine A, Prilusky J, Abola EE, Edelman M (1999) Automated analysis of interatomic contacts in proteins. Bioinformatics 15:327–332 Speir JA, Munshi S, Wang G, Baker TS, Johnson JE (1995) Structures of the native and swollen forms of cowpea chlorotic mottle virus determined by X-ray crystallography and cryo-electron microscopy. Structure 3:63–78 Sulkowska JI, Cieplak M (2007) Mechanical stretching of proteins – a theoretical survey of the Protein Data Bank. J Phys Cond Matter 19:283201 Sulkowska JI, Cieplak M (2008) Selection of optimal variants of Go-like models of proteins through studies of stretching. Biophys J 95:3174–3191 Sulkowska JI, Kloczkowski A, Sen TZ, Cieplak M, Jernigan RL (2008a) Predicting the order in which contacts are broken during single molecule protein stretching experiments. Proteins Struct Funct Bioinform 71:45–60 Sulkowska JI, Sulkowski P, Szymczak, P, Cieplak M (2008b) Tightening of knots in proteins. Phys Rev Lett 100:058106 Sulkowska JI, Sulkowski P, Szymczak P, Cieplak M (2008c) Stabilizing effect of knots on proteins. Proc Natl Acad Sci USA 105:19714–19719 Sulkowska JI, Sulkowski P, Onuchic JN (2009a) Dodging the crisis of folding proteins with knots. Proc Natl Acad Sci USA 106:3119–3124 Sulkowska JI, Sulkowski P, Onuchic JN ( 2009b) Jamming proteins with slipknots and their free energy landscape. Phys Rev Lett 103:268103 Szklarczyk O, Staron K, Cieplak M (2009) Native state dynamics and mechanical properties of human topoisomerase I within a structure-based coarse-grained model. Proteins Struct Funct Bioinform 77:420–431 Szymczak P, Cieplak M (2006) Stretching of proteins in a uniform flow. J Chem Phys 125:164903 Szymczak P, Cieplak M (2007a) The slip-length effects in molecular dynamics of beadlike models of proteins. In: Hansmann UHE, Meinke J, Mohanty S, Zimmerman O (eds) Forschungszentrum Juelich Proceedings, NIC workshop 2007 from computational biophysics to systems biology 2007, vol 36, NIC Series, pp 1–7. http://www.fz-juelich.de/nic-series/ volume36/nic-series-volume36.pdf

208

M. Cieplak and J.I. Sułkowska

Szymczak P, Cieplak M (2007b) Influence of hydrodynamic interactions on mechanical unfolding of proteins. J Phys: Condens Matter 19:285224 Szymczak P, Cieplak M (2007c) Proteins in a shear flow. J Chem Phys 127:155106 Takada S (1999) Go-ing for the prediction of protein folding mechanism. Proc Natl Acad Sci USA 96:11698–11700 Tama F, Brooks III CL (2002) The mechanism and pathway of pH induced swelling in cowpea chlorotic mottle virus. J Mol Biol 318:733–747 Tama F, Brooks III CL (2004) Diversity and identity of mechanical properties of icosahedral viral capsids studied with elastic network normal mode analysis. J Mol Biol 345: 299–314 Taylor WR (2000) A deeply knotted protein structure and how it might fold. Nature 406:916–919 Taylor WR, Lin K (2003) Protein knots – a tangled problem. Nature 421:25–25 Taylor WR (2007) Protein knots and fold complexity: some new twists. Comp Biol and Chem 31:151–162 Thompson D (2007) Free energy balance predicates dendrimer binding multivalency at molecular printboards. Langmuir 23:8441–8451 Tozzini V, Trylska J, Chang C, McCammon JA (2007) Flap opening dynamics in HIV-1 protease explored with a coarse-grained model. J Struct Biol 157:606–615 Tsai J, Taylor R, Chothia, C Gerstein M (1999) The packing density in proteins: standard radii and volumes. J Mol Biol 290:253–266 Valbuena A, Oroz J, Hervas R, Vera AM, Rodriguez D. Mendez M, Sulkowska JI, Cieplak M, Carrion-Vazquez M (2009) On the remarkable mechanostability of scaffoldings and the mechanical clamp motif. Proc Natl Acad Sci USA 106:13791–13796 Veitshans T, Klimov D, Thirumalai D (1997) Protein folding kinetics: time scales, pathways and energy landscapes in terms of sequence-dependent properties. Fold Des 2:1–22 Virnau P, Kantor Y, Kardar M (2005) Knots in globule and Coll phases of a model polyethylene. J Am Chem Soc 127:15102 Virnau P, Mirny LA, Kardar M (2006) Intricate knots in proteins: function and evolution. PLoS Comp Biol 2:1074–1079 Vliegenthart GA, Gompper G (2006) Mechanical deformation of spherical viruses with icosahedral symmetry. Biophys J 91:834–839 Wagner JR, Brunzelle JS, Forest KT, Vierstra RD (2005) A light-sensing knot revealed by the structure of the chromophore-binding domain of phytochrome. Nature 438:325–331 Wallin S, Zeldovich KB, Shakhnovich EI (2007) The folding mechanics of a knotted protein. J Mol Biol 368:884–893 Wereszczynski J, Andricioaei I (2006) On structural transitions, thermodynamic equilibrium, and the phase diagram of DNA and RNA duplexes under torque and tension, Proc Natl Acad Sci USA 103:16200–16205 Wojciechowski M, Cieplak M (2007) Coarse-grained modelling of pressure related effects in staphylococcal nuclease and ubiquitin. J Phys Cond Matter 19:285218 Yamakawa H (1970) Transport properties of polymer chains in dilute solutions. Hydrodynamic interaction. J Chem Phys 53:436–443 Yang G, Cecconi C, Baase, WA, Vetter IR, Breyer WA, Haack JA, Matthews BW, Dahlquist FW, Bustamante C (2000) Solid-state synthesis and mechanical unfolding of polymers of T4 lysozyme. Proc Natl Acad Sci USA 97:139–144 Zamyatin AA (1972) Protein volume in solution. Prog Biophys Mol Biol 24:107–123 Zink M, Grubmueller H (2009) Mechanical properties of the icosahedral shell of southern bean mosaic virus: a molecular dynamics study. Biophys J 96:1350–1363

Chapter 9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms Ulrich H. E. Hansmann

Abstract Computer simulations aim to become virtual microscopes that can probe the working of cells on a molecular level. One of the remaining obstacles is still poor sampling. This chapter reviews strategies for faster sampling and discusses their limitations. Recent applications to protein folding document the utility of the described techniques.

9.1 Introduction The goal of computational biophysics is to use molecular dynamics or Monte Carlo simulations as a “virtual microscope” to study processes or molecules in a cell that are not directly accessible in experiments. Already for single, non-interacting globular proteins the numerical difficulties are enormous. This is because the complex form of the intramolecular forces and the interaction with the surrounding environment, containing both repulsive and attractive terms, lead to a rough energy landscape with a huge number of local minima. Simple canonical Monte Carlo or molecular dynamics get trapped in a local minimum and, when relying on an all-atom model, normally do not thermalize within the available CPU time. Even simulations of “mini-proteins” with less than 50 residues are computationally challenging. A common estimate for a typical single-domain protein such as the 153 amino acid long myoglobin is that a single-folding trajectory of ≈10−4 s would take about 3 years on a supercomputer capable of trillions of floating point operations per second (Allen et al. 2001). This is because the computational effort to calculate accurately the physical quantities increases exponentially with the number of residues. Note that the problem is not necessarily solved using coarse-grained models. While current all-atom energy functions may introduce additional local minima U.H.E. Hansmann (B) Department of Physics, Michigan Technological University, Houghton, MI, USA e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_9,

209

210

U.H.E. Hansmann

in the energy landscape, leading to an additional decrease in sampling speed, roughness is an intrinsic characteristic of protein free-energy landscapes. A valid coarse-grained protein model has to lead to a rough energy landscape, and therefore will suffer still from an exponential increase of computational effort with size of the molecule. The numerical advantage of a coarse-grain model is one of a smaller prefactor (a much faster evaluation of energy), and for a sufficiently large proteins the problem of poor sampling and slow convergence of the simulation will re-appear. Hence, independent on the utilized energy functions and detail of description of interactions within a protein, and between a protein and its surrounding environment, there is a need for techniques that speed up the sampling of low-energy configurations. The task is to find local minima but avoid to get trapped, so that the simulation can continue to explore the protein energy landscape. It is obvious that any such approach will change the dynamics, i.e., in most cases one can no longer follow directly the trajectories by which a protein evolves. However, in many cases it is possible to derive enhanced sampling techniques that guarantee correct thermodynamics. We will focus in the following on techniques where this is the case, i.e., where at least in principle it is possible to calculate correct thermal averages. Because of their correct sampling properties, these methods will also allow to reconstruct the free-energy landscape (or its projection on a set of reaction coordinates) and as a consequence allow for a study of the relevant transitions and mechanism of folding, assembly, and aggregation of a protein. This chapter is organized as follows: we start with a short review of the basic simulation techniques and common global optimization methods. We will then introduce a number of advanced simulation techniques before discussing shortcomings and open problems. A summary and outlook finish this short review.

9.2 Basic Simulation Techniques 9.2.1 Molecular Dynamics In principle, one can follow the time evolution of a protein, its folding, assembly, and interaction with other molecules, through solving numerically Newton’s law for each atom i:

Fi (x1 (t), ...xn (t), t) = mi

d2 xi . dt2

(9.1)

Here, xi (t) describes the coordinates of atom i at time t (i.e., it is a vector). Correspondingly, vi (t) is the associated velocity of atom i at time t. Various schemes exist to integrate numerically these equations of motion, with the most simplest ones being the Verlet algorithm

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

xi (t + t) = 2xi (t) − xi (t − t) + vi (t) =

Fi (t) 2 t mi

xi (t + t) − xi (t − t) 2t

211

(9.2) (9.3)

and the leapfrog algorithm xi (t + t) = xi (t) + tvi (t + t/2) vi (t + t/2) = vi (t − t/2) + t

Fi (t) . mi

(9.4) (9.5)

Besides probing directly the kinetics of proteins such molecular dynamics simulations also allow one to calculate equilibrium properties by computing time averages. Solving the equations of motions requires knowledge (i.e., calculation) of the forces Fi at time t. In most cases, these forces will not depend explicitly on time, and the total energy E = Epot (xi ) + Ekin (vi )

with Ekin =

1 mi v2i 2

(9.6)

is conserved and the forces can be written as Fi (x1 (t), ..., xn (t) =

∂ Epot (x1 , ..., xn ). ∂xi

(9.7)

Hence, molecular dynamics requires both knowledge of the potentials and their derivates. As the total energy E = const, a molecular dynamics simulation of the above kind will realize a microcanonical ensemble. However, most experimental settings realize a canonical ensemble (i.e., not the energy E but the temperature T is conserved). This requires coupling of the system to a thermostat, described, for instance, in Frenkel and Smit (2001). Note that temperature and kinetic energy of the system are related by < Ekin (t) >=<

1 N mi v2i (t) >= kB T 2 2

(9.8)

where <> denotes a time average, kB is the Boltzmann constant, N is the number of degrees of freedom, and T is the temperature of the system.

9.2.2 Monte Carlo A canonical ensemble at a temperature T is easily realized in a Monte Carlo simulation than with molecular dynamics. This is because in Monte Carlo, trial moves

212

U.H.E. Hansmann

are generated randomly and accepted or rejected according to the Boltzmann weight exp(−E/kB T). For instance, in the Metropolis algorithm one calculates w(c →

c )

" =

1 E−E/kB T

E < 0 otherwise

(9.9)

with E = E(c ) − E(c), and E(c) the energy of the current configuration, and E(c ) that of the trial configuration. The trial configuration replaces the current one if w(c → c ) ≥ R, where the random number R takes values between 0 and 1. As the Metropolis algorithm ensures detailed balance and each configuration can be reached in a finite number of steps (ergodicity), the resulting Markov process will converge to the distribution of protein configurations that corresponds to the canonical ensemble at temperature T. Thermodynamic quantities < O > are now calculated by computing averages over the sampled conformations: < O >=

1 Ok M

(9.10)

where Ok is the value measured for the quantity O in the configuration k, and M the number of measurements. This average approximates the ensemble average < O >=

dxi dvi O(xi )e−E(xi ,vi )/kB T . dxi dvi e−E(xi ,vi )/kB T

As E = Epot (xi ) + Ekin (vi ) and Ekin = 1/2 velocities, and < O >=

(9.11)

mi v2i , it is possible to integrate out the

dxi O(xi )e−E(pot xi )/kB T . dxi e−Epot (xi )/kB T

(9.12)

As a consequence, for the generation of configurations by way of the Metropolis algorithms (Eq. (9.9)) one needs to calculate only the difference of the potential energies Epot . For this reason, we will write most times simply E when only the potential energy Epot is relevant. Note also that Monte Carlo does not require calculation of derivatives reducing the numerical workload. As the configurations are drawn randomly in Monte Carlo, it is not possible to follow the trajectory of a protein, and therefore Monte Carlo – unlike molecular dynamics – is not suitable for probing the kinetics of folding. On the other hand, Monte Carlo allows one to sample the configurational space much faster through utilizing artificial but fast move sets. These are often necessary because in the canonical ensemble crossing of an energy barrier of height E is suppressed by a factor ∝ exp(−E/kB T). This is the reason for the multiple minima problem and the resulting slowing down of protein simulations discussed in the introduction.

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

213

9.2.3 Optimization Techniques Most proteins are thermodynamically stable at room temperature (Anfinsen 1973). This implies that the biologically active configuration is the global minimum in free energy at T ≈ 270 − 300 K. For many proteins, this state is unique up to oscillations around a fixed structure. For this reason, one can identify the global minimum in free energy with that in potential energy, reducing the prediction of protein structures to a global optimization problem. While deterministic methods (for instance, the αBB algorithm (Androulakis et al. 1997)) have many conceptual advantages, stochastic algorithms are often faster and easier to implement. Take as an example simulated annealing (Kirkpatrick et al. 1983) which is inspired by the crystal growth process and realized by gradually decreasing the temperature in a Monte Carlo or molecular dynamics program. While only a logarithmic annealing schedule will ensure that the simulation finds the global minimum (Geman and Geman 1984), limitations in available computer resources require faster annealing schedules where success is no longer guaranteed. Still, because of its simplicity simulated annealing is often the first choice in protein optimization problems. Genetic algorithms (Holland 1975) and Monte Carlo minimization (Li and Scheraga 1987) are two other stochastic optimization techniques commonly used. As simulated annealing they try to avoid entrapment in local minima and continue to search for further solutions. This is a general characteristic of successful optimization techniques. For instance, in tabu search (Cvijovic and Klinowski 1995) the system is guided away from previously explored areas. This can result in slow convergence as the method does not distinguish between important and unimportant regions of the landscape. A somehow opposite approach (Besold et al. 1999; Wenzel and Hamacher 1999) aims at transforming the original energy landscape in a funnellandscape, where convergence toward the global minimum is fast. However, many landscape-deformation methods are hampered either by the required fine tuning or a priori information, or by difficulties with connecting back to the original landscape. Often, minima on the deformed surface are displaced or merged. The latter problem is avoided in energy landscape paving (ELP) (Hansmann and Wille 2002) which merges ideas from tabu search with energy landscape deformation. In ELP, low-temperature Monte Carlo simulations utilize an effective energy: ) w() E) = e−E/kB T

with ) E = E + f (H(q, t)).

(9.13)

Here, T is a (low) temperature and f (H(q, t)) a function of the histogram H(q, t) in a pre-chosen “order parameter” or “reaction coordinate” q. The weight of a local minimum state decreases with the time the system stays in that state, i.e., ELP deforms the energy landscape locally till the local minimum is no longer favored and the system will explore higher energies. It will then either fall in a new local minimum or walk through this high-energy region till the corresponding histogram entries all have similar frequencies and the system again has a bias toward low

214

U.H.E. Hansmann

energies. Since the weight factor is time dependent it follows that ELP violates detailed balance. Hence, the method cannot be used to calculate thermodynamic averages. Note, however, that for f (H(q, t)) = f (H(q)) detailed balance is fulfilled, and ELP reduces to the generalized-ensemble methods (Hansmann and Okamoto 1998) discussed in the following section. We have evaluated the efficiency of ELP in simulations of the 20-residue trp-cage protein whose structure we could “predict” within a root-mean-square deviation (rmsd) of 1 Å (Schug et al. 2005). Energy landscape paving allows also the possibility of zero-temperature simulations (Schug et al. 2005). For T → 0 only moves with ) E ≤ 0 will be accepted. If one chooses: ) E = E + cH(E, t), the acceptance criterion is given by: E + cH(q, t) ≤ 0 ↔ cH(q, t) ≤ −E

(9.14)

where E is the “physical” energy. Hence, energy landscape paving can overcome even at T = 0 any energy barrier. The waiting time for such a move is proportional to the height of the barrier that needs to be crossed. The factor c sets the timescale, and in this sense the T = 0 form of ELP is parameter-free.

9.3 Advanced Simulation Techniques Determining the structure of proteins through global optimization assumes the existence of a cost function whose global minimum describes the native structure. In most cases this is an energy that describes the physical interactions within a protein and between the protein and the surrounding environment, in most cases water. Since neither the available force fields nor the inclusion of solvation effects are perfect, it is not certain that the folded structure (as determined by X-ray or NMR experiments) corresponds to the global minimum conformation. Hence, the accuracy of the force fields sets a limit on any global optimization approach to structure prediction of proteins. Global optimization techniques are also not suitable for investigations of the folding mechanism, the change in shape when interacting with other molecules, or the appearance of mis-folded structures. As with structure prediction, it is necessary to go beyond global optimization techniques and to measure thermodynamic quantities, i.e., to sample a set of configurations from a canonical ensemble and take an average of the chosen quantity over this ensemble. In principle, this is possible with molecular dynamics and Monte Carlo simulations, however, but as argued earlier in this review this requires strategies that lead to a faster sampling of low-energy configurations.

9.3.1 Unfolding Simulations The poor sampling of protein configurations at physiologically relevant temperatures results from their rough energy landscape where barriers of height E are

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

215

suppressed by e−E/kB T . Hence, by increasing the temperature T it becomes easier for a protein to cross energy barriers. This can be used to induce the thermal unfolding of a protein. Such unfolding simulations at high temperature are interpreted sometimes as reversed-in-time folding (Daggett and Fersht 2003; Daggett 2002). This approach has been used in the past with some success (Daggett and Fersht 2003; Daggett 2002), but it is not clear whether in general it is justified in protein simulations. We have recently demonstrated that the C-fragment of Top 7, named by us as CFr, folds by a non-trivial pathway that involves caching of an N-terminal segment in an adjunct helix. Only when all other parts of the proteins are folded and in place, the N-terminal segment unfolds and re-folds to a strand that completes the final structure in a three-stranded sheet. We found that this folding mechanism cannot be interfered from unfolding simulations at high temperatures. In fact, the interpretation of unfolding data in Mohanty and Hansmann (2008) as folding in reversed time would miss the caching mechanism that governs folding of this protein. Likely, such an interpretation is restricted to simple two-state folder and associated with a nucleation mechanism, as observed, for instance, for CI2 (Daggett and Fersht 2003; Daggett 2002).

9.3.2 Advanced Updates A possible strategy to increase sampling of relevant protein configurations are improved updates. Within the context of molecular dynamics these are techniques that either guide the simulation and/or allow for larger time steps in the integrator. In the context of Monte Carlo these are usually collective moves that lead to a larger change in configurations. Examples are the re-bridging scheme (G¯o and Scheraga 1970; Wu and Deem 1999) and the biased Gaussian step method (Favrin et al. 2001). In hybrid Monte Carlo (Duane et al. 1987; Brass et al. 1993) a short molecular dynamics run is used as a collective move to provide a trial configuration, which is then accepted or rejected according to the Metropolis criterion. This allows to follow a trajectory over a long time with a large step size, because the Metropolis step corrects for the discretization errors in the molecular dynamics run. A general problem with all improved updates is that they depend strongly on the chosen model and are often not known a priori. A collective move that avoids this pitfall has been recently proposed by Berg (Berg 2003) under the name Rugged Metropolis (RM). The idea is to bias a Monte Carlo simulation by using informations from a simulation at a higher temperature. Assume a range of temperatures T1 > T2 > . . . > Tr > . . . > Tf −1 > Tf .

(9.15)

The simulation at the highest temperature, T1 , is performed with the usual Metropolis algorithm and the results are used to construct an estimator of the probability density function

216

U.H.E. Hansmann

ρ(x1 , . . . , xn ; T1 ). that biases the simulation at T2 . In turn, this simulation provides a bias for the one at T3 and iteratively continued down to Tf . For this purpose, Berg assumes the approximation ρ(x1 , . . . , xn ; Tr ) =

n !

ρ 1i (xi ; Tr ),

(9.16)

i=1

where ρ 1i (xi ; Tr ) are estimators of reduced one-variable probability densities ρi1 (xi ; T) =

!

dxj ρ(x1 , . . . , xn ; T) .

(9.17)

j=i

Recursively, the estimated probability density function ρ(x ¯ 1 , . . . , xn ; Tr−1 ) is generated as an approximation of ρ(x1 , . . . , xn ; Tr ). The acceptance step in the (biased) Metropolis procedure at temperature Tr is now given by *

PRM

+ exp −β E ρ(x1 , . . . , xn ; Tr−1 ) = min 1, exp (−β E) ρ(x 1 , . . . , x n ; Tr−1 )

(9.18)

Rugged Metropolis has been tested successfully for simulations of small peptides, however, as with other improved updates, by itself the gain in efficiency is not enough to make folding simulations of protein domains (consisting usually of 50–200 residues) feasible. On the other hand, improved updates are very useful when combined with the other techniques that we describe in the following subsections.

9.3.3 Generalized-Ensemble Techniques A very successful approach for improving the sampling of low-energy protein configurations is the generalized-ensemble approach. Its underlying idea is not to sample directly the canonical ensemble but an artificial ensemble tailored to enable efficient search for local minima while at same time avoiding entrapment. These generalized ensembles are defined in such a way that re-weighting techniques allow one to connect back to the canonical (i.e., physical) ensemble and to calculate thermodynamic averages at temperatures of interest (Hansmann 2003). A great number of such ensembles have been proposed, and while not all of them can be discussed in this review, we can classify them in principle according to whether they are generated by a random walk through order parameter space (for instance, energy),

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

217

control parameter space (temperature), or through model space (i.e., different energy functions). 9.3.3.1 Random Walks in Order Parameter Space In generalized ensembles that are defined by random walks in order parameter space, one requires that a Monte Carlo or molecular dynamics simulation leads to a broad distribution of a pre-chosen physical quantity. This allows one to sample both low and high-energy states with sufficient probability. For simplicity we will consider only ensembles that lead to flat distributions in one variable. Extensions to higher dimensional generalized ensembles are straightforward (Kumar et al. 1996). Probably the earliest realization of this idea is umbrella sampling (Torrie and Valleau 1977), but now more common is multicanonical sampling (Berg and Neuhaus 1991). Its first application of these techniques to protein simulations can be found in Hansmann and Okamoto (1993) where a Monte Carlo technique was used. Later, it was also adapted to molecular dynamics (Hansmann et al. 1996). The idea is to assign configurations with energy E a weight w(E) such that the distribution of energies Pmu (E) ∝ n(E)wmu (E) = const,

(9.19)

where n(E) is the spectral density. Since all energies appear with the equal probability, a free random walk in the energy space is enforced: the simulation can overcome any energy barrier and will not get trapped in one of the many local minima. For a wide range of temperatures it is now possible to obtain a canonical distribution by the re-weighting techniques (Ferrenberg and Swendsen 1988): −βE , PB (T, E) ∝ Pmu (E)w−1 mu (E)e

(9.20)

since a large range of energies is sampled. This allows one to calculate the expectation value of any physical quantity O at temperature T by < O >T =

dEO(E)PB (T, E) . dEPB (T, E)

(9.21)

The price for the resulting improved sampling is that (unlike in the canonical ensemble) the weights wmu (E) ∝ n−1 (E) are not a priori known (in fact, knowledge of the exact weights is equivalent to obtaining the density of states n(E), i.e., solving the system) and one needs their estimates for a numerical simulation. Calculation of the weights is usually done by an iterative procedure (Berg 2004; Hansmann and Okamoto 1993, 1994). Another efficient recursion is the so-called Wang–Landau sampling (Wang and Landau 2001) where one performs updates with estimators n(E) of the density of states p(E1 → E2 ) = min n(E1 )/n(E2 ), 1 .

(9.22)

218

U.H.E. Hansmann

Each time an energy level is visited, the estimator is updated according to n(E) → n(E) f

(9.23)

where, initially, n(E) = 1 and f = f0 = e1 . Once the desired energy range is covered, the factor f is refined, f1 =

, , f , fn+1 = fn+1 ,

(9.24)

until some small value is reached. In multicanonical simulations the computational effort increases with the number of residues like ≈ N 4 (when measured in Metropolis updates) (Hansmann and Okamoto 1999b). In general, the computational effort in simulations increases with ≈ X 2 where X is the variable in which one wants a flat distribution. This is because generalized-ensemble simulations realize by construction of the ensemble a 1D random walk in the chosen quantity X. In the multicanonical algorithm the reaction coordinate X is the potential energy X = E. Since E ∝ N 2 the above scaling relation for the computational effort ≈ N 4 is recovered. Hence, multicanonical sampling is not always the optimal generalized-ensemble algorithm in protein simulations. A better scaling of the computer time with size of the molecule may be obtained by choosing more appropriate reaction coordinate for our ensemble than the energy. This is the motivation behind the various other realizations of the generalizedensemble approach that exist. All aim at sampling a broad range of energies. In this way the simulation will overcome energy barriers and allow escape from local minima. For instance, in Hansmann and Okamoto (1999a) it was proposed that configurations are updated according to a special choice of the Tsallis generalized mechanics formalism (Curado and Tsallis 1994) (the Tsallis parameter q is chosen as q = 1 + 1/nF ): β(E − E0 ) −nF w(E) = 1 + . nF

(9.25)

Here E0 is an estimator for the ground-state energy and nF is the number of degrees of freedom of the system. The weight reduces in the low-energy region to the canonical Boltzmann weight exp(−βE). This is because E − E0 → 0 for β → 0 leading to β(E − E0 )/nF << 1. On the other hand, high-energy regions are no longer exponentially suppressed but only according to a power law, which enhances excursions to high-energy regions. In stochastic tunneling (Wenzel and Hamacher 1999)), conformations enter with a weight w(E) = exp(f (E)/kB T). Here, f(E) is a non-linear transformation of the potential energy onto the interval [0,1] and T is a low temperature. The physical idea behind such an approach is to allow the system to “tunnel” through energy barriers in the potential energy surface (Wenzel and Hamacher 1999). Such a transformation can be realized by

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

f (E) = e−(E−E0 )/nF ,

219

(9.26)

where E0 is again an estimate for the ground state and nF the number of degrees of freedom of the system. Note that the location of all minima is preserved. Hence, at a given low temperature T, the simulation can pass through energy barriers of arbitrary height, while the low-energy region is still well resolved. The efficiency of this algorithm for protein-folding simulations was demonstrated in Hansmann (1999). In both ensembles a broad range of energies is sampled. Hence, one can use again re-weighting techniques (Ferrenberg and Swendsen 1988) to calculate thermodynamic quantities over a large range of temperatures. In contrast to other generalized-ensemble techniques the weights are explicitly given for both new ensembles. One needs only to find an estimator for the ground-state energy E0 which is easier than the determination of weights for other generalized ensembles. In the context of molecular dynamics the generalized-ensemble idea is utilized in the metadynamics method where Gaussian-shaped repulsive potentials Ubias (s, t) =

ti

h exp −

|s − s(ti )|2 2w2

are added iteratively to the energy function. These potentials are centered at updated points s(ti ) of the reaction coordinates in order to discourage the system from revisiting the configurations (Laio and Parrinello 2002). The overall contribution from these auxiliary potentials flattens the underlying curvatures of the free-energy wells, therefore leading to a random walk. The original free-energy potentials are recovered by −Ubias (s, t). Another variant is simulated scaling where one assumes a system with potential U0 = Us + Ue ,. The potential Us represents the energy terms determining local conformations in a region of interest and Ue the remaining environmental energy terms. In order to accelerate sampling, one builds an expanded ensemble with the scaled potential U = λm Us + Ue and re-writes the scaled energy function in the dual-topology hybrid potential form that is usually utilized in free-energy simulations, U = (1 − λm ) UsA (x) + λm UsB (x ) + Ue .

(9.27)

Here, UsA (x) and UsB (x ) represent the unique portions of the energy terms for the two end-point chemical species A and B. When the λm histogram is flattened, the free energy difference between any two λm states is calculated according to A (λ0 → λ1 ) = −RT [a(λ1 ) − a(λ0 )] = −RTln

f (λ1 ) . f (λ0 )

(9.28)

Here, a(λm ) and f (λm ) represent the weight function and biasing function values.

220

U.H.E. Hansmann

Metadynamics-based methods are designed to enhance the crossing of energy barriers by flattening the energy surface. This has the undesired side effect of enlarging the conformation space that needs to be searched in a diffusive motion. As a consequence, the discovery of low-energy configurations becomes a rare event. This diffusion problem decreases sampling efficiency and it increases rapidly with the size of the molecule. This diffusion problem impedes also variants of the generalized-ensemble approach that aim at flat distributions in two or more dimensions. 9.3.3.2 Random Walks in Control Parameter Space The most appropriate generalized ensemble for a protein simulation is not necessarily the one generated by a random walk in a certain order parameter or reaction coordinate. Instead, it is often more efficient to enforce a random walk in a control parameter, for instance, temperature. One often used example is simulated tempering (Lyubartsev et al. 1992; Marinari and Parisi 1992) where the temperature itself becomes a dynamic variable and is sampled uniformly. Temperature and configuration are both updated with a weight: wST (T, E) = e−E/kB T−g(T) .

(9.29)

Here, the function g(T) is chosen so that the probability distribution of temperature is given by (9.30) PST (T) = dE n(E)e−E/kB T−g(T) = const. Physical quantities have to be sampled for each temperature point separately and expectation values at intermediate temperatures are calculated by re-weighting techniques (Ferrenberg and Swendsen 1988). As common in generalized-ensemble simulations, the weight wST (T, E) is not a priori known (since it requires knowledge of the parameters g(T)) and their estimator has to be calculated. They can be again obtained by an iterative procedure. In the simplest version the improved estimator for g(i) (T) for the ith iteration is cal(i−1) culated from the histogram of temperature distribution HST (T) of the preceding simulation as follows: (i−1)

g(i) (T) = g(i−1) (T) + log HST (T).

(9.31)

In this procedure one uses that the histogram of the ith iteration is given by HST (T) = e−gi−1 (T) Zi (T),

(9.32)

where Zi (T) = dEn(E) exp(−E/kB T) is an estimate for the canonical partition function at temperature T. Setting exp(gi (T)) = Zi (T) leads to the iterative relationship of Eq. (9.31).

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

221

It is easy to see that the factor g(T) drops out once one considers more than one copies of the system. This is the idea behind replica-exchange method (Hukushima and Nemoto 1996; Geyer and Thompson 1995), also known as parallel tempering, and first applied to protein science (Hansmann 1997). Assume, we have N noninteracting replicas of the molecule, each at a different temperature Ti . In addition to standard Monte Carlo or molecular dynamics moves that effect only one copy, parallel tempering has as an additional update (Hukushima and Nemoto 1996) the exchange of conformations between two copies i and j = i + 1 (i ≥ 1 and j ≤ N). This replica-exchange move is accepted or rejected according to the Metropolis criterion with probability w(Cold → Cnew ) = min(1, exp(−βi E(Cj ) − βj E(Ci ) + βi E(Ci ) + βj E(Cj ))) (9.33) = min(1, exp(βE);

(9.34)

where the factors g(Ti ) and g(Tj ) have been omitted as they do not depend on the replica (configuration) and therefore canceled. This exchange of conformations leads to a faster convergence of the Markov chain than in regular canonical simulations since the resulting random walk in temperatures allows the configurations to move out of local minima and cross energy barriers. Expectation values of a physical quantity A are calculated as usual according to: MES 1 A(Ci (k)), ATi = MES

(9.35)

k

where MES is the number of measurements taken for the ith temperature. Values for intermediate temperatures are calculated using multihistogram re-weighting techniques (Ferrenberg and Swendsen 1988, 1989). The temperature distribution should be chosen such that any relevant energy barrier can be crossed at the highest temperature. Note that parallel tempering does not require Boltzmann weights. The method can be combined easily with generalized-ensemble techniques (Hansmann 1997). Obviously, the method is also not restricted to temperature but can be used with any control parameter, for instance, pH or pressure. 9.3.3.3 Random Walks in Model Space Finally, one can also enhance sampling of low-energy configurations by performing a random walk through an ensemble of systems with slightly altered energy functions. In that way, information is exchanged between varying stages of coarse graining or different local environments. This is the idea behind “model hopping” (Kwak and Hansmann 2005), “Hamilton-exchange method,” and related approaches. Consider, for instance, that the energy function can be separated into two terms: E = EA + aEB . As in parallel tempering, “model hopping” considers N non-interacting copies of the molecule, but adjacent copies are now exchanged with probability

222

U.H.E. Hansmann

& w(Cold → Cnew ) = min 1, exp − β EA (Cj ) + ai EB (Cj ) + EA (Ci ) + aj EB (Ci ) ' −EA (Ci ) − ai EB (Ci ) − EA (Cj ) − aj EB (Cj ) = min 1, exp{βaEB } . (9.36) Here, a = aj − ai and EB = EB (Cj ) − EB (Ci ). Configurations perform a random walk on a ladder of models with a1 = 1 > a2 > a3 > .... > aN that differ by the relative contributions of EB to the total energy E of the molecule. Take as an example the barriers in the energy landscape of proteins that arise from van der Waals repulsion between atoms that come too close. Assuming that such barriers are a main reason for slow sampling in protein simulations, we have considered a version of “model hopping” where the contributions from the van der Waals energy become successively smaller. While the “physical” system is on one side of the ladder (at a1 = 1), the (non-physical) model on the other end of the ladder (at aN << 1) allows in the extreme atoms to share the same position in space. As the protein “tunnels” through van der Waals energy barriers, sampling of low-energy configurations is enhanced in the “physical” model (at a1 = 1). With this realization of “model hopping” we have “predicted” the structure of a 46-residue protein A in an all-atom simulation within a root mean square deviation (rmsd) of 3.2 Å (Kwak and Hansmann 2005). Model hopping also allows guiding a simulation by information obtained from homologous structures (Gront et al. 2005). Usually, such spatial constraints introduce an additional roughness into the energy landscape which often leads to extremely slow convergence of the simulation. This problem is circumvented in our approach through a random walk in an ensemble of replicas that differ by the strength with that the constraints are coupled to the system. We have demonstrated the usefulness of this approach on some examples of the CASP6 competition (Gront et al. 2005).

9.3.3.4 Optimizing the Efficiency of Generalized-Ensemble Sampling The computational efficiency of generalized-ensemble and replica-exchange techniques is often worse than their theoretical optimum. Bottlenecks and barriers can lead to slow relaxation. In parallel tempering, convergence is given by the relaxation at lowest temperature, which can be gauged by the frequency of statistically independent visits at this temperature. A lower bound for this number is the rate of round trips nrt between the lowest and highest temperature, T1 and TN . We define nup (i) (ndn (i)) as the number of replicas at temperature Ti that came from T1 (TN ). The fraction of replicas moving up

fup (i) =

nup (i) nup (i) + ndn (i)

(9.37)

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

223

describes a stationary distribution of probability flow between temperatures T1 and TN . Maximizing the number of round trips nrt results in a constant transition probability between neighboring nodes and a linear flow distribution among the nodes (Nadler and Hansmann 2007): opt

fup (i) = i/N

(9.38)

Note that in protein simulations relying on explicit solvent the system is dominated by water. In this case the heat capacity C is constant, and the system can be approximated by a D = 2C harmonic oscillator. Within this approximation, one finds then that the optimal temperature distribution is one of √ N opt ≈ 1 + 0.594 C ln(Tmax /Tmin )

(9.39)

replicas, with temperatures distributed according to i−1

opt

Ri

= Tmin

Tmax N−1 . Tmin

(9.40)

Here, Tmax is the highest temperature, Tmin is the lowest temperature, and both quantities have to be chosen in advance (Nadler and Hansmann 2008). If the relaxation at a particular temperature is slower than hopping in temperature, the state space partitions into disjoint free-energy basins connected only via neighboring nodes and forming a tree-like hierarchical network. In this case of broken ergodicity an optimized temperature distribution can be found still iteratively (Trebst et al. 2006) by requiring for temperature Tjk in the kth iteration that

Tjk

η(opt) (T)dT = j/N ,

(9.41)

T1

where 1 < j < N, the two extremal temperatures T1 and TN remain fixed, and η

(opt)

(T) =

C

1 df , T dT

with the normalization constant C chosen so that TN η(opt) (T)dT = 1 .

(9.42)

(9.43)

T1

This will again lead to a linear flow distribution, but the acceptance probabilities are not any longer constant. Similarly, one can show that in the case of broken ergodicity weights optimizing the flow through order parameter space (for instance, energy) lead not to a flat distribution (Trebst et al. 2006; Nadler and Hansmann 2007). A direct measurement of the flow distribution is computationally costly: Individual replicas have to cross the full ladder of nodes many times in order to

224

U.H.E. Hansmann

ensure sufficient statistics. Such “tunneling” events are especially rare in early stages of the control parameter optimization when round trip times are largest. For this reason, we have proposed to estimate the flow distribution from measurements of mean first passage times allowing to approximate the global flow using also replicas crossing only part of it the ladder. The number of first passage events τ (0 → n) in a simulation decreases with n, while the error increases. At the same time, the error for τ (N → n) increases with decreasing n. Obviously, there exists a node n∗ where the mean first passage times for τ (0 → n), n = 1, . . . , n∗ , and those for τ (N → n), n = n∗ + 1, . . . , N − 1 will be the most reliable ones. Hence, while in the optimization schemes of Trebst et al. (2006) and Nadler and Hansmann (2007) the limiting factor are tunneling events across the full ladder of temperatures, here the statistics is only limited by the number of first passage events from either boundary to n∗ . Defining sums over adjacent inverse transition probabilities h(0 → n) =

n j=1

1 W βj−1 , βj

n−1 τ (0 → j)

=

j=1

j(j + 1)

+

τ (0 → n) , n

(9.44)

and h(N → n) = =

N

1 W βj , βj−1 j=n+1 N−n−1 j=1

τ (N → N − j) τ (N → n) + , j(j + 1) N−n

(9.45)

one can derive the following expressions for the flow probabilities:

P(MFPT) (n) = up

⎧ ⎪ ⎪ ⎨1 −

:

n ≤ n∗

⎪ ⎪ ⎩

:

n > n∗

h(0 → n) h(0 → n∗ ) + h(N → n∗ ) h(N → n) h(0 → n∗ ) + h(N → n∗ ) (MFPT)

(9.46)

with a similar relation for Pdown . (MFPT) Starting from a flow distribution Pup reconstructed from mean first passage time analysis, one now can use existing iteration schemes that exhibits fast convergence to the optimal temperature values (Trebst et al. 2006; Nadler and Hansmann 2007). In our simulations, flow distributions derived from mean first passage times lead to temperature sets that are more stable upon iteration than those from flows measured directly (Nadler et al. 2008).

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

225

9.4 Recent Applications An example for applications of the above-described techniques are our recent simulations (Nadler et al. 2008) of GS-α3 W, a three-helix bundle with a single tryptophan buried in the interior of the protein, which was designed to serve as a simple model for the function of redox proteins. With 67 residues the protein is of a size that far exceeded what we could study previously in all-atom simulations. In Fig. 9.1 we show a low-energy structure found in an implicit-solvent replica-exchange Monte Carlo simulation relying on mean first passage times-optimized temperatures. The figure shows the overlap of the minimal structure with the experimentally determined (Protein Data Bank identifier 1LQ7). The root mean square deviation between both structures is only 3.3 Å. A detailed analysis of our data is published in Meinke and Hansmann (2009) and confirms that replica exchange with flowoptimized temperatures allows thermal-folding simulations of such large proteins. Fig. 9.1 Overlap of computational and experimentally determined configuration of a 67-residue protein. Figure taken from Nadler et al. (2008)

Our results for the 49-residue C-terminal CFr of the artificially designed Top7 (Kuhlman et al. 2003; Dantas et al. 2006) may be the most impressive example for the power of generalized-ensemble simulations and the progress made over the last few years. Our simulations of the monomer let to a lowest-energy structure with an α-helix packed against a three-stranded anti-parallel β-sheet, shown in Fig. 9.2, that has an rmsd of only 1.7 Å (Mohanty et al. 2008) to the experimentally determined structure (PDB-code 2GJH). It is also the free-energy minimum at T = 300 K. This result is more remarkable as proteins with end-to-end are particularly challenging. The N-terminal β-strand is synthesized early on, but it cannot bind to the C terminus before the chain is fully synthesized. During this time there is a danger that the β-strand at the N terminus interacts with nearby molecules leading to potentially harmful aggregates of incompletely folded proteins. For slow-folding proteins, one can conjecture a “backtracking” mechanism (Gosavi et al. 2006) where folding succeeds only after breaking of prematurely formed β-contacts.

226

U.H.E. Hansmann

Fig. 9.2 Overlay of the free-energy configuration of Cfr with the experimentally determined structure. Figure taken from Mohanty and Hansmann (2008)

However, the Cfr monomer is a fast-folding protein. The risk of mis-folding and aggregation is avoided by another mechanism in our simulations. During much of the folding the N-terminal region (Glu 2-Thr 12) forms an extension of the neighboring helix, residues Lys 13 through Gly 31 (picture 2 in Fig. 9.3). This extended helix forms early but frequently unfolds and refolds. After formation and proper arrangement of the β-hairpin (Tyr 32 through Leu 50) (pictures 3–4 in Fig. 9.3), their arrangement is stabilized through hydrophobic contacts between the two secondary structure elements. At this point, the helix unfolds partially and releases the N-terminal residues (picture 5 in Fig. 9.3) that now attach to the hairpin as the third strand of a β-sheet and complete the native structure (picture 6 in Fig. 9.3).

Fig. 9.3 Six snapshots of the folding pathway. Starting from random initial states (1), the molecule first forms a helix (2) that includes the N-terminal residues. The C-terminal hairpin is formed next (3), aligning with the helix (4). The helix partially unfolds (5) and the released residues complete the native structure as third strand of the β-sheet (6). The figure is taken from Mohanty et al. (2008)

The caching of an N-terminal strand in a helix prevents premature formation of contacts with parts of the molecule having strong β-strand propensities and also with similar parts in other molecules. Thus it acts both as a facilitator of folding and as an inhibitor of aggregation. This caching mechanism may be common in proteins with long-distance β-contacts. It requires that one of the strands exhibits chameleon behavior (Minor and Kim 1996), i.e., either extends an adjacent helix or forms a β-strand when provided with a template for a β-sheet. This is not only important because this mechanism may be common in the large class of proteins with β-sheets

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

227

where the strands have long separation in sequence, but also because it demonstrates that protein design cannot be restricted only on the final fold but needs to include the folding path.

9.5 Conclusion Progress in the development of algorithms over the last two decades has extended dramatically the size of peptides and proteins that are accessible in all-atom simulations. While in the early 1990s such simulations were restricted to pentapeptides such as Met-enkephalin, small proteins with 50–60 residues are now accessible. We can expect that further progress in algorithms and hardware will extend this range to even larger molecules. Research over the last years has also allowed to pinpoint the difficulties that need to be overcome in order to obtain further progress. Probably the most important open problem is that novel and advanced techniques require a tuning of parameters to give optimal sampling and that there are no simple and universal rule for this tuning. Research in the next years will need to focus on the development of procedures that allow for an optimal choice of algorithm and its tuning to the problem at hand. Such research requires the establishment of a set of benchmark systems that enable a fair and decisive comparison of simulation techniques and allow one to measure the scaling of their efficiency with size and complexity of the molecules. Such benchmark systems will also allow one to test and improve the software, in that these algorithms are implemented. As the described techniques can only reduce the sampling difficulties from an exponential scaling to a power law, it is necessary to have software that highly adapted to massively parallel computers and modern architectures such as GPUs and cell processors. Writing and adapting this software will be another major challenge. It is hoped that further advancements in hardware and algorithms will overcome the above-described problems and further establish the use of computer simulations as “microscope” to a point where not only single proteins are studied in atomistic resolution, but also their interactions and finally whole cells can be explored in silico. Acknowledgments Support by the National Science Foundation (research grants CHE-998174, 0313618, 0809002) and the National Institutes of Health (GM62838) are acknowledged.

References Androulakis IP, Maranas CD, Floudas CA (1997) Prediction of oligopeptide conformations via deterministic global optimization. J Global Optim 11:1–34 Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230 Allen F, Almasi G, Andreoni W, Beece D, Berne BJ, Bright A, Brunheroto J, Cascaval C, Castanos J, Coteus P, Crumley P, Curioni A, Denneau M, Donath W, Eleftheriou W, Fitch B, Fleischer B, Georgiou CJ, Germain R, Giampapa M, Gresh D, Gupta M, Haring R, Ho H, Hochschild P, Hummel S, Jonas T, Lieber D, Martyna G, Maturu K, Moreira J, Newns D, Newton M, Philhower R, Picunko T, Pitera J, Pitman M, Rand R, Royyuru A, Salapura V,

228

U.H.E. Hansmann

Sanomiya A, Shah R, Sham Y, Singh S, Snir M, Suits F, Swetz R, Swope RC, Vishnumurthy B, Ward TJC, Warren H, Zhou R (2001) Blue gene: a vision for protein science using a petaflop supercomputer. IBM Syst J 40:310–327 Berg BA, Neuhaus T (1991) Multicanonical algorithms for first order phase transitions. Phys Lett B 267:249–253 Berg BA (2003) Metropolis importance sampling for rugged dynamical variables. Phys Rev Lett 90:180601 Berg BA (2004) Markov chain Monte Carlo simulations and their statistical analysis. World Scientific, Singapore Besold G, Risbo J, Mouritsen OG (1999) Efficient Monte Carlo sampling by direct flattening of free energy barriers. Comp Mater Sci 15:311–340 Brass A, Pendleton BJ, Chen Y, Robson B (1993) Hybrid Monte Carlo simulation theory and initial comparison with molecular dynamics. Biopolymers 33:1307–1315 Curado EMF, Tsallis C (1994) Possible generalization of Boltzmann–Gibbs statistics. J Phys A-Math Gen 27:3663 Cvijovic D, Klinowski J (1995) Taboo search: an approach to the multiple minima problem. Science 267:664–666 Daggett V, Fersht AR (2003) Is there a unifying mechanism for protein folding? Trends Biochem Sci 28:18–25 Daggett V (2002) Molecular dynamics simulations of the protein unfolding/folding reaction. Acc Chem Res 35:422–429 Dantas G, Watters AL, Lunde BM, Eletr ZM, Isern NG, Roseman T, Lipfert J, Doniach S, Tompa M, Kuhlman B, Stoddard BL, Varani G, Baker D (2006) Mis-translation of a computationally designed protein yields an exceptionally stable homodimer: implications for protein engineering and evolution. J Mol Biol 362:1004–1024 Duane S, Kennedy AD, Pendleton BJ, Roweth D (1987) Hybrid Monte Carlo. Phys Lett B 195:216–221 Favrin G, Irback A, Sjunnesson F (2001) Monte Carlo update for chain molecules: Biases Gaussian steps in torsional space. J Chem Phys 114:8154–8158 Ferrenberg AM, Swendsen RH (1988) New Monte Carlo technique for studying phase transitions. Phys Rev Lett 61:2635–2638 Ferrenberg AM, Swendsen RH (1989) Optimized Monte Carlo data analysis. Phys Rev Lett 63:1195–1198 Frenkel D, Smit B (2001) Understanding molecular simulation. From algorithms to applications. In: Computational science series, vol 1, 2nd edn. Academic, New York, NY Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE T Pattern Anal 6:721–741 Geyer GJ, Thompson EA (1995) Annealing Markov Chain Monte Carlo with applications to ancestral inference. J Am Stat Assoc 90:909–920 Gosavi S, Chavez LL, Jennings PA, Onuchic JN (2006) Topological frustration and the folding of interleukin-1 beta. J Mol Biol 357:986–996 Gront D, Kolinski A, Hansmann UHE (2005) Exploring protein energy landscape with hierarchical clustering. Int J Quant Chem 105:826 G¯o N, Scheraga HA (1970) Ring closure and local conformational deformations of chain molecules. Macromolecules 3:178–187 Hansmann UHE, Okamoto Y (1993) Prediction of peptide conformation by multicanonical algorithm: a new approach to the multiple-minima problem. J Comp Chem 14:1333–1338 Hansmann UHE, Okamoto Y (1994) Comparative study of multicanonical and simulated annealing algorithms in the protein folding problem. Physica A 212:415–437 Hansmann UHE, Okamoto Y, Eisenmenger F (1996) Molecular dynamics, Langevin and hybrid Monte Carlo simulations in a multicanonical ensemble. Chem Phys Lett 259:321–330 Hansmann UHE (1997) Parallel tempering algorithm for conformational studies of biological molecules. Chem Phys Lett 281:140–150

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

229

Hansmann UHE, Okamoto Y (1998) The generalized-ensemble approach for protein folding simulations. In: Stauffer D (ed) Annual reviews in computational physics, vol. VI. World Scientific, Singapore Hansmann UHE, Okamoto Y (1999a) New Monte Carlo algorithms for protein folding. Curr Opin Struc Biol 9:177–184 Hansmann UHE (1999) Protein folding simulations in a deformed energy landscape. Eur Phys J B 12:607–612 Hansmann UHE, Okamoto Y (1999b) Finite-size scaling of helix–coil transitions in poly-alanine studied by multicanonical simulations. J Chem Phys 110:1267–1276 Hansmann UHE, Wille L (2002) Global optimization by energy landscape paving. Phys Rev Lett 88:068105 Hansmann UHE (2003) Protein folding in silico – an overview. Comput Sci Eng 5:64–69 Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor, MI Hukushima K, Nemoto K (1996) Exchange Monte Carlo method and applications to spin glass simulations. J Phys Soc (Japan) 65:1604–1608 Kirkpatrick S, Gelatt CP, Vecchi MP (1983) Optimization by simulated annealing. Science 220:671–680 Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D (2003) Design of a novel globular protein fold with atomic level accuracy. Science 302:1364–1368 Kumar S, Payne PW, Vásquez M (1996) Method for free-energy calculations using iterative techniques. J Comp Chem 17:1269–1275 Kwak W, Hansmann UHE (2005) Efficient sampling of protein structures by model hopping. Phys Rev Lett 95:138102 Laio A, Parrinello M (2002) Escaping free-energy minima. Proc Natl Acad Sci USA 99: 12562–12566 Li Z, Scheraga HA (1987) Monte Carlo-minimization approach to the multiple-minima problem in protein folding. Proc Natl Acad Sci USA 84:6611–6615 Lyubartsev AP, Martinovski AA, Shevkunov SV, Vorontsov-Velyaminov PN (1992) New approach to Monte Carlo calculations of the free energy: method of expanded ensembles. J Chem Phys 96:1776–1783 Marinari E, Parisi G (1992) Simulated tempering: a new Monte Carlo scheme. Europhys Lett 19:451–458 Meinke JH, Hansmann UHE (2009) Thermodynamics and free-energy driven folding of the 67-residue protein GSα W – A large-scale Monte Carlo study. J Comp Chem 30:1642–1648 Minor DL Jr, Kim PS (1996) Context-dependent secondary structure formation of a designed protein sequence. Nature 380:730–734 Mohanty S, Meinke JH, Zimmermann O, Hansmann UHE (2008) Simulation of Top7-CFr: a transient helix extension guides folding. Proc Natl Acad Sci USA 105:8004–8007 Mohanty S, Hansmann UHE (2008) Caching of a Chameleon segment facilitates folding of a protein with end-to-end β-sheet. J Phys Chem B 112:15134 Nadler W, Hansmann UHE (2007) Generalized ensemble and tempering simulations: a unified view. Phys Rev E 75:026109 Nadler W, Hansmann UHE (2008) Optimized explicit-solvent replica-exchange molecular dynamics from scratch. J Phys Chem B 112:10386 Nadler W, Meinke JA, Hansmann UHE (2008) Folding proteins by first-passage-times optimized replica exchange. Phys Rev E 78:061905 Schug A, Wenzel W, Hansmann UHE (2005) Energy landscape paving simulations of the trp-cage protein. J Chem Phys 122:194711 Trebst S, Troyer M, Hansmann UHE (2006) Optimized parallel tempering simulations of proteins. J Chem Phys 124:174903 Torrie GM, Valleau JP (1977) Nonphysical sampling distributions in Monte Carlo free-energy estimation: umbrella sampling. J Comp Phys 23:187–199

230

U.H.E. Hansmann

Wang FG, Landau DP (2001) Efficient, multiple-range random walk algorithm to calculate the density of states. Phys Rev Lett 86:2050–2053 Wenzel W, Hamacher K (1999) Stochastic tunneling approach for global minimization of complex potential energy landscapes. Phys Rev Lett 82:3003 Wu MG, Deem MW (1999) Analytical rebridging Monte Carlo: application to cis/trans isomerization in proline-containing cyclic peptides. J Chem Phys 111:6625–6632

Chapter 10

Protein Structure Prediction: From Recognition of Matches with Known Structures to Recombination of Fragments Michal J. Gajda, Marcin Pawlowski, and Janusz M. Bujnicki

Abstract The field of protein structure prediction has been revolutionized by the application of “mix-and-match” methods both in template-based homology modeling, as well as in template-free, “de novo” folding. Automated generation of models that are closer to the native structure of the target protein than the structure of its closest homolog is currently possible by recombination of fragments copied from known protein structures or extracted from alternative starting models. It is also the most successful approach in the cases where the target protein exhibits a novel three-dimensional fold. This chapter is an update of a review article published earlier [Bujnicki JM, Chem Bio Chem 7(1):19–27, 2006 Jan 9, Copyright WileyVCH Verlag GmbH & Co. KGaA. Reproduced with permission]. It summarizes the recent developments in both template-based and template-free protein structure modeling and compares the available methods for protein structure prediction by recombination of fragments.

10.1 Introduction The high-resolution three-dimensional structure of a protein is the key to the understanding and manipulation of its biochemical and cellular function. However, the rate of protein structure determination by X-ray crystallography lags behind the rate of determination of new protein sequences. As of January 2010, the National Center for Biotechnology Information’s Non-Redundant RefSeq database (Pruitt et al. 2007) contained 9,662,677 sequences, while the Protein Data Bank (Berman et al. 2000) contained only 36,043 protein structures with non-redundant sequences

J.M. Bujnicki (B) Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw, Poland; Laboratory of Bioinformatics, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_10,

231

232

M.J. Gajda et al.

(50,495 structures total). Currently, the size of the sequence database doubles approximately every 5 years, while the structure database doubles every 8 years. Thus, the gap between the number of known structures and known sequences will continue to widen in the foreseeable future and it is unlikely that it will be ever closed, e.g., the structures will never be solved experimentally for all proteins. Almost 50 years ago Anfinsen demonstrated that all of the information necessary for RNase A to fold to the native structure is contained in its amino acid sequence (Anfinsen et al. 1961). This finding has been generalized to most globular proteins, suggesting that a protein’s structure could be calculated (modeled) based on the knowledge of its sequence and our understanding of the sequence–structure relationships. Thus, the current structural genomics initiative aims to solve experimentally the structures of representative proteins, while the others are hoped to be modeled computationally (Baker and Sali 2001). The theoretical prediction of the native structure of a protein from its amino acid sequence remains, however, one of the most challenging problems in contemporary life sciences.

10.2 Protein Structure Prediction Methods: Classification and Critical Evaluation Efforts to solve the protein-folding problem have been traditionally rooted in two schools of thought (Fig. 10.1). One is based on the principles of physics, e.g., on the thermodynamic hypothesis formulated by Anfinsen, according to which the native structure of a protein corresponds to the global minimum of its free energy (Anfinsen 1973). Accordingly, physics-based methods model the process of protein folding by simulating the conformational changes and searching for the free-energy minimum. The other school of thought is based on the principles of evolution. After experimental determination of the first handful of protein structures it became clear that evolutionarily related (homologous) proteins usually retain the same threedimensional fold (i.e., the arrangement and connectivity of elements of secondary structure) despite the accumulation of divergent mutations (Chothia and Lesk 1986). It was also found that structural divergence is much slower than sequence divergence, although these two features are strongly correlated. Thus, methods have been developed to map the sequence of one protein (a target) to the structure of another protein (a template), model the overall fold of the target based on that of the template, and infer how the target structure will change due to substitutions, insertions, and deletions, as compared with the template (reviews: Cohen-Gonsaud et al. 2004; Krieger et al. 2003). Table 10.1 summarizes the key features of methods discussed in this chapter. Accordingly, methods for protein structure prediction have been divided into two classes: “de novo” modeling, in principal applicable to all types of proteins, including those for which no appropriate templates are available, and “comparative (homology) modeling” (CM), in which the target sequence must be aligned to an evolutionarily related, experimentally solved template structure. In this context it is

10

Protein Structure Prediction

233

Fig. 10.1 The “evolutionary” and “physical” approaches for protein structure prediction. Given the amino acid sequence, a simulation of either protein evolution or protein folding is carried out, according to quantitative models of either divergence of sequences and structures or physical interactions within the molecule and between the molecule and the solvent. (Bujnicki, 2006, Copyright Wiley-VCH Verlag GmbH & Co. KGaA. Reproduced with permission)

worthwhile to remind that protein structure can be described by a hierarchical system, with levels corresponding to primary sequence (covalent bonding of amino acids), secondary structure (segments of recurring arrangement of amino acids consecutive in the sequence), tertiary structure (mutual arrangement of secondary structures in a protein domain), and quaternary structure (mutual arrangement of domains within a multi-domain protein or different subunits in a multi-protein complex). Knowledge-based methods are usually effective on all levels of this hierarchy, first because evolutionary forces tend to preserve all aspects of protein structure. And if homology is present, it often manifests itself at all levels of hierarchy.

Type

CM FR/CM meta-selector FR/CM meta-selector FR/CM fragment splicer

FR/CM fragment splicer

FR/CM/de novo fragment splicer

CM fragment splicer

CM alignment splicer

De novo fragment splicer

FR/CM/de novo fragment splicer

De novo fragment splicer

De novo fragment splicer

Method

SWISS-MODEL PCONS5 (PROQ) 3D-JURY 3D-SHOTGUN

FRANKENSTEIN3D

GS-KudlatyPred

In silico protein recombination

GENETIC ALGORITHM

ROSETTA

GINZU/ROBETTA

SIMFOLD

PROFESY

Merging of domain models, Monte Carlo-simulated annealing Multi-canonical ensemble Monte Carlo Conformational space annealing

Superposition of templates Superposition of models Superposition of models Superposition and recombination of models Superposition and recombination of models Recombination of supersecondary structure fragments form any kind of initial models Superposition and recombination of models, local realignment Recombination of alignments, local realignment Monte Carlo-simulated annealing

Search strategy

Physical energy function

Physical energy function

Physical energy function with elements of a statistical potential, clustering FR score and statistical potential

Statistical potential

Statistical potential

Statistical meta-potential (MetaMQAP) Statistical meta-potential (MetaMQAPcons)

NA Statistical potential (PROQ) NA NA

Evaluation/selection of models

15 aa fragments from PDB

4–9 aa fragments from PDB

3 and 9 aa fragments from PDB, FR models

alternative target-template alignments 3 and 9 aa fragments from PDB

Comparative models with similar folds

FR/CM/de novo models

FR models

CM templates, loops from PDB FR models FR models FR models

Input and/or fragment library

Table 10.1 Summary of key features of methods discussed in this chapter. This table lists all prediction methods discussed in this chapter, along with their distinguishing features. It is intended as a reference and guide for contrasting approaches

234 M.J. Gajda et al.

De novo fragment splicer

De novo fragment splicer

De novo fragment splicer

FR/CM/de novo fragment splicer FR/CM/de novo fragment splicer

FRAGFOLD

UNDERTAKER

ABLE

TASSER

De novo fragment splicer

FR

SALAMI

Structural descriptors

ZAM

FR/CM/de novo fragment splicer De novo fragment splicer

FRANK/CABS

I-TASSER

Type

Method

Bayesian fragment picking, torsion angle dynamics NA

Monte Carlo-simulated annealing, iterated with restraints from previous rounds Replica Exchange Monte Carlo (on a lattice) Replica Exchange Monte Carlo (on a lattice), iterative enrichment of fragments with templates analogous to ab initio models from previous rounds Replica Exchange Monte Carlo (on a lattice) zipping and assembly from smaller fragments

Simulated annealing or genetic algorithm Genetic algorithm

Search strategy

Table 10.1 (continued)

NA

Physical energy function with elements of a statistical potential Steric clashes, clustering

Statistical potential

Statistical potential, clustering Statistical potential, clustering

Physical energy function with elements of a statistical potential

Statistical potential

Statistical potential

Evaluation/selection of models

Alphabet of 300 fragments each 6 residues long Descriptors (groups of >2 fragments; >3 residues long)

Individual residues

FR models

FR models, SCOP folds most similar to ab initio models

FR models

Supersecondary structures and 3–5 aa fragments Fragments of FR models, and 1–4 aa and 9–12 aa fragments from PDB Individual residues

Input and/or fragment library

10 Protein Structure Prediction 235

236

M.J. Gajda et al.

Further, in homology-based modeling errors at different level of structural hierarchy are largely independent from each other (e.g., it is not difficult to correctly predict a protein fold without getting all secondary structures correctly and without any consideration of the quaternary structure). This has been demonstrated in the course of Critical Assessment of Techniques for Protein Structure Prediction (CASP), as many modelers have generated models with correct folds while completely disregarding the quaternary structure, and with significant errors with respect to secondary structure alignment (Moult et al. 2005, 2007, 2009). On the other hand, “de novo” methods typically require accurate prediction at low levels of hierarchy in order to correctly predict higher levels. Their advantage, however, is that they are independent on the modeling of the primary sequence (they do not attempt to model sequence alignment between the target sequence and some other sequence). The “de novo” approach can be further subdivided into “ab initio” methods, i.e., those based exclusively on the physics of the interactions within the polypeptide chain and between the polypeptide and the solvent (Scheraga 1996), and “knowledge-based” methods that utilize statistical potentials based on the analysis of recurrent patterns in the known protein structures and sequences (Kolinski 2004). De novo methods may utilize different representations of the protein chain, frequently employing either coarse-grained models (see the chapter by Liwo et al in this volume) or fragments of experimentally solved structures (from parts of the side chain or the backbone, to individual residues, to groups of residues up to the size of a few elements of secondary structure). The CM approach can also be subdivided into two main trends. One is to model the structure by copying the coordinates of the template (both the backbone and the side chains) in the aligned core regions, which can also include “averaging” over coordinates of multiple templates. The variable regions are modeled by taking fragments with similar sequence from a database of previously observed loops, followed by replacing the mutated side chains with rotamers that satisfy the stereochemical criteria, and (optionally) limited energy optimization, as implemented in SWISS-MODEL (Peitsch 1995). The other possibility is to use the distance and torsion angles and interatomic distances from the aligned regions of the template(s) as modeling restraints, which permits the use of information from multiple, possibly conflicting structures. This approach also requires to idealize the geometry and packing of the entire chain by satisfying stereochemical constraints derived from the database of protein structures, as implemented in MODELLER (Sali and Blundell 1993). The CM approach has been also extended to “fold-recognition” (FR), where one attempts to identify a template with a similar fold that does not need to exhibit significant sequence similarity to the target (e.g., the target and the template may or may not be homologous, but they need to share the common fold) (Godzik et al. 1992; Jones et al. 1992). While the early FR methods relied mostly on the “threading” approach, i.e., evaluation of protein energy as the sum of pairwise residue–residue interactions based on physical or statistical potentials, nearly all contemporary FR methods are based mostly (or exclusively) on sequence comparisons and are tuned to detect distantly related homologs rather than unrelated structural analogs (reviews: Cymerman et al. 2004; Ginalski et al. 2005).

10

Protein Structure Prediction

237

Another way of subdividing the “comparative” approach is into orthodox (traditional) methods that use entire proteins (or domains) as templates and methods that use different (not necessarily related) structures or their fragments to model different parts of the target sequence (which are discussed in this review). With the decreasing size of fragments, the latter type of comparative methods blends with de novo methods that represent protein structures with fragments of known structures. Recently, a new generation of protein structure prediction methods has been developed that combine comparative modeling with de novo modeling. Typically, an initial model or its significant part is modeled by comparative approach, based on the general prediction of the three-dimensional fold, and then the entire structure or its part is “refined” by de novo methods, often in connection with evaluation of local quality by Model Quality Assessment Methods (MQAPs) (Kryshtafovych and Fidelis 2009). This review describes examples of all the above-mentioned approaches. In order to objectively assess the abilities (and inabilities) of different methods for protein structure prediction, Moult and coworkers organized the biennial Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP) to compare computationally predicted protein structures with the “golden standard” of experimentally determined ones. The first assessment experiment (CASP1) was held in 1994 and revealed that computational methods for protein structure prediction perform quite poorly, those based on physics and evolution alike (Moult et al. 1995). Since then, the progress in the field of protein structure prediction has been significant, especially in the “template/knowledge-based” category (e.g., CM and FR), in part due to the improvement in methodology but mostly because of the rapid growth of databases and accumulation of new potential template structures as well as numerous new sequences that can serve as convenient “evolutionary intermediates” in the homology searches. Nonetheless, it appears that in the recent years there has been little progress, if any, in the ability of both comparative and de novo methods (Moult et al. 2009).

10.3 “Meta” Approaches to Template-Based Prediction The series of CASP experiments have shown that the combined use of human expertise and automated methods can often result in successful predictions. This has became especially clear in the cases of very remote homology, where most FR methods return predictions with scores indicating the lack of statistical significance and correct models are “buried” among a number of incorrect models. A group of four human predictors including Daniel Fischer, Leszek Rychlewski, Arne Elofsson, and Janusz M. Bujnicki, pioneered the idea of “meta-prediction” in CASP4, by comparing the models generated by FR servers participating in the satellite experiment CAFASP-21 and submitting manually selected “consensus” predictions as the “CAFASP-CONSENSUS” group. This group performed better than any of the

238

M.J. Gajda et al.

individual servers and ranked seventh among all predictors of CASP4 (Fischer et al. 2001). Thus, it was demonstrated that the recurrence of a particular protein fold within the sets of top ten models returned by different servers (and not necessarily at the first position of their ranking) increases the likelihood of a correct prediction and that on the average, no single FR method is better than the combination of a few top methods. Since then, meta-prediction based on FR (Fig. 10.2) has become the most successful approach for template-based modeling and has been applied by a large number of human predictors, including the best performers in CASP5 and CASP6.

Fig. 10.2 The meta-server approach for protein structure prediction. The meta-server is used as a gateway to send the target sequence to various “primary” fold-recognition servers, collect the results (target-template alignments), build the corresponding models, compare them with each other, and either select the most representative structure or construct a hybrid model from the most frequently represented fragments (Bujnicki, 2006, Copyright Wiley-VCH Verlag GmbH & Co. KGaA. Reproduced with permission)

10

Protein Structure Prediction

239

Following the proven success of manual “meta-predictors,” several groups have implemented fully automated “meta-servers” (Bujnicki et al. 2001b; Douguet and Labesse 2001; Kurowski and Bujnicki 2003; Wu and Zhang 2007a). One of the earliest meta-predictors was the neural network PCONS developed by Elofsson and coworkers (Lundstrom et al. 2001), which collects a set of top models generated by different FR servers and selects the models that were most similar to other models in the set. The second edition of an independent assessment experiment LiveBench (Bujnicki et al. 2001a) organized shortly after CASP4 revealed that PCONS2 (version trained for a few specific “primary” FR servers) exhibited the sensitivity comparable to the most sensitive primary method and the specificity higher than any primary method. The newest version of PCONS is reinforced by methods for protein model evaluation (Wallner and Elofsson 2003), exhibits even higher specificity and is able to use as an input model generated by any set of methods. PCONS is available as a part of various meta-servers, as well as a standalone server Pcons.net (http://pcons.net/) (Wallner et al. 2007). 3D-JURY developed by Rychlewski and coworkers (Ginalski et al. 2003) is another automated meta-predictor that simply selects models from those produced by other servers. It takes as input any set of models, compares all against all, and selects one that appears to contain the largest subset of commonly superimposable coordinates. The most important feature contributing to the success of 3D-JURY and its popularity among the users is its scoring system, which allows confidently identifying the models with correctly predicted fold, even though it does not necessarily recognize the absolutely best model among similar “top solutions.” 3D-JURY is available as an integral part of the Bioinfo.pl metaserver (http://meta.bioinfo.pl/). Based on this algorithm, a very similar method 3D-Consens has been implemented as an optional post-processing tool in the GeneSilico meta-server (https://genesilico.pl/meta2/).

10.4 From Multiple Template-Based Models to Hybrids Apart from meta-predictors that simply select models from the input set, another breed of meta-predictors have been developed that use the unrefined models generated by primary servers as a “structural scrap-yard” from which to obtain “spare parts” to generate new models. 3D-SHOTGUN developed by Fischer (Fischer 2003) was the first fully automated meta-predictor designed to assemble hybrid models from fragments of models obtained from independent FR methods (e.g., from different components of the BIOINBGU server (Fischer 2000)). In the first step, regions of structural similarity are identified for all initial models by pairwise superposition. Subsequently, for each residue in each model, the number of its occurrences in the superimposed regions of other models is counted and a hybrid model is assembled by taking the coordinates of each residue from a model having the highest count. Thus, for each initial model a hybrid model is constructed that contains the most common structural features of all models, often including more residues than any

240

M.J. Gajda et al.

of the initial models. In the second step, the assembled models are assigned scores based on the combination of the original scores of their parent models (normalized to a similar scale) and the scores describing the structural similarity of the assembled model to other models, as determined by MAXSUB (Siew et al. 2000). For each cluster of highly similar assembled models only one representative model with the highest score is reported. The rationale of the 3D-SHOTGUN strategy is the same as in the consensus methods (selectors) acting on complete models, namely, that recurring structural features observed in models obtained from different FR methods are more likely to be correct. The initial version of 3D-SHOTGUN generated models that comprised only Cα atoms and commonly exhibited stereochemical problems, such as implausible distances and angles and steric clashes between fragments taken from different parent models. In terms of coverage and root mean square deviation (RMSD) between the model and the native structure, however, the approach of hybrid model construction is superior to selection of one of the stereochemically more acceptable input models, as the hybrids are on the average more complete and superimpose better on the native structure than the initial models. The method is sensitive to initial alignment errors – if none of the initial alignments is correct for a given region, it is unlikely that this region will be modeled correctly in the final structure. A new automated version SHGUM includes a crude refinement step, using MODELLER (Sali and Blundell 1993), to generate full-atom models with idealized stereochemistry and without gaps and collisions and even with a slight improvement in the overall RMSD (Sasson and Fischer 2003). The method is available via the INUB server at http://inub.cse.buffalo.edu/query.html. FRankenstein’s Monster is another approach to meta-prediction by consensus and recombination of fragments, developed in the authors’ laboratory (Kosinski et al. 2003, 2005). It is similar to 3D-SHOTGUN, but goes beyond the identification of geometrical consensus by including evaluation of the models by statistical potentials and features an additional step of local realignment of uncertain regions. This helps not only to overcome the problem of selection of the optimal template, but also to correct initial alignment errors. Briefly, the GeneSilico meta-server (Kurowski and Bujnicki 2003) is used as a gateway to run diverse FR methods and to generate preliminary full-atom models from initial pairwise target-template alignments. The local quality of sequence–structure fit in these models is evaluated by a fitness function of local MetaMQAP score (Pawlowski et al. 2008). The most probable folds are identified by clustering. For each fold, a hybrid model is assembled from fragments that are structurally similar in >40% of all preliminary models, while the remaining non-consensus fragments are selected based on the MetaMQAP local score. The initial hybrid model (the “FRankenstein’s monster”) typically exhibits stereochemical problems similar to those found in models generated by the 3D-SHOTGUN method. However, in the FRankenstein strategy the hybrid model is not directly refined, but instead it is superimposed onto the structures of the templates used, yielding a new target-multiple template sequence alignment, which is used to generate a new, stereochemically acceptable model by an “orthodox” CM procedure. The sequence–structure fit in the new model is re-evaluated with

10

Protein Structure Prediction

241

MetaMQAP, and regions of low local score are selected for further refinement. For each poorly scored non-consensus region, a set of new alignments is generated by progressively shifting the target sequence with a step of one residue in the direction of either terminus, within the region of overlap between the secondary structure elements found in the template structure and those predicted for the target. All resulting alignments are used to generate a new family of intermediate models, which are again evaluated and recombined to produce a hybrid model. The procedure is iterated until all regions in the protein core obtain acceptable score or if the score cannot be further improved. The FRANKENSTEIN3D method generates models that retain the fragments confidently predicted by consensus (regardless of their fitness according to statistical potentials) and attempts to refine the alignment in the uncertain regions to maximize the fitness score. As demonstrated in CASP5 (Kosinski et al. 2003) and CASP6, where the groups from the author’s laboratory ranked very high in the CM and FR categories (Kolinski and Bujnicki 2005; Kosinski et al. 2005), the application of this approach leads to very accurate target-template alignments, often more accurate than any of the initial alignments, provided that a template with a correct fold is identified by at least one of the FR servers used. The method automatically clusters all templates available from FR servers and allows to automatically build multiple alternative models before alignment optimization. The current version of the method is available as a FRANKENSTEIN3D server at http://genesilico.pl/frankenstein/. Another approach to overcome the problem of template selection and correction of alignment errors by recombination of alternative models was developed by Bates and coworkers (Contreras-Moreira et al. 2003a,b). The in silico Protein Recombination method starts with a population of models built from alternative alignments to one or more templates sharing the same fold and uses a genetic algorithm with two mutually exclusive genetic operators – “recombination” of parent models with crossover points outside the regions of secondary structure and “mutation” by averaging the coordinates of two parent models. The fitness function acting as a selection agent is a free-energy estimate based on protein contact pair-potentials and side-chain solvation energies, estimated from their solvent accessible area. The method was shown to be able to improve alignments by recombining well-aligned regions from the initial models and to produce recombinant models that are comparable to the best initial model. However, the quality of the initial models is the upper limit for the quality of the final model (e.g., unlike the FRANKENSTEIN method, it does not produce new, potentially better alignments). It is also critically dependent on the confident identification of a correct fold. The in silico Protein Recombination method is available as a web server at http://www.bmm.icnet.uk/servers/3djigsaw/recomb/. Another method that implements a genetic algorithm for comparative modeling was developed by Sali and coworkers (John and Sali 2003). It is similar to the FRANKENSTEIN3D approach in that it continuously refers to the target-template alignments, modifies them locally, and assesses the result of these changes by evaluation of the corresponding models, generated by MODELLER (Sali and Blundell 1993). The genetic operators include recombination of the parent alignments (one

242

M.J. Gajda et al.

and two-point crossovers) and gap insertions/deletions/shifts that actually generate local changes in the parent alignment. The fitness function is based on a score that combines the evaluation of the model by a statistical potential (Melo et al. 2002), target-template sequence identity, and a measure of structural compactness. The method was shown to increase the average quality of the target-template alignment and the corresponding models, but is dependent on the initial choice of the templates; in addition, the inaccurate statistical potential is generally unable to choose the best model (John and Sali 2003). A recent addition to the repertoire of “recombinators” that act on the level of target-template alignments is the MULTICOM-cluster method (which is also a Model Quality Assessment protocol). It performs greedy merging of the top-scoring FR alignment with other alternative alignments that can fill gaps in the structure, generates models with MODELLER, and clusters the resulting models. In the first iteration, the clustering is used to find references among all single-template models, to obtain a global quality score for each model. In the second iteration, models are compared to references to obtain local quality scores, based on average superposition accuracy to the best reference. The models with best global quality are identified as solutions (Cheng et al. 2009).

10.5 Fragment Assembly: A New Trend in De Novo Protein Structure Prediction Modeling protein structure “de novo” without the template is very difficult, because the conformational space to be searched is so vast that it is practically impossible to simulate the folding of a model that includes all atoms of the polypeptide chain and the surrounding solvent molecules. Methods and resources currently available allow to simulate up to about 1 ms of folding of full-atom representations of only small proteins (<100 residues), while most proteins are larger and fold in the timescale of milliseconds or even seconds. Therefore, the solvent is usually treated implicitly and various simplified models are used that have fewer degrees of freedom and exploit the repetitive nature of protein structure (see the chapter by Liwo and coworkers in this volume). These simplified models typically retain only certain atoms, such as Cα or Cβ or “united atoms” in which several atoms are grouped together, such as the centers of mass of the side chains (Liwo et al. 2005; Sun 1995). The protein structure may be represented using a number of simplified schemes such as lattices or bond angles with discrete values (Geetha and Munson 1996; Kolinski 2004). Despite the considerable reduction of dimensionality of the structure space in simplified models, the polypeptide main chain remains highly flexible and requires many variables per residue to model the protein conformation accurately (Hunter and Subramaniam 2002). Significant progress in the field of “de novo” protein structure prediction has been prompted by the observation that the structure of protein backbone can be represented quite accurately by using short fragments taken from other proteins (Claessens et al. 1989; Jones and Thirup 1986). Fragments up to ten residues long

10

Protein Structure Prediction

243

provided an efficient method for interpreting electron density maps in protein crystallography and in building protein models from nuclear magnetic resonance (NMR) data (Kraulis and Jones 1987). Classification of protein loops has proven useful in comparative modeling, where the incomplete framework of a protein core has to be amended by “de novo” insertion of polypeptide segments (Donate et al. 1996; Oliva et al. 1997; Tramontano et al. 1989) (see also above). Several groups classified peptide backbone units with fixed or variable lengths into collections of fragments (Bystroff and Baker 1997; Camproux et al. 1999; Kolodny et al. 2002; Micheletti et al. 2000; Unger et al. 1989). Analysis of such recurring fragments identified local sequence–structure correlations in proteins (Bystroff et al. 1996; Han and Baker 1996) and suggested a new method for “de novo” protein structure prediction.

10.5.1 De Novo Modeling by Fragment Assembly (and Subsequent Refinement) ROSETTA developed by Baker and coworkers (Simons et al. 1997) implements a model of folding in which short fragments of the protein chain alternate between different local conformations copied from segments of known, not necessarily homologous, protein structures. The probability of assuming a particular conformation is based on the similarity of the local sequence and predicted secondary structure of the target to sequence and structure from the template library, as in the “traditional” template-based methods for protein modeling. The early version of ROSETTA used the I-SITES library of fragments 7–19 residues in length that corresponded to one of 82 patterns of sequence/structure motifs commonly present in all known structures (Bystroff and Baker 1998). The current version uses a library of fragments 3–9 residues in length extracted from known structures, which are assembled by using a Monte Carlo (MC)-simulated annealing (SA) search strategy, in which fragments are randomly inserted into the protein chain by replacing the backbone torsion angles with those in the fragment. The resulting “decoy” conformation is then evaluated according to a database-derived pseudoenergy function that rewards native protein-like properties. Additionally, a number of heuristic filters can be used to discriminate “protein-like” decoys by virtue of contact order, topology of β-sheets, etc. In the standard protocol, ROSETTA uses a reduced representation with backbone heavy atoms and Cβ atoms explicitly included and the side chains represented by single centroids. ROSETTA is also capable of refinement of models using full-atom representation, special conformation modification operators, and a refined (more “physical”) energy function. During the simulation, a large set of decoys are generated (1,000 to many thousands or even millions, depending on the protein length and computing power used), which are then clustered to identify the largest populations of similar global conformations which correspond to the broadest free-energy minima. Full-atom models with explicit side-chain rotamers can be rebuilt before or after clustering (see the recent review of ROSETTA by Das and Baker (2008)).

244

M.J. Gajda et al.

The difference between ROSETTA and most of template-based methods for fragment recombination lies in the stochastic and iterative character of this process and in the utilization of multiple small fragments of different, unrelated proteins (template-based methods use the whole structure of one protein or a few-related templates). Thus, ROSETTA can generate a “de novo” model by allowing the fulllength polypeptide chain to explore conformational space via the fragment insertion search method, even if no homologous or analogous template structure is available. Nonetheless, if a template structure is available, the conserved parts of the target can be built as in traditional CM, while the variable parts are allowed to explore the conformational space with fragments in fashion similar to the “de novo” protocol, but in the context of the template (Rohl 2005). As demonstrated already in CASP5 (Bradley et al. 2003), ROSETTA is capable of generating native-like protein models either de novo (i.e., without any template structure) or by adding long insertions and N- and C-terminal extensions to the template that matches only a part of the modeled protein. Among the “winners” of CASP, ROSETTA continues to be the only “de novo” method that has been made available to the academic community both as a source code of the standalone program and as a web server. It is available in two versions: as a part of the GINZU/ROBETTA meta-server developed by the Baker group (Kim et al. 2004) (http://robetta.bakerlab.org/) that uses a hybrid strategy of template-based, if template is available, and template-free modeling if not, and in conjunction with the alternative fragment library I-SITES and the fragment assignment method HMMSTER (http://www.bioinfo.rpi.edu/~bystrc/hmmstr/), developed and maintained by the Bystroff group (Bystroff and Shao 2002). SIMFOLD is another fragment-assembly method developed by Shoji Takada and coworkers (Chikenji et al. 2003). The original method, which performed quite well in CASP6 as “ROKKO” and “ROKKY”, is similar to ROSETTA. It uses 4–9 residue fragments and its energy function consists of various interactions that are based on physical considerations (Fujitsuka et al. 2004). Yet, SIMFOLD exhibits an important difference: it introduces reversible fragment insertion, e.g., when a new fragment is inserted at a junction between two fragments, the replaced “old” conformation comprising elements of two different fragments is added to the library of fragments, so it can be re-inserted. The latter operation is not possible in ROSETTA, which uses only fragments from the original database. This modification satisfies the detailed balance condition, providing the basis for the application of a multicanonical ensemble Monte Carlo method (MEMC), as used by the human team ROKKO in CASP6. MEMC is more effective in finding low-energy conformations than the conventional SA method (Chikenji et al. 2003). SIMFOLD has been available as a server, but unfortunately ceased to be available publicly. PROFESY developed by Lee and coworkers (Lee et al. 2004) is similar to SIMFOLD in that it attempts to improve the poor sampling efficiency of the traditional SA method and uses a physics-based energy function rather than a statistical potential. The global minimization of the energy function is thus performed by the conformational space annealing (CSA) method (Liwo et al. 1999). The fragment library is constructed using the secondary structure prediction method

10

Protein Structure Prediction

245

PREDICT and comprises a collection of 15-residue long backbone structures. To our knowledge, this method is not yet publicly available. FRAGFOLD developed by Jones uses supersecondary structural fragments (comprising 2 or 3 sequential secondary structures) from a library of high-resolution protein structures as well as small (3, 4, and 5 residues) fragments (Jones and McGuffin 2003). Possible supersecondary fragments are assigned to the target sequence by a threading procedure similar to that in the GenTHREADER FR algorithm (Jones 1999). The global structure is assembled by a genetic algorithm or a simulated annealing method, in which half of random moves correspond to the insertion of a pre-selected supersecondary structure fragment and the other half involve a completely free choice of one of the small fragments. Conformations that lack steric clashes and pass the checks for protein-like compactness and hydrogen bonding are clustered to identify representatives of the most probable folds. FRAGFOLD is not available as a server, but a standalone version that runs on GPU exists (http://bioinfadmin.cs.ucl.ac.uk/downloads/gpufragfold/). UNDERTAKER is a method developed by Karplus and coworkers that assembles the target structure using fragments of known structures obtained from three sources: a generic library of very short segments (1–4 residues) that must exactly match the target sequence, medium-length segments (9–12 residues) that are assigned by the FRAGFINDER program from the SAM suite, and variable-length segments assigned by FR analysis (Karchin et al. 2003). In addition to fragment replacement, UNDERTAKER implements an alignment replacement operation in which a complete FR match is imported into the model, allowing the replacement of several segments at once in the same orientation as they occur in the template structure. UNDERTAKER uses a genetic algorithm for the stochastic search and includes a crossover operation that allows recombining different conformations. The cost function used to assess the decoys includes many tunable parameters, among which the most important one, as the name of the method implies, is the burial. To our knowledge, UNDERTAKER is not yet publicly available. ABLE developed by Shimizu and coworkers (Ishida et al. 2003) is also based on fragment assembly, but it assigns main-chain dihedral angles individually to each residue. The energy function is similar to that used in ROSETTA. ABLE method has two interesting features that help to avoid problems if the initial distribution of decoys is too broad and no clusters can be identified based on the RMSD as a measure of the distance between the conformations. First, it uses the unit-vector root mean square distance (URMS) (Kedem et al. 1999) as a measure of structural similarity. Second, if not enough clusters with sufficient size and density are obtained, the fragment assembly search is reiterated, but with additional spatial restraints obtained from the consensus substructures in the models generated by the previous minimization procedure. To our knowledge, ABLE is not yet publicly available. The SALAMI method developed by Andrew Torda and coworkers uses a small alphabet of 300 fragment templates derived from clusters of 6-residue fragments in a non-redundant version of the Protein Data Bank (PDB) database (Margraf et al. 2009). Each fragment template includes information about average geometry of residues and distribution of deviations from average values. These distributions

246

M.J. Gajda et al.

are used to derive energies that along with steric clash potentials are used to find a lowest-energy conformation. This conformation is a result of simulated annealing in the torsion angle space for each of randomly chosen sequence of fragments. Results for different sequences of fragments are treated as set of decoys and clustered to obtain the final solution. This approach gives more complete coverage of fragment space than many other methods, because it generalizes single, average geometries found in ROSETTA (for example) to fragment template distributions, defined as histograms of torsion angles. SALAMI is available as a web server at http://www.zbh.uni-hamburg.de/salami.

10.5.2 Hybrid Methods Involving Fragment Assembly and Folding Simulations ZAM is an interesting method developed by the group of Kenneth Dill, which uses a unique combination of assembly and ab initio-folding simulation (Ozkan et al. 2007). The target-protein sequence is divided into small pieces, which are folded by a physics-based ab initio method and only then assembled. It will be interesting to see if computational methods such as this one that explore conformation of fragments by ab initio calculations will match the performance of de novo methods that simply derive the fragments from a database of known structures. To our knowledge, the ZAM method is unfortunately not available publicly. An alternative approach to fragment assembly, and one with a long history, is that of the lattice representation, in which residues are restricted to points on a regular three-dimensional lattice (Hinds and Levitt 1992; Skolnick and Kolinski 1991). These methods allow very fast sampling of the conformational space, but their ability to represent the atomic details and to use physics-based energy function is limited. Following the success of fragment-assembly methods, several hybrid methods arose, which combine the strengths of both approaches. TASSER developed by Skolnick and coworkers (Zhang and Skolnick 2004a) starts with an FR analysis based on the PROSPECTOR threading method (Skolnick et al. 2004) that identifies either a single consensus fold or a set of templates with globally distinct folds. Based on the FR alignments, the protein chain is divided into contiguous aligned regions of at least 5 residues (20.7 residues on average, according to the authors’ own benchmark) and gapped unaligned regions. The conformation of aligned regions is copied from the templates and remains unchanged during the assembly, while the unaligned regions are represented on an underlying cubic lattice as in the earlier models developed by Skolnick, Kolinski, and coworkers (Kihara et al. 2001). A series of initial models is generated and submitted for assembly and refinement to parallel hyperbolic Replica Exchange Monte Carlo (REMC) sampling method. Structures generated in the lowest-temperature replicas are subjected to iterative clustering using SPICKER (Zhang and Skolnick 2004b) to identify the final models based on the cluster density. I-TASSER is a newer variant of this method, which substitutes fragments modeled de novo for

10

Protein Structure Prediction

247

analogous fragments found in known structures in the PDB database and uses them to provide restraints to guide the folding simulation during the following iteration. TASSER and its variants have been very successful in CASP, since CASP6. I-TASSER is available as a web server at http://zhanglab.ccmb.med.umich.edu/ I-TASSER/ (Wu et al. 2007). TASSERLite, a version that allows for refinement of comparative models but not full de novo modeling (Pandit et al. 2006), is available from the Skolnick group website (http://cssb.biology.gatech.edu/skolnick/ webservice/tasserlite/index.html). Another hybrid method, involving the recombination of fragments and latticebased modeling, was developed by the author of this article in cooperation with Andrzej Kolinski, by combining the FRankenstein’s Monster method (Kosinski et al. 2003) (see also above) for generation of initial models, with the reduced lattice model CABS (Kolinski 2004). Briefly, preliminary hybrid models are generated with the template-based recombination method and scored with an MQAP software to identify well-folded fragments, as described earlier. These fragments are not considered directly, but are used as a source of spatial restraints to guide the REMC-folding simulation using the CABS model. The resulting decoys are clustered using the HCPM method (Gront and Kolinski 2005) to identify the final models. This method performed very well in CASP6 evaluation (Kolinski and Bujnicki 2005), but has been never implemented as fully integrated and automated software. One of the authors of this chapter (M.P.) has recently developed a new fragmentbased method, which has participated as “GS-KudlatyPred” in CASP8 (Pawlowski 2009). This method is to some extent related to the “FRankenstein’s Monster approach”, but operates essentially only on the three-dimensional fragments, without toggling between the three-dimensional and alignment representation. As an input this method takes a set of models built with any methods (comparative or de novo) and extracts fragments comprising 1–4 sequentially occurring secondary structures. Each of the initial models is scored by MetaMQAPcons (a derivative of method described in the article by Pawlowski et al. (2008)). Five models with the best global score are selected as the reference models. All possible combinations of supersecondary structure fragments are generated and ranked based on the sum of three components: (1) MetaMQAPcons score of all fragments in the combination; (2) a GDT_TS score describing the fit of each fragment onto the closest “root model”; and (3) degree of structural similarity in the area of overlap between neighboring fragments. In the last step, hybrid models are built for 200 top-scored combinations of fragments, using MODELLER (Fiser and Sali 2003) in a multi-template mode and ranked by the MetaMQAPcons method.

10.5.3 Other Methods Based on Fragment Prediction All the aforementioned methods for protein structure prediction by recombination use contiguous fragments of protein backbone. Another approach to protein

248

M.J. Gajda et al.

structure prediction is based on the concept of three-dimensional structural descriptors developed by Kryshtafovych and Fidelis (unpublished analysis cited in Hvidsten et al. (2003)), e.g., substructures that encompass a set of non-contiguous protein backbone fragments residing within a spatial neighborhood of a specific residue. The calculation of descriptors for all known protein structures followed by clustering similar descriptors into groups revealed certain sequence preferences that can be interpreted as propensity of particular residues to be accommodated within particular substructures (Hvidsten et al. 2003), similar to the observation made for single contiguous fragments, e.g., in the I-SITES library (Bystroff et al. 2000). Based on these correlations it is possible to identify descriptors matching the target sequence and to predict a three-dimensional fold that is most compatible with these descriptors, without building an explicit three-dimensional model of the target structure (Hvidsten et al. 2003). In principle, it may be possible to assemble the tertiary structure of the target from the descriptors that contain multiple backbone fragments but to the author’s knowledge such a method has not yet been developed. It remains to be seen if the three-dimensional descriptor approach will ever lead to practically a useful method for tertiary structure prediction that would become available publicly.

10.6 Why Are the Fragments-Assembly Methods So Successful? Template-based methods, especially FR meta-servers, have been found to produce exceptionally good predictions and are now widely used for protein structure prediction. In particular, their relatively low-computational cost makes them very useful for large-scale analyses, e.g., for building models for proteins encoded in whole genomes. However, all template-based methods suffer from the fundamental limitation of being able to recognize only the folds that have been already observed. The results of structural genomics initiatives reveal that the majority of proteins belong to the previously characterized folds, but the percentage of structures with “new” folds or variations of “old” folds that cannot be accurately predicted by FR methods remains significant. On the other hand, physics-based methods for “ab initio” folding are extremely costly in terms of the computing power even if they use reduced representation and do not yet successfully fold large proteins. However, even when a novel fold is discovered, it usually turns out to be composed of common structural motifs, often at the level of supersecondary or even larger structures. Levitt and coworkers (Kolodny et al. 2002) have demonstrated that all proteins in the PDB can be modeled accurately from rigid fragments of unrelated proteins that are concatenated without any degrees of freedom. Skolnick and coworkers (Kihara and Skolnick 2003; Zhang et al. 2005) have shown that most of proteins in the PDB have significant structural alignments to other proteins in a different secondary structure and fold class. Thus, modeling of new folds can be greatly facilitated by assembling them from fragments of known structures identified by “local fold recognition,” rather than attempting to model the whole process of protein folding based on first principles.

10

Protein Structure Prediction

249

The success of methods based on fragment assembly lies not only in the restriction of the conformational space, but which can be also achieved by other reduced models (e.g., “pure” lattice models) that are less successful. As emphasized by Takada and coworkers (Chikenji et al. 2003), one of the problems of the contemporary energy functions, those based on physics and statistics alike, it the limited ability to capture the subtleties of interactions between the neighboring residues (side-chain/main-chain hydrogen bonding, side-chain configurational entropy loss, etc.), which govern the local torsional propensities. Computing the local interaction energies “ab initio” may lead to accumulation of inaccuracies and greatly decrease the chances of obtaining a globally correct model. Methods that utilize fragments avoid this problem by sampling local conformations that exist in native protein structures, which provides implicit, yet accurate representation of local interactions. Thus, a single-fragment substitution corresponds to instantly transporting the modeled protein from one local energy minimum to another, without the necessity of overcoming local energetic barriers. This enormously speeds up the search for the global energy minimum and allows shifting the focus to the generation of non-local conformational changes and identification of globally native-like structures. The conservation of local structure may have not only physical, but also evolutionary sense. Lupas et al. (2001) proposed a scenario, in which modern proteins evolved from ancient short-peptide ancestors, called antecedent domain segments (ADSs). They suggested that the ancestors of contemporary (sub)domains arose by spontaneous non-covalent association of peptides with native-like and/or tertiarylike structural features, and since such assemblies provided functional advantage (e.g., due to improved stability of the individual fragments or their increased efficient concentration), the fusion of primitive genes encoding these fragments was preferentially selected by evolution. It is noteworthy that attempts to form folded and functional proteins by recombination revealed that successfully recombined fragments called “schemas” often correspond to known supersecondary structural elements (Voigt et al. 2002). This hypothetical “mix-and-join” scenario convincingly explains not only the structure of repetitive proteins, such as propellers, TIM-barrels, and helical bundles, but may also be invoked to explain the origin of more complicated and asymmetrical domains (Soding and Lupas 2003).

10.7 Conclusions and Outlook In the earlier version of this chapter the main author (J.M.B.) predicted that in the “near future” with respect to the year of 2006 (i.e., near past with respect to the year of 2010) we should have seen more integration of the most successful approaches, that is, meta-prediction and assembly of fragments, and further convergence of the evolutionary and physical schools. In opinion of the authors of this chapter this indeed has happened. Essentially all methods that currently score best in CASP rely on some sort of meta-prediction, either with the use of external servers or ones constructed in house. In parallel, emphasis increases on the use of physics-based methods for the refinement of models that are close to the native structure, but could

250

M.J. Gajda et al.

be even closer. So what is our prediction of the nearest future, e.g., a few years following the publication of this volume? The results of recent CASP demonstrate that the progress in protein structure prediction is negligible and comes more from the area of “information technology” than “science.” In our opinion this indicates that the field of protein structure prediction has grown old and the only alternative to retirement is a major breakthrough rather than just recombination of what is already available. While we keep our fingers crossed for such a breakthrough on any line of the stalled protein front, we predict that an increased number of researchers will turn away from the field of protein structure prediction to move to other fields. One of the interesting directions is the testing of the applicability of techniques developed for protein three-dimensional structure prediction in the emerging field of RNA threedimensional structure prediction, which on the other hand offers many interesting solutions that could be used to refresh the aging field of protein bioinformatics. The Baker group has already developed a version of “ROSETTA for RNA”, dubbed FARNA (Fragment Assembly of RNA) (Das and Baker 2007). So is the RNA structure the New World for conquistadors from the Protein Continent? It remains to be seen. See you there! Acknowledgments Our recent research in the field of structural bioinformatics has been funded by the Polish Ministry of Scientific Research and Higher Education (grant numbers: POIG.02.03.00-00-003/09, 188/N-DFG/2008/0, N301 106 32/3600, and PBZ-MNiI2/1/2005), by the NIH (grant numbers R01GM081680 and R03TW007163-01), by the European Commission (grant numbers LSHG-CT-2003-503238, LSHG-CT-2005-518238, MRTN-CT-2005019566, 229676, and RIDS 011934), and by HFSP program RGP 55/2006.

References Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230 Anfinsen CB, Haber E, Sela M, White FH Jr (1961) The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc Natl Acad Sci USA 47:1309–1314 Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science 294:93–96 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005) GenBank. Nucleic Acids Res 33:D34–38 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242 Bradley P, Chivian D, Meiler J, Misura KM, Rohl CA, Schief WR, Wedemeyer WJ, SchuelerFurman O, Murphy P, Schonbrun J, Strauss CE, Baker D (2003) ROSETTA predictions in CASP5: successes, failures, and prospects for complete automation. Proteins 53 Suppl 6: 457–468 Bujnicki JM (2006 Jan 9) Protein structure prediction by recombination of fragments. Chem Bio Chem 7(1):19–27 Bujnicki JM, Elofsson A, Fischer D, Rychlewski L (2001a) LiveBench-2: large-scale automated evaluation of protein structure prediction servers. Proteins Suppl 5:184–191 Bujnicki JM, Elofsson A, Fischer D, Rychlewski L (2001b) Structure prediction meta server. Bioinformatics 17:750–751 Bystroff C, Baker D (1997) Blind predictions of local protein structure in CASP2 targets using the I-sites library. Proteins Suppl 1:167–171

10

Protein Structure Prediction

251

Bystroff C, Baker D (1998) Prediction of local structure in proteins using a library of sequencestructure motifs. J Mol Biol 281:565–577 Bystroff C, Shao Y (2002) Fully automated ab initio protein structure prediction using I-SITES, HMMSTR and ROSETTA. Bioinformatics 18(Suppl 1):S54–S61 Bystroff C, Simons KT, Han KF, Baker D (1996) Local sequence-structure correlations in proteins. Curr Opin Biotechnol 7:417–421 Bystroff C, Thorsson V, Baker D (2000) HMMSTR: a hidden Markov model for local sequencestructure correlations in proteins. J Mol Biol 301:173–190 Camproux AC, Tuffery P, Chevrolat JP, Boisvieux JF, Hazout S (1999) Hidden Markov model approach for identifying the modular framework of the protein backbone. Protein Eng 12: 1063–1073 Cheng J, Wang Z, Tegge AN, Eickholt J (2009) Prediction of global and local quality of CASP8 models by MULTICOM series. Proteins 77 Suppl 9:181–184 Chikenji G, Fujitsuka Y, Takada S (2003) A reversible fragment assembly method for de novo protein structure prediction. J Chem Phys 119:6895–6903 Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. Embo J 5:823–826 Claessens M, Van Cutsem E, Lasters I, Wodak S (1989) Modelling the polypeptide backbone with ‘spare parts’ from known protein structures. Protein Eng 2:335–345 Cohen-Gonsaud M, Catherinot V, Labesse G, Douguet D (2004) From molecular modeling to drug design. In: Bujnicki JM (ed) Practical bioinformatics, vol. 15. Springer, Berlin, pp 35–71 Contreras-Moreira B, Fitzjohn PW, Bates PA (2003a) In silico protein recombination: enhancing template and sequence alignment selection for comparative protein modelling. J Mol Biol 328:593–608 Contreras-Moreira B, Fitzjohn PW, Offman M, Smith GR, Bates PA (2003b) Novel use of a genetic algorithm for protein structure prediction: searching template and sequence alignment space. Proteins 53 Suppl 6:424–429 Cymerman IA, Feder M, Pawlowski M, Kurowski MA, Bujnicki JM (2004) Computational methods for protein structure prediction and fold recognition. In: Bujnicki JM (ed) Practical bioinformatics, vol 15. Springer Berlin, pp 1–21 Das R, Baker D (2007) Automated de novo prediction of native-like RNA tertiary structures. Proc Natl Acad Sci USA 104:14664–14669 Das R, Baker D (2008) Macromolecular modeling with ROSETTA. Annu Rev Biochem 77: 363–382 Donate LE, Rufino SD, Canard LH, Blundell TL (1996) Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci 5:2600–2616 Douguet D, Labesse G (2001) Easier threading through web-based comparisons and crossvalidations. Bioinformatics 17:752–753 Fischer D (2000) Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pacific Symp Biocomp 5:119–130 Fischer D (2003) 3D-SHOTGUN: a novel, cooperative, fold-recognition meta-predictor. Proteins 51:434–441 Fischer D, Elofsson A, Rychlewski L, Pazos F, Valencia A, Rost B, Ortiz AR, Dunbrack RL Jr (2001) CAFASP2: the second critical assessment of fully automated structure prediction methods. Proteins Suppl 5:171–183 Fiser A, Sali A (2003) Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol 374:461–491 Fujitsuka Y, Takada S, Luthey-Schulten ZA, Wolynes PG (2004) Optimizing physical energy functions for protein folding. Proteins 54:88–103 Geetha V, Munson PJ (1996) Simplified representation of proteins. J Biomol Struct Dyn 13: 781–793

252

M.J. Gajda et al.

Ginalski K, Elofsson A, Fischer D, Rychlewski L (2003) 3D-jury: a simple approach to improve protein structure predictions. Bioinformatics 19:1015–1018 Ginalski K, Grishin NV, Godzik A, Rychlewski L (2005) Practical lessons from protein structure prediction. Nucleic Acids Res 33:1874–1891 Godzik A, Kolinski A, Skolnick J (1992) Topology fingerprint approach to the inverse protein folding problem. J Mol Biol 227:227–238 Gront D, Kolinski A (2005) HCPM – program for hierarchical clustering of protein models. Bioinformatics 21:3179–3180 Han KF, Baker D (1996) Global properties of the mapping between local amino acid sequence and local structure in proteins. Proc Natl Acad Sci USA 93:5814–5818 Hinds DA, Levitt M (1992) A lattice model for protein structure prediction at low resolution. Proc Natl Acad Sci USA 89:2536–2540 Hunter CG, Subramaniam S (2002) Natural coordinate representation for the protein backbone structure. Proteins 49:206–215 Hvidsten TR, Kryshtafovych A, Komorowski J, Fidelis K (2003) A novel approach to fold recognition using sequence-derived properties from sets of structurally similar local fragments of proteins. Bioinformatics 19(Suppl 2):II81–II91 Ishida T, Nishimura T, Nozaki M, Inoue T, Terada T, Nakamura S, Shimizu K (2003) Development of an ab initio protein structure prediction system ABLE. Genome Inform Ser Workshop Genome Inform 14:228–237 John B, Sali A (2003) Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Res 31:3982–3992 Jones DT (1999) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287:797–815 Jones DT, McGuffin LJ (2003) Assembling novel protein folds from super-secondary structural fragments. Proteins 53(Suppl 6):480–485 Jones DT, Taylor WR, Thornton JM (1992) A new approach to protein fold recognition. Nature 358:86–89 Jones TA, Thirup S (1986) Using known substructures in protein model building and crystallography. Embo J 5:819–822 Karchin R, Cline M, Mandel-Gutfreund Y, Karplus K (2003) Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins 51: 504–514 Kedem K, Chew LP, Elber R (1999) Unit-vector RMS (URMS) as a tool to analyze molecular dynamics trajectories. Proteins 37:554–564 Kihara D, Lu H, Kolinski A, Skolnick J (2001) TOUCHSTONE: an ab initio protein structure prediction method that uses threading-based tertiary restraints. Proc Natl Acad Sci USA 98:10125–10130 Kihara D, Skolnick J (2003) The PDB is a covering set of small protein structures. J Mol Biol 334:793–802 Kim DE, Chivian D, Baker D (2004) Protein structure prediction and analysis using the ROBETTA server. Nucleic Acids Res 32:W526–531 Kolinski A (2004) Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 51:349–371 Kolinski A, Bujnicki JM (2005) Generalized protein structure prediction based on combination of fold-recognition with de novo folding and evaluation of models. Proteins 61(Suppl 7): 84–90 Kolodny R, Koehl P, Guibas L, Levitt M (2002) Small libraries of protein fragments model native protein structures accurately. J Mol Biol 323:297–307 Kosinski J, Cymerman IA, Feder M, Kurowski MA, Sasin JM, Bujnicki JM (2003) A “FRankenstein’s monster” approach to comparative modeling: merging the finest fragments of fold-recognition models and iterative model refinement aided by 3D structure evaluation. Proteins 53(Suppl 6):369–379

10

Protein Structure Prediction

253

Kosinski J, Gajda MJ, Cymerman IA, Kurowski MA, Pawlowski M, Boniecki M, Obarska A, Papaj G, Sroczynska-Obuchowicz P, Tkaczuk KL, Sniezynska P, Sasin JM, Augustyn A, Bujnicki JM, Feder M (2005) FRankenstein becomes a cyborg: the automatic recombination and realignment of fold recognition models in CASP6. Proteins 61(Suppl 7):106–113 Kraulis PJ, Jones TA (1987) Determination of three-dimensional protein structures from nuclear magnetic resonance data using fragments of known structures. Proteins 2:188–201 Krieger E, Nabuurs SB, Vriend G (2003) Homology modeling. Methods Biochem Anal 44: 509–523 Kryshtafovych A, Fidelis K (2009) Protein structure prediction and model quality assessment. Drug Discov Today 14:386–393 Kurowski MA, Bujnicki JM (2003) GeneSilico protein structure prediction meta-server. Nucleic Acids Res 31:3305–3307 Lee J, Kim SY, Joo K, Kim I (2004) Prediction of protein tertiary structure using PROFESY, a novel method based on fragment assembly and conformational space annealing. Proteins 56:704–714 Liwo A, Khalili M, Scheraga HA (2005) Ab initio simulations of protein-folding pathways by molecular dynamics with the united-residue model of polypeptide chains. Proc Natl Acad Sci USA 102:2362–2367 Liwo A, Lee J, Ripoll DR, Pillardy J, Scheraga HA (1999) Protein structure prediction by global optimization of a potential energy function. Proc Natl Acad Sci USA 96:5482–5485 Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A (2001) Pcons:a neural-network-based consensus predictor that improves fold recognition. Protein Sci 10:2354–2362 Lupas AN, Ponting CP, Russell RB (2001) On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J Struct Biol 134:191–203 Margraf T, Schenk G, Torda AE (2009) The SALAMI protein structure search server. Nucleic Acids Res 37:W480–484 Melo F, Sanchez R, Sali A (2002) Statistical potentials for fold assessment. Protein Sci 11:430–448 Micheletti C, Seno F, Maritan A (2000) Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies. Proteins 40:662–674 Moult J, Pedersen JT, Judson R, Fidelis K (1995) A large-scale experiment to assess protein structure prediction methods. Proteins 23:ii–v Moult J, Fidelis, K, Rost, B, Hubbard T, Tramontano A (2005) Critical assessment of methods of protein structure prediction (CASP)-round 6. Proteins 61(Suppl 7):3–7 Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A (2007) Critical assessment of methods of protein structure prediction – Round VII. Proteins 69(Suppl 8):3–9 Moult J, Fidelis K, Kryshtafovych A, Rost B, Tramontano A (2009) Critical assessment of methods of protein structure prediction – Round VIII. Proteins 77(Suppl 9):1–4 Oliva B, Bates PA, Querol E, Aviles FX, Sternberg MJ (1997) An automated classification of the structure of protein loops. J Mol Biol 266:814–830 Ozkan SB, Wu GA, Chodera JD, Dill KA (2007) Protein folding by zipping and assembly. Proc Natl Acad Sci USA 104:11987–11992 Pandit SB, Zhang Y, Skolnick J (2006) TASSER-Lite: an automated tool for protein comparative modeling. Biophys J 91:4180–4190 Pawlowski M (2009) Rozwój metod udokładniania i oceny poprawno´sci teoretycznych modeli struktur białek i zastosowanie ich w technologii Molecular Replacement (MR). Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, p. 111 (in Polish, PhD) Pawlowski M, Gajda MJ, Matlak R, Bujnicki JM (2008) MetaMQAP: a meta-server for the quality assessment of protein models. BMC Bioinformatics 9:403 Peitsch MC (1995) Protein modelling by e-mail. Bio/Technology 13:658–660 Pruitt KD, Tatusova, T, Maglott DR (2007) NCBI Reference Sequence (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35: D61–5

254

M.J. Gajda et al.

Rohl CA (2005) Protein structure estimation from minimal restraints using ROSETTA. Methods Enzymol 394:244–260 Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815 Sasson I, Fischer D (2003) Modeling three-dimensional protein structures for CASP5 using the 3D-SHOTGUN meta-predictors. Proteins 53 (Suppl 6):389–394 Scheraga HA (1996) Recent developments in the theory of protein folding: searching for the global energy minimum. Biophys Chem 59:329–339 Siew N, Elofsson A, Rychlewski L, Fischer D (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics 16:776–785 Simons KT, Kooperberg C, Huang E, Baker D (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268:209–225 Skolnick J, Kihara D, Zhang Y (2004) Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins 56:502–518 Skolnick J, Kolinski A (1991) Dynamic Monte Carlo simulations of a new lattice model of globular protein folding, structure and dynamics. J Mol Biol 221:499–531 Soding J, Lupas AN (2003) More than the sum of their parts: on the evolution of proteins from peptides. Bioessays 25:837–846 Sun S (1995) Reduced representation approach to protein tertiary structure prediction: statistical potential and simulated annealing. J Theor Biol 172:13–32 Tramontano A, Chothia C, Lesk AM (1989) Structural determinants of the conformations of medium-sized loops in proteins. Proteins 6:382–394 Unger R, Harel D, Wherland S, Sussman JL (1989) A 3D building blocks approach to analyzing and predicting structure of proteins. Proteins 5:355–373 Voigt CA, Martinez C, Wang ZG, Mayo SL, Arnold FH (2002) Protein building blocks preserved by recombination. Nat Struct Biol 9:553–558 Wallner B, Elofsson A (2003) Can correct protein models be identified? Protein Sci 12:1073–1086 Wallner B, Larsson P, Elofsson A (2007) Pcons.net: protein structure prediction meta server. Nucleic Acids Res 35:W369–374 Wu S, Skolnick J, Zhang Y (2007) Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biol 5:17 Wu S, Zhang Y (2007a) LOMETS: a local meta-threading-server for protein structure prediction. Nucleic Acids Res 35:3375–3382 Zhang Y, Arakaki AK, Skolnick J (2005) TASSER: an automated method for the prediction of protein tertiary structures in CASP6. Proteins 61(Suppl 7):91–98 Zhang Y, Skolnick J (2004a) Automated structure prediction of weakly homologous proteins on a genomic scale. Proc Natl Acad Sci USA 101:7594–7599 Zhang Y, Skolnick J (2004b) SPICKER: a clustering approach to identify near-native protein folds. J Comput Chem 25:865–871

Chapter 11

Genome-Wide Protein Structure Prediction Srayanta Mukherjee, Andras Szilagyi, Ambrish Roy, and Yang Zhang

Abstract The post-genomic era has witnessed an explosion of protein sequences in the public databases; but this has not been complemented by the availability of genome-wide structure and function information, due to the technical difficulties and labor expenses incurred by existing experimental techniques. The rapid advancements in computer-based protein structure prediction methods have enabled automated and yet reliable methods for generating three-dimensional (3D) structural models of proteins. Genome-scale structure prediction experiments have been conducted by a number of groups, starting as early as in 1997, and some noteworthy efforts have been made using the MODELLER and ROSETTA methods. Along another line, TOUCHSTONE was used to predict the structures of all 85 small proteins in the Mycoplasma genitalium genome, which established templaterefinement-based structure prediction as a practical approach for genome-scale experiments. This was followed by the development of Threading ASSEmbly Refinement (TASSER) and Iterative Threading ASSEmbly Refinement (I-TASSER) algorithms which use a combination of various approaches for threading, fragment assembly, ab initio loop modeling, and structural refinement to predict the structures. A successful structural prediction for all medium-sized open reading frames (ORFs) in the Escherichia coli genome was demonstrated by this method, achieving high-accuracy models for 920 out of 1,360 proteins. G protein-coupled receptors (GPCRs) are an extremely important class of membrane proteins for which only very few structures are available in the Protein Data Bank (PDB). TASSER was used to predict the structures of all 907 putative GPCRs in the human genome, and the high accuracy confirmed by newly solved GPCR structures and recent blind tests have demonstrated the usefulness and robustness of the TASSER/I-TASSER models for the functional annotation of GPCRs. Recently, the I-TASSER protein structure prediction method has been used as a basis for functional annotation of protein sequences. The increasing popularity and need for such automated structure

Y. Zhang (B) Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_11,

255

256

S. Mukherjee et al.

and function prediction algorithms can be judged by the fact that the I-TASSER server has generated structure predictions for 35,000 proteins submitted by more than 8,000 users from 86 countries in the last 24 months. The success of these modeling experiments demonstrates significant new progress in high-throughput and genome-wide protein structure prediction.

11.1 Introduction The post-genomic era has witnessed an explosion of sequence-level information for proteins which, however, has not been complemented by the availability of structural information, mainly due to the inherent limitations of current experimental techniques for determining the protein structure. The increasing gap between the sequence and structure space (shown in Fig. 11.1), along with the awareness that the three-dimensional (3D) structure of a protein is closely linked to its biological function (Lopez et al. 2007), has prompted the structural genomics (SG) project to increase the throughput of experimental structure determination (Baker and Sali 2001; Gerstein et al. 2003; Chandonia and Brenner 2006) and to provide a framework for inferring the biological function (Skolnick et al. 2000; Aloy et al. 2001) of proteins. While SG aims to structurally characterize the protein universe by an optimized combination of experimental structure determination and comparative modeling (CM), 3D structures of at least 16,000 optimally selected proteins would be required in order for CM to cover approximately 90% of protein domain families (Vitkup et al. 2001), and at the current rate it appears that this goal can only

Fig. 11.1 Plot showing the rise in the number of protein sequences in databases compared to the rise in the number of structures in the PDB (Berman et al. 2000) by year

11

Genome-Wide Protein Structure Prediction

257

be achieved in ∼10 years (Zhang 2009b). This underscores the need and the feasibility for genome-wide protein structure prediction by CM and other computational methods, so that 3D structural models can be built and provide insight into molecular mechanisms, thereby promoting better understanding of physiological processes and biological systems (Wiley 1998). Rapid strides have been taken in the field of protein structure prediction from amino acid sequence using computational methods (Zhang 2008b). The obvious advantage of computational methods is their speed and low cost, making genomescale structure prediction and functional annotations a reality. Protein structure prediction methods can be divided into three main categories based on the approach that is adopted (Zhang 2008b): (1) comparative or homology modeling (Sali and Blundell 1993; Fiser et al. 2000; Marti-Renom et al. 2000), (2) threading or fold recognition (Bowie et al. 1991; Jones et al. 1992; Xu and Xu 2000; Skolnick et al. 2004b; Wu and Zhang 2007b), and (3) ab initio or de novo methods (Kolinski and Skolnick 1994; Simons et al. 1997; Liwo et al. 1999; Kihara et al. 2001; Zhang et al. 2003; Bradley et al. 2005; Klepeis et al. 2005; Oldziej et al. 2005; Wu et al. 2007a). In comparative modeling (CM), the protein structure is constructed by matching the sequence of the protein of interest (query protein) to an evolutionarily related protein with a known structure (template protein) in the Protein Data Bank (PDB). Thus, a prerequisite for comparative modeling technique is the presence of a homologous protein in the PDB (Berman et al. 2000) library. For proteins with >50% sequence identity to their templates, models built by CM techniques can have up to 1 Å root mean square deviation (RMSD) from the native structure for the backbone atoms. For proteins which have 30–50% sequence identity with their template, the models often have ∼85% of their core regions within an RMSD of 3.5 Å from the native structure, with errors mainly in loop regions. When the sequence identity drops below 30% (in the twilight zone (Rost 1999)), modeling accuracy sharply decreases because of substantial alignment errors and lack of significant template hits. Also, by definition, models built by CM usually have a strong bias and are closer to the template structure than the native structure of the target protein (Tramontano and Morea 2003; Read and Chavali 2007). Threading or fold recognition is similar to CM in the sense that it also searches a structure library to identify a known structure which would “best fit” a given query sequence; however, an evolutionary relationship (homology) between the query and the template is not a prerequisite in this case. These “sequence to structure” alignment approaches usually employ a wide range of scoring functions to find the best alignment and may rely on distance-dependent potentials (Sippl and Weitckus 1992), predicted secondary structure (McGuffin and Jones 2003), solvent accessibility (Zhang et al. 1997; Chen and Zhou 2005a), and other predicted structural features. Most of the successful threading approaches use scores combining sequence features and predicted structural information (Skolnick et al. 2004b; Zhou and Zhou 2005; Wu and Zhang 2008b), with a search engine of either dynamic programming (Needleman and Wunsch 1970; Smith and Waterman 1981) or a Hidden Markov model (Karplus et al. 1998; Soding 2005) for remote homology detection and fold recognition.

258

S. Mukherjee et al.

Ab initio or de novo methods originally referred to the approaches purely based on physicochemical properties; however, some of the contemporary algorithms in this category do use evolutionary and knowledge-based information to collect spatial restraints or to detect structural fragments to assist structural assembly. However, by definition, ab initio methods are not dependent on the presence of known structures which are sequentially or structurally similar to a given query sequence. The guiding principle of this approach is the Anfinsen hypothesis (Anfinsen 1973), which states that the native structure of the protein lies at the global energy minimum of the configurational space. Therefore, ab initio approaches try to fold a given protein based on various force fields via conformational search. Though some notable developments have been made in this field (Kolinski and Skolnick 1994; Simons et al. 1997; Liwo et al. 1999; Kihara et al. 2001; Zhang et al. 2003; Bradley et al. 2005; Klepeis et al. 2005; Oldziej et al. 2005; Wu et al. 2007a), predicting threedimensional structure of proteins longer than 150 amino acids is still an unsolved problem due to the inaccuracy of available force fields and the bottlenecks arising out of insufficient conformational search. Significant progress has been achieved in developing composite structure predictions which combine various approaches to comparative modeling, threading, and ab initio folding. The Threading ASSEmbly Refinement (TASSER) (Skolnick et al. 2004a) and Iterative Threading ASSEmbly Refinement (I-TASSER) (Wu et al. 2007a; Zhang 2007, 2008a) methods are notable examples in this category. In what follows, we give an overview of the field with a focus on genomewide automated protein structure prediction. We start with a discussion of the early attempts at large-scale structure prediction. Then, we provide an introduction to the TASSER and I-TASSER algorithms, followed by a review of the genome-scale structure prediction experiments conducted using these composite methods. Lastly, we conclude the chapter with comments on the usefulness of genome-wide structure prediction and current challenges in the field. Due to the space limit of this chapter, we are not aiming at providing an exhaustive list of efforts made in this important field.

11.2 Pioneering Efforts in Genome-Scale Structure Predictions One of the earliest attempts aiming at structure prediction on a genomic scale was carried out by Fischer and Eisenberg (1997) on the Mycoplasma genitalium genome (Fraser et al. 1995). The primary goal of the experiment was to assign a fold to each of the 486 putative proteins in the M. genitalium genome. The method used in this experiment was protein fold recognition using threading. Thus, each target sequence was threaded onto structures in a library of representative protein structures obtained from the PDB (Berman et al. 2000), using both sequence–profile and profile–profile alignment to find a template protein in the database of known structures that presumably has a similar structure to the target protein or at least shares a structural motif with it. Using this procedure, a fold could be assigned with high

11

Genome-Wide Protein Structure Prediction

259

confidence to 22% of the proteins in the genome. At the time of the experiment, the threading template library included only 1,632 entries at a 50% pair-wise sequence identity cutoff. A genome-scale structure prediction of proteins in the Saccharomyces cerevisiae genome was undertaken by Sanchez and Sali (1998), using comparative protein structure modeling. The program MODELLER (Sali and Blundell 1993; Sanchez and Sali 1997), which models an unknown protein structure based on the satisfaction of spatial restraints derived from homologous proteins of known structure, was used for all three steps necessary to perform comparative modeling, namely, sequence– structure alignment, building a model based on the restraints from templates, and evaluation of the quality of the model. Structure modeling was carried out on 6,218 open reading frames (ORFs), using a template library consisting of 2,045 proteins at a 95% pair-wise sequence identity cutoff. Models were generated for substantial segments of 1,071 ORFs (17.2%) from the complete genome. In contrast, only 40 proteins had been solved experimentally at that time. Taking it one step further, Sanchez et al. carried out a “multi-genomic”-scale comparative structure modeling for approximately 17,000 proteins from 10 complete genomes and all sequences from Arabidopsis thaliana and Homo sapiens (Sanchez et al. 2000) available at that time. The models were generated using the MODPIPE pipeline software (Sanchez and Sali 1998), an integration of positionspecific iterated-basic local alignment search tool (PSI-BLAST) (Altschul et al. 1997) for threading with MODELLER (Sali and Blundell 1993), and were deposited in the MODBASE database. The MODPIPE pipeline thus established a state-ofthe-art automated procedure capable of performing large-scale protein structure modeling that could be used for various biological applications. TOUCHSTONE (Kolinski and Skolnick 1994; Kihara et al. 2001; Zhang et al. 2003), a Monte Carlo simulation-based method built on a reduced knowledge-based force field combined with secondary structure prediction and threading-based tertiary structure restraints, was used for the genome-scale prediction of all small proteins (those shorter than 150 amino acids) in the M. genitalium genome (Kihara et al. 2002), demonstrating the feasibility of large-scale prediction experiments using ab initio-based modeling methods. Since the structure of none of the 85 small proteins in the genome had been solved experimentally at that time, it was not possible to compare the models with the native structures. However, as judged based on TOUCHSTONE benchmarking results on a 65-protein test set, the results were promising. Out of the 85 proteins, the threading program protein structure predictor employing combined threading to optimize results (PROSPECTOR) (Skolnick and Kihara 2001) was able to produce significant threading hits for 34 proteins, all of which were then used to produce reliable full-length models. For 29 out of the remaining 51 proteins without any significant threading hits, the Monte Carlo simulations converged to 5 or fewer clusters. Based on a simple application of the statistics obtained from the benchmarking study, it was concluded that the models were reliable for 24 of these 29 proteins. Thus, the total number of proteins with reliable models was estimated to be 58 or 68% of all the target proteins in the study.

260

S. Mukherjee et al.

Simons et al. conducted a large-scale test of the ROSETTA structure prediction method (Simons et al. 1997; Bradley et al. 2005) by predicting the structures of 150 proteins with sizes up to 150 amino acids (Simons et al. 2001). The protein set included 30 small (<50 residues), 127 medium-sized (50–100 residues), and 3 relatively large proteins (>100 residues). Models with an RMSD ≤ 5 Å were produced for 80% of the small proteins and 73% of the medium-sized proteins. For the rest of the proteins, including the three large ones, the models had an RMSD between 5 and 7 Å. In a more recent study, Malmstrom et al. carried out a superfamily assignment and protein structure prediction experiment on the 6,238 ORFs in the S. cerevisiae genome (Malmstrom et al. 2007). The sequences were parsed into 14,934 structural domains, 47% of which showed detectable similarity to homologs or analogs of known structure. From the remaining 53% of the domains, the ones shorter than 150 residues and lacking predicted transmembrane helices were selected for ab initio structure prediction using ROSETTA. For each of the selected 3,338 domains, 10,000 models were generated by ROSETTA and then condensed to 30 representative models by clustering. This large-scale computational study was an expensive effort as it required 1,350 CPU years. The resulting structural data were integrated with existing experimental data on the function, process, and localization of the domains in order to assign them to structural classification of proteins (SCOP) superfamilies; an assignment was made for 581 domains. In addition, structural annotations were assigned to 7,094 domains with structures predicted using fold recognition or homology modeling. The genome-wide predictions and superfamily assignments produced by this ground-breaking study can serve as a basis for the generation of experimentally testable hypotheses about the structure–function relationships of a large number of yeast proteins.

11.3 TASSER Methods TASSER is a composite structure prediction method developed in the Skolnick lab (Skolnick et al. 2004a; Zhang and Skolnick 2004c) involving a hierarchical combination of template search by threading, followed by the assembly and rearrangement of continuous fragments excised from the templates. The protein conformation is specified in an on-and-off-lattice system with energy function integrating a number of structural restraints which are predicted from the threading templates. The on-and-off-lattice-based conformational search is used to generate thousands of conformations which are then subjected to iterative structural clustering for the selection of the final models (Zhang and Skolnick 2004b). The TASSER predictions begin by taking the amino acid sequence as input, which is then subjected to “sequence–structure alignment” or threading by protein structure predictor employing combined threading to optimize results 3 (PROSPECTOR3) (Skolnick et al. 2004b) against a comprehensive threading library. The threading process utilizes close and distant sequence profiles and predicted secondary structure information from PSIPRED (Jones 1999) to find the best

11

Genome-Wide Protein Structure Prediction

261

match. The alignment is performed using the Needleman–Wunsch dynamic programming algorithm (Needleman et al. 1970), and the raw alignment score and the alignment length are used to obtain the statistical significance (Z-score) of the alignment. The alignments on different templates are ranked by the Z-score, which is also used to classify the query protein into “easy,” a “medium,” or a “hard”. The “hard” category basically means that no good threading template is identified in the library, and the structure will have to be largely predicted by an “ab initio” method. The templates found by the threading process are divided into continuously aligned (>5 residues) and gapped regions and placed onto the CAS (C-Alpha and Side-chain center of mass) on-and-off-lattice model. The local structure of the aligned regions remains unchanged during the simulation; their Cα atoms are excised from the template and placed off-lattice in order to keep the fidelity of the structures. In the gapped or ab initio regions, Cα atoms are placed on the lattice points with a grid of 0.87 Å. The side-chain centers of mass are off-lattice for all regions. The gapped regions are first filled up using a random walk of Cα–Cα bond vectors to generate a full-length model which is subsequently subjected to the parallel hyperbolic Monte Carlo sampling (Zhang et al. 2002). Once again the CAS model differentiates between the on-lattice and off-lattice atoms with regard to the movements they are subjected to. The off-lattice atoms are subjected to rigid-body translation and rotation. Care is taken to ensure that the acceptance probability of a movement is approximately the same for different fragment lengths, implemented by normalizing the amplitude of movement by the length of the fragment. On the other hand, on-lattice atoms are subjected to 2–6-bond movements and sequence shifts of multiple bonds. A pictorial representation of the CAS model is shown in Fig. 11.2. The TASSER energy function integrates three different classes of energy terms. The first term consists of a number of knowledge-based statistical potential derived from the PDB (Berman et al. 2000), including long-range side-chain pair interactions, hydrogen bond potential terms, hydrophobic interaction, and local Cα correlations. The second class includes the propensity of an amino acid to assume a particular secondary structure as predicted by PSIPRED (Jones 1999), while the third class includes protein-specific tertiary structure contact restraints and a distance map calculated by PROSPECTOR_3 from the generated threading templates. The decoys generated from the TASSER sampling are finally subjected to iterative structural clustering by SPICKER (Zhang and Skolnick 2004b) to rank the decoys and extract near-native final models.

11.4 I-TASSER Methods I-TASSER is an extension of the TASSER methodology, which is implemented by running repeated iterations of the TASSER Monte Carlo sampling (Wu et al. 2007a). A schematic overview of the I-TASSER methodology is shown in Fig. 11.3. The main new developments in I-TASSER compared to TASSER are (a) LOMETS

262

S. Mukherjee et al.

Fig 11.2 A schematic representation of the CAS on-lattice and off-lattice models for a fragment of a polypeptide chain, with each residue being represented by the Cα atom and side-chain center of mass. The Cα atoms of the unaligned residues (white) are placed on-lattice and subjected to 2–6-bond movements and multi-bond shifts. The Cα regions of the aligned regions are subjected only to rigid-body rotations and translations. All side-chain atoms are off-lattice

is used to extract spatial restraints from multiple threading algorithms (Wu and Zhang 2007b); (b) sequence-based contact predictions from SVMSEQ guide the ab initio simulations (Wu and Zhang 2008a, 2009); (c) REMO is used to refine the hydrogen-bonding network of reduced models (Li and Zhang 2009); (d) iterative TASSER reassembly (Wu et al. 2007a); and (e) integration of structure-based functional annotations. The starting templates in I-TASSER are collected by LOMETS (Wu and Zhang 2007b), a meta-threading server combining nine state-of-the-art threading algorithms: FUGUE (Shi et al. 2001), HMM-HMM search (HHsearch) (Soding 2005), multi-source threader (MUSTER) (Wu and Zhang 2008b), protein structure prediction and evaluation computer toolkit (PROSPECT) (Xu and Xu 2000), PROSPECTOR3 (Skolnick et al. 2004b), sequence alignment and modeling-T02 (SAM-T02) (Karplus et al. 1998), sequence, secondary structure Profiles and residue-level knowledge-based energy Score 2 (SPARKS2) (Zhou and Zhou 2004), SP3 (Zhou and Zhou 2005), and profile-profile alignment (PPA) (Wu and Zhang 2007b). On average, as tested on a benchmark set of 620 non-homologous proteins, the threading alignment found by LOMETS outperforms the best-individual threading programs, with a template- score (TM-score) increase of at least 8%.

11

Genome-Wide Protein Structure Prediction

263

Fig. 11.3 A schematic diagram of the I-TASSER (Wu et al. 2007a; Zhang 2007, 2008a, 2009a) structure and function prediction protocol. Templates for the query protein are identified by local meta server (LOMETS) (Wu et al. 2007b), which provides template fragments and spatial restraints. The template fragments are then assembled by parallel hyperbolic Monte Carlo simulations (Zhang et al. 2002). The conformations generated during the simulation are clustered using SPICKER (Zhang and Skolnick 2004b), in order to find the structure with the lowest-free energy. As an iterative strategy, the cluster centroids are then subjected to second round of simulation with the purpose of refining the structure and removing clashes. The final all-atom models are generated by refined models (REMO) (Li and Zhang 2009) through the optimization of hydrogenbonding networks. Functional homologs (PDB structures that have an associated EC number/GO term/known binding site) of the final models are identified using both global structural search (Zhang and Skolnick 2005b) and local structure alignment programs which aim at finding matches between the binding/active sites of the predicted structure and templates with known function

The new potential terms that have been incorporated in I-TASSER include the predicted accessible surface area (ASA) (Chen and Zhou 2005a; Wu et al. 2007a) and sequence-based contact predictions (Wu and Zhang 2008a). Both energy terms have been derived and optimized using machine learning methods. The overall correlation between the actual exposed area as calculated by structure identification (STRIDE) (Frishman and Argos 1995) and that predicted by a neural network is 0.71, based on a test on 2,234 non-homologous proteins. In the latest version of ITASSER (Zhang 2009a), the sequence-based pair-wise residue contact information from SVMSEQ (Wu and Zhang 2008a), SVMCON (Cheng and Baldi 2007), and BETACON (Cheng and Baldi 2005) are used to constrain the simulation search and improve the funnel around the global minimum of the energy landscape. The trajectories of the low-temperature replicas of the first-round TASSER simulations are clustered by SPICKER (Zhang and Skolnick 2004b). The cluster centroids are obtained by averaging all the clustered structures after superposition and are ranked based on the structure density of the cluster. Cluster centroids generally have a number of non-physical steric clashes between Cα atoms and can be over-compressed. Starting from the selected SPICKER cluster centroids, the

264

S. Mukherjee et al.

TASSER Monte Carlo simulation is performed again (see Fig. 11.3). While the inherent I-TASSER potential remains unchanged in the second run, external constraints are added, which are derived by pooling the initial high-confidence restraints from threading alignments, the distance and contact restraints from the combination of the centroid structures, and the PDB structures identified by the structure alignment program TM-align (Zhang and Skolnick 2005b) using the cluster centroids as query structures. The conformation with the lowest energy in the second round is selected as the final model. The main purpose of this iterative strategy is to remove the steric clashes of the cluster centroids. To increase the biological usefulness of protein models, all-atom models are generated by REMO (Li and Zhang 2009) simulations, which include three general steps: (1) removal of steric clashes by moving around each of the Cα atoms that clash with other residues; (2) backbone reconstruction by scanning a backbone isomer library collected from the solved high-resolution structures in the PDB library; (3) hydrogen-bonding network optimization based on predicted secondary structure from PSIPRED. Finally, Scwrl3.0 (Canutescu et al. 2003) is used to add the side-chain rotamers. Recently, I-TASSER was extended by an additional component to predict the biological function of the query proteins. The procedure involves matching the I-TASSER-generated structural models against representative libraries of proteins with known function using both global and local structure alignment-based methods in order to find the best functional homologs in the PDB library. Based on a large-scale benchmark test set of 218 non-homologous proteins, it was found that even when the structures are predicted after removing all the homologous templates from the template library, the correct function (EC number and GO terms) could be predicted for 72% of the test proteins with a precision of 74% (Roy et al. 2010).

11.5 TASSER/I-TASSER Structure Prediction on Large-Scale Benchmarks For a comprehensive test of the methodology, we collected a representative set of 2,234 single-domain proteins in the PDB whose size ranged from 41 to 300 residues (Skolnick et al. 2004a; Zhang and Skolnick 2004c). For each protein, homologous templates with sequence identity >30% to the target are excluded from the threading template library (Skolnick et al. 2004a; Zhang and Skolnick 2004c). In Fig. 11.4a, we present a chart showing the fractions of I-TASSER-generated models having RMSDs (from the native structure) below various thresholds. About 2/3 of targets (1,470/2,234) have an acceptable topology (RMSD from native <6.5 Å); 46% of targets (1,026/2,234) have an RMSD from the native structure <4 Å. As the RMSD threshold decreases, the fraction of models below the threshold (especially those <2 Å) sharply drops, which is partially due to the limitations of the TASSER potential with regard to high-resolution modeling. Because there is no template alignment available, loop modeling is a difficult unsolved problem in protein structure prediction. In the 2,234-protein benchmark

11

Genome-Wide Protein Structure Prediction

265

Fig. 11.4 The success rate of TASSER/I-TASSER on three benchmark sets versus the RMSD threshold defining success; (a) 2,234 proteins with non-homologous templates from the threading program (Skolnick et al. 2004a; Zhang and Skolnick 2004c), (b) 56 small proteins in the ab initio limit (Wu et al. 2007a), and (c) 1,489 proteins with non-homologous templates from structure alignment by TM-align (Zhang and Skolnick 2005a,b)

set, there are overall 3,565 unaligned regions (ranging from 4 to 170 residues long, mainly in loops and tails). If we assess loop modeling accuracy by calculating the RMSD between the predicted and the native loop conformations based on a superposition of the stem residues (Zhang and Skolnick 2004c), the average RMSD for all 3,565 loops is 6.1 Å, a low value in comparison with the RMSD of 14.2 Å obtained when the same loops are built by MODELLER (Sali and Blundell 1993). If we use an RMSD cutoff of 4 Å to define success, MODELLER succeeds in 14% (499 of 3,565) of the cases, whereas TASSER ab initio modeling is successful in 44% (1,567 of 3,565) of the cases. In Fig. 11.4b, we show the RMSD distribution of I-TASSER ab initio models for 56 small (<120 residues) single-domain proteins (Wu et al. 2007a). Any meaningful template with a sequence identity >20% to the target or having a PSI-BLAST E-value <0.5 was excluded. In this limit, about 90% of the final models have a correct fold, with an RMSD <6.5 Å from the native structure. The average RMSD of the I-TASSER models is 3.8 Å, compared to 5.9 Å by TOUCHSTONE (Zhang et al. 2003) for the same set of proteins. Since the template exclusion used here is much stricter than that in TOUCHSTONE, which used a sequence identity cutoff of 30%, these data demonstrate significant progress by I-TASSER over TOUCHSTONE in ab initio modeling. For the 16 proteins which were also tested by Bradley et al. (2005), the overall result is comparable with that from all-atomic ROSETTA simulations (both have an average RMSD of 3.8 Å), but the CPU time required by I-TASSER was much shorter (5 CPU hours vs. 150 CPU days). In Fig. 11.4c, we show the success rates of a procedure that uses the best templates identified by structural alignments. First, a representative set of 1,489 target proteins from the PDB with lengths between 41 and 200 residues was taken as target proteins; then the native structure of each target was superimposed to structures in the PDB to identify the best template, while homologous templates with sequence identity >25% to the target are excluded (Zhang and Skolnick 2005a). The purpose

266

S. Mukherjee et al.

of this experiment was to examine whether the current PDB structure library is complete (Zhang and Skolnick 2005a) and if so, how well TASSER structure prediction can perform when starting from the best-possible non-homologous templates. The data show that starting from structural alignments, TASSER generates “foldable” models with an RMSD < 6 Å for almost all targets and “good” models with an RMSD <4 Å for 97% of the targets. Although the fraction of high-resolution models (<2 Å) is still relatively low, these striking data suggest that when the goal is to build a model with correct topology (RMSD <6 Å (Zhang and Skolnick 2005a)), the structure prediction problem for single-domain proteins could in principle be solved using the current PDB by efficient fold recognition algorithms that would be able to recover the structural alignments (Zhang and Skolnick 2005a). Indeed, all single-domain folds in the PDB are represented in an artificially generated library of compact, hydrogen-bonded, sticky homo-polypeptide structures (Zhang et al. 2006b).

11.6 Prediction of All Medium-Sized ORFs in the E. coli Genome Inspired by the success of the benchmark test, a genome-scale structure prediction experiment (Zhang and Skolnick 2004a) was carried out for all 1,360 mediumsized ORFs (<201 residues) in the E. coli genome (Blattner et al. 1997). The PROSPECTOR_3 threading algorithm assigns 829 proteins to the easy set, 521 to the medium set, and only 10 to the hard set; this target distribution is quite similar to that of the benchmark set. Based on the benchmarking study described above, a confidence score (or C-score) was defined to assess the quality of a model, which is a combination of the Z-score of the threading template and the degree of convergence of the conformations generated by the CAS refinement simulations. The confidence score is defined by C−score = ln

M Z rmsd Mtot

(11.1)

where M is the multiplicity of structures in a given SPICKER cluster, is the average RMSD of the structures in the cluster from the cluster centroid, Mtot is the total number of conformations used as input to SPICKER, and Z is the Z-score of the starting template. Having observed a good correlation of C-score with RMSD for the benchmark set (a C-score > −1.5 is roughly equivalent to a TM-score > 0.5 which indicates a similar fold), this C-score could be used to assess the quality of the models generated for the E. coli ORFs. The E. coli ORFs were found to have a C-score distribution similar to the one observed for the PDB benchmark set. If the correlation between C-score and RMSD is assumed to be the same for the E. coli set as the benchmark set, ∼920 or 68% of the models generated can be considered to be reliable. The percentage of correct structures is slightly higher than for the PDB benchmark set (see Fig. 11.4), partly because homologous proteins were not excluded during the threading process for the E. coli ORFs. A histogram showing

11

Genome-Wide Protein Structure Prediction

267

Fig. 11.5 A histogram showing the C-score (defined in Eq. (11.1)) distribution of models for the E. coli genome (solid line) and the PDB benchmark set (bars). The different colors in the bars indicate the fraction of targets below and above an RMSD cutoff of 6.5 Å for the PDB benchmark set

the distribution of C-scores for the E. coli ORFs and the PDB benchmark set is shown in Fig. 11.5. Transmembrane proteins are particularly difficult to crystallize, with difficulties ranging from expression of membrane proteins in microbial host cells to purification of the protein to the crystallization process itself, due to the amphipathic nature of their environment (Ostermeier and Michel 1997; Caffrey 2003). Hence, accurate prediction of membrane proteins is of special importance. According to MEMSAT (Jones et al. 1994), 309 of the 1,360 E. coli ORFs belong to the membrane protein class. The TASSER models generated for these ORFs share good consistency with the MEMSAT predictions in having at least one long, putative transmembrane helix. If the C-score, defined above, is used to map the models, 174 of the 309 proteins or 56% have a probability >60% to have an overall RMSD < 6.5 Å and about 146 or 47% have a chance >80% to have an RMSD less than 6.5 Å.

11.7 Structural Modeling of All 907 Putative GPCRs in the Human Genome G protein-coupled receptors (GPCRs) comprise the largest family of integral membrane proteins and act as cell surface receptors responsible for the transduction of an endogenous signal into a cellular response (Watson and Arkinstall 1994; Flower

268

S. Mukherjee et al.

1999). Many diseases involve their malfunction, making them the most important class of drug targets (Flower 1999; Drews 2000; Lundstrom 2005; Hubbard 2006) However, structure-based drug design has been hampered by the lack of atomic-level protein structure information for GPCRs. Until now, only four GPCR structures, bovine rhodopsin (Palczewski et al. 2000), turkey β1 -adrenergic receptor (Warne et al. 2008), human β2 -adrenergic (Cherezov et al. 2007; Rasmussen et al. 2007; Rosenbaum et al. 2007), and A2A -adenosine receptors (Jaakola et al. 2008), have been solved. We collected 907 human GPCRs from the registered entries at http://www.cmbi. kun.nl/7tm/htmls/entries.html and http://www.expasy.org/cgi-bin/lists?7tmrlist.txt. TASSER was employed to model all the GPCRs (Zhang et al. 2006a). The resulting models are publicly downloadable from http://cssb.biology.gatech.edu/ skolnick/files/gpcr/gpcr.html. Because there was no restraint on the global topology, it is of interest to examine how often the models adopt a typical TM-helix bundle architecture. Using an automatic TM-helix identification program, we found that 862 of the 907 GPCRs have the 7-helix bundle topology, although only 744 targets started from a TM-helixlike template. Among the other 45 cases, 16 are incomplete or alternatively spliced transcripts; most are missing the majority of their TM regions; three (Q8TDU0, Q8TDV3, and Q96HT6) do not appear to be GPCRs based on sequence analysis; (Marchler-Bauer et al. 2005) two (Q9HC23 and P06850) are wrongly annotated as GPCRs (Pisarska et al. 2001; Chen et al. 2005b). The remainder may represent incorrect TASSER predictions, since the C-score of these targets is low. Although at the time of the study, there was no solved X-ray or nuclear magnetic resonance (NMR) structure for any human GPCR and a direct comparison of models with experimental structures was not possible, two criteria were found to be useful for the assessment of our models. First, we use the model’s C-score (see Eq. (11.1). Based on the 2,234-protein benchmark set, the correlation coefficient between C-score and RMSD is 0.85 (Skolnick et al. 2004a), with a similar correlation also obtained for a benchmark set of 38 membrane proteins (Zhang et al. 2006a). Due to the uniform 7-TM topology and the robust sequence profiles (Skolnick et al. 2004b), a much higher fraction of the GPCR models have a high C-score than the models generated for PDB benchmark set (Fig. 11.6). Assuming that the GPCR models have the same correlation between C-score and RMSD as those of the PDB benchmark proteins, we estimate that 819 GPCR models have a correct fold with an RMSD below 6.5 Å. Second, we evaluated the GPCR models by considering the affinity-labeling and site-directed mutagenesis experiments designed to identify critical residues and motifs that participate in ligand binding (Schwartz 1994; Flower 1999; Shi and Javitch 2002). These data provide useful clues about the spatial arrangements of binding site residues, and we can examine if our models are consistent with these. We checked all TASSER models with C-score >1.3 with available site-directed mutagenesis data collected from 64 papers. These included angiotensin receptor 1, chemokine receptors, opioid receptors, thromboxane A2 receptor, neuromedin B receptor, melatonin 2 receptor, gonadotropin-releasing hormone receptor, and neuropeptide Y receptors. Excluding N- and C-terminal tails, the TASSER-predicted

11

Genome-Wide Protein Structure Prediction

269

Fig. 11.6 Histogram showing the distribution of C-scores (defined in Eq. (11.1)) for the PDB benchmark set (bars) and the GPCR models (solid line). The different colors in the bars indicate the fraction of models in the PDB benchmark set with an RMSD below (dark gray) and above (light gray) 6.5 Å, respectively

models were consistent with the experiment (Zhang et al. 2006a). Fig. 11.7 shows the human Y1 receptor. Consistent with the mutagenesis studies (Zhou et al. 1994; Hwa et al. 1995; Sautel et al. 1996; Du et al. 1997), the ligand-binding residues are well grouped in the model. Based on an all-against-all comparison of the predicted structures, GPCRs in the same functional family were found to be more conserved in structure space than in sequence space. This finding establishes the possibility of functional annotation of orphan proteins based on topology-level comparisons of predicted structures. One such instance is the RDC1 receptor, which was considered an orphan receptor for 15 years; its closest but weak relative is the adrenomedullin receptor (AMDR) based on phylogenetic studies (Ladoux and Frelin 2000). The TASSER structural predictions placed the RDC1 receptor in the family of chemokine receptors because the predicted RDC1 structure is closest to the predicted structure of the CXCR4 chemokine receptor (Zhang et al. 2006a). This finding was later confirmed by binding experiments (Miao et al. 2007). After the modeling had been done (Zhang et al. 2006a), the structures of two human GPCRs, the β2 -adrenergic and A2A -adenosine receptors, were solved by two laboratories at Stanford University and The Scripps Research Institute (Cherezov et al. 2007; Rasmussen et al. 2007; Jaakola et al. 2008). These structures provide a

270

S. Mukherjee et al.

Fig. 11.7 The first model of neuropeptide Y Y1 receptor predicted by TASSER, having a C-score of 1.93 (Zhang et al. 2006a). Secondary structure elements are displayed as open ribbons. (a) Three pairs of highlighted residues are in contact as verified by the reciprocal mutagenesis experiments. (b) Highlighted residues represent the important residues identified in NPY agonistbinding mutagenesis experiments. (c) Highlighted residues are the critical residues identified by BIBP3226 antagonist-binding mutagenesis experiments (Sautel et al. 1996; Du et al. 1997)

unique opportunity to objectively examine the quality of the TASSER models. β2 AR is a class-A receptor that is 413-residues long. It is found in human smooth muscle and mediates the catecholamine-induced activation of adenylate cyclase through the action of G proteins. Efforts to crystallize wild-type β2 -AR had been unsuccessful because of the inherent conformational plasticity mainly induced by the C-terminal tail and the third unstructured intracellular loop (ICL3) (Granier et al. 2007; Rosenbaum et al. 2007). To increase crystal contacts, Rasmussen et al. (2007) removed the C terminus and bound a monoclonal antibody (Mab5) to ICL3. Using high-brilliance microcrystallography, the structure of a 216-residue portion was determined at a resolution of 3.4 Å (PDB ID: 2r4rA). Cherezov et al. (2007) replaced ICL3 with T4 lysozyme (T4L) to increase the TM conformational stability. This led to a high-resolution structure of 282 residues with a resolution of 2.4 Å (PDB ID: 2rh1A). The missing parts are mainly from the N and C termini and the ICL3 region. This is the first-solved human GPCR structure. Because it is longer and has a higher resolution than 2r4rA, we compared our models to 2rh1A in our analysis. In our modeling of β2 -AR, PROSPECTOR3 identified bovine rhodopsin (1f88A) as the template with a high significance score (Z-score = 23.1). The RMSD of the

11

Genome-Wide Protein Structure Prediction

271

253 aligned residues from the template to the native structure is 4.94 Å with a TMscore = 0.71. In the 7 TM-helix regions, i.e., TM1 (29–60), TM2 (67–96), TM3 (103–136), TM4 (147–171), TM5 (197–229), TM6 (267–298), and TM7 (305–328), the RMSD for the rhodopsin template is 3.7 Å. TASSER takes the restraints from the template and reassembles the fragments with loops built by ab initio modeling. As a result, the structure of the first model has an RMSD of 4.37 Å in the threading aligned regions; for the 7 TM-helix regions, the RMSD of the first TASSER model is 2.28 Å (Fig. 11.8, left panel). For the full-length model, the RMSD to native is 4.88 Å with a TM-score = 0.82.

Fig. 11.8 Side and top views of the first TASSER model (gray) superposed on the crystal structure (dark) of β2 -AR and ADORA2A over the seven transmembrane regions with an RMSD of 2.28 and 2.87 Å, respectively

ADORA2A is a class-A purinergic receptor with a length of 412 residues. Stevens and coworkers exploited a similar T4L fusion strategy to crystallize the receptor resulting in a structure of 2.6 Å resolution (PDB ID: 3eml) (Jaakola et al. 2008). PROSPECTOR3 identified again the bovine rhodopsin as a template with an RMSD of 5.13 Å in 262 aligned residues; the RMSD of the templates in the TMregions is 3.23 Å. After TASSER reassembly, the RMSD of the first model in the threading aligned regions is 4.20 Å while in the 7 TM-helix regions, the RMSD of the model is reduced to 2.84 Å (Fig. 11.8, right panel). The overall RMSD of the full-chain model is 4.76 Å with a TM-score = 0.80. It should be mentioned that the modeling here was made using rhodopsin as template. When using the newly solved adrenergic receptors which are structurally closer to ADORA2A, the models could be further improved, e.g., the RMSD in TM-helices of our model by I-TASSER which was recently submitted to the community-wide GPCR docking experiment (Michino et al. 2009) was 2.04 Å (model ID: 1800_2.pdb, picture not shown). Both blind-test examples (β2 -AR and ADORA2A) show that the TASSER/ I-TASSER fragment assembly procedure can draw the template significantly closer to the native structure (i.e., by 1.4 Å/0.6 Å and 0.4 Å/0.9 Å in TM-helix/aligned regions for β2 -AR and ADORA2A, respectively). This ability to refine structures is particularly important for modeling those GPCRs that do not have close templates in the PDB. Currently, efforts are under way to extend the I-TASSER methodology for predicting the structure of all classes of integral membrane proteins. In an initial benchmarking study, 88 integral membrane proteins (66 α-barrel and 24 β-barrel

272

S. Mukherjee et al.

proteins) belonging to 24 different families were selected from the PDB and modeled using the current I-TASSER protocol, excluding any templates having >30% sequence identity with the target. Overall, 61 proteins were classified as easy targets, 24 as medium, and 3 as hard targets, based on the LOMETS threading alignment. For 37 proteins, the best-identified template was itself a membrane protein, and 43 templates had a TM-score >0.6 with the native target structure, showing that good templates exist in the current PDB library for ∼45% of the membrane proteins. After generating full-length models by the I-TASSER procedure, 37 proteins in the benchmark set were modeled with an average RMSD of 4.203 Å, and 43 proteins had an average TM-score of 0.7726 for the full-length model. Much effort is being made to develop a membrane protein-specific version of I-TASSER, which would be capable of taking into consideration the uniqueness of the membrane environment and predicting integral membrane protein structures even when no good template is identified in the template library, with an equivalent precision and accuracy to that for globular proteins.

11.8 Application of I-TASSER to the Chlamydia trachomatis Genome Bacteria from the Chlamydia genus are implicated in a large number of human diseases, including glaucoma and ectopic pregnancy among many others. The lack of a gene transfer system for these bacteria makes them difficult to study ex vivo and has greatly hampered our understanding of their biology. Although the genome sequences of many Chlamydia species are freely available in genome databases, the functional annotations of ORFs in these genomes, based on sequence comparisons, have been limited due to the lack of reliable sequence similarity with proteins of known function. As residues located far apart in the sequence may be very close in 3D space, and only a few spatially conserved residues are generally responsible for a protein’s function (Wallace et al. 1996; Kleywegt 1999), predicted 3D structures for proteins from such organisms can provide meaningful insights into the key component (s) of their functionality. The I-TASSER methodology for protein structure and function prediction was recently applied to 100 ORFs with no functional annotation in the C. trachomatis genome. Fig. 11.9a shows the distribution of the confidence score (C-score) of the first I-TASSER models for all 100 proteins. Based on the correlation data of C-score with RMSD and TM-score (Zhang 2008a), it can be expected that 66 of these 100 proteins could be correctly folded (predicted TM-score > 0.6) and could provide meaningful insight into the function of these proteins. Moreover, by using a local and global structure alignment-based method, a highly confident function prediction (based on a benchmark test of 218 proteins) could be made for 12 enzymatic and 38 non-enzymatic proteins, i.e., altogether exactly 50% of all target proteins. Fig. 11.9b shows an illustrative example, the protein CT780. The structure of an ortholog of this protein, in Chlamydia pneumoniae, had already

11

Genome-Wide Protein Structure Prediction

273

Fig. 11.9 (a) Distribution of C-scores of predicted structures for 100 proteins in Chlamydia trachomatis genome using I-TASSER. (b) Comparison of the modeled structure of CT780 (dark gray and stick) with the crystal structure of thioredoxin disulfide isomerase (dark gray and cartoon) from Chlamydia pneumoniae

been solved (PDB: 2ju5). For testing purposes, the structure of CT780 was modeled by excluding this template and all other proteins having a sequence >40% with the target. The first model generated by I-TASSER had a TM-score of 0.84 in the core region (when compared to 2ju5, the C. pneumoniae ortholog), reflecting that the structure was predicted correctly. Based on this predicted model, TM-align identified a correct functional homolog, the third thioredoxin domain of protein disulfide isomerase A4 from mouse, with EC: 5.3.4.1 (2dj3A). Primary sequence comparison supports the annotation of the protein as a thioredoxin disulfide isomerase, DsbH. Functional studies on DsbH demonstrated that it exhibits many of the enzymatic properties of thioredoxin from E. coli (Mac et al. 2008). This identified homolog shares a sequence identity of 27.1% with the query protein, showing that even when only remotely homologous templates are available, the modeled structure can provide meaningful insight into the molecular function and can make genomic-scale functional annotation a reality.

11.9 Concluding Remarks Genome-wide structure predictions have been carried out by state-of-the-art methods for a number of organisms, with representative examples including the predictions for S. cerevisiae by MODELLER (Sanchez and Sali 1998), the yeast proteome by ROSETTA (Malmstrom et al. 2007), and the E. coli proteome (Skolnick et al. 2004a), all human GPCRs (Zhang et al. 2006a), and the C. trachomatis proteome by TASSER/I-TASSER (Skolnick et al. 2004a; Wu et al. 2007a). A large percentage of the proteins in proteomes (e.g., 47% of yeast proteins) can be classified as a comparative modeling or fold recognition target, for which reliable structures can be built by current template-based methods. These predictions are immediately useful

274

S. Mukherjee et al.

for function prediction and for the design and interpretation of wet-lab experiments (Zhang 2009b). For the proteins with no recognizable relationship to known structures, ab initio methods have to be developed for structure prediction. However, there are serious limitations to the application of the ab initio methods, which hamper their use in genome-wide prediction. In the absence of recognizable similarity to proteins with known domains, splitting long sequences into domains can be done with only limited accuracy. Even when domain parsing is successful, ab initio methods can hardly be applied to domains >150 residues (Zhang 2008b). Membrane proteins are another group which is often excluded from the prediction attempts except for special classes where homologous or analogous templates are available. Therefore, the efforts of genome-wide prediction based on ab initio approaches (Kihara et al. 2002; Skolnick et al. 2004a; Malmstrom et al. 2007) and those aimed at membrane proteins (Zhang et al. 2006a) are exceptionally important. Although their results so far are encouraging, when all structure prediction approaches are combined, a significant fraction (∼1/3) of the proteome remains that is inaccessible to current methods (Zhang 2008b). Although the ultimate purpose of structure prediction is to help design and interpret experiments, the accuracy of the final model determines its possible use. Only high-resolution models can be used for reliable docking or drug design; the lowerresolution structures can be useful for superfamily assignment or putative functional annotation (Zhang 2009b). The refinement of low-resolution models to achieve higher resolution is therefore of key importance but remains a challenge (Kopp et al. 2007; Read and Chavali 2007). In the context of genome-wide protein structure prediction, the “sequence-tostructure-to-function” paradigm does not necessarily have to be conceived as a one-way path. Obviously, functional annotation can be based on predicted structures; but this relationship also works the other way: existing functional information can help select the most likely structure when several different candidate structures are available. The study of Malmstrom et al. (2007) represents a prime example for this logic: the assignment of SCOP superfamilies to ab initio-predicted domain structures was augmented by the available functional data. In the future, the integration of computational and experimental findings will be essential to enhance our understanding of biological processes. Acknowledgments The project is supported in part by the Alfred P. Sloan Foundation, NSF Career Award (DBI 0746198), and the National Institute of General Medical Sciences (R01GM083107, R01GM084222).

References Aloy P, Querol E, Aviles F, Sternberg J (2001) Automated structure based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol 311:395–408 Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI_BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

11

Genome-Wide Protein Structure Prediction

275

Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230 Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science 294:93–96 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242 Blattner F, III GP, Bloch C, Perna N, Burland V, Riley M, Collado-Vides J, Glasner J, Rode C, Mayhew G and others (1997) The complete genome sequence of E. coli K-12. Science 277:1453–1474 Bowie JU, Luthy R, Eisenberg D (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science 253(5016):164–170 Bradley P, Misuara K, Baker D (2005) Towards high-resolution de novo structure prediction for small proteins. Science 309:1868–1871 Caffrey M (2003) Membrane protein crystallization. J Struct Biol 142:108–132 Canutescu AA, Shelenkov AA, Dunbrack RL Jr (2003) A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci 12:2001–2014 Chandonia J, Brenner S (2006) The impact of structural genomics: expectations and outcomes. Science 311:347–351 Chen H, Zhou HX (2005a) Prediction of solvent accessibility and sites of deleterious mutations from protein sequence. Nucleic Acids Res 33(10):3193–3199 Chen J, Kuei C, Sutton S, Wilson S, Yu J, Kamme F, Mazur C, Lovenberg T, Liu C (2005b) Identification and pharmacological characterization of prokineticin 2beta as a selective ligand for prokineticin receptor 1. Mol Pharmacol 67:2070–2076 Cheng J, Baldi P (2005) Three-stage prediction of protein beta-sheets by neural networks, alignments and graph algorithms. Bioinformatics 21(Suppl 1):i75–84 Cheng J, Baldi P (2007) Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics 8:113 Cherezov V, Rosenbaum DM, Hanson MA, Rasmussen SG, Thian FS, Kobilka TS, Choi HJ, Kuhn P, Weis WI, Kobilka BK, others (2007) High-resolution crystal structure of an engineered human beta2-adrenergic G protein-coupled receptor. Science 318(5854):1258–1265 Drews J (2000) Drug discovery: a historical perspective. Science 287(5460):1960–1964 Du P, Salon JA, Tamm JA, Hou C, Cui W, Walker MW, Adham N, Dhanoa DS, Islam I, Vaysse PJ, others (1997) Modeling the G-protein-coupled neuropeptide Y Y1 receptor agonist and antagonist binding sites. Protein Eng 10:109–117 Fischer D, Eisenberg D (1997) Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium. Proc Natl Acad Sci 94:11929–11934 Fiser A, Do RK, Sali A (2000) Modeling of loops in protein structures. Protein Sci 9:1753–1773 Flower DR (1999) Modelling G-protein-coupled receptors for drug design. Biochim Biophys Acta 1422:207–234 Fraser C, Gocayne J, White O, Adams M, Clayton R, Fleischmann R, Bult C, Kerlavage A, Sutton G, Kelley J, others (1995) The minimal gene complement of Mycoplasma genitalium. Science 270:397–403 Frishman D, Argos P (1995) Knowledge-based protein secondary structure assignment. Proteins 23:566–579 Gerstein M, Edwards A, Arrowsmith C, Montelione G (2003) Structural genomics: Current progress. Science 299(5613):1663 Granier S, Kim S, Shafer AM, Ratnala VR, Fung JJ, Zare RN, Kobilka B (2007) Structure and conformational changes in the C-terminal domain of the beta2-adrenoceptor: insights from fluorescence resonance energy transfer studies. J Biol Chem 282:13895–13905 Hubbard R ed (2006) Structure based drug discovery, Royal Society of Chemistry. Hwa J, Graham RM, Perez DM (1995) Identification of critical determinants of alpha 1-adrenergic receptor subtype selective agonist binding. J Biol Chem 270:23189–23195 Jaakola VP, Griffith MT, Hanson MA, Cherezov V, Chien EY, Lane JR, Ijzerman AP, Stevens RC (2008) The 2.6 angstrom crystal structure of a human A2A adenosine receptor bound to an antagonist. Science 322(5905):1211–1217

276

S. Mukherjee et al.

Jones D (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202 Jones DT, Taylor WR, Thornton JM (1992) A new approach to protein fold recognition. Nature 358(6381):86–89 Jones DT, Taylor WR, Thornton JM (1994) A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33(10):3038–3049 Karplus K, Barrett C, Hughey R (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14(10):846–856 Kihara D, Lu H, Kolinski A, Skolnick J (2001) TOUCHSTONE: an ab initio protein structure prediction method that uses threading based tertiary restraints Proc Natl Acad Sci 98: 10125–10130 Kihara D, Zhang Y, Lu H, Kolinski A, Skolnick J (2002) Ab initio protein structure prediction on a genomic scale: application to Mycoplasma genitalim genome. Proc Natl Acad Sci 99: 5993–5998 Klepeis JL, Wei Y, Hecht MH, Floudas CA (2005) Ab initio prediction of the three-dimensional structure of a de novo designed protein: a double-blind case study. Proteins 58:560–570 Kleywegt GJ (1999) Recognition of spatial motifs in protein structures. J Mol Biol 285: 1887–1897 Kolinski A, Skolnick J (1994) Monte Carlo simulations of protein folding. I. Lattice model and interaction scheme. Proteins 18:338–352 Kopp J, Bordoli L, Battey JN, Kiefer F, Schwede T (2007) Assessment of CASP7 predictions for template-based modeling targets. Proteins 69(Suppl 8):38–56 Ladoux A, Frelin C (2000) Coordinated up-regulation by hypoxia of adrenomedullin and one of its putative receptors (RDC-1) in cells of the rat blood–brain barrier. J Biol Chem 275: 39914–39919 Li Y, Zhang Y (2009) REMO: a new protocol to refine full atomic protein models from C-alpha traces by optimizing hydrogen-bonding networks. Proteins 76(3):665–676 Liwo A, Lee J, Ripoll DR, Pillardy J, Scheraga HA (1999) Protein structure prediction by global optimization of a potential energy function. Proc Natl Acad Sci USA 96(10):5482–5485 Lopez G, Rojas A, Tress M, Valencia A (2007) Assessment of predictions submitted for the CASP7 function prediction category. Proteins 69(Suppl 8):165–174 Lundstrom K (2005) Structural biology of G protein-coupled receptors. Bioorg Med Chem Lett 15:3654–3657 Mac TT, von Hacht A, Hung KC, Dutton RJ, Boyd D, Bardwell JC, Ulmer TS (2008) Insight into disulfide bond catalysis in Chlamydia from the structure and function of DsbH, a novel oxidoreductase. J Biol Chem 283:824–832 Malmstrom L, Riffle M, Strauss CE, Chivian D, Davis TN, Bonneau R, Baker D (2007) Superfamily assignments for the yeast proteome through integration of structure prediction with the gene ontology. PLoS Biol 5:e76 Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, others (2005) CDD: a conserved domain database for protein classification. Nucleic Acids Res 33(Database issue):D192–196 Marti-Renom M, Stuart A, Fiser A, Sanchez R, Melo F, Sali A (2000) Comparative protein structure modeling of genes and genomes. Ann Rev Biophys Biomol Struct 29: 291–325 McGuffin L, Jones D (2003) Improvement of GenTHREADER method for genomic fold recognition. Bioinformatics 19:874–881 Miao Z, Luker KE, Summers BC, Berahovich R, Bhojani MS, Rehemtulla A, Kleer CG, Essner JJ, Nasevicius A, Luker GD, others (2007) CXCR7 (RDC1) promotes breast and lung tumor growth in vivo and is expressed on tumor-associated vasculature. Proc Natl Acad Sci USA 104(40):15735–15740 Michino M, Abola E, et al. (2009) Community-wide assessment of GPCR structure modelling and ligand docking: GPCR Dock 2008. Nat Rev Drug Discov 8(6):455–463

11

Genome-Wide Protein Structure Prediction

277

Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453 Oldziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila JA, Khalili M, Arnautova YA, Jagielska A, Makowski M, others (2005) Physics-based protein-structure prediction using a hierarchical protocol based on the UNRES force field: assessment in two blind tests. Proc Natl Acad Sci USA 102:7547–7552 Ostermeier C, Michel H (1997) Crystallization of membrane proteins. Curr Opin Struct Biol 7:697–701 Palczewski K, Kumasaka T, Hori T, Behnke CA, Motoshima H, Fox BA, Le Trong I, Teller DC, Okada T, Stenkamp RE, others (2000) Crystal structure of rhodopsin: a G protein-coupled receptor. Science 289(5480):739–745 Pisarska M, Mulchahey JJ, Sheriff S, Geracioti TD, Kasckow JW (2001) Regulation of corticotropin-releasing hormone in vitro. Peptides 22:705–712 Rasmussen SG, Choi HJ, Rosenbaum DM, Kobilka TS, Thian FS, Edwards PC, Burghammer M, Ratnala VR, Sanishvili R, Fischetti RF, others (2007) Crystal structure of the human beta2 adrenergic G-protein-coupled receptor. Nature 450(7168):383–387 Read RJ, Chavali G (2007) Assessment of CASP7 predictions in the high accuracy template-based modeling category. Proteins 69(Suppl 8):27–37 Rosenbaum DM, Cherezov V, Hanson MA, Rasmussen SG, Thian FS, Kobilka TS, Choi HJ, Yao XJ, Weis WI, Stevens RC, others (2007) GPCR engineering yields high-resolution structural insights into beta2-adrenergic receptor function. Science 318(5854):1266–1273 Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12:85–94 Roy A, Kucukural A, Mukherjee S, Hefty PS, Zhang Y (2010) Large scale benchmarking of protein function prediction using modeled protein structures. J Mol Biol (Submitted) Sali A, Blundell T (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815 Sanchez R, Pieper U, Mirkovic N, Bakker Pd, Wittenstein E, Sali A (2000) MODBASE, a database of annotated comparitive protein structure models Nucleic Acids Rese 28:250–253 Sanchez R, Sali A (1997) Evaluation of comparative protein structure modelling by MODELLER3. Proteins Suppl 1:50–58 Sanchez R, Sali A (1998) Large scale structure modelling of the Saccharomyces cerevisiae genome. Proc Natl Acad Sci 95:13597–13602 Sautel M, Rudolf K, Wittneben H, Herzog H, Martinez R, Munoz M, Eberlein W, Engel W, Walker P, Beck-Sickinger AG (1996) Neuropeptide Y and the nonpeptide antagonist BIBP 3226 share an overlapping binding site at the human Y1 receptor. Mol Pharmacol 50:285–292 Schwartz TW (1994) Locating ligand-binding sites in 7TM receptors by protein engineering. Curr Opin Biotechnol 5:434–444 Shi J, Blundell TL, Mizuguchi K (2001) FUGUE: sequence–structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol 310:243–257 Shi L, Javitch JA (2002) The binding site of aminergic G protein-coupled receptors: the transmembrane segments and second extracellular loop. Annu Rev Pharmacol Toxicol 42:437–467 Simons KT, Kooperberg C, Huang E, Baker D (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268:209–225 Simons KT, Strauss C, Baker D (2001) Prospects for ab initio protein structural genomics. J Mol Biol 306:1191–1199 Sippl M, Weitckus S (1992) Detection of native like models for amino acid sequences of unknown three-dimensional structure in a database of known protein conformations. Proteins 13:258–271 Skolnick J, Fetrow JS, Kolinski A (2000) Structural genomics and its importance for gene function analysis. Nat Biotechnol 18:283–287 Skolnick J, Kihara D (2001) Defrosting the frozen approximation: PROSPECTOR – a new approach to threading. Proteins:Struct Funct Genet 42:319–331

278

S. Mukherjee et al.

Skolnick J, Kihara D, Zhang Y (2004a) Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins 56:502–518 Skolnick J, Kihara D, Zhang Y (2004b) Development and large scale benchmark testing of the Prospector_3 threading algorithm. Proteins 56:502–518 Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197 Soding J (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21: 951–960 Tramontano A, Morea V (2003) Assesment of homology based predictions in CASP 5. Proteins 53(Suppl 6):352–368 Vitkup D, Melamud E, Moult J, Sander C (2001) Completeness in structural genomics. Nat Struct Biol 8:559–566 Wallace AC, Laskowski RA, Thornton JM (1996) Derivation of 3D coordinate templates for searching structural databases: application to Ser-His-Asp catalytic triads in the serine proteinases and lipases. Protein Sci 5:1001–1013 Warne T, Serrano-Vega MJ, Baker JG, Moukhametzianov R, Edwards PC, Henderson R, Leslie AG, Tate CG, Schertler GF (2008) Structure of a beta1-adrenergic G-protein-coupled receptor. Nature 454(7203):486–491 Watson S, Arkinstall S. (1994) The G protein linked receptors factbook. Academic, New York, NY Wiley SR (1998) Genomics in the real world. Curr Pharm Des 4:417–422 Wu S, Skolnick J, Zhang Y (2007a) Ab initio modelling of small proteins by iterative TASSER simulations. BMC Biol 5:17 Wu S, Zhang Y (2007b) LOMETS: a local meta-threading-server for protein structure prediction. Nucleic Acids Res 35(10):3375–3382 Wu S, Zhang Y (2008a) A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics 24:924–931 Wu S, Zhang Y (2008b) MUSTER: improving protein sequence profile–profile alignments by using multiple sources of structure information. Proteins 72:547–556 Wu S, Zhang Y (2009) Improving protein tertiary structure assembly by sequence based contact predictions. Submitted Xu Y, Xu D (2000) Protein threading using PROSPECT: design and evaluation. Proteins 40: 343–354 Zhang B, Jaroszewski L, Rychlewski L, Godzik A (1997) Similarities and differences between nonhomologous proteins with similar folds: evaluation of threading strategies. Fold Des 2:307–317 Zhang Y (2007) Template-based modeling and free modeling by I-TASSER in CASP7. Proteins 69(Suppl 8):108–117 Zhang Y (2008a) I-TASSER server for protein 3D structure prediction. BMC Bioinformatics 9:40 Zhang Y (2008b) Progress and challenges in protein structure prediction. Curr Opin Struct Biol 18:342–348 Zhang Y (2009a) I-TASSER: fully automated protein structure prediction in CASP8. Proteins:In press Zhang Y (2009b) Protein structure prediction: when is it useful? Curr Opin Struct Biol 19:145–155 Zhang Y, Devries ME, Skolnick J (2006a) Structure modeling of all identified G protein-coupled receptors in the human genome. PLoS Comput Biol 2:e13 Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J (2006b) On the origin and highly likely completeness of single-domain protein structures. Proc Natl Acad Sci USA 103: 2605–2610 Zhang Y, Kihara D, Skolnick J (2002) Local energy landscape flattening: Parallel hyperbolic Monte-Carlo sampling of protein folding. Proteins 48:192–201 Zhang Y, Kolinski A, Skolnick J (2003) TOUCHSTONE II: a new approach to ab initio protein structure prediction. Biophys J 85:1145–1164 Zhang Y, Skolnick J (2004a) Automated Structure prediction of weekly homologous proteins on a genomic scale. Proc Natl Acad Sci 101:7594–7599

11

Genome-Wide Protein Structure Prediction

279

Zhang Y, Skolnick J (2004b) Spicker: approach to clustering protein structures for near native model selection. J Comp Chem 25:865–871 Zhang Y, Skolnick J (2004c) Tertiary structure predictions on a comprehensive benchmark of medium to large size proteins. Biophys J 87:2647–2655 Zhang Y, Skolnick J (2005a) The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci USA 102:1029–1034 Zhang Y, Skolnick J (2005b) TM-align:a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33:2302–2309 Zhou H, Zhou Y (2004) Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins 55: 1005–1013 Zhou H, Zhou Y (2005) Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins 58:321–328 Zhou W, Flanagan C, Ballesteros JA, Konvicka K, Davidson JS, Weinstein H, Millar RP, Sealfon SC (1994) A reciprocal mutation supports helix 2 and helix 7 proximity in the gonadotropinreleasing hormone receptor. Mol Pharmacol 45:165–170

Chapter 12

Multiscale Approach to Protein Folding Dynamics Sebastian Kmiecik, Michał Jamroz, and Andrzej Kolinski

Abstract Dynamic behavior of proteins is a key factor for understanding the functions of a living cell. Description of the conformational transitions of proteins remains extremely difficult for the computational simulation as well as the experimental techniques. No technique is able to span extremely short dynamic events together with long-timescale processes when the most interesting transitions occur. Thus new methods for simulation and utilization of all accessible experimental data are needed. The advances in the development of hybrid models, which attempt to combine a simplified modeling efficiency with atomic resolution accuracy, should provide new opportunities for the use of computer simulation in the integration of different kinds of data to study folding dynamics at relevant timescales. This review outlines the advances in description of protein dynamics and discusses recent applications of the CABS-reduced modeling tool to the studies of protein folding dynamics.

12.1 Introduction Protein folding and unfolding are among essential processes in a living cell. Recently, attempts to simulate protein dynamics have become very popular, because of basic role of protein flexibility in functions of living organisms, increasing danger of protein misfolding and aggregation diseases (e.g., Alzheimer’s, Parkinson’s), and thanks to recent advances in experimental and simulation techniques. Already 40 years ago the problem complexity was highlighted by Levinthal, who pointed out an impossibly long period of time required to fold a protein by a random conformational search (Levinthal 1968). Despite significant technological development from that time, we are still neither able to reliably predict protein structures from their sequences only nor able to monitor protein structure dynamics on relevant A. Kolinski (B) Faculty of Chemistry, University of Warsaw, Warsaw, Poland e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_12,

281

282

S. Kmiecik et al.

timescales. Recent studies show that a high-resolution structure prediction is possible for small proteins, but requires huge computational resources and the method does not guarantee the prediction success (Bradley et al. 2005b). It is expected that detailed understanding of the folding process may lead to significant improvement of structure prediction algorithms. Thus, characterization of all alternative protein conformations that emerge along the folding pathway, including the unfolded state and partially folded intermediates, is needed. Protein folding, as the process including both extremely short-timescale subprocesses and long-timescale rearrangements of thousands of atoms, remains extremely difficult not only to simulate but also to study experimentally. Recent reports show progress in the development of experimental techniques allowing for more and more detailed descriptions of very fast processes and short-lived transient conformations. The progress can be also seen in the development of simulation techniques, therefore the number of opportunities for combining experimental results and simulation is growing. Usually general or sparse experimental observations can be interpreted by a simulation or guide the simulation. Theoretical studies already led to better understanding of experimental results, providing easy-to-interpret structural models (Schaeffer et al. 2008). Thus, the role of computational techniques is to deliver all-atom structural models describing the whole process either by utilizing experimental data or by ab initio prediction if possible. Importantly, combining simulation and experiment allows for constant validation of the simulation techniques. This article begins with a short characterization of experimental techniques providing input for the folding dynamics simulation. Applications of all-atom Molecular Dynamics (MD) and simplified protein models are then briefly discussed. Next, the outcome from coarse-grained, ab initio, long-timescale folding simulations of three protein model systems is described with the focus on comparison of experimental and simulation data. Finally, perspectives of folding dynamics methods and future development needs are summarized.

12.2 Structural Dynamics from Combination of Experiment and Simulation No single technique, computational or experimental, is able to cover all relevant events of protein folding dynamics (Fig. 12.1). Comprehensive description of the protein dynamics requires integration of different kinds of static and dynamic protein characterizations, at different resolutions (Russel et al. 2009). That can be achieved at atomic level via computational approaches. Determination of the folded structure is a priority for the complete biochemical protein characterization. A large number of folded protein structures at atomic resolution were determined by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy. Since protein flexibility is a key determinant of a protein biological function, the derived static structures are frequently insufficient in description of function mechanisms on the molecular level. Crystal or NMR

12

Multiscale Approach to Protein Folding Dynamics

283

Fig. 12.1 Timescale resolution of various experimental techniques used in studying the protein folding dynamics (above the axis) and timescales of protein folding events (gray bars below the axis) compared to the time frame accessible to all-atom MD

structures provide a good starting point for the simulation of native-fold dynamics under conditions close to the physiological ones (Rueda et al. 2007). Computational techniques provide also the opportunity for combining different resolution data for the interpretation of low-resolution data. Namely, X-ray or NMR atomic structure data can be combined with low resolution but valuable results from small-angle X-ray scattering (Forster et al. 2008; Krukenberg et al. 2008) or electron microscopy (DiMaio et al. 2009; Jonic and Venien-Bryan 2009). Experimental evidence on completely folded conformations is much more vast than data on the key folding states (for definitions, see Table 12.1). Besides static structure determination, NMR spectroscopy is a powerful experimental technique for protein dynamics characterization spanning many timescales and yielding sparse, site-specific spatial information that can be used in simulation (Mittermaier and Kay 2006). Only for a few proteins, the folding intermediates were characterized using protein engineering (Matouschek et al. 1992) and NMR techniques (Udgaonkar and Baldwin 1988; Bycroft et al. 1991). Protein engineering (phivalue analysis) remains the only experimental technique for probing Transition State structures at the level of individual residues (Serrano et al. 1992; Otzen et al. 1994). Even much less is known, however, about very early folding events. Obviously, it is very important to understand how protein folding is initiated and how the native structure is formed. The denatured state – a highly heterogeneous ensemble of partially folded conformations is very difficult to study, although there have been recent reports of NMR studies of residual structure in denatured proteins. Such structures, along with hydrophobic clusters, were discovered even under highly denaturing

284

S. Kmiecik et al.

Table 12.1 Some definitions were taken and adapted to protein folding from the review by Rusell et al. on structural dynamics of macromolecular processes Terms used in the field of protein folding structural dynamics State

Key state

Transition/Transition State

Restraint

Protein engineering (Phi-value analysis)

Protein folding pathway Molten globule

A state is described by a three-dimensional structure of an assembly at some resolution. The structure may be flexible and its description may be incomplete. The set of key states and transitions between them capture the essence of the process. Key states need not be stable and can correspond to Transition States. A transition occurs between a pair of key states that can interconvert directly without passing through other key states. If not indicated otherwise, the term Transition State usually refers to the state of the highest energy along folding reaction coordinate. This state is thought to consist of a large number of extremely short-lived Transition State structures, partially folded, with equal probability to complete the folding process or unfold again. Restraint restricts geometric and/or temporal properties of an assembly, such as the distance between two components, the overall shape of the complex, or the time interval between two key states. A restraint is a scalar function that quantifies the agreement between a restrained feature and the data. The method relies on the quantity of Phi: Phi = 0 suggests the absence of interaction in the Transition State, whereas Phi = 1 marks an interaction similar to that in the native state (Matouschek, Kellis et al. 1989). Phi values are the result of conformational folding stability and folding kinetics comparison of the wild-type protein with those of one or more point mutants. A pathway is represented as a set of key states connected by transitions with associated trajectory and rate information. Molten globule is known as a stable, collapsed state with partial order that proteins can exhibit under certain conditions (Kuwajima 1989; Ptitsyn 1995). Compared to the native structure molten globule possesses less tightly packed native-like secondary and tertiary structure.

conditions (Kazmirski et al. 2001; Klein-Seetharaman et al. 2002). Moreover, denatured proteins can exhibit a long-range ordering of native-like topology (Shortle and Ackerman 2001). Therefore, the folding process can be directed from the very beginning when starting from a specific structure (Dobson 1994; Blanco et al. 1998). It becomes evident that the denatured state plays a crucial role in all aspects of protein stability and folding mechanisms (Shortle 1996). MD is a well-established and powerful method for studying dynamics of complex molecular systems. However, there is a gap between the timescales of the classical MD simulation and the timescales of protein folding (Fig. 12.1). Recent advances in algorithms scalability and computer hardware have made the microsecondtimescale simulations with tens of thousands of atoms practical (Klepeis et al. 2009). An average protein folds slower by orders of magnitude. Thus, simulations of realsize proteins are limited to high-temperature unfolding simulations (whose results

12

Multiscale Approach to Protein Folding Dynamics

285

are questionable due to dramatic, highly non-physiological conditions) or dynamics of the experimental structure. For larger proteins, the all-atom simulations of the entire folding process, from random coil to native state, are so far possible only for Go models. There have been a number of Go potential-based studies of protein G using a simplified model (Prieto et al. 2005), an all-atom model (Shimada and Shakhnovich 2002) or with a weak Go-like contribution to the applied force field (Lee et al. 2004). In Go models, only native interactions are taken into account. Consequently, the lowest energy of the native conformation is guaranteed. The obvious weak point of such approach is that the knowledge of the native structure is needed to construct the Go potential. Significant shortcoming comes also from neglecting the non-native interactions, thereby ignoring their sometimes important role in the folding mechanisms (Rothwarf and Scheraga 1996; Blanco et al. 1997).

12.3 Protein Dynamics by a High-Resolution Reduced Modeling Due to the timescale limitations of the all-atom molecular mechanics, the reduced models offer the most promising possibilities to study large-scale protein rearrangements during the folding process (Kolinski and Skolnick 2004). The simplest models of protein-like systems with highly idealized protein chain representation and interaction scheme led to the understanding of the basic rules governing protein folding (Chan and Dill 1990). Addressing more specific problems like some aspects of folding kinetics is possible, with a more complex interaction scheme and still very simple chain representation (Thirumalai and Klimov 1999). The most advanced reduced models (high-resolution models employing complex interactions) enable the folding studies of real proteins. The simulations can be successfully performed using various sampling and interaction schemes like Monte Carlo search with the knowledge-based statistical potentials (CABS) (Kmiecik and Kolinski 2007a, 2008) or Langevin dynamics with the physics-based united-residue force field (UNRES) (Liwo et al. 2005).

12.3.1 Paradigm Systems of Protein Folding Studies by a High-Resolution De Novo Modeling Numerous experimental and simulation studies established small proteins: barnase (experimental structure PDB code: 1BNR), chymotrypsin inhibitor (2CI2), and b1 domain of protein G (2GB1) as model systems for folding studies. From early 1990s, as a result of extensive and pioneering protein engineering analyses, barnase and chymotrypsin inhibitor were being presented as complementary variants of protein folding mechanism (for general observations from experiments, see Table 12.2).

286

S. Kmiecik et al. Table 12.2 General experimental findings on the model system structure dynamics

Proteins

Barnase

2GB1

Number of aa Number of domains and hydrophobic cores in native Folding kinetics

110 2 domains, 3 cores

56 64 Single domain, single core

Folding mechanism

Denatured state

2CI2

2-state (no 2-state (Jackson intermediates 1998; Krantz et al. detected) (Jackson 2002) or 3-state and Fersht 1991) (Park et al. 1999; Roder et al. 2006), presence of an intermediate is under debate Variants of nucleation–condensation mechanism (Daggett and Fersht 2003a,b) As the propensity for the stable secondary The more tendency structure decreases, consolidation of the for stable secondary and tertiary structure is less secondary separated and occurs simultaneously during structure in the condensation from extended nucleus denatured state, formed in TS. the folding is more hierarchical, and TS is assembled from pre-formed elements of secondary structure. Highly unstructured Considerable Very slight tendency Different from the amount of residual for the native fully unfolded structure (Arcus, helical structure random coil state: Vuilleumier et al. and a minor restricted motions 1995; Freund et al. hydrophobic in native helical 1996) clustering near the and second center of the chain β-hairpin areas (de Prat Gay et al. (Kuszewski et al. 1995) 1994; Frank et al. 1995)

3-state (through at least one intermediate) (Fersht 1993)

The model proteins folding from the denatured to the native state occurs in millisecond timeframe, therefore remaining inaccessible to all-atom MD. Our research group has recently attempted to characterize their full folding process, performing unbiased, ab initio simulation in a high-resolution, reduced representation space (Kmiecik and Kolinski 2007a, 2008), using the CABS model (Kolinski 2004). The use of the reduced representation of polypeptide chain led to a significant reduction

12

Multiscale Approach to Protein Folding Dynamics

287

of the conformational space. Thus simulated system evolution from a highly denatured to a near-native state was possible in a reasonable timescale. Compared to the experimental results, we have obtained a similar sequence of folding events and have identified the interactions critical for the folding process. For the simulation results and comparison with other experimental findings, see Table 12.3. It is particularly interesting to compare GB1 with CI2, which are of similar single-domain characteristics (Table 12.3). The CABS simulation observations (Table 12.3, Fig. 12.2) are in slight disagreement with the interpretation of the experimental data summarized by Daggett and Fersht (2003a,b). According to them, CI2 folds via nucleation collapse around an extended nucleus – similar to what has been observed for GB1. Indeed, in the case of GB1, all nuclei residues take part in the nucleation event at very early stages of folding. To the contrary, CI2 folds via assembly of distinct cooperative subunits (Kmiecik and Kolinski 2007a). At the folding transition, only the native tertiary interactions are observed between two central strands – β3–β4. Consolidation of the α-helix and β3–β4 takes place at lower temperatures. Very interestingly, a comparison of the CABS energy (or radius of gyration) as a function of temperature shows a very similar exponential thermal dependence for all three proteins what means that the stepwise formation of cooperative subunits in CI2 case does not affect the characteristics of these observables (Kmiecik and Kolinski 2007a, 2008). What is important, differences in the folding pathways observed in the simulations of GB1 and CI2 are actually in agreement with the available experimental data (particularly in revealing important long-range interactions being consistent with the phi-value analysis). The simulations described here provide a detailed insight into the folding mechanism on the level of individual residues. Since procedures for protein chain reconstruction to all-atom representation exist (Gront et al. 2007), smooth and fully automated transition to atomic resolution is feasible (Kmiecik 2007b). Such a hierarchical methodology was successfully applied to protein structure prediction during a community-wide testing experiment of the prediction methods (Kolinski and Bujnicki 2005). The approach ranked second best in general as well as in the new fold category – the critical test for ab initio methods – after ROSETTA, the recombination of short fragments extracted from known protein structures (Bradley et al. 2005a). The presented approach goes far beyond the simple analytical or Go models or all-atom MD enabling the study of complete unfolding/folding pathways. Physically realistic folding mechanisms observed in the CABS simulations imply that the interactions in the denatured state are very similar to those in the native structures. Consequently, the knowledge-based potentials from native structures are a good approximation of the interactions in the denatured state. Moreover, proposed Monte Carlo dynamics and a sampling scheme mimic the qualitative features of the continuous long-time dynamics of proteins. Therefore, the suggested model may be a useful tool for qualitative studies of entire folding pathways of large proteins and macromolecular assemblies.

High-resolution reduced model simulation results (de novo by CABS) (Kmiecik and Kolinski 2007a, 2008)

MD simulations of the denatured state (Bond et al. 1997; Wong et al. 2000), intermediate state (Li and Daggett 1998), and Transition State with restraints from Phi-values (Salvatella et al. 2005) provided the models and enabled the interpretation of NMR and Phi-value data.

Insights from all-atom simulations MC simulations with Go potential (Shimada and Shakhnovich 2002) and with restraints from Phi values (Hubner et al. 2004) identified six residues forming the folding nucleus.

2GB1

2CI2

MC simulations with Go potential combined with Phi-value analyses revealed that β3–β4 should be the last element to unfold (Li and Shakhnovich 2001). MD unfolding showed sequential or parallel unfolding of substructures, preference of β3–β4 as the last and β1–β5 as the first to unfold could be noted (Lazaridis and Karplus 1997; Ferrara et al. 2000a,b; Reich and Weikl 2006). Near-perfect correspondence between most persistent long-range contacts in the simulated denatured state and the folding nuclei described in protein engineering studies of barnase (Matouschek et al. 1992), 2GB1 (McCallister et al. 2000), and CI2 (Jackson et al. 1993). Sequential assembly of cooperative Nucleation–condensation mechanism consistent with the experimental subunits, identical to those obtained observations and mentioned above simulations. in MD studies (see above). Heat capacity peak involved with completion of a nucleus formation and Heat capacity peak involved with transition to molten globule. transition from highly unstructured Nucleus consists of long-range native-like hydrophobic interactions. Nucleus state to hydrophobic clustering initiation sites are early formed portions of secondary structure: single around β3–β4. β-hairpin and α-helix, adjacent to each other in the nucleus core.

Barnase

Proteins

Table 12.3 General observations on the model system structure dynamics from the computer simulation

288 S. Kmiecik et al.

12

Multiscale Approach to Protein Folding Dynamics

289

Fig. 12.2 Folding pathways explored by CABS – a comparison of key states of barnase, b1 domain of protein G, and chymotrypsin inhibitor in a highly denaturing state just before Tt and below Tt (Kmiecik and Kolinski 2007a, 2008). The transition temperature (Tt ) is identified by the steepest drop of the energy and the peak of the heat capacity. Tt cannot be strictly identified with the major Transition State. Sometimes, as for CI2, conformations observed at Tt may be relatively unstructured, with some features of a molten globule state

12.4 Summary Molecular level characterization of protein folding and folding in the context of protein interactions is crucial for understanding basic mechanisms of life. Most of existing models of protein folding transient structures and their complexes have been obtained by custom-designed methodologies for integration of experimental results, using tools not suited to handle large-scale objects equipped with different incompatible force fields. Accessing higher efficiency and precision would require new methods for simulation and utilization of all accessible experimental data (Russel et al. 2009). Presently, the all-atom MD is a well-established technique for the protein dynamics simulation, reaching longer than one microsecond timescales for tens of thousands of atoms, using the most advanced software and hardware infrastructures. Over the last 3 years, the maximum simulation speed recorded for an all-atom MD simulation has increased by roughly an order of magnitude, largely due to

290

S. Kmiecik et al.

more efficient parallelization over large numbers of multicore processing nodes. During the same time period, the capacity of individual high-end processor cores has increased by only about 50% (Klepeis et al. 2009). In practice, all-atom MD with explicit solvent allows for folding pathway simulations of peptides or very small, fast-folding proteins. Obviously, the throughput accessible to MD will continue to increase. However, the gap between the needed and the achievable is still huge and will remain so in the years ahead. Access to longer timescales and larger system sizes would require combination of all-atom MD with reduced modeling techniques employing either a reduced geometrical representation of modeled systems and/or simplified models of motion, reduced interaction schemes, and implicit solvent models. In parallel with efficiency, present efforts concentrate on the accuracy of tools. Currently, force-field improvement of both, all-atom and reduced models, as well as the water potentials is a priority due to precision issues (Scheraga et al. 2007). Development of hybrid models that attempt to combine reduced modeling efficiency with high accuracy at atomic resolution, together with experimental results, will be critical. The reduced modeling tool developed in our lab provides a high-throughput element of such systems and the unique combination of solutions applied there can serve as an inspiration for design of novel, more efficient tools.

References Arcus VL, Vuilleumier S et al (1995) A comparison of the pH, urea, and temperature-denatured states of barnase by heteronuclear NMR: implications for the initiation of protein folding. J Mol Biol 254:305–321 Blanco FJ, Ortiz AR et al (1997) Role of a nonnative interaction in the folding of the protein G B1 domain as inferred from the conformational analysis of the alpha-helix fragment. Fold Des 2:123–133 Blanco FJ, Serrano L et al (1998) High populations of non-native structures in the denatured state are compatible with the formation of the native folded state. J Mol Biol 284:1153–1164 Bond CJ, Wong KB et al (1997) Characterization of residual structure in the thermally denatured state of barnase by simulation and experiment: description of the folding pathway. Proc Natl Acad Sci USA 94:13409–13413 Bradley P, Malmstrom L et al (2005a) Free modeling with ROSETTA in CASP6. Proteins 61 (S 7):128–134 Bradley P, Misura KM et al (2005b) Toward high-resolution de novo structure prediction for small proteins. Science 309:1868–1871 Bycroft M, Ludvigsen S et al (1991) Determination of the three-dimensional solution structure of barnase using nuclear magnetic resonance spectroscopy. Biochemistry 30:8697–8701 Chan HS, Dill KA (1990) Origins of structure in globular proteins. Proc Natl Acad Sci USA 87:6388–6392 Daggett V, Fersht AR (2003a) The present view of the mechanism of protein folding. Nat Rev Mol Cell Biol 4:497–502 Daggett V, Fersht AR (2003b) Is there a unifying mechanism for protein folding? Trends Biochem Sci 28:18–25 de Prat Gay, Ruiz-Sanz GJ et al (1995) Conformational pathway of the polypeptide chain of chymotrypsin inhibitor-2 growing from its N terminus in vitro. Parallels with the protein folding pathway. J Mol Biol 254:968–979

12

Multiscale Approach to Protein Folding Dynamics

291

DiMaio F, Tyka MD et al (2009) Refinement of protein structures into low-resolution density maps using ROSETTA. J Mol Biol 392:181–190 Dobson CM (1994) Protein folding. Solid evidence for molten globules. Curr Biol 4:636–640. Ferrara P, Apostolakis J et al (2000a) Computer simulations of protein folding by targeted molecular dynamics. Proteins 39:252–260 Ferrara P, Apostolakis J et al (2000b) Targeted molecular dynamics simulations of protein unfolding J Phys Chem B 104:4511–4518 Fersht AR (1993) The sixth Datta Lecture. Protein folding and stability: the pathway of folding of barnase. FEBS Lett 325(1–2):5–16 Forster F, Webb B et al (2008) Integration of small-angle X-ray scattering data into structural modeling of proteins and their assemblies. J Mol Biol 382:1089–1106 Frank MK, Clore GM et al (1995) Structural and dynamic characterization of the urea denatured state of the immunoglobulin binding domain of streptococcal protein G by multidimensional heteronuclear NMR spectroscopy. Protein Sci 4:2605–2615 Freund SM, Wong KB et al (1996) Initiation sites of protein folding by NMR analysis. Proc Natl Acad Sci USA 93:10600–10603 Gront D, Kmiecik S et al (2007) Backbone building from quadrilaterals: A fast and accurate algorithm for protein backbone reconstruction from alpha carbon coordinates. J Comput Chem 28:1593–1597 Hubner IA, Shimada J et al (2004) Commitment and nucleation in the protein G transition state. J Mol Biol 336:745–761 Jackson SE (1998) How do small single-domain proteins fold? Fold Des 3:R81–R91 Jackson SE, el Masry N et al (1993) Structure of the hydrophobic core in the transition state for folding of chymotrypsin inhibitor 2: a critical test of the protein engineering method of analysis. Biochemistry 32:11270–11278 Jackson SE, Fersht AR (1991) Folding of chymotrypsin inhibitor 2. 1. Evidence for a two-state transition. Biochemistry 30:10428–10435 Jonic S, Venien-Bryan C (2009) Protein structure determination by electron cryo-microscopy. Curr Opin Pharmacol 9:636–642 Kazmirski SL, Wong KB et al (2001) Protein folding from a highly disordered denatured state: the folding pathway of chymotrypsin inhibitor 2 at atomic resolution. Proc Natl Acad Sci USA 98:4349–4354 Klein-Seetharaman J, Oikawa M et al (2002) Long-range interactions within a nonnative protein. Science 295(5560):1719–1722 Klepeis JL, Lindorff-Larsen K et al (2009) Long-timescale molecular dynamics simulations of protein structure and function. Curr Opin Struct Biol 19:120–127 Kmiecik S, Gront D et al (2007b) Towards the high-resolution protein structure prediction. Fast refinement of reduced models with all-atom force field. BMC Structural Biology 7:43 Kmiecik S, Kolinski A (2007a) Characterization of protein-folding pathways by reduced-space modeling. Proc Natl Acad Sci USA 104:12330–12335 Kmiecik S, Kolinski A (2008) Folding pathway of the B1 domain of protein G explored by multiscale modeling. Biophys J 94:726–736 Kolinski A (2004) Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 51:349–371 Kolinski A, Bujnicki JM (2005) Generalized protein structure prediction based on combination of fold-recognition with de novo folding and evaluation of models. Proteins 61: 84–90. Kolinski A, Skolnick J (2004) Reduced models of proteins and their applications. Polymer 45: 511–524 Krantz BA, Mayne L et al (2002) Fast and slow intermediate accumulation and the initial barrier mechanism in protein folding. J Mol Biol 324:359–371 Krukenberg KA, Forster F et al (2008) Multiple conformations of E. coli Hsp90 in solution: insights into the conformational dynamics of Hsp90. Structure 16:755–765

292

S. Kmiecik et al.

Kuszewski J, Clore GM et al (1994) Fast folding of a prototypic polypeptide: the immunoglobulin binding domain of streptococcal protein G. Protein Sci 3:1945–1952 Kuwajima K (1989) The molten globule state as a clue for understanding the folding and cooperativity of globular-protein structure. Proteins 6:87–103 Lazaridis T, Karplus M (1997) “New view” of protein folding reconciled with the old through multiple unfolding simulations. Science 278(5345):1928–1931 Lee SY, Fujitsuka Y et al (2004) Roles of physical interactions in determining protein-folding mechanisms: molecular simulation of protein G and alpha spectrin SH3. Proteins 55: 128–138 Levinthal C (1968) Mossbauer Spectroscopy in Biological Systems. In: Debrunner P, Tsibris J, Munck E (eds) Proceedings of a meeting held at Allerton house, Monticello, Illinois. University of Illinois Press, Urbana, Illinois, pp 22–24 Li A, Daggett V (1998) Molecular dynamics simulation of the unfolding of barnase: characterization of the major intermediate. J Mol Biol 275:677–694 Li L, Shakhnovich EI (2001) Constructing, verifying, and dissecting the folding transition state of chymotrypsin inhibitor 2 with all-atom simulations. Proc Natl Acad Sci USA 98:13014–13018 Liwo A, Khalili M et al (2005) Ab initio simulations of protein-folding pathways by molecular dynamics with the united-residue model of polypeptide chains. Proc Natl Acad Sci USA 102:2362–2367 Matouschek A, Kellis JT Jr et al (1989) Mapping the transition state and pathway of protein folding by protein engineering. Nature 340(6229):122–126 Matouschek A, Serrano L et al (1992) The folding of an enzyme. IV. Structure of an intermediate in the refolding of barnase analysed by a protein engineering procedure. J Mol Biol 224:819–835 McCallister EL, Alm E et al (2000) Critical role of beta-hairpin formation in protein G folding. Nat Struct Biol 7:669–673 Mittermaier A, Kay LE (2006) New tools provide new insights in NMR studies of protein dynamics. Science 312(5771):224–228 Otzen DE, Itzhaki LS et al (1994) Structure of the transition state for the folding/unfolding of the barley chymotrypsin inhibitor 2 and its implications for mechanisms of protein folding. Proc Natl Acad Sci USA 91:10422–10425 Park SH, Shastry MC et al (1999) Folding dynamics of the B1 domain of protein G explored by ultrarapid mixing. Nat Struct Biol 6:943–947 Prieto L, de Sancho D et al (2005) Thermodynamics of Go-type models for protein folding. J Chem Phys 123:154903 Ptitsyn OB (1995) Molten globule and protein folding. Adv Protein Chem 47:83–229 Reich L, Weikl TR (2006) Substructural cooperativity and parallel versus sequential events during protein unfolding. Proteins 63:1052–1058 Roder H, Maki K et al (2006) Early events in protein folding explored by rapid mixing methods. Chem Rev 106:1836–1861 Rothwarf DM, Scheraga HA (1996) Role of non-native aromatic and hydrophobic interactions in the folding of hen egg white lysozyme. Biochemistry 35:13797–13807 Rueda M, Ferrer-Costa C et al (2007) A consensus view of protein dynamics. Proc Natl Acad Sci USA 104:796–801 Russel D, Lasker K et al (2009) The structural dynamics of macromolecular processes. Curr Opin Cell Biol 21:97–108 Salvatella X, Dobson CM et al (2005) Determination of the folding transition states of barnase by using PhiI-value-restrained simulations validated by double mutant PhiIJ-values. Proc Natl Acad Sci USA 102:12389–12394 Schaeffer RD, Fersht A et al (2008) Combining experiment and simulation in protein folding: closing the gap for small model systems. Curr Opin Struct Biol 18:4–9 Scheraga HA, Khalili M et al (2007) Protein-folding dynamics: overview of molecular simulation techniques. Annu Rev Phys Chem 58:57–83 Serrano L, Matouschek A et al (1992) The folding of an enzyme. III. Structure of the transition state for unfolding of barnase analysed by a protein engineering procedure. J Mol Biol 224:805–818

12

Multiscale Approach to Protein Folding Dynamics

293

Shimada J, Shakhnovich EI (2002) The ensemble folding kinetics of protein G from an all-atom Monte Carlo simulation. Proc Natl Acad Sci USA 99:11175–11180 Shortle D (1996) The denatured state (the other half of the folding equation) and its role in protein stability. Faseb J 10:27–34 Shortle D, Ackerman MS (2001) Persistence of native-like topology in a denatured protein in 8 M urea. Science 293(5529):487–489 Thirumalai D, Klimov DK (1999) Deciphering the timescales and mechanisms of protein folding using minimal off-lattice models. Curr Opin Struct Biol 9:197–207 Udgaonkar JB, Baldwin RL (1988) NMR evidence for an early framework intermediate on the folding pathway of ribonuclease A. Nature 335(6192):694–699 Wong KB, Clarke J et al. (2000) Towards a complete description of the structural and dynamic properties of the denatured state of barnase and the role of residual structure in folding. J Mol Biol 296:1257–1282

Chapter 13

Error Estimation of Template-Based Protein Structure Models Daisuke Kihara, Yifeng David Yang, and Hao Chen

Abstract The tertiary structure of proteins provides rich source of information for understanding protein function and evolution. Computational protein tertiary structure prediction has made significant progress over more than a decade due to the advancement of the techniques and the growth of sequence and structure databases. However, methods for assessing quality of predicted structure models are not well established. Quality assessment of structure models is important for reranking and selecting the best possible models from a pool of models as a postprocessing step in structure prediction, and thus many methods are developed in this direction. Recently, it is also recognized that the model-quality assessment is crucial for practical use of a model such as design and interpretation of biochemical experimental data. For such practical application of a computational model, the realvalue quality of the model should be predicted, which is different from reranking alternative models. The quality (error) of a model determines its potential practical application, ranging from protein design, designing site-directed mutagenesis experiments, ligand–protein docking prediction, to function prediction from structure. In this chapter, we discuss importance of the real-value error estimation and overview the existing methods.

13.1 Introduction Protein tertiary structure prediction from amino acid sequence has made steady progress due to advances in techniques as well as the increase in the number of known solved structures available for templates for modeling. Now it is often possible to build atomic-detailed models which can be used for redesigning protein

D. Kihara (B) Department of Biological Sciences, College of Science; Department of Computer Science, College of Science; Markey Center for Structural Biology, Purdue University, West Lafayette, IN, USA e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_13,

295

296

D. Kihara et al.

function (Ashworth et al. 2006). However, predicting an accurate atomic-detailed model is still not always possible. Depending on the techniques employed and available template structures for modeling, the accuracy of a model ranges from an root mean square deviation (RMSD) of 1.5–2 Å (typically in comparative modeling using a closely related template to the target) to 6 Å (threading) to its native structure and even over 10 Å in case when the prediction goes significantly wrong. In the latest Critical Assessment of Techniques in Structure Prediction (CASP) experiment that was documented on Proteins (CASP7 held in the summer 2006), the average GDT-HA score of models submitted by the top five groups in the templatebased modeling (TBM) category was 50.3 (Kopp et al. 2007). In the free-modeling (FM) category, the average GDT-TS score of the top groups was around 30–40, as read from Fig. 13.2B of the assessors’ report (Jauch et al. 2007). The GDT-HA and the GDT-TS scores are defined as the average of the percentage of residues which do not deviate from the target structure by more than four threshold values, 0.5, 1.0, 2.0, and 4.0 Å for GDT-HA and 1.0, 2.0, 4.0, and 8.0 Å for the GDT-TS. Thus, very roughly speaking, the CASP results imply that on average about half of

Fig. 13.1 Definition of the SPAD score. SPAD quantifies the diversity of suboptimal alignments around the optimal alignment of a target and a template protein. The difference (distance) of a position in an alignment as compared with an alternative alignment is computed on a dynamic programming (DP) matrix. The target protein sequence is placed on the horizontal axis while the template protein is aligned on the vertical axis. The thick line represents the DP path of the optimal alignment of the two proteins. The distance from a position in the optimal alignment to the i-th suboptimal alignment is defined as the average of the horizontal (diH ) and the vertical (diV ) distance to the i-th suboptimal alignment. The local SPAD score for a position in the optimal alignment is the average of the distance to the set of suboptimal alignments considered. Then, the global SPAD score of the optimal alignment is defined as the average of the local SPAD score of each position in the alignment. In principle, the computation of SPAD is applicable for any threading method that uses the dynamic programming

13

Error Estimation of Template-Based Protein Structure Models

297

residue position in a model is “correct” in the TBM, while only about one-third of residues are predicted correctly in the FM. Therefore, it should be realized that a computational model may have significant errors even if it is constructed by current state-of-the-art methods. Methods for error estimation or quality assessment of structure models have drawn much attention in recent years (Kihara et al. 2009). Roughly speaking, there are three purposes or types of quality assessments of protein structure models. First, in structural biology, stereochemical properties of experimentally solved structures are routinely examined. The second type is to rerank predicted models, i.e., evaluating relative accuracy among the models, such that the most native-like structure can be selected from a pool of constructed models. The third purpose is to predict the real-quality value of a model, such as an RMSD value of the model to the native structure. These three purposes have significant overlap between each other but they are not identical. Thus, methods developed for one purpose is not necessarily suitable for the other purposes. In experimental structural biology, validation of tertiary structures built from experimental data is an important step. Thus, earlier works on protein structure validation focused on identifying potential errors in models built from X-ray diffraction patterns of protein crystals. Tools developed for that purpose include PROCHECK (Laskowski et al. 1993), MOLPROBITY (Davis et al. 2004), protein volume evaluation (PROVE) (Pontius et al. 1996), and WHATCHECK (Hooft et al. 1996). These methods compare stereochemical properties of a protein structure, such as the bond length, bond angles, hydrogen bonds, and atom clashes, with their regular values sampled from a set of representative protein structures of a good resolution. The same methods could be applied to assess the quality of predicted structures (Bhattacharya et al. 2008). However, it may not always be suitable to use these validation tools for analyzing predicted structures because these tools concern small deviations of distances or angles, which is the level of the accuracy that may not be meaningful to expect for predicted structures of a moderate accuracy. The second type of the model-quality assessment, reranking of predicted models, is most well studied recently in the context of protein structure prediction. In a procedure of structure prediction, particularly in ab initio structure prediction, which usually generates a large number of alternative models, selecting a few most native-like models is essential. Of course each structure prediction method has its own scoring function that guides building models, but an outside measure specific for quality assessment could often offer useful evaluation. For example, clustering of models to find out most populated folds is proven to be an effective metric of selecting near-native models (Betancourt and Skolnick 2001; Shortle et al. 1998). In addition, reranking of models is also an important step in a meta-server approach, where models generated by different methods are compared and often combined (Kolinski and Bujnicki 2005). To rerank models, various aspects of the models are evaluated, ranging from structures and energetic terms in atomic-detailed levels, in residue levels, in global-fold levels, and often in sequence-alignment levels. In the next section we briefly discuss structural characteristics used as scoring terms in quality assessment.

298

D. Kihara et al.

The third type of quality assessment, i.e., prediction of real value of quality of models (e.g., the RMSD value to the native structure), is relatively less studied. Methods for the previous two types are not necessarily suitable for predicting a real-quality value of a model, because the accuracy of detailed stereochemical properties (i.e., the first type) does not guarantee global structural similarity of a model to the native (Wroblewska and Skolnick 2007; Melo and Sali 2007). Interestingly, it is demonstrated that structure models of a wrong fold can have good detailed stereochemical structures (Wroblewska and Skolnick 2007). Moreover, the best-ranked model among a given pool, which is identified by the second type of the quality assessment method, may not have a certain value of the RMSD, e.g., less than 3 Å. To predict a real value of quality of models, the same structural characteristics used in model reranking methods could be employed. However, real-value quality prediction would be more difficult because most of the quality assessment measures do not have rationale to have a good correlation to the absolute value of the model quality. Exceptions would be terms that evaluate target–template alignment, a typical one being the sequence identity between the target sequence to be modeled and a template protein. The relationship between the sequence similarity of and the structural divergence of proteins is well established (Chothia and Lesk 1986; Wilson et al. 2000). The real-value quality assessment methods would be most useful for biologists who would like to practically use structure models to aid their wet-lab experiments. If the accuracy of a model can be predicted, it can be used for an appropriate purpose according to its estimated accuracy. It is important to note that a low-resolution model is still useful for certain purposes (Baker and Sali, 2001). High-resolution models with an RMSD of 1–1.5 Å are useful for almost any application where a tertiary structure of a protein can be useful, including for studying catalytic mechanism of enzymes, for structure-based protein engineering, and for drug design. A model of an RMSD of around 4 Å is still useful, for example, for designing site-directed mutagenesis experiments (Wells et al. 2006; Skowronek et al. 2006), chemical labeling, and for performing small-ligand-docking predictions (Wojciechowski and Skolnick, 2002; Vakser 1996). If the fold of a model is expected to be correct (an RMSD of about 6 Å), function of the protein could still be predicted using the predicted tertiary structures (Baker and Sali 2001; Skolnick et al. 2000; Hawkins and Kihara 2007; Kihara and Skolnick 2004). Therefore, it is important to establish realvalue quality estimation methods, so that a model can be used wisely by knowing the limitations of the model. In this chapter, we will focus our review on the quality assessment methods of this type, as the methods of the other two types are recently thoroughly reviewed in another article (Kihara et al. 2009).

13.2 Overview of Quality Assessment Measures We first briefly overview the structural characteristics used as scoring terms in quality assessment. For more details please refer to the recent review (Kihara et al. 2009).

13

Error Estimation of Template-Based Protein Structure Models

299

13.2.1 Physics-Based Score One of the most straightforward ways for assessing model quality would be to employ physics-based all-atom force field. Previous works range from using molecular mechanics energy (Kmiecik et al. 2007) to free-energy computation of structure models, including the molecular mechanics–Poisson–Boltzmann surface area (MM-PBSA) free-energy (Lee et al. 2001; Feig and Brooks III 2002), MM-Generalized Born implicit solvation model with a surface area-dependent term (MM-GBSA) (Wroblewska and Skolnick 2007; Feig and Brooks III 2002), the Explicit simulation/implicit solvent (ES/IS) method, which computes the solvation free energy from short molecular dynamics simulations with explicit solvent (Vorobjev and Hermans 2001), and the colony energy approach combined with the MM-PBSA, which assesses conformational entropy by explicitly sampling the conformational space in the vicinity of a reference structure (Fogolari and Tosatto 2005). Wroblewska and Skolnick showed that reoptimization of relative weights of energy components of the AMBER force field yielded significant improvement for scoring and refinement of protein models (Wroblewska et al. 2008).

13.2.2 Knowledge-Based Potential Alternatively, structure models can be evaluated using knowledge-based statistical potentials. A knowledge-based potential considers preference of a certain structural property of atoms or amino acid residues in protein structures by counting the number of observed cases of the property, which is then normalized by the expected number of counts. Many different types of knowledge-based potentials are developed, including atom- or residue-contact potentials, main-chain torsion angles (Betancourt 2008; Tosatto and Battistutta 2007), atom/residue-level burial/exposure preference (Holm and Sander 1992), atom/residue-packing preference (Gregoret and Cohen 1991; Melo and Sali 2007), and the accessible surface area of residues of atoms/residues (Melo and Feytmans 1997; McConkey et al. 2003; Melo et al. 2002). Atom- or residue-contact potentials would be among the most well studied. Pioneer works were carried out by Sippl using knowledge-based contact potentials for identifying errors in protein crystal structures (Hendlich et al. 1990; Sippl 1993). In principle, knowledge-based atom-contact potentials are designed to evaluate structures with an atomic-level accuracy, but they are shown to have good performance on predicted structures of a moderate accuracy as well (Melo and Feytmans 1997; Pettitt et al. 2005; McConkey et al. 2003). When models are not expected to have atomic detail-level accuracy, examining residue contacts could be advantageous for quality assessment (Melo et al. 2002). Verify3D assesses structural environment of residues which are defined by a combination of the secondary structure, burial status, and polarity of positions in a structure (Eisenberg et al. 1997).

300

D. Kihara et al.

13.2.3 Assessing Alignment Quality In parallel to the structure-based terms, scores which assess the validity of the alignment between a target and a template are effective in the case of template-based modeling. Simply, the quality of a target–template alignment can be evaluated by considering the significance of alignment raw scores, such as the sequence identity, the Smith–Waterman alignment score, or a threading score between a target and a template (in the case of threading). More frequently, statistical significance of a raw score is considered, e.g., the E-value in BLAST (Altschul et al. 1990) or the Z-score used in threading algorithms. The Z-score of a raw alignment score is also obtained from a distribution of alignment scores from shuffled sequences (Pearson and Lipman 1988). Similarly, for predicting quality of local regions, raw scores local alignment regions can be used (Tress et al. 2003; Zhang et al. 1999; Lee et al. 2007; Tondel 2004). Another strategy to estimate the reliability of an alignment is to compare it explicitly with alternative alignments of the same pair of sequences, i.e., to consider consistency with suboptimal alignments (Mevissen and Vingron 1996; Vingron and Argos 1990; Vingron 1996; Saqi and Sternberg 1991; Jaroszewski et al. 2002; John and Sali 2003). Instead of explicitly computing numerous suboptimal alignments, several other methods provide a probability (reliability) to each position in an alignment. Examples are those which use the partition function to express the probability of alternative alignments (Zhang and Marr 1995; Schlosshauer and Ohlsson 2002; Miyazawa 1995; Koike et al. 2004) and ones using hidden Markov models (Yu and Smith 1999; Cline et al. 2002). Recently, we have also developed a quality assessment score based on the consistency of a target–template alignment with suboptimal alignments, named SPAD (SuboPtimal Alignment Diversity) (Chen and Kihara 2008). SPAD quantifies divergence of suboptimal alignments around the optimal alignment on the Dynamic Programming matrix. It was shown that the SPAD score has a significant correlation not only to alignment shift-level errors but also to global and local structural-level errors (i.e., RMSD to the native structure and the distance of corresponding residues of a model and its native structure) of structure models built based on optimal alignments. We will explain SPAD in the next section in more detail.

13.3 The SPAD Score 13.3.1 Definition of the SPAD Score Figure 13.1 shows how the SPAD score is defined. When an alignment between a target sequence and a template structure is computed by the dynamic programming (DP) algorithm (Needleman and Wunsch 1970), the alignment is represented as a path in the DP matrix. In Fig. 13.1, the top-scoring (i.e., optimal) target–template

13

Error Estimation of Template-Based Protein Structure Models

301

Fig. 13.2 Examples of estimated global and local errors by the SPAD score. Prediction structures of two CASP7 targets produced by our group, Chen-Tan-Kihara, are shown. The left panels are superimposition of the predicted (black) and the native (gray) structures of the target. The middle panels show the predicted (dotted line) and the actual (solid line) Cα distance error of the models. The right panels show the sausage representation of models, where the radius of the tube is proportional to the estimated Cα distance error. (a) A model for T0367. The native structure is 2hsbA, of which the length is 125 amino acids. The model used 1ufbA as the template. The global RMSD of the model is predicted to be 4.3 Å, while the actual RMSD of the model to the native is 3.9 Å. (b) A model for T0378 (native: 2i6dA, the length: 254 amino acids). The template of the model is 1x7oA. The actual/predicted RMSD of the model to the native is 3.6/3.9 Å

alignment is represented as a thick path from the left upper corner to the right bottom corner. As a set of suboptimal alignments are also represented as their own paths, the consistency of a position in the optimal alignment as compared with suboptimal alignments can be quantified by counting the number of grid cells between the paths. Thus, the local SPAD score of a certain position m (lSPADm ) in the optimal alignment is defined as: n

ISPADm =

i=1

IALDim n

(13.1)

where lALDi m is the distance at the position m in the optimal alignment to the iH (Fig. 13.1). n is suboptimal alignment i, which is defined as IALDim = diV +d 2 the number of suboptimal alignments considered. Averaging lSPADm over all the positions in the optimal alignment yields the global-level SPAD, gSPAD:

302

D. Kihara et al. 1

gSPAD =

ISPADm

m=1

(13.2) l where l is the length of the optimal alignment. Suboptimal alignments are computed by the algorithm proposed by Vingron and Argos (1990). In their algorithm, the maximal number of possible suboptimal alignments of a pair of sequences of the length M and N is M×N, because for each cell in the DP matrix (i.e., each pair of residues from the two sequences), it computes the optimal alignment which goes through the cell. The number of suboptimal alignments considered, n, is set to n = 0.1 × M × N (i.e., top 10% high-scoring suboptimal alignments). A profile– profile alignment (Wang and Dunbrack Jr 2004) is employed to compute optimal and suboptimal alignments of a target and a template, which uses a profile-matching score and the secondary structure-matching score (Chen and Kihara 2008).

13.3.2 Correlation of SPAD to RMSD of Models We prepared a large dataset of 5,232 template-based structure models and examined how well SPAD correlates with global and local qualities of the models. Models of a variety of quality are obtained by using template structures of three similarity levels to target proteins, i.e., those in the same family as targets, the same superfamily, and the same fold. Target–template alignments are computed by the in-house profile–profile alignment program mentioned above and MODELLER (Eswar et al. 2008) was used to construct the tertiary structure of the model from the alignments. The correlation coefficient computed for log–log plots of the RMSD of the models relative to SPAD (Eq. (13.2)) is 0.598, 0.630, and 0.384, for the models using templates of the family, the superfamily, and the fold-level similarity, respectively (refer to Table 13.2A in Chen and Kihara (2008)). These correlation coefficient values are shown to be much more significant than the other sequence alignment-based measures, which are the sequence identity, the threading Z-score, and the Z-score by PRSS (an alignment shuffling program) (Pearson and Lipman 1988). Moreover, interestingly, SPAD has more significant correlation than the discrete optimized protein energy (DOPE) score (normalized by the number of atom contacts), the target function of the MODELLER, to the RMSD in the models constructed with templates in the family and the superfamily-level similarity. The correlation of the normalized DOPE score to the RMSD is 0.453, 0.617, and 0.587 in the family, the superfamily, and the fold-level template models. This implies a dominant influence of the alignment quality on the final structural quality in the structure modeling process in MODELLER.

13.3.3 Correlation to the Local Quality of Models We also examined the correlation of the local SPAD score (Eq. (13.1)) to the local quality of models, which is defined as the Euclidean distance between corresponding

13

Error Estimation of Template-Based Protein Structure Models

303

Cα atoms of the model to the native structure when they are globally superimposed. The correlation is 0.565, 0.509, and 0.277 for models which used templates in the family, the superfamily, and the fold-level similarity. These values are not as significant as that of the global SPAD to the RMSD, however, much higher than the other simple local alignment-based scores, such as the conservation in multiple sequence alignments, the ratio of gaps in the alignment position, and the average BLOSUM score of the alignment position (Table 13.2B in Chen and Kihara (2008)). According to the linear correlation we observed in the log–log plots of the SPAD score and the RMSD and the local Cα distance error, two equations are obtained: RMSD = exp(0.3576 × ln (global SPAD) + 1.882) Cα distance error = exp(0.3294 × ln (local SPAD) + 1.645)

(13.3) (13.4)

Figure 13.2 shows two examples of actual prediction of the global RMSD and the local Cα distance error of structure models using the two equations. These two structure models are predicted for targets T0367 and T0378 in the CASP7. The in-house profile-based threading program was used to find the template and make the alignment, which was followed by running MODELLER to build the structural model. In these examples, the global RMSD of the models are predicted quite well. The predicted Cα distance error (the middle panels) captures poorly predicted regions, although the absolute value of the predicted and actual error did not agree well in some regions. What are shown in the right panel are the sausage representations of the models, which intuitively represent predicted Cα distance error as well as the overall fold of models.

13.4 Real-Value Quality Assessment of Structure Models In this section, we overview three quality assessment methods which predict realvalue quality. At last, we introduce a method we recently developed, named SubAqua (Subotpimal Alignment-based quality assessment method), which uses the SPAD score as a main component of its scoring term.

13.4.1 Tondel’s Method Tondel showed that the global RMSD and the total residue contact area of homology models can be well predicted by a regression model which combines sequence alignment-based scores between a target and a template (Tondel 2004). The scores employed are the sequence identity, the number of non-aligned residues in the target–template alignment, and an amino acid similarity score (PAM250) of each position in the alignment. The method was tested on a set of homology models of kinase, whose RMSD to the native varies in a relatively small range from 1 to 7 Å. Since the score of each alignment position is used as variables in the regression

304

D. Kihara et al.

model, this method can be only applied to the specific protein family for which the regression model is built for. However, as the author discusses, it is possible to construct a regression model for each major protein family.

13.4.2 ProQ ProQ uses neural network to predict real value of two structure similarity scores, the LGscore (Cristobal et al. 2001) and the MaxSub score (Siew et al. 2000) to the native structure (Wallner and Elofsson 2006). They found that combining different types of quality assessment measures improves the accuracy of the correct fold prediction. They combined seven terms of different natures that range from those for describing coarse-grained to atomic-detailed structural features of a model. Two terms are for capturing overall global features of a model: the fraction of the protein modeled and the fatness, which is the ratio of the longest and the shortest axes when the structure is fit to an ellipsoid. Two other terms are for describing main-chain level features: the agreement of the model to the template structure as measured by LGscore or MaxSub and the agreement of the actual and predicted secondary structure of the model. In addition, two residue-level features are combined: the fraction of residues in four different bins of accessible surface area and the fraction of residue–residue contacts classified into four categories. Finally, an atomic-detailed structural feature is captured by the fraction of atom–atom-contact types observed in the model.

13.4.3 TVSMod Eramian et al. developed a method named TVSMod, which predicts the RMSD as well as the number of correctly predicted residues in a model (residues locating within 3.5 Å to the corresponding residues in the native structure, named No3.5 Å) using support vector machine (SVM) regression, which combines up to nine alignment and structure features (Eramian et al. 2008). The nine features are the sequence identity between a target and template, the percentage of gaps in the target–template alignment, a distance-dependent residue-contact potential, a distance-dependent atom-contact potential (DOPE), a residue-based accessible surface statistical potential score, a composite score named GA341, which combines residue-level contact and accessibility scores and a score measuring structural compactness, and the sequence identity (John and Sali 2003), and agreement scores of prediction and actual secondary structure of the model. The residue- and atomcontact potential values and the composite score are normalized by computing the Z-score referencing a score distribution of 200 random sequences with the same amino acid composition and the structure as the query model. Consistent with the paper of ProQ (Wallner and Elofsson 2006), they reported that combination of these scores improve the accuracy although each individual score does not have a strong correlation to the RMSD and No3.5 Å.

13

Error Estimation of Template-Based Protein Structure Models

305

TVSMod also uses SVM. SVM is trained in a model-specific fashion, that is, SVM is developed using structure models of the similar size and the secondary structure content as the query model. Interestingly, for an input structure model to be evaluated, the SVM is trained on the fly using a large database of 5,790,899 template-based models with known quality: First, if the aligned target and template sequences share more than 85% identity, the system simply predicts an RMSD of 0.5 and 1.0 Å for No3.5 Å without taking any further steps. If the target and template are not so closely related, the model database is scanned to find all examples where the same region of the template was used either as a template or as the target sequence. A region in a model in the database is considered equivalent with a certain region in the query model if the starting and the ending residues are each within ten residues and the difference is within 10% of the length of the query. Subsequently, those selected models are further filtered by considering similarity of the score values of the residue- and atom-contact potentials, the residue-based accessible surface area potential, and the composite score for the distance and surface potential score. Finally, SVM is trained on the tailored training dataset. They reported a highcorrelation coefficient of 0.84 between the predicted and actual RMSD and 0.86 for predicted/actual No3.5 Å. The advantage of using a query-tailored training dataset is that the structure-based potentials, e.g., the contact potentials, will be able to have better correlation to the model-quality values.

13.4.4 The SubAqua Method Recently we have extended the idea of using suboptimal alignments for modelquality assessment to develop a better quality assessment method, named SubAqua (Subotpimal Alignment-based quality assessment method) (Yang et al. 2009). The webserver is available at http://kiharalab.org/SubAqua/. SubAqua combines the SPAD score introduced above (Chen and Kihara 2008) with other structure-based scoring terms. It predicts a global and a local real-value quality measures, the RMSD value of a model to its native structure, and the error of Cα positions as compared with corresponding residues when the model is superimposed with the native structure. Since SubAqua uses suboptimal alignments as one of the scoring terms, it is more suitable for evaluating template-based models. Below we briefly describe outline of the work. 13.4.4.1 Correlation of Quality Assessment Terms to RMSD We first prepared a large dataset of template-based models with a variety of quality, whose RMSD ranges from around 1 to 20 Å. Pairs of proteins which share the family, the superfamily, and the fold-level similarity are selected from the Lindahl and Elofsson’s dataset (Lindahl and Elofsson 2000), which resulted in 1,076, 1,395, and 2,761 pairs, respectively. Optimal alignments of the pairs are computed using a profile–profile alignment (Wang and Dunbrack Jr 2004), which are then fed to MODELLER (Eswar et al. 2008) to build the tertiary structure models.

306

D. Kihara et al.

Using the dataset of template-based models, first we examined a dozen of quality assessment measures in terms of the correlation coefficient to the RMSD of models (Table 13.1). Five different types of measures are examined, those based on target–template alignments, overall model fold, those concern local residue environments, stereochemistry of atoms, and composite model-quality assessment scores. Individual scores are explained in the table legend. What is roughly consistent with the observation by Eramian et al. (2008) is that none of the individual scores by itself has a significant correlation to the RMSD of the models in this diverse dataset. Among the measures examined, the alignment-based scores have relatively higher correlation over 0.5, which implies strong dependence of the modeling procedure (by MODELLER) to the quality of target–template alignments. SPAD shows a better correlation when log-transformed. This may be because the possible alternative alignments diverge rapidly as the sequence similarity drops. We also found that normalizing the Verify3D score by the model length (L) and a square of the length (L2 ) improves the correlation and taking the logarithm of Verify3D/L2 further improves the correlation to 0.53. It is expected that normalization by the length improves the correlation since Verify3D is the sum of a residue-based score. The reason why the normalizing by L2 makes the correlation better might be because Verify3D implicitly takes residue contacts into account as the environment of a residue. The other residue-level and atomic-level scores have insignificant correlation to the RMSD. ProQ also shows insignificant correlation to the RMSD, which is consistent with Table 13.1 in Eramian et al. that reports the correlation coefficient of 0.57 and 0.44 for ProQ-MX and ProQ-LG (Eramian et al. 2008).

13.4.4.2 Variable Selection for Constructing Regression Models Next, we employed the forward stepwise variable selection procedure to build a linear regression model with a meaningful subset of the variables in Table 13.1. In this procedure, variables are added to the regression model sequentially until adding more variables makes no significant contribution. The contribution of a variable to a regression model is shown by the partial R2 , which indicates how much more R2 can be explained by adding the variable. Two variables, log(SPAD) and log(Verify3D/L2 ), are selected in this order as the most contributing variables to predict the global RMSD with the model R2 value of 0.586 (Table 13.2). The rest of the variables, the Z-score by PRSS to the normalized DOPE, are selected with statistical significance but their contribution to the model judged by the partial R2 is marginal. Therefore, we decided to use only the two variables and the resulting linear regression model is as follows: RMSD = −4.99 + 2.25 × log(SPAD) − 2.17 × log(Verify3D/L2 )

(13.5)

Figure 13.3 shows the relationship between the actual and predicted RMSDs of the structure models in the entire dataset. The correlation coefficient is 0.77.

13

Error Estimation of Template-Based Protein Structure Models

307

Table 13.1 Correlation coefficient to the global RMSD of structure models Types of variables

Variable

Correlation coefficient

Alignment

Seq. identity Length PRSS Z-score SPAD log (SPAD) Compactness (Sc) ERRAT TAP Verify3D Verify3D/L Verify3D/L2 log(Verify3D/L2 ) DOPE/Nh 2 ANOLEA1 ANOLEA2 PROCHECK1 PROCHECK2 ProQ-MX ProQ-LG GA341 Pg

0.58 0.24 0.63 0.55 0.71 0.19 0.35 0.34 0.09 0.45 0.46 0.53 0.29 0.29 0.02 0.05 0.16 0.39 0.31 0.17 0.01

Overall fold Residue environment

Atom environment

Quality assessment scores

Absolute values of the correlation coefficients are shown. Scores with a correlation coefficient of over 0.5 are highlighted in gray. Alignment-based scores are the sequence identity between the target and the template sequences; the number of residues of the model, the Z-score of the alignment score computed from a score distribution of shuffled sequences by PRSS; the SPAD score considers the consistency of the target–template alignment relative to suboptimal alignments. The compactness is defined as the total volume of each amino acids relative to that of a sphere with the diameter of the maximum distance of residue pairs (Melo and Sali 2007). The three residue-level scores, ERRAT, evaluates the number of non-bonded interactions between heavy atoms (Colovos and Yeates 1993); TAP evaluates torsion angle propensity of amino acids (Tosatto and Battistutta 2007); and Verify3D assesses the fitness of each residue in a model to its structural environment defined by the total/poplar burial area and the secondary structure (Luthy et al. 1992). In addition to the raw Verify3D score, three variations of normalized score by the model length (L) are also examined. Atomic-detailed structure is examined by four measures: DOPE is an atom distance-dependent statistical potential (Shen and Sali 2006). Here we normalized DOPE by square of the number of heavy atoms (Nh ) in a model. ANOLEA1 and ANOLEA 2 come as output from the ANOLEA method that evaluates atom contact and accessible surface propensity (Melo and Feytmans 1997). PROCHECK1 is the percentage of residues in the disallowed region in the Ramachandran plot and PROCHECK2 is the percentage of residues with bad contacts, both of which come from the PROCHECK program (Laskowski et al. 1993). ProQ-MX and ProQ-LG is predicted MaxSub and the LG score, respectively, by a neural network-based program, ProQ (Wallner and Elofsson 2006). GA341 combines atom contact and solvent accessibility potentials, the sequence identity, and the compactness score (Sc), and pG is predicted probability that a model has a correct fold using a Bayesian classifier which uses Ga341 and the model length (Melo et al. 2002)

308

D. Kihara et al. Table 13.2 Variable selection for linear regression model Step

Variable

Partial R2

Model R2

P (F)a

1 2 3 4 5 6 7

log (SPAD) log (Verify3D/L2 ) PRSS Z-score ERRAT Sc ProQ-MX DOPE/Nh 2

0.499 0.087 0.014 0.010 0.007 0.004 0.003

0.499 0.586 0.600 0.610 0.617 0.622 0.624

<0.001 <0.001 <0.001 <0.001 <0.001 0.0001 0.0015

a The

p-value of F-value is another statistical metric to show the significance of the contribution of the variable to the regression model. A smaller p-value means more significant contribution.

Fig. 13.3 Predicted and actual RMSD values of template-based models. Equation (13.5) is used to predict RMSD

We have also employed logistic regression to predict if a model’s RMSD is smaller than a certain value (i.e., “correct” structure) or not (i.e., “incorrect” structure). We used 6.0 Å as the threshold value of the RMSD. The forward variable selection is used with the same set of variable choices (shown in Table 13.1). Again, the same two variables, log(SPAD) and log(Verify3D/L2 ), are selected as the most significant variables: log(p/1−p) = −7.93 + 1.62 × log(SPAD) − 1.48 × log(Verify3D/L2 ),

(13.6)

13

Error Estimation of Template-Based Protein Structure Models

309

where p is the probability that a model is correct (i.e., an RMSD of below 6.0 Å). The model dataset has 1,843 correct models and 3,389 incorrect models. Using the logistic regression model (Eq. (13.6)), 4,376 models (83.64%) are correctly classified either to correct or incorrect models. We conclude the followings from these results: First, as also observed by other related studies, there is no single term which is sufficient by itself to predict the global RMSD of models (Table 13.1) but combination of the terms improves prediction of RMSD (Table 13.2). Among the quality assessment terms tested, those which evaluate quality of target–template alignments rather than atom- or residue-based structural terms have higher correlation to the quality of the models (Table 13.1). Moreover, the forward stepwise variable selection procedure identified two variables for constructing linear regression as the most significant contributing terms, namely, log(SPAD) and log(Verify3D/L2 ), both of which are evaluating coarse-grained features of models (Table 13.2). Adding more structure-based terms including those which evaluate atomic-detailed structures (ERRAT and DOPE) improves the linear regression model, however, their contribution is marginal (Table 13.2). We will further extend the discussion later while summarizing this chapter. 13.4.4.3 Two-Step Procedure to Predict Local Quality Next, we developed a regression model for predicting local quality of models, i.e., the Cα distance between corresponding residues of a model and its native structure. Individual scoring terms examined do not show significant correlation to the Cα distance (Table 13.3) and regression models constructed by combining these scores does not show sufficient correlation, neither. However, we find that the prediction performance show significant improvement when the predicted global RMSD (Eq. (13.5)) is integrated. The predicted global RMSD correlates relatively well with

Table 13.3 Variable selection for linear regression model for local quality prediction Step

Variable

Partial R2

Model R2

P (F)

Corr. coeff.a

1 2 3 4 5 6

Predicted global RMSD Gap ratio log (local SPAD/SPAD + 1) local Verify3D log (localVerify3Dpositive 2 + 1)a Conservationb Mutation scorec Local ERRAT

0.1940 0.0453 0.0229 0.0066 0.0048 0.0038 – –

0.1940 0.2393 0.2622 0.2688 0.2736 0.2774 – –

<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 – –

0.44 0.24 0.21 0.32 –0.14 –0.29 –0.32 0.19

a The

correlation coefficient against the Cα distance. positive is a non-negative local Verify3D score assigned to each residue; 0 is assigned when a negative localVerify3D score is replaced with 0. c The conservation is the fraction of the most abundant residue at the position in the multiple sequence alignment of the target protein used for alignment with the template. d Average BLOSUM45 score of a column in the profile. b localVerify3D

310

D. Kihara et al.

the Cα distance, with the correlation coefficient of 0.44. Therefore, we designed a two-step procedure by integrating the predicted global RMSD to predict local quality of individual residues: First, given a protein structure model, the SubAqua method predicts the global RMSD using the Eq. (13.5). Then, the Cα distance is predicted by a linear regression combining the predicted global RMSD, the gap ratio, and log(localSPAD/SPAD + 1). These three variables are selected by the forward stepwise variable selection. The resulting regression is as follows: Cα distance = −4.04 + 0.94 × (predicted RMSD) + 16.59 × (gap ratio) (13.7) +3.55 × log(localSPAD/SPAD + 1) The gap ratio is the fraction of the gaps at the residue position in the multiple sequence alignment and the localSPAD is a measure of the divergence of suboptimal alignments at a local position, expressed by Eq. (13.1). The predicted Cα distance using the regression (Eq. (13.7)) showed the correlation coefficient of 0.51 to the actual Cα distance. Although this correlation coefficient value is not very large, it does improve that of the predicted global RMSD alone (0.44).

13.5 Summary In this chapter, we introduced methods for quality assessment of protein models, which predict real-value of estimated errors. Quality assessment of protein structure models has been recently studied increasingly in the context of model selection (reranking) from a pool of prediction models, which is an important post-processing step for ab initio structure prediction. What we addressed here are methods that are closely related but aim at a different purpose – methods for real-value error estimation of single protein structure model, which is crucial information for applying structure models for practical biological purposes. Estimating real-value error is probably more difficult than reranking models, since most of metrics, especially knowledge-based or physical potentials, do not have clear rationale to be able to indicate the degree of real-value global/local error of models. Probably only exceptions are sequence alignment-based scores, including the classical sequence identity between the target and template, which are empirically known to have a significant correlation to the RMSD of structures (Chothia and Lesk 1986). Here, we overviewed four methods for real-value quality assessment. We can learn the followings from these methods: First, as expected, the sequence alignment-based scores are useful for predicting real-value errors (Tondel, TVSMod, SubAqua). Moreover, rather than the simple sequence identity between a target and a template, metrics which evaluate the significance of an alignment in comparison with alternative alignments (the PRSS Z-score, the SPAD score) correlate better to the RMSD. Second, structure-based terms, in general, particularly

13

Error Estimation of Template-Based Protein Structure Models

311

those which examine atomic-detailed level structures, do not have strong correlation to the real-value errors by itself, but combinations of terms can improve the correlation (ProQ, TVSMod, SubAqua). Third, to make structure-based terms correlate better to the global RMSD, Tondel and Eramian et al. (TVSMod) presented a very interesting idea of constructing a query-dependent dataset for training parameters of the predicting algorithm (Tondel used regression analysis and TVSMod is based on SVM). Obviously values of knowledge/physics-based potentials computed for different proteins cannot be directly compared since the systems are different. Thus, in order to use such potential values for predicting real-value quality, some pre-processing (e.g., normalizations) is needed. The idea of the query-dependent dataset is to only use models with known quality that use the same templates as the query model in training prediction algorithms, so that the potential values are as well correlated as what would be expected for decoys of the same protein. Fourth, predicted global RMSD of a model is a useful variable for predicting local quality at each residue position (SubAqua). Thus, predicting quality in multiple steps, from global to local or coarse-grained to finer-grained, seems to be a valid strategy. The tertiary structure of proteins provides crucial information for elucidating function of proteins and its mechanism. Computational structure models are expected to serve to biology research in the same way, but it is only possible when errors of the model well estimated. By knowing the absolute value and location of the error in a model, it can be effectively used for practical purposes, which depend on the estimated error range. Thus, real-value error estimation is a key for bridging structure prediction to practical application, and thereby capitalizes tremendous efforts paid in the past years for developing protein structure prediction methods. Acknowledgments This work was supported in part by grants from the National Institute of General Medical Sciences of the National Institutes of Health (R01GM075004 and U24GM077905) and from National Science Foundation (IIS0915801, DMS0800568 and EF0850009).

References Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410 Ashworth J, Havranek JJ, Duarte CM, Sussman D, Monnat RJ Jr, Stoddard BL et al (2006) Computational redesign of endonuclease DNA binding and cleavage specificity. Nature 441:656–659 Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science 294:93–96 Betancourt MR (2008) Knowledge-based potential for the polypeptide backbone. J Phys Chem B 112 5058–5069 Betancourt MR, Skolnick J (2001) Finding the needle in a haystack: educing native folds from ambiguous ab initio protein structure predictions. J Comput Chem 22:339–353 Bhattacharya A, Wunderlich Z, Monleon D, Tejero R, Montelione GT (2008) Assessing model accuracy using the homology modeling automatically software. Proteins 70:105–118 Chen H, Kihara D (2008) Estimating quality of template-based protein models by alignment stability. Proteins 71:1255–1274 Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5:823–826

312

D. Kihara et al.

Cline M, Hughey R, Karplus K (2002) Predicting reliable regions in protein sequence alignments. Bioinformatics 18:306–314 Colovos C, Yeates TO (1993) Verification of protein structures: patterns of nonbonded atomic interactions. Protein Sci 2:1511–1519 Cristobal S, Zemla A, Fischer D, Rychlewski L, Elofsson A (2001) A study of quality measures for protein threading models. BMC Bioinform 2:5 Davis IW, Murray LW, Richardson JS, Richardson DC (2004) MOLPROBITY: structure validation and all-atom contact analysis for nucleic acids and their complexes. Nucleic Acids Res 32:W615–W619 Eisenberg D, Luthy R, Bowie JU (1997) VERIFY3D: assessment of protein models with threedimensional profiles. Methods Enzymol 277:396–404 Eramian D, Eswar N, Shen MY, Sali A (2008) How well can the accuracy of comparative protein structure models be predicted? Protein Sci 17:1881–1893 Eswar N, Eramian D, Webb B, Shen MY, Sali A (2008) Protein structure modeling with MODELLER. Methods Mol Biol 426:145–159 Feig M, Brooks CL III (2002) Evaluating CASP4 predictions with physical energy functions. Proteins 49:232–245 Fogolari F, Tosatto SC (2005) Application of MM/PBSA colony free energy to loop decoy discrimination: toward correlation between energy and root mean square deviation. Protein Sci 14:889–901 Gregoret LM, Cohen FE (1991) Protein folding. Effect of packing density on chain conformation. J Mol Biol 219:109–122 Hawkins T, Kihara D (2007) Function prediction of uncharacterized proteins. J Bioinform Comput Biol 5:1–30 Hendlich M, Lackner P, Weitckus S, Floeckner H, Froschauer R, Gottsbacher K et al (1990) Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force. J Mol Biol 216: 167–180 Holm L, Sander C (1992) Evaluation of protein models by atomic solvation preference. J Mol Biol 225:93–105 Hooft RW, Vriend G, Sander C, Abola EE (1996) Errors in protein structures. Nature 381:272 Jaroszewski L, Li W, Godzik A (2002) In search for more accurate alignments in the twilight zone. Protein Sci 11:1702–1713 Jauch R, Yeo HC, Kolatkar PR, Clarke ND (2007) Assessment of CASP7 structure predictions for template free targets. Proteins (S 8) 69:57–67 John B, Sali A (2003) Comparative protein structure modeling by iterative alignment, model building and model assessment. Nuc Acid Res 31: 3982–3992. Kihara D, Chen H, Yang YD (2009) Quality assessment of computational protein models. Curr Protein Pept Sci 10:216–228 Kihara D, Skolnick J (2004) Microbial genomes have over 72% structure assignment by the threading algorithm PROSPECTOR_Q. Proteins 55:464–473 Kmiecik S, Gront D, Kolinski A (2007) Towards the high-resolution protein structure prediction. Fast refinement of reduced models with all-atom force field. BMC Struct Biol 7:43 Koike R, Kinoshita K, Kidera A (2004) Probabilistic description of protein alignments for sequences and structures. Proteins 56:157–166 Kolinski A, Bujnicki JM (2005) Generalized protein structure prediction based on combination of fold-recognition with de novo folding and evaluation of models. Proteins (S 7) 61:84–90 Kopp J, Bordoli L, Battey JN, Kiefer F, Schwede T (2007) Assessment of CASP7 predictions for template-based modeling targets. Proteins (S 8) 69:38–56 Laskowski RA, MacArthur MW, Moss DS, Thornton JM (1993) Procheck – A program to check the stereochemical quality of protein structures. J Appl Crystallogr 26:283–291 Lee M, Jeong CS, Kim D (2007) Predicting and improving the protein sequence alignment quality by support vector regression. BMC Bioinform 8:471

13

Error Estimation of Template-Based Protein Structure Models

313

Lee MR, Tsai J, Baker D, Kollman PA (2001) Molecular dynamics in the endgame of protein structure prediction. J Mol Biol 313:417–430 Lindahl E, Elofsson A (2000) Identification of related proteins on family superfamily and fold level. J Mol Biol 295:613–25 Luthy R, Bowie JU, Eisenberg D (1992) Assessment of protein models with three-dimensional profiles. Nature 356:83–85 McConkey BJ, Sobolev V, Edelman M (2003) Discrimination of native protein structures using atom–atom contact scoring. Proc Natl Acad Sci USA 100:3215–3220 Melo F, Feytmans E (1997) Novel knowledge-based mean force potential at atomic level. J Mol Biol 267:207–222 Melo F, Sali A (2007) Fold assessment for comparative protein structure modeling. Protein Sci 16:2412–2426 Melo F, Sanchez R, Sali A (2002) Statistical potentials for fold assessment. Protein Sci 11:430–448 Mevissen HT, Vingron M (1996) Quantifying the local reliability of a sequence alignment. Protein Eng 9:127–132 Miyazawa S (1995) A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng 8:999–1009 Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453 Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448 Pettitt CS, McGuffin LJ, Jones DT (2005) Improving sequence-based fold recognition by using 3D model quality assessment. Bioinformatics 21:3509–3515 Pontius J, Richelle J, Wodak SJ (1996) Deviations from standard atomic volumes as a quality measure for protein crystal structures. J Mol Biol 264:121–136 Saqi MA, Sternberg MJ (1991) A simple method to generate non-trivial alternate alignments of protein sequences. J Mol Biol 219:727–732 Schlosshauer M, Ohlsson M (2002) A novel approach to local reliability of sequence alignments. Bioinformatics 18:847–854 Shen M, Sali A (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15:2507–2524 Shortle D, Simons KT, Baker D (1998) Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci USA 95:11158–11162 Siew N, Elofsson A, Rychlewski L, Fischer D (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics 16:776–785 Sippl MJ (1993) Recognition of errors in three-dimensional structures of proteins. Proteins 17: 355–362 Skolnick J, Fetrow JS, Kolinski A (2000) Structural genomics and its importance for gene function analysis. Nat Biotechnol 18:283–287 Skowronek KJ, Kosinski J, Bujnicki JM (2006) Theoretical model of restriction endonuclease HpaI in complex with DNA predicted by fold recognition and validated by site-directed mutagenesis. Proteins 63:1059–1068 Tondel K (2004) Prediction of homology model quality with multivariate regression. J Chem Inf Comput Sci 44:1540–1551 Tosatto SC, Battistutta R (2007) TAP score: torsion angle propensity normalization applied to local protein structure evaluation. BMC Bioinform 8:155 Tress ML, Jones D, Valencia A (2003) Predicting reliable regions in protein alignments from sequence profiles. J Mol Biol 330:705–718 Vakser IA (1996) Low-resolution docking: prediction of complexes for underdetermined structures. Biopolymers 39:455–464 Vingron M (1996) Near-optimal sequence alignment. Curr Opin Struct Biol 6:346–352 Vingron M, Argos P (1990) Determination of reliable regions in protein sequence alignments. Protein Eng 3:565–569

314

D. Kihara et al.

Vorobjev YN, Hermans J (2001) Free energies of protein decoys provide insight into determinants of protein stability. Protein Sci 10:2498–2506 Wallner B, Elofsson A (2006) Identification of correct regions in protein models using structural alignment and consensus information. Protein Sci 15:900–913 Wang G, Dunbrack RL Jr (2004) Scoring profile-to-profile sequence alignments. Protein Sci 13:1612–1626 Wells GA, Birkholtz LM, Joubert F, Walter RD, Louw AI (2006) Novel properties of malarial S-adenosylmethionine decarboxylase as revealed by structural modeling. J Mol Graph Model 24:307–318 Wilson CA, Kreychman J, Gerstein M (2000) Assessing annotation transfer for genomics: quantifying the relations between protein sequence structure and function through traditional and probabilistic scores. J Mol Biol 297:233–249 Wojciechowski M, Skolnick J (2002) Docking of small ligands to low-resolution and theoretically predicted receptor structures. J Comput Chem 23:189–197 Wroblewska L, Jagielska A, Skolnick J (2008) Development of a physics-based force field for the scoring and refinement of protein models. Biophys J 94:3227–3240 Wroblewska L, Skolnick J (2007) Can a physics-based all-atom potential find a protein’s native structure among misfolded structures? I. Large scale AMBER benchmarking. J Comput Chem 28:2059–2066 Yang YD, Spratt P, Chen H, Park C, Kihara D (2010) Sub-AQUA: real-value quality assessment of protein structure models. Protein Eng Des Sel 23:617–32 Yu L, Smith TF (1999) Positional statistical significance in sequence alignment. J Comput Biol 6:253–259 Zhang MQ, Marr TG (1995) Alignment of molecular sequences seen as random path analysis. J Theor Biol 174:119–129 Zhang Z, Berman P, Wiehe T, Miller W (1999) Post-processing long pairwise alignments. Bioinformatics 15:1012–1019

Chapter 14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies Anna Tramontano and Domenico Cozzetto

Abstract The Internet is swarmed with tools for predicting protein structure from sequence, and it also provides access to databases of protein three-dimensional models. This wealth of methods and repositories can be very useful to design experiments and interpret their results, as testified by several examples in the literature. On the other side, however, life scientists need to select the most appropriate resource for their problem of interest. The structural bioinformatics community has devised worldwide initiatives – which are described in this chapter – to objectively monitor the state of the art in the field. The challenges in assessing the accuracy of structural models, in comparing different approaches, and in detecting and measuring the extent of progress over time will be discussed here together with some of the solutions adopted by the community. Finally, we will briefly describe a few examples of protein structure analysis and prediction that have been instrumental in shedding light on relevant biomedical problems.

14.1 Introduction The Internet traces its origins back to the United States ARPANET (Advanced Research Projects Agency NETwork) project during the late 1960s. Over the following 20 years, its infrastructure evolved along with some popular applications – emails for instance – that are still very widespread (Tanenbaum 2006). Nowadays, many research areas – including computational biology and bioinformatics – benefit enormously from the speed at which data can be accessed, which allows for maintenance and update of massive databases. In 1991, Tim Berners–Lee announced the CERN (Conseil Européen pour la Recherche Nucléaire) World Wide Web project, which “started with the philosophy that much academic information should be freely

A. Tramontano (B) Department of Biochemical Sciences, “Sapienza” University of Rome, Rome, Italy; Istituto Pasteur – Fondazione Cenci Bolognetti, “Sapienza” University of Rome, Rome, Italy e-mail: [email protected]

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_14,

315

316

A. Tramontano and D. Cozzetto

available to anyone”1 This event marked the beginning of a new era, where people are able to easily access billions of interlinked documents and virtually anything that can be digitally encoded. At the time of writing, newcomers to molecular modelling can retrieve several million results using the Google’s PageRank algorithm (Brin and Page 1998) for the queries “protein structure prediction methods”, “protein modelling software”, and “protein model database” – which is a clear sign of both the intellectual fascination and the tantalizing elusiveness of the folding problem. Inevitably, this considerable amount of items confronts life scientists with the identification of a tool suitable for their purposes. Of course, several of the Google-retrieved hits will point to data, statistics, ideas, opinions, and commercial advertisements. A thorough look at such extensive lists could help select only those web pages that refer to the most up-to-date programmes or databases freely available for download or on-line use. Unfortunately, this screening would still be inadequate to rank the sites – and by extension the protein models they store or build – according to their reliability. Indeed, while traditional publications undergo critical review before they can circulate, no rules or standards apply in general to ensure the quality of web releases. Inescapably, controversial resources appear on-line side by side with excellent ones, the former contributing to the potential propagation of erroneous or confusing information. Experts in the areas of protein structure and modelling have long recognized that random or incorrect choices have dangerous effects on the work of both computational and experimental scientists. Indeed, structure prediction tools and model repositories cannot be regarded as black boxes unless they also provide reliable quality estimates – ideally both for each predicted residue and for whole models – that conform to specific standards. While looking forward to achieving this goal, the structural bioinformatics community is running rigorous worldwide initiatives in order to objectively monitor the state of the art in the field. The results of such benchmarks are freely available, and users can find out which methods produce the most accurate results and are thus expected to be more reliable. This chapter mainly focuses on the experiments named Critical Assessment of techniques for protein Structure Prediction (Moult et al. 1995, 1997, 1999, 2001, 2003, 2005, 2007, 2009) – CASP for short. They preceded all similar endeavours in the field and have had the greatest impact so far, as confirmed by the level of participation (Fig. 14.1) and by the launch of analogous efforts in other areas of computational biology (Hirschman et al. 2005; Janin et al. 2003; Reese et al. 2000). Next to the historical focus on three-dimensional (3D) modelling, CASP also benchmarks methods for the inference of other related features – including biological function, domain boundaries, disordered regions, and model quality – that have gained their own relevance over recent years. The basic structure and organization of the experiments have not changed since 1994. Every 2 years, the organizers gather the sequences of soon-to-be determined

1 The

original post containing the World Wide Web Executive Summary is still available at http://groups.google.com/group/alt.hypertext/msg/395f282a67a1916c.

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

317

Fig. 14.1 Increase in the numbers of targets (squares), participating groups (inverted triangles), and 3D models (circles) submitted from CASP1 (1994) to CASP8 (2008). All data are available on the CASP website

protein structures – the CASP targets – from individual experimental groups and structural genomics centres. The CASP website (http://www.predictioncenter.org/) grants access to this information to participating groups. The same sequences serve as test set for fully automated structure prediction systems, allowing the unbiased comparison of their results with those of human experts and of meta-servers – i.e. programmes that combine the results from many independent servers into a single output. As soon as the frantic prediction season ends and the experimental structures become available, a hectic evaluation phase follows. This starts with the target dissection into domains and their classification by modelling difficulty, which depends on their similarity to structures in the Protein Data Bank (PDB) (Berman et al. 2000). In particular, target domains belong to the template-based modelling (TBM) category if structures similar to them exist in the database, regardless of whether they can or cannot be easily identified by sequence-based methods. All remaining domains fall in the free modelling (FM) category (Tress et al. 2009). It should be mentioned that the distinction between these two classes is rather fuzzy – and so they can partly overlap – due to the continuous nature of the fold space (Sadreyev et al. 2009; Sippl 2009). The Protein Structure Prediction Centre compares all submissions with the corresponding experimental data, thus deriving a huge amount of measurements of the prediction accuracy of Cα traces, side-chain angles, structurally conserved and divergent regions, and ligand-binding sites (Kryshtafovych et al. 2009b). Independent assessors employ these and possibly other measures to blindly evaluate the results – the predictors’ identity not being revealed to them until the results

318

A. Tramontano and D. Cozzetto

are finalized. The employment of multiple evaluation criteria and the participation of new assessors every time – with rare exceptions – aim at reducing bias as well. The organizers finally host a meeting where the community discusses the outcome of the test, the assessors’ conclusions, and the methodologies that performed best. The CASP website provides permanent and open access to the numerical evaluation results of the findings and to the material presented by the assessors at the meetings with the purpose of the widest possible dissemination. Moreover, the journal Proteins: Structure, Function and Bioinformatics devotes a special issue to the experiment that collects papers by the organizers, the assessors, and the best prediction groups. In light of the CASP experience, the following sections detail the challenges and the corresponding solutions for the evaluation of model accuracy, for the critical comparison of different strategies, and for the detection and measure of progress over time. Confident a priori estimates of protein model quality are instrumental for taking proper advantage of the synergy between computational and experimental analyses in everyday practice, and the chapter pays due attention to this aspect before describing a few examples of protein structure analysis and prediction useful to relevant biomedical investigations.

14.2 Numerical Evaluation of Model Quality The accuracy of a computational model is measured in terms of its closeness to the native structure after optimal superposition. In principle, this similarity can be measured at increasing levels of detail by considering larger and larger subsets of superimposed atoms – from Cα atoms only to all atoms in the protein. The assessment of structure prediction results provided by the CASP organizers typically focuses on the overall quality of the Cα trace, which can be done through one or more of the following measures. The traditional measure of structural similarity is the root mean square deviation (rmsd), which is defined as: # $ N $1 di2 rmsd = % N

(14.1)

i=1

where – here and in the following − di represents the Euclidean distance between the ith pair of corresponding atoms. Its use is very common among structural biologists for its conceptual simplicity and its effectiveness in quantifying the extent of protein flexibility, for example, for studying nuclear magnetic resonance (NMR) structure ensembles or different structural determinations of the same protein in diverse conditions. Albeit central in the first three rounds of CASP (Mosimann et al. 1995; Venclovas et al. 1997; Zemla et al. 1999), the rmsd is not really adequate to assess the quality of 3D models because it is prone to magnify errors rather than to identify well-predicted regions – contrary to the end users’ interests. Furthermore,

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

319

Fig. 14.2 RMS/coverage graph (Hubbard 1999) for the server HHpred5 prediction the CASP8 target T0389 – the rhodanese domain of the human dual specificity phosphatase 16, PDB code: 2VSW. The x-axis indicates the fraction of modelled Cα atoms that can be superimposed to the experimental structure and the y-axis reports the corresponding rmsd values in Angstroms. Such graphs and their variants – including Global Distance Test plots (Zemla et al. 2001) – are useful to visually compare multiple 3D models for a specific target and to identify the best predictions, corresponding to the lower rightmost lines

it tends to increase with protein size as a result of its dependence from the number of equivalent atom pairs (Fig. 14.2). Hence, the alignment of (nearly) identical numbers of residues is necessary but not sufficient to interpret the same rmsd score for two pairs of proteins of different lengths as an indication of the same level of structural similarity. Moreover, as a consequence of its quadratic nature, this parameter gives relatively more weight to large deviations between computational and experimental structures. High-rmsd values can result from a few local structural variations – possibly involving highly flexible regions – even when the global topologies of the model and the structure are largely similar. Threshold-based metrics attempt to alleviate the above drawbacks by means of one or more reasonable, empirically derived distance cut-offs. For instance, the Levitt–Gerstein’s score (Levitt and Gerstein 1998) LG = M

N i=1

1+

1 2

(14.2)

di d0

weighs the residue pairs for which di < d0 = 5 Å relatively more than those at larger distances, thus making the measure more sensitive to the global topology than to local structural differences. Later, the LG function inspired other model quality

320

A. Tramontano and D. Cozzetto

measures, including the MaxSub score (Siew et al. 2000) and the TM-score (Zhang and Skolnick 2004) 1 1 S= 2 q di i=1 1 + d0 ⎧ ⎫ ⎪ ⎪ N ⎨1 ⎬ 1 TM−score = Max 2⎪ ⎪ ⎩ q i=1 1 + di ⎭ d0 (q) N

(14.3)

(14.4)

where q ≥ N is the number of residues in the experimental structure. Apart from the technical details, notice that in Eq. (14.3) d0 is constant – and actually set to 3.5 Å – while in Eq. (14.4) it varies as a function of the target length q. This change makes the TM-score independent of protein size, permitting the direct comparison of the quality of 3D models of different proteins. The Global Distance Test-Tertiary Structure (GDT-TS) (Zemla et al. 2001) is another very popular evaluation measure that addresses the rmsd shortcomings from a completely different point of view. Most end users simply do not care about the predicted regions that are too far apart from the native state to derive meaningful biological insights. GDT-TS implements this perspective by leaving out wrongly predicted regions from its calculation rather than by penalizing them. Specifically, GDT-TS is the mean of the maximum percentage pd of predicted residues that are at most d Angstroms away from the corresponding experimental positions. Four distance cut-offs are used – namely, 1, 2, 4 and 8 Å – to find as many separate optimal alignments of the predicted and experimental structures by the LGA package (Zemla 2003) and to obtain the associated pd values.

GDT-TS =

p1 + p2 + p4 + p8 4

(14.5)

The selected thresholds span a relatively wide range that rewards models with a roughly correct fold and score highest those with a backbone conformation nearly identical to the target structure. Naturally, alternative cut-offs can be used, for example, the Global Distance Test-High Accuracy (GDT-HA) (Kryshtafovych et al. 2007b) employs values of 0.5, 1, 2, and 4 Å and is therefore more sensitive to the fine details of the modelled main-chain conformation. GDT-HA =

p0.5 + p1 + p2 + p4 4

(14.6)

As previously asserted, all such measures can estimate the reliability of the overall fold prediction, but do not take into account the details of the structure, such as the side-chain conformations. The quality of rotamer prediction is usually evaluated by comparing the torsion angles between the model and the target structure for each side chain and by calculating the fraction of χ 1 and χ 2 angles, which deviate less

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

321

than a set upper limit – typically 30◦ (Read and Chavali 2007; Tramontano and Morea 2003; Tress et al. 2005). The measure can be applied to all side chains or to buried and/or non-ambiguous ones only. More complex calculations can be used to evaluate how well the hydrogen bond interactions in the experimental structures are reproduced in the protein models (Kopp et al. 2007) or take into account the placement of functional or terminal groups relative to entire domains (Keedy et al. 2009). As it is easy to guess, it is worth taking into account such detailed measures only for structural models with a reasonably correct overall fold.

14.3 The Identification of Successful Strategies Most methods for inferring the 3D atomic coordinates of protein structures from their amino acid sequences are knowledge-based, that is, they resort to heuristic rules that are extracted from the analysis of the relationship between the primary and tertiary structures of the proteins in the PDB. The nature of the relationship between the target sequence and known protein structures establishes their conventional categorization into comparative modelling, fold recognition, and fragment-based techniques. Evolutionarily related proteins preserve not only sequence similarity but also structural features (Chothia and Lesk 1986; Flores et al. 1993; Hubbard and Blundell 1987; Russell et al. 1997). Comparative or homology modelling is feasible when sequence analysis tools can detect a clear evolutionary relationship with known protein structures, as described in Chapter 13. On the other side, highly different sequences can adopt similar structures, and indeed the number of different protein topologies seems to be limited (Chothia 1992; Orengo et al. 1993). Fold recognition methods aim at assessing whether the target sequence is compatible with known protein structures, despite very low or even undetectable sequence conservation. In some ways, fold recognition extends the scope of comparative modelling, so the community currently refers to both techniques as “template-based modelling”. Finally, unrelated proteins exhibit modest local sequence–structure correlation ranging from a few residues to supersecondary structural motifs (Bystroff et al. 1996; Han and Baker 1996). Fragment-based methods, discussed in Chapters 10 and 11, are useful to assemble three-dimensional models from lists of structure fragments when template-based methods do not provide convincing results. The distinction between such methodologies is blurring as a consequence of their cross-fertilization, of ceaseless database growth and of the development of more and more sophisticated homology searching methods. Physics-based or ab initio approaches tackle the protein structure prediction problem within a purely physical framework, building on the observation that polypeptide chains reach nearly invariably the same conformation in physiological conditions (Anfinsen 1973). Consequently, these methods aim at solving this optimization problem by searching the conformational space for the minimum free-energy configuration usually using stochastic search strategies (see Chapters 3 and 9).

322

A. Tramontano and D. Cozzetto

Devising effective benchmarks to critically test and compare such diverse approaches is far from trivial. In the past, developers used to assess the performance of prediction algorithms by reproducing known structures. However, this practice is likely to overestimate the reliability of knowledge-based protocols because the test set could be biased by data used for their parameterization. Though well established, the principle of separate training and test sets is hard to comply with in practice, since subtle biases could remain even after standard filtering procedures. The introduction of blind tests is one the foremost contributions of CASP to computational biology. Their implementation requires the availability of new experimental structural data on an appropriate timescale for as many proteins as possible, as to derive general conclusions. Collecting bona fide blind predictions from several sources is the initial step of the experiments. The quality of individual models can be calculated through the metrics introduced in the previous paragraph, which, however, cannot always be readily used for model evaluation. When proteins lack evolutionary relationships to known folds, the decoys could considerably depart from the experimental structure and thereby attain very low scores – no matter the specific measure of similarity used. In these cases, careful visual inspection is needed to identify predictions that succeed in modelling key structural features. On the other hand, numerical measures are well suited to automatically evaluate the accuracy of template-based predictions and to compare the performance of different methods on a large scale. The following describes the evaluation procedures adopted in CASP – that the community has widely accepted and thus can be regarded as standard – but it will not dwell on the technical details that can be found elsewhere (Cozzetto et al. 2009a). GDT-TS has proved to be a satisfactory measure of model accuracy since CASP4, and it has often been complemented with other metrics (Cozzetto et al. 2009a; Kinch et al. 2003; Kopp et al. 2007; Tramontano et al. 2001; Tramontano and Morea 2003; Tress et al. 2005; Wang et al. 2005). The scores associated with each model have to be combined, i.e. summed or averaged, over the set of submissions by a given predictor in order to rank the methods. Nonetheless, the direct combination of the raw scores does not take into account the targets’ intrinsic modelling difficulties that mainly originate from the decrease in size and similarity of the structurally conserved regions as evolutionary distance grows. Hence, producing a model with high GDT-TS is much more challenging and remarkable when the target’s homologs of known structure are remote than when they are close. Sensible scoring systems should not uniformly weigh these different cases; rather they should reward the former ones. In CASP4, the assessor evaluated group performance by means of Z-scores derived from the target-based GDT-TS value distributions (Tramontano et al. 2001). These normalized scores reflect the relative accuracy of each model with respect to those of all available predictions. This strategy rewards more those models that are closer to the native structure than the “average” model and therefore implicitly takes into account the difficulty of the target prediction. All assessors followed this solution in subsequent CASP editions. The next issue is the appraisal of the statistical significance of the results, i.e. the assessment of whether the differences in the results by various methods are due to a genuinely better performance or to stochastic variations. Answering this question

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

323

can be tricky or even impossible if the experimental design is not proper, for example, when individual methods produce models for different subsets of the target list. In CASP5, the assessor first conducted statistical comparisons of group performance by testing the null hypothesis that two groups could achieve identical results on the set of common predictions (Tramontano and Morea 2003). This strategy has had great impact since, convincing not only the CASP assessors and organizers but also the editorial boards of the leading journals in the field of bioinformatics. As it often happens, statistical and biological significance do not necessarily match: the observation of some differences in model quality can occur by chance with extremely low probability and yet be of little or no practical importance.

14.4 Recognition of Progress in Protein Structure Prediction The field of protein modelling has exerted great influence since the beginning of structural biology. Over about four decades, scientists have unceasingly tried out new ideas and procedures in order to conclude this fascinating “grail quest”. While the definitive solution is deemed out of reach in the foreseeable future, both the number of amino acid sequences that can be assigned approximate and yet useful three-dimensional atomic coordinates computationally and the accuracy of the corresponding structural models are increasing. From the end user standpoint the previous statement is satisfactory and protein structure prediction is by now a widely used tool in the life sciences, indeed. However, developers of methods are interested in understanding whether the observed progress is due to a genuine improvement of the methods or linked to the availability of larger datasets of sequences and structures. The Human Genome Project (Lander et al. 2001; Venter et al. 2001) fostered remarkable advances in sequencing technologies (Mardis 2008), which in turn are now producing public sequence data at an unprecedented high-throughput rate. The year 2004 saw the release of the first meta-genomic analyses from environmentally derived samples (Tyson et al. 2004; Venter et al. 2004) and the analysis of human genetic variation across 1,000 individuals represents the current excitement in genome-centred research (http://www.1000genomes.org/). Whenever possible, most knowledge-based methods for structure prediction conveniently exploit the information provided by the sequences belonging to the same evolutionary family of the target protein. Indeed, larger sequence databases enhance the ability to detect potential templates – especially in the case of distantly related proteins – and to infer correct alignments, thus benefiting both template-based and template-free techniques. At the same time, structural genomics initiatives are ongoing worldwide with a major focus on the determination of selected structures to expand the coverage of the protein fold space (Chandonia and Brenner 2006; Liu et al. 2007; Sali 1998). Of course, their output is doomed to further expand the scope and reliability of empirical protocols that profit from experimental structural data to model protein fragments and domains.

324

A. Tramontano and D. Cozzetto

In general, the results of the comparison of different methods on the same test set are independent of database growth, as long as the prediction time window is relatively short. In CASP, very seldom new sequences or structures become available during the prediction period. On the contrary, the matter requires special attention when contrasting the results of different approaches on different test sets, which is necessary for assessing the advancement of the field over the course of time. For this purpose, an appropriate scale is needed to estimate the difficulty of building a confident structure prediction for a specific protein sequence at different time points and therefore using different sequence and structure databases. In CASP, the definition of 3D modelling difficulty is based on two well-renowned features: the sequence and structure conservation with respect to the best-available single template – i.e. the most similar protein of known structure. In particular, the last is identified by superimposing the target on each protein in the PDB through the LGA programme (Zemla 2003). The fraction of residues that are aligned within 5 Å and of the aligned pairs that are identical are computed from the LGA-derived structural alignment. The average of these two figures is used as an estimate of the target prediction difficulty (Kryshtafovych et al. 2007a, 2009a, 2005; Venclovas et al. 2001, 2003). This relationship is based on reasonable assumptions, but it might be biased by different factors. Nonetheless, its use has highlighted some progress from CASP1 to CASP6 and much more marginal improvement in the subsequent editions of the experiment (Kryshtafovych et al. 2009a). As mentioned before, most knowledge-based methods make extensive use of evolutionary information in the form of multiple sequence alignments. This implies that the number and phylogenetic distribution of the sequences employed in a modelling exercise are relevant factors in estimating the prediction difficulty. Indeed, though accounting for the structural effect associated with the accumulation of mutations, the percent sequence identity or similarity derived from the pair-wise alignment of the target and template sequences is not an adequate parameter for measuring the difficulty of detecting suitable templates and for obtaining the correct alignment. Indeed, it is rarely the case that a sequence alignment between the target and template sequences does not take into account the sequences of the whole protein family and the number of sequences of a family is not constant over time. Figure 14.3 schematically illustrates how a multiple alignment including the target, the template, and other similar sequences can facilitate both the detection of the template and the alignment. Each node in the graph represents a sequence, while edge labels s(a,b) indicate the percent sequence identity between the proteins that they connect. The availability of related sequences determines the difficulty – and by extension the accuracy – of the pair-wise alignment of the target and the template. Indeed, despite very low pair-wise sequence identity, more and more often the alignment of very distantly related amino acid sequences turns out to be unexpectedly correct, thanks to the inclusion of additional and much more conserved intermediate sequences. The situation parallels the effort of crossing a broad river by hopping from one emerging stone to another: regardless of the river’s width, the difficulty of the task depends on the longest distance between two stones that must be unavoidably traversed. Back to the protein sequence space, a reasonable

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

325

Fig. 14.3 Toy illustration of the procedure to calculate the difficulty of obtaining a correct alignment for a pair of target and template sequences – denoted by t and T, respectively. First, the sequences spanning at least 80% of the aligned region between t and T – coloured in black – are identified (a). Their pair-wise sequence identities are then used to weigh the edges of a completely connected graph (b). The thick path from t to T is the one that bridges them using pair-wise alignments between sequences with at least 38% identical residues. Notice that t and T only have 16% identical amino acids and that there is no path connecting them where the minimum label is larger than 38

estimate of the target–template alignment difficulty consists in the hardest pair-wise alignment – i.e. the one between the pairs of proteins sharing the lowest sequence identity – required to bridge their evolutionary distance, given the set of homologous proteins retrieved from the database. Therefore, given the set Pt,T of paths from the target t to the template T, the relevant figures are the elements p that maximize the

326

A. Tramontano and D. Cozzetto

minimum similarity between pairs of traversed nodes (Cozzetto and Tramontano 2005): μt,T = max

min

p∈Pt,T {a,b}∈E(p)

s(a, b)

(14.7)

In first approximation, μt,T estimates the difficulty of finding the correct evolutionary correspondence between the target and the template, based on the public sequence database at the time of the prediction and is therefore useful for analyzing the alignment quality for equally difficult protein pairs at different times. This method was first applied to compare the results obtained in the comparative modelling category in CASP4 and CASP5, leading to the conclusion that the observed improvement in the scope and quality of homology modelling was due to the availability of larger and more finely sampled protein families rather than to the improvement of the methods (Cozzetto and Tramontano 2005). More recent analyses have investigated the outcome of successive CASP editions, confirming marginal advances in alignment methods over time and pointing to database growth as the main factor driving progress in the area (Cozzetto and Tramontano 2008; Tramontano et al. 2008). Notwithstanding the usefulness of the methods described above, the relationship between the expected prediction difficulty and the actual model quality needs to be investigated further. Multiple sequence alignments usually result from the comparison of two profiles that provide statistical descriptions of many sequence subsets and not from a chain of independent pair-wise matches. Moreover, both the approaches presented here assume that protein structure prediction techniques employ a single template, which is becoming more and more atypical. Indeed, albeit more distantly related, additional templates can provide structural information about the regions not present in the single best one and/or increase the confidence in the prediction. It follows that the number and structural distribution of available templates should be taken into account in estimating the difficulty of modelling a given protein structure.

14.5 A Priori Estimates of Model Quality Hitherto the discussion has dealt with the assessment of model accuracy with respect to the corresponding experimental reference data. As stated above, this is of great interest for developers of methods, but, obviously, what users really need is the ability of assessing the quality of a model when the experimental structure is not available – i.e. of estimating it a priori. Indeed, prediction tools and model repositories (Castrignano et al. 2006; Kiefer et al. 2009; Pieper et al. 2009) rarely provide an estimate of the expected divergence between the models and the corresponding native conformations – an essential parameter to decide whether they are suitable for a given application. This aspect is also crucial for the development and improvement of fold recognition and fragment-based methods and for meta-predictive techniques that evaluate several alternative models and select the most likely one.

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

327

The relationship between sequence and structure variation in homologous proteins (Chothia and Lesk 1986) or the parameter μt,T defined in Eq. (14.7) can help only to a limited extent since neither can be applied to template-free models given that the latter do not start from an initial sequence alignment among homologous proteins. Early tools for the analysis of solved protein structures and computational models started more than a decade ago and were usually based on the quality of the model stereochemistry using parameters calibrated on well-refined high-resolution structures (Hooft et al. 1996; Laskowski et al. 1996). Other approaches for discriminating correctly folded models among a set of decoys have also long been available (Eisenberg et al. 1997; Sippl 1993), but the problem is far from being solved. This has prompted the structural bioinformatics community to devote larger efforts to the model quality assessment (MQA) problem. Over the last 4 years, indeed more than 20 original research papers appeared on this hot topic (Kryshtafovych and Fidelis 2009). Thus far, molecular mechanics energy-based functions, empirical potentials and machine-learning approaches – that take molecular environment, hydrogen bonding, secondary structure, solvent exposure, pair-wise residue interactions, and molecular packing into account – have been reported to achieve different degrees of success. On the other hand, the ease of access to several independent fully automated prediction tools allows for the collection of alternative models for the same target protein. This fostered the development of consensus-based strategies that score each prediction according to its similarity to the set of available models (Fischer 2003; Ginalski et al. 2003; Lundstrom et al. 2001; Wallner and Elofsson 2005; Wallner et al. 2003). In 2006, the CASP organizers launched MQA as a new category of critical judgment, thus acknowledging the need to identify effective protocols for the task (Moult et al. 2007). The experiment takes advantage of the server models submitted for assessment in the TBM and FM categories. Each model submitted by an automatic server for each of the targets in CASP is immediately made available to the community. Originally, this decision was meant to reduce the load on servers during the prediction season, since many predictors used server predictions as starting points for their models. An interesting side effect of this decision is that such models can be used as targets for MQA groups who are asked to provide blind estimates of both their overall and residue-level reliability. In CASP, global quality estimates have to be real numbers ranging from 0.0 to 1.0 – where 1.0 denotes allegedly perfect 3D models and lower numbers stand for less accurate decoys. Residue-based confidence scores are to represent distances in Angstroms between the corresponding Cα atoms of the 3D model and the native structure. The first two editions of this test led to the development of a sound and robust assessment procedure that reflects the multifaceted nature of the MQA problem (Cozzetto et al. 2007, 2009b). Of course, picking the best 3D model out of several alternatives for a particular protein is extremely useful for the correct interpretation of experimental data or for the design of further experiments. On the other side, the ability to correctly rank all models according to their quality can be used for optimizing the results of a model-building exercise. The analysis of the correlation between the predicted and observed confidence measures has proven that

328

A. Tramontano and D. Cozzetto

Fig. 14.4 Target-based Pearson’s correlation coefficients as a function of the 3D modelling difficulty (Kryshtafovych et al. 2009a). Pcons results (black squares) are compared with the average correlation obtained by all other participating groups (empty circles). Vertical lines indicate the range of r values for all CASP8 single-domain targets, excluding T0498

consensus-based approaches are significantly better than those scoring one model at a time. Based on the analysis of thousands of quality estimates, Figs. 14.4 and 14.5 emphasize the good results of the Pcons method (Larsson et al. 2009; Wallner and Elofsson 2007). In CASP8, other consensus-based approaches turned out to be similarly effective (Archie et al. 2009; Benkert et al. 2009; Cheng et al. 2009; McGuffin 2009). Undoubtedly, the success of clustering-based techniques bodes well for the enhancement of fold recognition, fragment-based, and meta-predictive approaches. On the other side, end users should be aware of their limitations, which primarily stem from the characteristics of the 3D model set at hand. Of course, these methods are of very limited usage when a few structure predictions – or even just one in the hardest cases – are available or when the decoys considerably differ from each other. Furthermore, while effective in several instances, the basic assumption that the most recurrent structural features are more likely to be correct does not hold in general. Indeed, the engineered CASP8 targets T0498 and T0499 adopt very distinct folds, in spite of only 3 different residues out of 56 (He et al. 2008). Most servers predicted both structures starting from similar templates in the PDB, which in fact were suitable for the latter protein only. Consequently, the performance of consensus-based MQA programmes was rather poor for T0498, for which only three prediction groups could rank the server models achieving a positive, albeit limited, correlation. In particular, Fiser-QA-Comb was able to score highest a model that was 4.91 GDT-TS units far from the native structure using statistical pair potentials (Rykunov and Fiser 2007). Not surprisingly, engineered proteins pose tricky challenges to structure prediction methods.

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

329

a)

b)

Fig. 14.5 Two local quality estimates submitted by Pcons (solid lines) at CASP8 compared with the observed Cα distances (dashed lines). Predictions in (a) and (b) relate to server models T0458TS085_1 and T0465TS495_1, respectively. Targets T0458 and T0465 posed very different prediction difficulties (Kryshtafovych et al. 2009a), as the former was in the TBM category while the latter in the FM one. Nonetheless, predicted and observed distances are highly correlated in both cases

Another conclusion that was derived from the analysis of the CASP results is that there is still much room for improvement in methods able to estimate the reliability of a protein structure prediction on an absolute scale. A proper solution to this problem would be extremely attractive not only because it could be applied to individual models, but also because it would permit to identify the required quality threshold for a model to be used in a specific biological application.

14.6 Applications of Protein Models to Biomedical Research As stated before, the knowledge of the three-dimensional structure of a protein is essential for understanding the details of its molecular function and can be a

330

A. Tramontano and D. Cozzetto

useful guide for the design of effective experiments as it has been shown in several instances. On the other hand, the real cogent question is in which cases a model of a protein can effectively replace an experimental structure. Ideally, one would like to have a single quality figure associated to a model and a set of rules dictating which applications of the model are reasonable-given its estimates quality-, since models of different quality can be effectively used for different purposes. In some cases only very accurate models are useful, but in others even a very approximate model can be instrumental for fostering interesting discoveries. Perhaps, the most effective way to convince the reader that this is the case is to describe a few examples. They have been chosen because the authors of this chapter were involved in the work and are therefore familiar with the specific details. Many more examples, equally if not more interesting, can be easily retrieved from the scientific literature. The first one that we will illustrate is a case where a very approximate model provided a wealth of information. This is the case of the Hepatitis C (HCV) virus NS3 protease. HCV is an RNA-positive virus (Lindenbach and Rice 2005), meaning that its genome is made of RNA which serves both the role of storing the genetic information and of functioning as mRNA. The virus encodes a large polyprotein of about 1,000 amino acids that is processed by cellular proteases and by a virally encoded protease located at the N terminus of a protein called NS3. In 1990, available information about the virus was very limited. It was impossible to replicate the virus in vitro and the only animal model was the chimpanzee – not a very easy animal to handle in the laboratory. The only possibility for testing potential inhibitors of the virus was to express its proteins and test their function in heterologous systems, but the precise location of the cleavage sites of the polyprotein – which would give raise to the individual proteins – was unknown. At the time, methods for the detection of distant homologs had not yet appeared: no such thing as iterative or profile searches or Hidden Markov Models (HMM) were available. On the other hand, the analysis of the sequence of the virus had allowed for the detection of the presence of a serine protease and several structures of serine proteases were available. The strategy was the following (Pizzi et al. 1994): all known structures of serine proteases were superimposed to each other and a structure-based multiple sequence alignment was manually derived from the superposition. The putative viral protease sequence was manually aligned to the derived multiple sequence alignment and modelled on the basis of a consensus protease structure. The sequence identity between the viral protease and any other protein of known structure was below 15% and, as stated before, this implied that the homology model could only have a very low accuracy. Yet, during evolution the functional parts of proteins are subjected to a strong evolutionary pressure and are therefore better conserved than the rest. Incidentally, this is one major advantage of comparative modelling that the most relevant parts of the structure are bound to be modelled with higher accuracy. The inspection of the active site of the model of the viral protease showed that the bottom of the specificity pocket – i.e. the site of the protein that recognizes the substrate – was closed by a phenylalanine which led to a rather small and hydrophobic environment for the side chain of the cleavable substrate.

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

331

An inspection of all the residues in proximity with a phenylalanine in the database of protein structures highlighted the possibility that a cysteine could interact favourably with it. Indeed, next to the expected cleavage sites of the polyprotein – only approximately known by an estimate of the size of the cleaved products – a cysteine was always present. The prediction was experimentally confirmed and opened the road to a number of useful experiments to dissect the function of the enzyme (Pizzi et al. 1994). Another feature that was observed by inspecting the model was the presence of a putative metal-binding site. Not only the prediction was experimentally confirmed, but instrumental for facilitating the expression of the protein in heterologous systems (De Francesco et al. 1996). When the structure became available a few years later (Kim et al. 1996), it turned out that not only the prediction of the structure of the specificity pocket and of the metal-binding site was correct, but also that – as expected – the overall structure of the model was far from accurate. It would have been impossible to use the model for any detailed analysis, such as drug design experiments. Nevertheless its availability was invaluable in speeding up the understanding of the biology of the virus by at least a few years. A more recent – and different – example regards the possibility of interfering with the cytoadherence mechanism of the Plasmodium parasite, the causative agent of malaria, a disease responsible for an estimated 300–500 million clinical cases and 1–3 million deaths annually (Patil et al. 2009). Of the four species of Plasmodium infecting humans (Plasmodium falciparum, Plasmodium malaria, Plasmodium vivax, and Plasmodium ovale), P. falciparum is the only one able to cytoadhere to the surface of postcapillary endothelial cells through members of the PfEMP1 family of proteins encoded by the var genes binding to the intracellular adhesion molecule (ICAM-1) receptor (Kraemer and Smith 2006). The structure of ICAM-1 has been experimentally determined (Chakravorty and Craig 2005), while that of PfEMP1 is unknown. In this case the interest is understanding the detailed intermolecular interactions of the two proteins and, for that, one needs both an accurate model or structure of the two partners and a method for modelling their interaction. The model of PfEMP1 interaction domain can be built on the basis of the structure of the EBA175 protein identified by a method (FUGUE) (Shi et al. 2001) that combines conventional profile and HMM methods using both sequence and structure information and improved environment-specific substitution tables. The sequence identity between the target and template was about 30% over the entire binding domain (Bertonati and Tramontano 2007). A model of the complex between the model of PfEMP1 and ICAM-1 can be obtained using docking techniques. In the particular case described here, the RosettaDock program (Gray et al. 2003) was used. The final resulting model of the complex was consistent with a number of experimental data. For example, all but one of the residues of ICAM-1 shown to be important for binding were within 10 Å from at least one residue of PfEMP1. Furthermore, it is known that fibrinogen binds to ICAM-1 and effectively inhibits the PfEMP1 attachments to ICAM-1, suggesting that the binding sites of PfEMP1 and fibrinogen to ICAM-1 overlap at least partially. The fibrinogen-binding site

332

A. Tramontano and D. Cozzetto

of ICAM-1 has been located near residues D26 and P70 and, in the model, these residues are near the putative interface between ICAM-1 and PfEMP1, consistently with the experimental observations. An ICAM-1 gene variant – named Kilifi – presenting the K29M substitution in the first N-terminal Ig-like ICAM-1 domain has been found in Kenya and Gambia. The mutant has a lower-binding affinity towards some PfEMP1 strains. According to the model, residue K29 of ICAM-1 interacts with PfEMP1 in a region which is different in strains with different binding affinity for the mutant. The consistency with experimental data indicates that the model is sufficiently accurate to allow for the design of small peptides likely to interfere with the binding of the two proteins and of great utility as tools for understanding the interaction. Models can also be used in more creative ways, for example, for assessing interesting features of complex genomes, such as the human one. After years of efforts, the number of genes in the human genome has not yet been established. The most recent estimate lies between 20,000 and 25,000 – a figure way too low to explain the complexity of the organism. An international effort (ENCODE) (Birney et al. 2007) aiming at the exhaustive identification and verification of functional sequence elements is presently ongoing. Its pilot phase – initially limited to a region of about 1% of the human genome – has revealed that a multitude of different transcripts are made at any given locus and many of them are originated by alternatively spliced RNAs. Can the alternative splicing mechanism provide the “missing” genes? In other words, are these transcripts actually translated in functional proteins? Our approach consisted in building template-based three-dimensional models of the putative products and in analyzing them to try and answer the above questions, as well as to assess to which extent alternative splicing can generate different functions – a highly debated matter. There are 2,608 annotated transcripts in the ENCODE dataset and 1,097 of these loci are predicted to be protein coding – a large proportion of them leading to identical protein sequences. These coding sequence-identical variants are alternatively spliced in the 5 and 3 untranslated regions (Tress et al. 2007). Structures are known for proteins from 42 different loci (almost 10% of the total). All the others were analyzed to evaluate whether a reliable model could be built using comparative modelling techniques. Once the models were generated, one could ask the question of which would be the effect of alternative splicing events on the structures by manually inspecting several of their features. When the alternative splicing isoform consisted in a protein with a deleted region with respect to the main isoform, one could assess whether the deletion was expected to affect the packed hydrophobic core of the protein and whether the residues flanking the deletion were so distant in the structure that major rearrangements would be required in the spliced isoform to preserve chain connectivity. If the alternative spliced isoform presented an insertion with respect to the main one, one could verify whether the location of the insertion was exposed on the surface of the protein or compatible with and the presence of its main secondary structural elements. The effect of the substitution of regions by alternative splicing

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

333

events is more difficult to appraise unless the difference in length between the two alternative exons is sufficiently large – in which case they could be thought of as insertions or deletions. The most striking results of the analysis was that in more than half of the analyzed cases, the resulting protein structure is likely to be substantially altered in relation to that of the principal sequence (Tress et al. 2007) and to give rise to a “problematic” protein structure – i.e. a protein that could not be thought of as the main isoform plus or minus a peripheral part of the structure. Are these putative alternative splicing products functional? Very hard to say. Some other proteins could interact with such a “problematic” isoform and stabilize its structure or the latter could fold into a completely different conformation with respect to the main one. These hypotheses are difficult to test not only computationally, but also experimentally. Nevertheless, these phenomena are unlikely to be the explanation for all the observed cases. It follows that probably a sizeable fraction of the alternative spliced products are either never translated or degraded because of their inability to form a properly folded stable structure, yet another surprise coming from genomic analysis and yet another area where protein structure prediction has revealed its usefulness.

14.7 Conclusions and Outlook Structural bioinformatics is providing life scientists with many tools, mainly aimed at providing three-dimensional information on the molecules of life. The progress in the field has been impressive, not only in terms of coverage and accuracy of the methods, but also in their accessibility and in the availability of tools and initiatives for assessing their reliability. We focused here on two aspects – the evaluation of the accuracy of methods and models and their applicability to problems of biomedical relevance. The former concerns both the assessment of the quality of methods and the possibility of establishing a priori the reliability of a three-dimensional model of a protein. The latter is still an open problem and we would like to see substantial improvements, since this would permit the confident usage of model repositories – thus saving considerable time and effort – and would increase the already large set of examples where structural bioinformatics has contributed to important discoveries. We hope we showed that the field is well aware of the relevance of these issues and that it is trying to focus efforts in the direction of providing life scientists with reliable, tested, and available tools. On the other hand, life scientists should make sure to be aware of recent developments and pay attention to the results of the assessments that are accessible to all. Acknowledgments The authors gratefully acknowledge support from the Italian Ministry of Labour, Health, and Social Policies, contract no.onc_ord 25/07. This work was partially supported by KAUST (award no. KUK-I1-012-43) and MIUR (FIRB Rete Italiana di Proteomica and Italbionet).

334

A. Tramontano and D. Cozzetto

References Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230 Archie JG, Paluszewski M, Karplus K (2009) Applying undertaker to quality assessment. Proteins 77 (Suppl 9):191–195 Benkert P, Tosatto SC, Schwede T (2009) Global and local model quality estimation at CASP8 using the scoring functions QMEAN and QMEANclust. Proteins 77 (Suppl 9):173–180 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242 Bertonati C, Tramontano A (2007) A model of the complex between the PfEMP1 malaria protein and the human ICAM-1 receptor. Proteins 69:215–222 Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day, N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SC, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu, M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermuller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi, HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai, C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei, CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu, Y, Green ED, Karaoz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou, M, Nikolaev S, Montoya-Burgos JI, Loytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim, J, Bhinge AA, Jiang N, Liu, J, Yao, F, Vega VB, Lee CW, Ng P, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Xu M, Haidar JN, Yu Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PI, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrimsdottir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VV, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, LindbladToh, K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B, de Jong PJ (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447:799–816 Brin S, Page L (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seventh International World-Wide Web Conference (WWW 1998)

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

335

Bystroff C, Simons KT, Han KF, Baker D (1996) Local sequence-structure correlations in proteins. Curr Opin Biotechnol 7:417–421 Castrignano T, De Meo, PD, Cozzetto D, Talamo IG, Tramontano A (2006) The PMDB protein model database. Nucleic Acids Res 34:D306–309 Chakravorty SJ, Craig A (2005) The role of ICAM-1 in Plasmodium falciparum cytoadherence. Eur J Cell Biol 84:15–27 Chandonia JM, Brenner SE (2006) The impact of structural genomics: expectations and outcomes. Science 311:347–351 Cheng J, Wang Z, Tegge AN, Eickholt J (2009) Prediction of global and local quality of CASP8 models by MULTICOM series. Proteins 77 (Suppl 9):181–184 Chothia C (1992) Proteins. One thousand families for the molecular biologist. Nature 357:543–544 Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. Embo J 5:823–826 Cozzetto D, Kryshtafovych A, Ceriani M, Tramontano A (2007) Assessment of predictions in the model quality assessment category. Proteins 69 Suppl 8:175–183 Cozzetto D, Kryshtafovych A, Fidelis K, Moult J, Rost B, Tramontano A (2009a) Evaluation of template-based models in CASP8 with standard measures. Proteins 77 (Suppl 9):18–28 Cozzetto D, Kryshtafovych A, Tramontano A (2009b) Evaluation of CASP8 model quality predictions. Proteins 77 (Suppl 9):157–166 Cozzetto D, Tramontano A (2005) Relationship between multiple sequence alignments and quality of protein comparative models. Proteins 58:151–157 Cozzetto D, Tramontano A (2008) Advances and pitfalls in protein structure prediction. Curr Protein Pept Sci 9:567–577 De Francesco R, Urbani A, Nardi MC, Tomei L, Steinkuhler C, Tramontano A (1996) A zinc binding site in viral serine proteinases. Biochemistry 35:13282–13287 Eisenberg D, Luthy R, Bowie JU (1997) VERIFY3D: assessment of protein models with threedimensional profiles. Methods Enzymol 277:396–404 Fischer D (2003) 3D-SHOTGUN: a novel cooperative fold-recognition meta-predictor. Proteins 51:434–441 Flores TP, Orengo CA, Moss DS, Thornton JM (1993) Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci 2:1811–1826 Ginalski K, Elofsson A, Fischer D, Rychlewski L (2003) 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics 19:1015–1018 Gray JJ, Moughon S, Wang C, Schueler-Furman O, Kuhlman B, Rohl CA, Baker D (2003) Protein– protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J Mol Biol 331:281–299 Han KF, Baker D (1996) Global properties of the mapping between local amino acid sequence and local structure in proteins. Proc Natl Acad Sci USA 93:5814–5818 He Y, Chen Y, Alexander P, Bryan PN, Orban J (2008) NMR structures of two designed proteins with high sequence identity but different fold and function. Proc Natl Acad Sci USA 105:14412–14417 Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6 (Suppl 1):S1 Hooft RW, Vriend G, Sander C, Abola EE (1996) Errors in protein structures. Nature 381:272 Hubbard TJ (1999) RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions. Proteins (Suppl 3):15–21 Hubbard TJ, Blundell TL (1987) Comparison of solvent-inaccessible cores of homologous proteins: definitions useful for protein modelling. Protein Eng 1:159–171 Janin J, Henrick K, Moult J, Eyck LT, Sternberg MJ, Vajda S, Vakser I, Wodak SJ (2003) CAPRI: a critical assessment of predicted interactions. Proteins 52:2–9 Keedy DA, Williams CJ, Headd JJ, Arendall WB 3rd, Chen VB, Kapral GJ, Gillespie RA, Block JN, Zemla A, Richardson DC, Richardson JS (2009) The other 90% of the protein: assessment beyond the calphas for CASP8 template-based and high-accuracy models. Proteins 77 (Suppl 9):29–49

336

A. Tramontano and D. Cozzetto

Kiefer F, Arnold K, Kunzli M, Bordoli L, Schwede T (2009) The SWISS-MODEL Repository and associated resources. Nucleic Acids Res 37:D387–392 Kim JL, Morgenstern KA, Lin C, Fox T, Dwyer MD, Landro JA, Chambers SP, Markland W, Lepre CA, O’Malley ET, Harbeson SL, Rice CM, Murcko MA, Caron PR, Thomson JA (1996) Crystal structure of the hepatitis C virus NS3 protease domain complexed with a synthetic NS4A cofactor peptide. Cell 87:343–355 Kinch LN, Wrabl JO, Krishna SS, Majumdar I, Sadreyev RI, Qi Y, Pei J, Cheng, H Grishin NV (2003) CASP5 assessment of fold recognition target predictions. Proteins 53 (Suppl 6):395–409 Kopp J, Bordoli L, Battey JN, Kiefer F, Schwede T (2007) Assessment of CASP7 predictions for template-based modeling targets. Proteins 69 (Suppl 8):38–56 Kraemer SM, Smith JD (2006) A family affair: var genes PfEMP1 binding and malaria disease. Curr Opin Microbiol 9:374–380 Kryshtafovych A, Fidelis K (2009) Protein structure prediction and model quality assessment. Drug Discov Today 14:386–393 Kryshtafovych A, Fidelis K, Moult J (2007a) Progress from CASP6 to CASP7. Proteins 69 (Suppl 8):194–207 Kryshtafovych A, Fidelis K, Moult J (2009a) CASP8 results in context of previous experiments. Proteins 77 (Suppl 9):217–228 Kryshtafovych A, Krysko O, Daniluk P, Dmytriv Z, Fidelis K (2009b) Protein structure prediction center in CASP8. Proteins 77 (Suppl 9):5–9 Kryshtafovych A, Prlic A, Dmytriv Z, Daniluk P, Milostan M, Eyrich V, Hubbard T, Fidelis K (2007b) New tools and expanded data analysis capabilities at the Protein Structure Prediction Center. Proteins 69 (Suppl 8):19–26 Kryshtafovych A, Venclovas C, Fidelis K, Moult J (2005) Progress over the first decade of CASP experiments. Proteins 61 (Suppl 7):225–236 Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee, HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu, J, Hood L, Rowen L, Madan A, Qin, S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

337

A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921 Larsson P, Skwark MJ, Wallner B. Elofsson A (2009) Assessment of global and local model quality in CASP8 using Pcons and ProQ. Proteins 77 (Suppl 9):167–172 Laskowski RA, Rullmannn JA, MacArthur MW, Kaptein R, Thornton JM (1996) AQUA and PROCHECK-NMR: programs for checking the quality of protein structures solved by NMR. J Biomol NMR 8:477–486 Levitt M, Gerstein M (1998) A unified statistical framework for sequence comparison and structure comparison. Proc Natl Acad Sci USA 95:5913–5920 Lindenbach BD, Rice CM (2005) Unravelling hepatitis C virus replication from genome to function. Nature 436:933–938 Liu J, Montelione GT, Rost B (2007) Novel leverage of structural genomics. Nat Biotechnol 25:849–851 Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A (2001) Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci 10:2354–2362 Mardis ER (2008) The impact of next-generation sequencing technology on genetics. Trends Genet 24:133–141 McGuffin LJ (2009) Prediction of global and local model quality in CASP8 using the ModFOLD server. Proteins 77 (Suppl 9):185–190 Mosimann S, Meleshko R, James MN (1995) A critical assessment of comparative molecular modeling of tertiary structures of proteins. Proteins 23:301–317 Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A (2007) Critical assessment of methods of protein structure prediction –Round VII. Proteins 69:3–9 Moult J, Fidelis K, Kryshtafovych A, Rost B, Tramontano A (2009) Critical assessment of methods of protein structure prediction – Round VII. Proteins 77 (Suppl 9):1–4 Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A (2005) Critical assessment of methods of protein structure prediction (CASP) – Round 6. Proteins 61 (Suppl 7):3–7 Moult J, Fidelis K, Zemla A, Hubbard T (2001) Critical assessment of methods of protein structure prediction (CASP): round IV. Proteins (Suppl 5):2–7 Moult J, Fidelis K, Zemla A, Hubbard T (2003) Critical assessment of methods of protein structure prediction (CASP) – Round V. Proteins 53 (Suppl 6):334–339 Moult J, Hubbard T, Bryant SH, Fidelis K, Pedersen JT (1997) Critical assessment of methods of protein structure prediction (CASP): round II. Proteins (Suppl 1):2–6 Moult J, Hubbard T, Fidelis K, Pedersen JT (1999) Critical assessment of methods of protein structure prediction (CASP): round III. Proteins (Suppl 3):2–6 Moult J, Pedersen JT, Judson R, Fidelis K (1995) A large-scale experiment to assess protein structure prediction methods. Proteins 23:ii–v Orengo CA, Flores TP, Taylor WR, Thornton JM (1993) Identification and classification of protein fold families. Protein Eng 6:485–500 Patil AP, Okiro EA, Gething PW, Guerra CA, Sharma SK, Snow RW, Hay, SI (2009) Defining the relationship between Plasmodium falciparum parasite rate and clinical disease: statistical models for disease burden estimation. Malar J 8:186 Pieper U, Eswar N, Webb BM, Eramian D, Kelly L, Barkan DT, Carter H, Mankoo P, Karchin R, Marti-Renom MA, Davis FP, Sali A (2009) MODBASE, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res 37: D347–354 Pizzi E, Tramontano A, Tomei L, La Monica N, Failla C, Sardana M, Wood T and De Francesco R (1994) Molecular model of the specificity pocket of the hepatitis C virus protease: implications for substrate recognition. Proc Natl Acad Sci USA 91:888–892

338

A. Tramontano and D. Cozzetto

Read RJ, Chavali G (2007) Assessment of CASP7 predictions in the high accuracy template-based modeling category. Proteins 69 (Suppl 8):27–37 Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, Lewis SE (2000) Genome annotation assessment in Drosophila melanogaster. Genome Res 10:483–501 Russell RB, Saqi MA, Sayle RA, Bates PA, Sternberg MJ (1997) Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mol Biol 269:423–439 Rykunov D, Fiser A (2007) Effects of amino acid composition finite size of proteins and sparse statistics on distance-dependent statistical pair potentials. Proteins 67:559–568 Sadreyev RI, Kim, BH, Grishin NV (2009) Discrete-continuous duality of protein structure space. Curr Opin Struct Biol 19:321–328 Sali A (1998) 100,000 protein structures for the biologist. Nat Struct Biol 5:1029–1032 Shi J, Blundell TL, Mizuguchi K (2001) FUGUE: sequence–structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol 310:243–257 Siew N, Elofsson A, Rychlewski L, Fischer D (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics 16:776–785 Sippl MJ (1993) Recognition of errors in three-dimensional structures of proteins. Proteins 17: 355–362 Sippl MJ (2009) Fold space unlimited. Curr Opin Struct Biol 19:312–320 Tanenbaum AS (2006) Computer networks. Prentice Hall PTR, Upper Saddle River, NJ Tramontano A, Cozzetto D, Giorgetti A, Raimondo D (2008) The assessment of methods for protein structure prediction. Methods Mol Biol 413:43–57 Tramontano A, Leplae R, Morea V (2001) Analysis and assessment of comparative modeling predictions in CASP4. Proteins (Suppl 5):22–38 Tramontano A, Morea V (2003) Assessment of homology-based predictions in CASP5. Proteins 53 (Suppl 6):352–368 Tress M, Ezkurdia I, Grana O, Lopez G, Valencia A (2005) Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins 61 (Suppl 7):27–45 Tress ML, Ezkurdia I, Richardson JS (2009) Target domain definition and classification in CASP8. Proteins 77 (Suppl 9):10–17 Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink JJ, Yeats C, Olason PL, Albrecht M, Hegyi H, Giorgetti A, Raimondo D, Lagarde J, Laskowski RA, Lopez G, Sadowski MI, Watson JD, Fariselli P, Rossi I, Nagy A, Kai W, Storling Z, Orsini M, Assenov Y, Blankenburg H, Huthmacher C, Ramirez F, Schlicker A, Denoeud F, Jones P, Kerrien S, Orchard S, Antonarakis SE, Reymond A, Birney E, Brunak S, Casadio R, Guigo R, Harrow J, Hermjakob H, Jones DT, Lengauer T, Orengo CA, Patthy L, Thornton JM, Tramontano A, Valencia A (2007) The implications of alternative splicing in the ENCODE protein complement. Proc Natl Acad Sci USA 104:5495–5500 Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram, RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428:37–43 Venclovas C, Zemla A, Fidelis K, Moult J (1997) Criteria for evaluating protein structures derived from comparative modeling. Proteins (Suppl 1):7–13 Venclovas C, Zemla A, Fidelis K, Moult J (2001) Comparison of performance in successive CASP experiments. Proteins (Suppl 5):163–170 Venclovas C, Zemla A, Fidelis K, Moult J (2003) Assessment of progress over the CASP experiments. Proteins 53 (Suppl 6):585–595 Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew, I, Fasulo

14

Evaluation of Protein Structure Prediction Methods: Issues and Strategies

339

D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan, W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun, J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan, C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V Istrail S Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu, X, Lopez J, Ma, D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan, S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu, X (2001) The sequence of the human genome. Science 291:1304–1351 Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:66–74 Wallner B, Elofsson A (2005) Pcons5: combining consensus structural evaluation and fold recognition scores. Bioinformatics 21:4248–4254 Wallner B, Elofsson A (2007) Prediction of global and local model quality in CASP7 using Pcons and ProQ. Proteins 69 (Suppl 8):184–193 Wallner B, Fang H, Elofsson A (2003) Automatic consensus-based fold recognition using Pcons, ProQ and Pmodeller. Proteins 53 (Suppl 6):534–541 Wang G, Jin Y, Dunbrack RL Jr (2005) Assessment of fold recognition predictions in CASP6. Proteins 61 (Suppl 7):46–66 Zemla A (2003) LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res 31:3370–3374 Zemla A, Venclovas Moult J, Fidelis K (2001) Processing and evaluation of predictions in CASP4. Proteins (Suppl 5):13–21 Zemla A, Venclovas C, Moult J, Fidelis K (1999) Processing and analysis of CASP3 protein structure predictions. Proteins (Suppl 3):22–29 Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57:702–710

Index

A Accessible surface area (ASA), 263 Adenylate kinase (AKE) Cα displacements, 169 conformational rearrangement, 168 crystallographic structures, 168 decomposition of, 173 density matrix of pairwise root mean square distances, 170 dynamical evolution of GROMACS software, 169 PME method, 169 experimental and theoretical studies, 168 free and bound forms, 168 free-energy landscape, 169 minima, 171 FRET experiments, 168 Gaussian network model for, 172 Gaussian character, 169 large-scale conformational changes, 174 lowest-energy excitation, 171 modes from trajectory, 169 near-native free-energy landscape, 168 NMR relaxation experiments, 168 open and closed arrangements, 168, 172 protein internal dynamics, 167 robustness, 172 simple point charge (SPC), 169 subdivisions, 173 time evolution of, 171 total fluctuation, 171 Adrenomedullin receptor (AMDR), 269 ADSs, see Antecedent domain segments (ADSs) Aggregation process “dry steric zipper” organization, 122 fibril-forming peptides study, 121

fragments of Aβ and protein tau, 122 hydrophobicity and hydrogen bonding forces, 121 of PHF6 peptides, 121–122 simulations with, 123 of protein, 121 stages of, 121 All-atom models of biomolecules, 90 sampling with all-atom force fields, 92 See also Protein models at resolutions All-atom potentials for proteins DMD-adapted potential of Dokholyan group, 113 Lund potential, 113 molecular dynamics (MD) method, 112 AMDR, see Adrenomedullin receptor (AMDR) Anfinsen thermodynamic hypothesis, 61, 144, 258 Antecedent domain segments (ADSs), 249 ASA, see Accessible surface area (ASA) ASP, see Atomic solvation potential (ASP) Assisted model building with energy refinement (Amber), 91 See also Protein models at resolutions Atom–atom contacts, 139 Atomic solvation potential (ASP), 97 Atomistic models, 90 ATTRACT algorithm, 24 B Backbone Building from Quadrilaterals (BBQ) program, 25 Backbone hydrogen bond potential, 114 Backbone root mean square deviation (bRMSD), 118–119 BBQ program, see Backbone Building from Quadrilaterals (BBQ) program Bethe approximation, 134 Bovine pancreatic trypsin inhibitor (BPTI), 37

A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0,

341

342 BPTI, see Bovine pancreatic trypsin inhibitor (BPTI) bRMSD, see Backbone root mean square deviation (bRMSD) Build-up procedure for computation of protein structure, 64 C Cα –Cβ -side group (CABS) models, 14–17 modeling software, 22 CAPRI, see Critical Assessments of Prediction of Interactions (CAPRI) CASP, see Critical Assessment of Techniques for Protein Structure Prediction (CASP) Chain molecules, conformations, 1 Chemistry at Harvard molecular mechanics (CHARMM) force field, 56 Coarse-grained protein models all-atom potential energy function by Boltzmann principle, 37 by EDMC, 36 illustration of, 46 UNRES model, 36 apamin and melittin as lowest in energy optimized potential energy function, 39 APP and crambin as training proteins optimized potential energy function, 39 application of ab initio protein folding, 40 backbone hydrogen bonds patterns, 40 side-chain contact patterns, 40 biomolecular and soft-material systems, 36 BPTI use residue–residue five-state model for, 37 Cα atoms, distances between from PDB, 38 CABS model, 40 CASP4–CASP8, 40 categories of potentials by factorization, 35 potentials by force-matching method, 35 statistical potentials from structural databases, 35 distance-dependent and orientationdependent statistical potentials, 39 effective energy, formula for, 44–45 energy-embedding method, 39 energy functions, 39 force fields derivation, 45

Index optimization, 41 physical connection, 47 probability of conformation, 47 REE and PMF, 46 Gay–Berne model, 41 global optimization algorithm, 39 history of, 37 hydrogen-bonding and long-range side-chain contact energy, 39 knowledge-based potentials, 38 derivation of, 47–48 factor expansion of PMF, 51–55 interaction scheme, 49 protein–protein and protein–ligand complexes, 51 protein structure representation, 48 for protein structures, 47 reference state, 49 representation of, 49 statistical data derived from, 50–51 mean-field-based analytical formulas, 40 Miyazawa–Jernigan interaction energies, 41 MONSSTER and TOUCHSTONE approaches, 40 Monte Carlo dynamics algorithm using crambin, 40 folding model α-helical and β-sheet proteins, 40 folding simulations of protein A, 40 leucine zipper, 40 repressor of primer (ROP) dimer, 40 physics-based and knowledge-based potentials, 39 PRIMO all-atom force fields 1–4 interactions, 96 all-atom reconstructions based on, 94 ASP term, 97 bonded interactions between, 95 bond geometries, 94 CHARMM22/CMAP force field, 95 distance-based spline-interpolated potentials, 95 effective angle interaction, 95 GBMV formalism, 97 generalized Born (GB) formalism, 97 implicit solvent terms in, 97 interaction potential, 94 interaction sites, 93 Lennard–Jones interactions, 96 NMR J-coupling data, 96 non-bonded terms in, 96–97

Index on-the-fly from, 95 partial charges and scaling factor, 97 Poisson theory, 97 potential of mean force (PMF) as function, 97 reconstructing, 96 sampling with, 98 solvation free energy, 97 transferability, 98 for protein structures features, 43 pseudo-energy, 38 quasi-chemical approximation, 37 residue–residue interaction by Boltzmann principle, 38 Gaussian-like functional form, 38 Lennard–Jones-like functional form, 38 potentials, 48–49 RFE function of united peptide chain, 41 SICHO model, 40 implementation of, 100 interaction potential, 98–99 Kyte–Doolittle hydrophobicity scale, 99 MONSSTER program, 100 potential overlap, 99 sampling with, 98 side-chain packing, 98 statistical orientation-dependent side-chain–side-chain interaction potentials, 39 structure-based contact potentials, 37 theoretical approach protein structure and dynamics, 36 UNRES force field, 40 virtual-bond geometry, 37 Z -score optimization approach, 41 Coarse-grained United-Residue (UNRES) model, 36 Coil–globule collapse transition, 5 Collapse transition and secondary structures in proteins, 8 Conformational space annealing (CSA) method, 62 Conformational space representation Brownian dynamics, 43 continuous-space representation, 43 discrete representations of, 44 fragment approach, 44 intermediate-resolution lattice models, 44 low-resolution lattice models, 44 threading/fold recognition approach, 44 in torsional angles, 43 virtual-bond dihedral angles, 43

343 Continuous-space models, 25, 44 Cowpea Chlorotic Mottle Virus (CCMV) capsid nanoindentation of, 200, 202 fullerene-like truncated icosahedral symmetry, 201 Hookean reversible regime, 201 proteins in, 201 structure-based molecular dynamics study of, 201 Critical Assessment of Techniques for Protein Structure Prediction (CASP), 236, 296 average GDT-HA score of models, 296 CASP6 targets crystallographic and predicted structure, 17 computationally predicted protein structures with “golden standard,” 237 databank, 17 decoys, 147 definition of 3D modelling, 324 energy-based protein structure prediction, 41 evaluation procedures in, 322 identify and assemble templates, 86 “meta-prediction” in, 237–238 MQA critical judgment, 327 protein structure prediction, 64 secondary structure alignment, 236 targets used as test set for iterative structure refinement protocol, 100 TASSER and variants, 247 three-dimensional (3D) modelling, 316 Web site, 317–318 Critical Assessments of Prediction of Interactions (CAPRI), 23 CSA method, see Conformational space annealing (CSA) method Cubic lattice hydrophobic polar (HP) models, 10 Cysteine slipknots and knots, 189–190 D Database-free predictions of protein structure, 63 Data-driven docking, 23 Decoys threading results for two-body potentials for, 152 z-scores, 151

344 Define secondary structure of proteins (DSSP) assignment, 27 Delaunay tessellation, 139–140 DFIRE state, see Distance scaled, Finite Ideal-gas Reference state (DFIRE state) Discrete molecular dynamics (DMD)-adapted potential, 113 Discrete optimized protein energy (DOPE) score, 302 Distance-dependent atom-contact potential, 304 Distance scaled, Finite Ideal-gas Reference state (DFIRE state), 138 Distant-dependent potential functions DFIRE state, 138 distance-dependent energy functions, 138 distance-dependent potential functions, 138 effective potential energy for pairwise interactions, 138 KBP function, 138 Miyazawa–Jernigan potential function, 137 pair probability distribution, 138 RAPDF, 138 uniform density model, 138 uniform density reference state, 138 Distant-independent potential functions Bethe approximation, 134 contact matrix for pairs of residues, 134 effective inter-residue contact energies, 134 Miyazawa–Jernigan contact potentials, 134 residue–residue close contacts, 134 sample weighing alignment energy of residues, 136 average collapse energy, 136 average contact energy, 135 energy difference, 136–137 hydrophobic effect, 137 long-range energy, 135 repulsive energies, 136 threading of sequences, 137 two-body contact energies, 134–135 DOPE score, see Discrete optimized protein energy (DOPE) score DSSP assignment, see Define secondary structure of proteins (DSSP) assignment E EDMC, see Electrostatically Driven Monte Carlo (EDMC) EF-CG method, see Effective force-coarsegraining (EF-CG) method

Index Effective energy function optimization, 57 energy-gap optimization approaches, 58 energy-landscape theory, 58 experimental and calculated with optimized force field, 59 illustration of ordering of energy levels, 59 native-like and non-native conformations, 58 thermodynamic and structural data, 58 Z -score value, 58 Effective force-coarse-graining (EF-CG) method, 57 Elastic network models, 166–167 See also Low-energy collective excitations Electrostatically Driven Monte Carlo (EDMC), 36 Energy-based prediction methods, 63 Energy landscape paving (ELP), 213–214 Energy-landscape theory, 58 Energy-ranked conformations, 62 Error estimation methods, 297 Excluded volume, 1, 5–6, 12, 14–15, 48–49, 99, 115, 147 Explicit simulation/implicit solvent (ES/IS) method, 299 F Fast Fourier Transform algorithms, 23 Flexible docking conformational changes, 24 experimental data ATTRACT algorithm, 24 NMR chemical shifts, 24 RDC, 24 four-body statistical pseudo-potentials, 24 RosettaDock algorithm, 23 side chains optimization by free-space side-chain, 24 by rotamer library, 24 Flory’s mean-field theory chain dimension as function of temperature, 3 expansion factor, 3 free energy and conformational properties, 3 ideal chain dimensions, 3 quasi-chemical approximation, 3 zero-order approximation for folding/collapse transition, 3 Fluctuating bond models, 13 Folding thermodynamics backbone root mean square deviation (bRMSD), 118

Index calculated stability order, 117 circular dichroism (CD) and NMR experiments, 117 global free-energy, 117 GS-α3W simulations, 118–119 hydrogen bond-based nativeness, 117 hydrophobicity energy, 117 melting behavior, 118 stability differences among peptides, 116 statistically reliable results, 116 Force-field optimization for foldability, 57 Force-matching method centers of masses of groups of atoms, 55 CHARMM force field, 56 EF-CG approaches, 57 Euclidean norm of vector, 55 force-field optimization for foldability, 57 four-bead model, 56 least-squares fitting, 56 long-range forces, 56 pairwise contributions, 56 RFE function, 55 spline coefficients, 56 transferability, 57 Yvon–Born–Green equations in liquid-state theory, 56 Four-bead model, 56 Four-body contact scoring function, 139 Four-body statistical pseudo-potentials, 24 Fragment approach, 64 Fragment assembly aforementioned methods for, 247 ADSs, 249 descriptors, calculation, 248 de novo protein structure, 242 ABLE method, 245 CSA method, 244 FRAGFOLD, 245 GenTHREADER FR algorithm, 245 global structure, 245 HMMSTER method, 244 I-SITES library, 243 MEMC, 244 PDB database, 245 ROKKO and ROKKY, 243 ROSETTA use, 243–245 SALAMI method, 245–246 SIMFOLD, 243 UNDERTAKER method, 245 URMS, 245 and folding simulations FR alignments, 246 FRankenstein’s Monster method, 247

345 GS-KudlatyPred in, 247 HCPM method, 247 I-TASSER, 246–247 MetaMQAPcons method, 247 MQAP software, 247 REMC sampling method, 246 SPICKER, 246 TASSER, 247 TASSER and PROSPECTOR threading method, 246 ZAM method, 246 methods and resources, 242 Free-energy landscape of proteins, 162 Freely jointed chain model, 2 G Gaussian chain, 2 Gaussian-overlap model of solvation, 54 Gay–Berne model, 41 Generalized born with molecular volume (GBMV) formalism, 97 Generalized-ensemble techniques, 216 control parameter space, random walks in expectation values, 221 Monte Carlo/molecular dynamics, 221 physical quantities, 220 replica-exchange method, 221 simulated tempering, 220 temperature distribution, 220–221 temperature, probability distribution, 220 model space, random walks in Hamilton-exchange method, 221 hopping, 221 van der Waals repulsion, 222 optimizing efficiency of flow distribution, direct measurement, 223–224 replicas moving up, fraction, 222 temperature distribution, 223 transition probabilities, 224 tunneling, 224 order parameter space, random walks in canonical distribution, 217 Gaussian-shaped repulsive potentials, 219 metadynamics-based methods, 220 Monte Carlo/molecular dynamics simulation, 217 multicanonical simulations, 218 protein-folding simulations, 219 re-weighting techniques, 217 simulated scaling, 219

346 Generalized-ensemble techniques (cont.) stochastic tunneling, 218 Tsallis generalized mechanics formalism, 218 umbrella sampling, 217 Wang–Landau sampling, 217 Genome-wide protein structure prediction ab initio/de novo methods, 258 Anfinsen hypothesis, 258 CAS on-lattice and off-lattice models, 262 comparative modeling (CM), 256–257 in databases number of protein sequences, 256 E. coli, medium-sized ORFs confidence score, 266 C-score distribution, histogram, 266–267, 269, 273 PROSPECTOR_3 threading algorithm, 266 SPICKER, 266 transmembrane proteins, 267 in human genome GPCRs, structural modeling of, 267–272 I-TASSER methods, 261 ASA, 263 BETACON, 263 to Chlamydia trachomatis, application, 272–273 diagram of, 263 HMM-HMM search (HHsearch), 262 large-scale benchmarks, structure prediction, 264–266 LOMETS and FUGUE, 262 MUSTER, 262 PDB structures, 264 PROSPECT and PROSPECTOR3, 262 SPARKS2, 262 SPICKER, 263 STRIDE, 263 SVMSEQ and SVMCON, 263 SVMSEQ guide, 262 TASSER Monte Carlo simulation, 264 template-score (TM-score), 262 ORFs and GPCRs, 255 PDB, 255, 257 post-genomic era, 256 rapid strides, 257 RMSD, 257 scale structure predictions, pioneering efforts MODBASE database, 259

Index MODELLER, program, 259 MODPIPE pipeline, 259 multi-genomic scale comparative structure, 259 Mycoplasma genitalium, 258 ORFs, 259 and profile-profile alignment, 258 PSI-BLAST, 259 ROSETTA structure prediction method, large-scale test, 260 Saccharomyces cerevisiae, 259 SCOP, 260 sequence–profile, 258 TOUCHSTONE and PROSPECTOR, 259 structural genomics (SG) project, 256 TASSER methods CAS model, 261 and I-TASSER, 255, 258 Needleman–Wunsch dynamic programming algorithm, 261 PROSPECTOR3, 260 PSIPRED, structure information from, 260–261 SPICKER, 261 Z -score, 261 three-dimensional (3D) structure, 256 Geometrical restraints, 63–64 Geometric potential functions alpha shape of protein molecules, 139 Voronoi tessellation procedure, 139 GINZU/ROBETTA meta-server, 244 Global optimization problem, 61–62 G¯o-like models, 42 GPCRs, see G protein-coupled receptors (GPCRs) G protein-coupled receptors (GPCRs), 255 Groningen molecular simulation (GROMOS), 91 H Hemoglobin, X-ray-resolved structures, 160 High-coordination lattice protein models alpha-carbon trace of globular protein, 15 backbone vectors, 14 CABS lattice-based model, 14–15 conformational transitions, 12 main-chain conformation, 14 non-physical fluctuations of bond length, 13 SICHO concept, 16 three-dimensional “chess-knight” representation, 12

Index with side groups, 13 High-resolution reduced modeling model system structure dynamics from computer simulation folding pathways explored by CABS, 289 observations, 288 protein folding studies, paradigm systems, 285 CABS energy, comparison of, 287 model system structure dynamics, experimental findings, 286 ROSETTA, 287 system evolution from, 287 Homology modelling, 321 Homopolymeric model of protein collapse, 8–9 Hookean reversible regime, 201 Human genome GPCRs, structural modeling of, 267 ADORA2A, 271 all-against-all comparison, 269 AMDR, 269 blind-test, 271 C-terminal tail and, 270 CXCR4 chemokine receptor, structure, 269 intracellular loop (ICL3), 270 I-TASSER protocol, 271–272 LOMETS, 272 neuropeptide Y Y1 receptor, first model, 270 PROSPECTOR3, 270–271 from registered entries, 268 RMSD of, 270–271 TASSER method, 268 TASSER models, 270 T4 lysozyme (T4L), 270 TM-helix identification program, 268 TM-helix regions, 271 X-ray/nuclear magnetic resonance (NMR) structure for, 268 Hybrids, multiple template-based models and BIOINBGU server, 239–240 3D-SHOTGUN, 239–240 fitness function, 242 FRANKENSTEIN3D method, 241 FRankenstein’s Monster approach, 240 FR methods, 239 FR server, 241 MetaMQAP score, 240–241 MODELLER, 240–241 MULTICOM-cluster method, 242

347 Protein Recombination method, 241 RMSD, 240 sequence–structure, 240–241 SHGUM, 240 Hydrodynamic interactions, 199–201 I Induced-fit theory, 160 Intermediate-resolution lattice models, 44 Intracellular adhesion molecule (ICAM-1) receptor structure of, 331 Inverse Boltzmann relationship Boltzmann distribution, 130 coarse graining level, 130 descriptor type, 131 energetic terms, 132–133 linear approximation, 131 linear potential function, 132 microstates of system, 130 net potential energy, 132 pairwise contact potential extraction from structures, 132 partition function for protein, 130 potential energy function, 131 probability distribution, 131 of occurrence, 132 of structural feature, 133 state interaction energies, 131 statistical mechanics, 131 See also Knowledge based potentials I-TASSER methods, see Iterative Threading Assembly Refinement (I-TASSER) methods Iterative refinement with protein models native state and CHARMM22/rdie and CHARMM19/EEF1, 102–103 model resolution, 103 PRIMO/GB and PRIMO/rdie, 102–103 refinement cycles, 104 scoring function, 104 using RMSD, 102 sampling protocol, 100 all-atom molecular dynamics simulations, 101 average sampling efficiency ratios, 104–105 PRIMO molecular dynamics simulations, 101 range of, 103 SICHO lattice Monte Carlo sampling, 101–102

348 Iterative refinement with protein models (cont.) structure refinement protocol, 100 Iterative structure refinement framework Monte Carlo simulations, 88 quantitative measure of sampling efficiency favor/disfavor sampling, 89 native-like structures at given cycle, 89–90 probability of conformation with coordinate RMSD, 89 sampling efficiency ratio, 90 standard deviation, 89 refinement scheme, 87 RMSD-based scoring function, 88 sampling–scoring cycles, 88 Iterative Threading Assembly Refinement (I-TASSER) methods, 255 to Chlamydia trachomatis, application local and global structure alignmentbased method, 272–273 ORFs in, 272 ortholog, structure, 272–273 large-scale benchmarks, structure prediction loop modeling, 264–265 MODELLER, 265 PDB structure, 265–266 RMSDs, 264 ROSETTA, 265 success rate of, 265 TOUCHSTONE, 265 K Kendall correlation coefficients, 151 Knot-related topologies, 189 Knots dynamics, 193–198 Knowledge based potentials, 60–61 applications of CASP decoy, 147 decoy sets, 147 discrimination of native structure from decoys, 128 four-body sequence-based contact potentials (4CP-seq) and (4CP-non-seq), 147–148 machine learning algorithms, 149 new proteins design, 128 in predicting protein stability, 149 protein design, 149 protein docking, 128 protein stability and binding affinity prediction, 128 threading results, 148

Index used in protein–protein docking predictions, 148–149 used optimized non-linear design potential functions, 149 computational studies, 129 developing steps, 128 experimental observations, 129 functions atomic-level potentials, 129 coarse-grained potentials, 129 inverse Boltzmann relationship Boltzmann distribution, 130 coarse graining level, 130 descriptor type, 131 energetic terms, 132–133 linear approximation, 131 linear potential function, 132 microstates of system, 130 net potential energy, 132 pairwise contact potential extraction from structures, 132 partition function for protein, 130 potential energy function, 131 probability distribution, 131 probability of occurrence, 132 probability of structural feature, 133 state interaction energies, 131 statistical mechanics, 131 physical interactions, 129 quasi-chemical approximation desolvation of residues, 133 i-type and j-type residue, selfinteractions, 133 self-contact energy, 133 solvent–solvent interaction potential, 134 Kyte–Doolittle hydrophobicity scale, 99 L Lattice models and continuous-space models, 44 of liquids, 4 protein folding and structure prediction with, 16–17 Lattice polymers with protein-like features collapse transition, 7 Levitt–Gerstein’s score, 319–320 Low-energy collective excitations coarse-grained description and elastic network models, 166–167 protein dynamical domains, 167 generic entry, 163 inter-substate and intra-substate fluctuations, 164–165

Index structural fluctuations in substates root mean square inner product (RMSIP), 165 structural substates, 163 dissimilarity cost function, 164 K -medoids-partitioning scheme, 164 Low-resolution lattice models, 44 Lund potential full-scale thermodynamic simulations of peptides, 115 native geometries studied using, 117 using PROFASI, 115 M Mass-weighted Hessian, 162 MaxSub score, 320 Mean-field approximation, 134 Mechanical unfolding AFM experiments, 119 optical tweezers, 119 ubiquitin and fibronectin, study of, 119–120 Mechanics–Poisson–Boltzmann surface area (MM-PBSA), 299 Membrane proteins, 198–199 Minimal protein-like models, 9 beta-barrel type target structure, 9 coil or turn-type local structures, 9 cubic lattice hydrophobic polar (HP) models, 10 ersatz of hydrogen bonds, 11 globular protein folding, 12 Greek-key topology, 9 motif, 11 lattice anisotropy effects, 10 potential energy, 11 protein-folding motifs, 10 putative native-like structure, 11 REMC sampling method, 11 two-state behavior of, 11–12 types of residues, 10 Miyazawa–Jernigan contact statistical potentials, 128 Miyazawa–Jernigan interaction energies, 41 Modeling of new structures from secondary and tertiary restraints (MONSSTER) program, 100 Model Quality Assessment Methods (MQA), 237, 327 Molecular dynamics (MD), 217, 221, 282 Boltzmann constant, 211 forces, 211 leapfrog algorithm, 211

349 method, 112 Newton’s law for, 210 temperature and kinetic energy, 211 Verlet algorithm, 210–211 Monte Carlo techniques, 5 study of polymer collapse transitions, 4 MQA, see Model Quality Assessment Methods (MQA) Multi-body potentials, 139 Delaunay tessellation algorithms, 140 four-body contact potentials Boltzmann formula, 142 construction of, 140–142 energy function, 142 identification of residue points for use in, 141 relative values of, 143 residue types combination, 141 total free energy, 142 Multicanonical ensemble Monte Carlo method (MEMC), 244 Multicopy (MC) algorithms, 26 Multiscale-coarse-graining (MS-CG) method, 55 Multiscale flexible docking with CABS, 24 all-atom reconstructions BBQ program, 25 main chain and beta carbons rebuilding, 25 side-chain fitting via side-chain replacement, 25 clustering procedure, 27 continuous space model, 25 crystallographic resolution, 28 flow chart of, 27 lattice representation, 25 LINUX computing unit, 27 MC algorithms, 26 mean-square dispersion, 28 Molecular Dynamics simulations, 26 peptide docking to receptor protein DSSP assignment, 27 hydrogen-bonding patterns, 27 protein–protein docking FTdock program, 30 intra-molecular restraints, 29 Rop homo-dimer, 29 structure obtained from, 29–30 pseudo-random mechanisms, 25 REMC simulations, 27 treating of flexibility de novo assembly, 26 fully flexible docking, 26

350 Multiscale flexible docking with CABS (cont.) intra-protein interactions, 27 multi-body packing effects, 27 semi-flexible docking, 26 Multi-source threader (MUSTER), 262 Myoglobin, X-ray-resolved structures, 160 N Needleman–Wunsch dynamic programming algorithm, 261 Neuropeptide Y Y1 receptor model, 270 Non-homology (free modeling)-based structure prediction methods, 150 Numerical evaluation, model quality alternative cut-offs use for, 320 complex calculations used, 321 Euclidean distance between, 318 GDT-TS evaluation, 320 Levitt-Gerstein’s score, 319–320 MaxSub score and TM-score, 320 nuclear magnetic resonance (NMR), 318 root mean square deviation, 318 rotamer prediction, quality, 320 threshold-based metrics, 319 O Open reading frames (ORFs), 259 OPLS, see Optimized potentials for liquid simulations (OPLS) Optimization method, 143–144 Optimized potential functions, see Knowledge based potentials Optimized potentials for liquid simulations (OPLS), 91 P Particle mesh Ewald (PME) method, 169 PDB, see Protein Data Bank (PDB) Pearson’s correlation coefficient, 151 Peptide docking defined, 22 See also Flexible docking PFF potential of Wenzel group, 113 DMD method, 114 helical proteins for, 114 Lennard–Jones interactions, 114 Physics-based potentials, 60–61 PME method, see Particle mesh Ewald (PME) method PMF, see Potential of mean force (PMF) Polymers chain contour in, 2 computer simulations of, 5 dimensions as function of temperature, 8 Monte Carlo simulations, 5

Index Position specific iterated-basic local alignment search tool (PSI-BLAST), 259 Potential of mean force (PMF), 46 factor expansion of analytical expressions, 54 cluster-cumulant functions, 52 component energies, 51–52 first-order factors, 52 fourth-order factors, 53 Gaussian-overlap model of solvation, 54 illustration of physical meaning, 53 intrasite/intersite energy, 51–52 for pair of isobutane molecules, 54 second-order factors, 52–53 third-order factors, 53 μ Potential of Shakhnovich group with Monte Carlo (MC) methods, 114 PDB-derived information, 114 Potentials‘R’Us database, 150 Prediction methods biomedical research, protein models applications, 329 ENCODE, international effort, 332 FUGUE, method, 331 Hepatitis C (HCV) virus, case, 330 Hidden Markov Models, 330 ICAM-1 receptor, structure, 331–332 metal-binding site, 331 models, 332–333 P. falciparum, 331 PfEMP1, model of, 331–332 Plasmodium parasite, 331 polyprotein, 331 RosettaDock program, 331 viral protease sequence, 330 CASP website, 317–318 experiments, structure and organization, 316–317 meta prediction CAFASP-CONSENSUS group, 237–238 CASP experiments, 237 3D-JURY, 239 FR method, 238 hybrids, multiple template-based models, 239–242 meta-server approach for, 238 PCONS, newest version, 239 model quality, numerical evaluation alternative cut-offs use for, 320 complex calculations used, 321 Euclidean distance between, 318

Index GDT-TS evaluation, 320 Levitt-Gerstein’s score, 319–320 MaxSub score and TM-score, 320 nuclear magnetic resonance (NMR), 318 root mean square deviation, 318 rotamer prediction, quality, 320 threshold-based metrics, 319 model quality, priori estimates, 326 CASP, 327 clustering-based techniques, 328 Fiser-QA-Comb, 328 GDT-TS units, 328 MQA, 327 Pcons method, 328–329 Pearson’s correlation coefficients, 328 sequence and structure variation, relationship between, 327 sound and robust assessment procedure, 327 PDB and TBM, 317 protein structure and modelling, experts in, 316 protein structure, classification and critical evaluation CASP, 236–237 CM approach, 236 comparative approach, 237 comparative modeling (CM), 232 de novo approach, 236 fold-recognition (FR), 236 homology-based modeling errors, 236 knowledge-based methods, 233 methods, 234–235 MODELLER, 236 MQAPs, 237 and physical approaches for, 232 physics-based methods model, 232 SWISS-MODEL, 236 in protein structure, recognition of progress CASP, 324, 326 grail quest, 323 Human Genome Project, 323 LGA-derived structural alignment, 324 target and template sequences, illustration, 325 RMS/coverage graph, 319 successful strategies, identification of blind predictions, bona fide, 322 CASP5 assessor, 323 comparative/homology modelling, 321 fold recognition methods, 321 GDT-TS value distributions, 322

351 knowledge-based protocols, 322 physics-based/ab initio approaches, 321 protein structures, 3D atomic coordinates, 321 stochastic variations, 322 template-based modelling, 321 targets, 317 US ARPANET project, 315 Program protein structure predictor employing combined threading to optimize results (PROSPECTOR), 259 PROSPECTOR3, 260 Protein Data Bank (PDB), 38, 255, 257, 317 Protein dynamics and thermodynamics, 1, 13, 16, 35, 111, 168, 281–283, 285, 289 Boltzmann constant, 66 Brownian dynamics, 66 CABS model of, 66 coarse-grained protein-folding simulations, 66 coordinates and momenta of particles Cartesian coordinates, 65 by Newton’s equations of motion, 65 Dirac function, 66 experimental studies of, 65 folding trajectory of, 67 friction coefficient, 65 generalized-ensemble algorithms, 68 KMT algorithm and knot, 190 knotted structures and, 198 Langevin equations of motion, 65 Monte Carlo (MC) methods, 66 non-conservative forces, 65 physics-based united-residue UNRES force field, 67 REMD method, 68–69 scalability graph, 67 scaling data, 68 sequence of folding events, 66 small-scale motions, 66 WHAM technique, 69 Protein folding dynamics, multiscale approach, 281 experiment and simulation, structural dynamics from denatured state, 283–284 experimental techniques, timescale resolution, 283 folding process, all-atom simulations of, 285 macromolecular processes, definitions, 284 MD, 283–284

352 Protein folding dynamics (cont.) nuclear magnetic resonance (NMR) spectroscopy, 282–283 X-ray crystallography, 282 by high-resolution reduced modeling protein folding studies, paradigm systems, 285–287 timescale limitations, 285 molecular dynamics (MD), 282 Protein intermediate model (PRIMO) all-atom force fields 1–4 interactions, 96 all-atom reconstructions based on, 94 ASP term, 97 bonded interactions between, 95 bond geometries, 94 CHARMM22/CMAP force field, 95 distance-based spline-interpolated potentials, 95 effective angle interaction, 95 GBMV formalism, 97 generalized Born (GB) formalism, 97 implicit solvent terms in, 97 interaction potential, 94 interaction sites, 93 Lennard–Jones interactions, 96 NMR J-coupling data, 96 non-bonded terms in, 96–97 on-the-fly from, 95 partial charges and scaling factor, 97 Poisson theory, 97 potential of mean force (PMF) as function, 97 reconstructing, 96 sampling with, 98 solvation free energy, 97 transferability, 98 Protein models at resolutions all-atom models of biomolecules, 90 sampling with all-atom force fields, 92 assisted model building with energy refinement (Amber), 91 atomistic models, 90 CHARMM force field, 90 coarse-grained models, 90, 92 PRIMO, 93–98 covalent molecular bonding geometries, 91 GROMOS, 91 interaction potential, 92 Lennard–Jones potential, 91 non-bonded interactions, 91 OPLS, 91 PRIMO and SICHO model, 91

Index spline-based torsion cross-correlation term, 91–92 van der Waals dispersion interactions, 91 Protein structure prediction, 64 Protein Structure Prediction Centre, 317–318 Protein thermodynamics, 51 Protein volume evaluation (PROVE), 297 Pseudo-random mechanisms, 25 PSI-BLAST, see Position specific iteratedbasic local alignment search tool (PSI-BLAST) Q Quality assessment terms to RMSD, correlation MODELLER, 305 template-based models, dataset, 306 Verify3D score, 306 Quasi-chemical approximation desolvation of residues, 133 i-type and j-type residue, self-interactions, 133 self-contact energy, 133 solvent–solvent interaction potential, 134 R Random flight model, 2 RDC, see Residual dipolar coupling (RDC) REMUCA, see Replica-exchange multicanonical method (REMUCA) REMUCAREM, see Replica-exchange multicanonical method with replica exchange (REMUCAREM) Replica-exchange molecular dynamics, 63 Replica Exchange Monte Carlo (REMC) sampling method, 11, 26, 246 Replica-exchange multicanonical method (REMUCA), 69 Replica-exchange multicanonical method with replica exchange (REMUCAREM), 69 Residual dipolar coupling (RDC), 24 Residue–residue five-state model, 37 Restricted free-energy (RFE) function, 41 Rigid docking procedures, 23 CAPRI and, 23 data-driven docking, 23 Fast Fourier Transform algorithms, 23 geometric hashing procedures, 23 scoring functions, 23 steps in binary structures generation, 23 structures scoring, 23

Index Root mean square deviation (RMSD), 240, 257, 296 ROSETTA coarse-grained statistical potential, 64 RosettaDock algorithm, 23 Rouse model, 3 beads-and-springs Rouse chain, 4 dynamics of, 6 “phantom” chain approximation, 4 S SCOP, see Structural classification of proteins (SCOP) SCWRL program, 25 Secondary-structure prediction algorithms, 63 Second-order approximation, 134 Semi-coarse-grained model all-atom backbone and united-residue side chain, 41 hydrogen-bonding correlation, 41 knowledge-based force field, 42 Miyazawa–Jernigan interaction energies, 41 Monte Carlo and molecular dynamics, 42 side-chain interactions, 41 Semiflexible polymers behavior of, 8 cooperative collapse transition of, 9 trans-type conformations, 7 Sequence threading, 64 Side chain-only (SICHO) model, 16–17, 40 implementation of, 100 interaction potential, 98–99 Kyte–Doolittle hydrophobicity scale, 99 MONSSTER program, 100 potential overlap, 99 sampling with, 98 side-chain packing, 98 Side chain–side chain potential, 115 Simulation techniques advanced updates Gaussian step method, 215 Metropolis algorithm, 215–216 probability density function, 216 Rugged Metropolis (RM), 215–216 chameleon behavior, 226–227 folding pathway, 226 free-energy configuration of, 226 generalized-ensemble techniques, 216 control parameter space, random walks in, 220–221 model space, random walks in, 221–222 optimizing efficiency of, 222–224

353 order parameter space, random walks in, 217–220 global optimization techniques, 214 molecular dynamics Boltzmann constant, 211 forces, 211 leapfrog algorithm, 211 Newton’s law for, 210 temperature and kinetic energy, 211 Verlet algorithm, 210–211 Monte Carlo trial, 211 Metropolis algorithm, 212 thermodynamic quantities, 212 N-terminal strand in caching mechanism, 226 optimization techniques deterministic methods, 213 ELP, 213–214 and genetic algorithms, 213 RMSD, 214 stochastic algorithms, 213 protein, configuration of, 225 RMS deviation, 225 unfolding simulations protein configurations, poor sampling, 214–215 Slicing, 332–333 Smith–Waterman alignment score, 300 SPAD, see SuboPtimal Alignment Diversity (SPAD) Spearman’s rank correlation coefficient, 151 Square lattice polymers, 5 Statistical force fields for coarse-grained protein models CABS model, 146 distance-dependent pairwise energy, 146 OPUS-Ca potentials, 146 statistical potentials, 146 UNRES force field, 147 Statistical potentials types of distance-dependent, 128 distance-independent, 128 geometric, 128 Statistical protein contact potentials comparative analysis of contact potential matrices of hydrophobicity, 144 electrostatic interactions for potentials, 144–145 graphical illustration of correlations, 145 one-body approximation, 146

354 Statistical protein contact potentials (cont.) quasi-chemical principle, 145 residue-type-dependent factor, 144 Structural classification of proteins (SCOP), 260 Structure-based models, 42 application dynamics of knots, 193–198 hydrodynamic interactions, 199–200 mechanical strength of 17,134 proteins, 191–192 proteinic mechanical clamps, 193 proteins in membrane, 198–199 virus capsids, nanoindentation of, 200–202 β-cyclodextrins, 191 construction of atomic-overlap-based criterion, 184 chirality of residue, 184 damping constant, 185 dihedral angles, 184 dispersion of random forces, 185 disulfide bridges, 185 fifth-order Gear predictor-corrector scheme, 186 force-displacement curves, 186 hydrophobicity scale, 186 Langevin terms, 185 LJ3 model, 187–188 Miyazawa−Jernigan couplings, 187–188 molecular dynamics simulations, 186 native and non-native contacts in biomolecule, 182 Newton’s equations of motion, 185 and physical properties, 183 potentials, 183, 188 of protein for Trp-cage miniprotein, 185 pulling speeds, 183 “steered” all-atom molecular dynamics computations, 187 thermal fluctuations and, 187 thermostating, 186 of DNA and dendrimers, 188 bead models, 189–190 fluctuations in, 191 energy parameter, 182 folding process and, 181 free energy nature, 181 implementations, 181 Lennard-Jones potential, 181, 184 for native state proteins, 181

Index poly(amidino amine) (PAMAM) dendrimer ink, 191 poly(propylene imine) (PPI), 191 of proteins, 182–186 stretching/thermal denaturization, 182 SubAqua method constructing regression models, variable selection, 306 RMSD of structure models, correlation coefficient, 307 template-based models, predicted and actual RMSD values, 308 variable selection for, 308 quality assessment terms to RMSD, correlation MODELLER, 305 template-based models, dataset of, 306 Verify3D score, 306 two-step procedure to predict local quality linear regression model, variable selection, 309 RMSD, 310 SuboPtimal Alignment Diversity (SPAD), 300 score defined, 301–302 DOPE score, 302 global and local errors, examples, 301 local quality of models, correlation, 302–303 MODELLER, 302 optimal alignment, defined, 301 to RMSD of models, correlation, 302 suboptimal alignments, 301 target–template alignments, 302 Support vector machine (SVM), 304 T TASSER methods, see Threading Assembly Refinement (TASSER) methods Template-based modelling (TBM), 296, 317 Template-based protein structure models and CASP, 296 error estimation, methods for, 297 GDT-HA score, 296 model-quality assessment predicted models, reranking of, 297 real value of quality of models, prediction, 298 target–template alignment, 298 PROCHECK and MOLPROBITY, 297 protein volume evaluation (PROVE), 297 quality assessment measures, overview, 298

Index assessing alignment quality, 300 ES/IS method, 299 knowledge-based potential, 299 physics-based score, 299 SPAD, 300 RMSD, 297–298 SPAD score defined, 301–302 definition of, 296 DOPE score, 302 global and local errors, examples, 301 local quality of models, correlation, 302–303 MODELLER, 302 optimal alignment, defined, 301 to RMSD of models, correlation, 302 suboptimal alignments, 301–302 target–template alignments, 302 structure models, quality assessment, 297 structure models, real-value quality assessment ProQ, neural network, 304 SubAqua method, 305–306, 308–310 Tondel’s method, 303–304 TVSMod, 304–305 TBM and FM, 296–297 and WHATCHECK, 297 Thermodynamic-clustering technique, 63 Threading Assembly Refinement (TASSER) methods, 64, 255 large-scale benchmarks, structure prediction loop modeling, 264–265 MODELLER, 265 PDB structure, 265–266 RMSDs, 264 ROSETTA, 265 success rate of, 265 TOUCHSTONE, 265 model views, 271

355 Needleman–Wunsch dynamic programming algorithm, 261 PSIPRED, structure information from, 260–261 Threading results for two-body potentials for decoys, 152 Three-dimensional “chess-knight” representation, 12 Thymidine kinase 1p6x, slipnot conformation, 193 Tondel’s method, 304 PAM250 method, 303 Trans-activator of transcription (TAT), 162 flat free-energy landscape, 174 free-energy minima, 175 MD trajectories, 175 Two-bead model, 56 U United residues (UNRES), 25–26 Unit-vector root mean square distance (URMS), 245 UNRES model, see Coarse-grained United-Residue (UNRES) model URMS, see Unit-vector root mean square distance (URMS) V Verdier–Stockmayer dynamics of cubic lattice chain, 6 Virus capsids, nanoindentation, 200–203 Voronoi tessellation procedure, 139 W Weighted-histogram analysis method (WHAM), 63 Y Yvon–Born–Green equations in liquid-state theory, 56