This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
DIVISION OF BIOLOGY CALIFORNIA INSTITUTE OF TECHNOLOGY PASADENA, CALIFORNIA
FOUNDING EDITORS
Sidney P. Colowick and Nathan O. Kaplan
Preface
Five years ago, Academic Press published parts A and B of volumes of Methods in Enzymology devoted to Macromolecular Crystallography, which we had edited. The editors of the series, in their wisdom, requested that we assemble the present volumes. We have done so with the same logical style as before, moving smoothly from methods required to prepare and characterize high quality crystals and to measure high quality data, in the first volume, to structure solving, refinement, display, and evaluation in the second. Although we continue to look forward in these volumes, we also look resolutely back in time by having recruited three chapters of reminiscence from some of those on whose shoulders we stand in developing methods in modern times: Brian Matthews, Michael Rossmann, and Uli Arndt. A spiritually similar contribution opens the second volume: David Blow’s introduction to our Phases section has his personal reflections on the impact that Johannes Bijvoet has had on modern protein crystallography. In the earlier volumes, we foreshadowed a time when macromolecular crystallography would become as automated as the technique applied to small molecules. That time is not quite upon us, but we all feel rattling of the windows from the heavy tread of high-throughput synchrotron-based macromolecular crystallography. As for the previous volumes, we have tried to provide in this volume sufficient reference that those becoming immersed in the field might find an explanation of methods they confront, while hopefully also stimulating others to create the new and better methods that sustain intellectual vitality. The years since publication of parts A and B have seen amazing advances in all areas of the discipline. Super high brightness synchrotron sources (Advanced Photon Source in the United States, European Synchrotron Radiation Facility in Europe, and Super Photon Ring-8 in Japan) are producing numerous important results even while the older sources are increasing productivity. Proteomics and structural genomics have appeared in the lexicon of all biologists and have become vital research programs in many laboratories. In the spirit of the time, these chapters approach many of the methods that are pertinent to high-throughput structure determination. These are now robots for large-scale screening of crystal-growth conditions using sub-microliter volumes, which were accessible only in a few dedicated research laboratories a decade ago. Similarly, automation has begun to assume increasing roles in cryogenic specimen changing for data collection; many laboratories are building and beginning to use robots for this purpose. xiii
xiv
preface
The first and largest section of technical chapters dissects the cutting-edge methods for thinking about or accomplishing crystal growth, including theoretical aspects, using physical chemistry to understand and improve crystal diffraction quality, robotics, and cryocrystallography. The other large section addresses phasing. A profound shift has occurred with the growing appreciation that map interpretation and model refinement are inseparable from the phase problem itself. Various methods of integrating the two processes in automated algorithms constitute an important step toward realization of high-throughput. More importantly perhaps, they improve the resulting structures themselves. New algorithms for representing the variance parameters have come into wider practice. The database of solved macromolecular structures has grown to the point where its statistical properties now afford impressive insight and can be used to improve the quality of structures. Concurrently, simulation methods have become more accessible, reliable, and relevant. The validation process is therefore one that impacts a widening sphere of activities, including homology modeling and the presentation and analysis of conformational, packing, and surface properties. Many of these are reviewed in the concluding chapters. We take little credit, either for the quality of the volume, which goes to the chapter authors, or for comprehensive coverage of competing methods. We will happily accept blame for mistakes and omissions. Academic Press has remained supportive and helpful throughout the long and trying process of completing this job, earning our sincere appreciation. Charles W. Carter Robert M. Sweet
Contributors to Volume 374 Article numbers are in parentheses and following the names of contributors. Affiliations listed are current.
Jan Pieter Abrahams (8), Biophysical Structureal Chemistry, Leiden Institute of Chemistry, 2300 RA Leiden, The Netherlands
Axel T. Brunger (3), The Howard Hughes Medical Institute and Departments of Molecular and Cellular Physiology, Neurology, and Neurological Sciences, Stanford Radiation Laboratory, Stanford University, 1201 Welch Road, Stanford, California 94205
Paul D. Adams (3), Lawrence Berkeley Laboratory, 1 Cyclotron Road, Berkeley, California 94720
Sergey V. Buldyrev (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Vandim Alexandrov (23), Department of Biochemistry and Biophysics, Texas A & M University, College Station, Texas, 77843
Kyle Burkhardt (17), Research Callaboratory for Structural Bioinformatics, Department of Chemistry, Rutgers The State University of New York, Piscataway, New Jersey 08854
W. Bryan Arendall, III. (18), Department of Biochemistry, Duke University, Duke Building, Durham, North Carolina 27708 Nenad Ban (8), Institute for Molecular Biology and Biophysics, Swiss Federal Institute of Technology, CH8093 Zurich, Switzerland
Raul E. Cachau (15), Advanced Biomedical Computer Center, Frederic, Maryland 21703 Stephen Cammer (22), University of California San Diego Libraries, 9500 Gillman Drive, La Jolla, California 92093
Joel Berendzen (3), Biophysics Group, Los Alamos National Laboratory, Los Alamos, New Mexico 87545
Charles W. Carter, Jr. (7, 22), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Helen M. Berman (17), Research Callaboratory for Structural Bioinformatics, Department of Chemistry, Rutgers The State University of New York, Piscataway, New Jersey 08854 D. M. Blow (1), 26 Riversmeet, Appledore, Bideford, Devon EX39 1RE, United Kingdom
Zbigniew Dauter (5), Synchroton Radiation Research Section, NCI Brookhaven National Laboratory Building, Upton, New York 11973
Jose M. Borreguero (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Feng Ding (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
ix
x
contributors to volume 374
Eleanor J. Dodson (3), Department of Chemistry, University of York, Heslington York YO1 5DD, United Kingdom Nikolay V. Dokholyn (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Andrzej Joachimiak (15), Structural Biology Sciences, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439 Jochen Junker (23), Max Planck Institut fur Biophysikalische Chemie, D37070 Gottingen, Germany
Zukang Feng (17), Research Callaboratory for Structural Bioinformatics, Department of Chemistry, Rutgers The State University of New York, Piscataway, New Jersey 08854
Michel H. J. Koch (24), European Molecular Biology Laboratory, Hamburg Outstation, D-22603 Hamburg, Germany
Andra´s Fiser (20), Department of Biochemistry and Seaver Foundation Center for Bioinformatics, Albert Einstein College of Medicine, Bronz, New York 10461
W. G. Krebs (23), San Diego Supercomputer Center, University of California San Diego, La Jolla California 92093
Roger Fourme (4), Soleil (CNRS-CEAMEN), Batiment 209d, Universite Paris XI, 91898 Orsay, Cedex France Mark Gerstein (23), Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520 Ralf W. Gosse-Kunstleve (3), Lawrence Berkeley Laboratory, 1 Cyclotron Road, Berkeley, California 94720 Dorit Hanein (10), The Burnham Institute, La Jolla, California 92037 Jan Hermans (19), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 Barry Honig (21), Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032
Victor S. Lamzin (11), European Molecular Biology Laboratory, Hamburg Outstation, 22603 Hamburg, Germany Richard J. Morris (11), European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, United Kingdom Garib N. Murshudov (14), Chemistry Department, University of York, Helsington York, YO1 5DD, United Kingdom Ronaldo A. P. Nagem (5), CBME Laboratorio Nacional de Luz Sincrotron and Instituto de Fisica Gleb Weataghin, Unicamp Caixa, CEP 13084-971 Campinas SP, Brazil Tom Oldfield (13), European Bioinformatics Institute, European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Cambridge CB10 1SD, United Kingdom
Thomas R. Ioerger (12), Texas A & M University, College Station, Texas 77843
Miroslav Z. Papiz (14), Daresbury Laboratory, Daresbury, Warrington, WA4 4AD, United Kingdom
Ronald Jansen (23), Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520
Anastassis Perrakis (11), Netherlands Cancer Institute, Department of Carcinogenesis, 1066 CX Amsterdam, The Netherlands
contributors to volume 374 Donald Petrey (21), The Howard Hughes Medical Institute, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032 Alberto Podjarny (15), Structural Biology Sciences, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439 Igor Polikarpov (5), Instituto de Fisica de Sao Carlos, Universidade de Sao Paulo, Av Trabalhador, Saovarlense, 13560 Sao Carlos SP, Brazil Thierry Prange´ (4), LURE (CNRS-CEAMEN), Batiment 209d, Universite Paris XI, 91898 Orsay, Cedex France David C. Richardson (18), Department of Biochemistry, Duke University, Duke Building, Durham, North Carolina 27708
xi
Eugene L. Shakhnovich (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 George M. Sheldrick (3), Lehrstuhl fur Strukturchemie, Gottingen University D37077 Gottingen, Germany H. Eugene Stanley (25), Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 Dmitri I. Svergun (24), Institute of Crystallography Russian Academy of Sciences, 117333 Moscow, Russia Lynn F. Ten Eyck (16), National Partnership for Advanced Computational Infrastructure, San Diego Supercomputer Center, La Jolla, California 92093
Jane S. Richardson (18), Department of Biochemistry, Duke University, Duke Building, Durham, North Carolina 27708
Thomas C. Terwilliger (2, 3), Los Alamos National Laboratory, Los Alamos, New Mexico 87545
Jeffrey Roach (6), Department of Chemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
Alexander Tropsha (22), Department of Medicinal Chemistry and Natural Products, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599
Mark A. Rould (7), Department of Physiology, University of Vermont, School of Medicine, Burlington, Vermont 05405 James C. Sacchettini (12), Texas A & M University, College Station, Texas 77843 Andrej Sˇali (20), Mission Bay Genentech Hall, University of California at San Francisco, San Francisco, California 94143 Celia Schiffer (19), Department of Biochemistry and Molecular Pharmacology, University of Massachusetts, Medical School, Worcesster, Massachusetts 01655 Marc Schiltz (4), LURE (CNRS-CEAMEN), Batiment 209d, Universite Paris XI, 91898 Orsay, Cedex France Thomas R. Schneider (3, 15), Lehrstuhl fur Strukturchemie, Gottingen University D37077 Gottingen, Germany
J. Tsai (23), Department of Biochemistry and Biophysics, Texas A & M University, College Station, Texas, 77843 Maria G. W. Turkenburg (3), Department of Chemistry, University of York, Heslington York YO1 5DD, United Kingdom Isabel Uson (3), Lehrstuhl fur Strukturchemie, Gottingen University D37077 Gottingen, Germany Patrice Vachette (24), LURE Bat. 209d, University Paris-Sud, F-91898 Orsay, Cedex France Iosif I. Vaisman (22), School of Computational Sciences, George Mason University, Manassas, Virginia 20110 Niels Volkmann (10), The Burnham Institute, La Jolla, California 92037
xii
contributors to volume 374
Charles M. Weeks (3), Hauptman-Woodward Medical Research Institute, 73 High Street, Buffalo, New York 14203
Martyn D. Winn (14), Daresbury Laboratory, Daresbury, Warrington, WA4 4AD, United Kingdom
John Westbrook (17), Research Callaboratory for Structural Bioinformatics, Department of Chemistry, Rutgers The State University of New York, Piscataway, New Jersey 08854
Kam Y. J. Zhang (9), Department of Structural Biology, Plexxikon, Inc., Berkeley, California 94710
[1]
3
how bijvoet made the difference
[1] How Bijvoet Made the Difference: The Growing Power of Anomalous Scattering By D. M. Blow History
Johannes Bijvoet (1892–1980) made pioneering contributions to the determination of noncentrosymmetric structures. He was the first to exploit the isomorphous replacement method to reveal a noncentrosymmetric structure, using isomorphous sulfate and selenate salts to determine the structure of strychnine on the basis of two projections.1–3 In space group C2, the selenium atoms, one in each asymmetric unit, make a centrosymmetric array, and the structure factors of the heavy atoms (with appropriate choice of origin) are all real. The isomorphous difference then determines the real part of the strychnine structure factor, but the sign of the imaginary part of the structure factor is undefined. The best estimate of the strychnine structure factor is its real part. This leads to an electron density map in which the structure and its inverse are superimposed, with symmetry C2/m. The structure of the strychnine molecule was deduced by discarding one of each pair of atoms related by the mirror, using the same principles that Carlisle and Crowfoot4 had used in separating the two images of the cholesteryl iodide molecule generated by the ‘‘heavy atom’’ method. In both cases, the authors deriving their structure did not know which interpretation was a true representation of the molecule, and which was its inverted image. Bijvoet recognized that anomalous scattering could be used to identify the correct enantiomorph of a noncentrosymmetric structure. He wrote,5 There is in principle a general way of determining the sign [of a phase angle]. . . .We can use the abnormal scattering of an atom for a wavelength just beyond its absorption limit. . . .It also becomes possible to attribute the d or l structure to an optically active compound on actual grounds and not merely by a basic convention.
Nishikawa and Matsukawa6and Coster et al.7 had observed departure from Friedel’s law8 in diffraction from opposite polar faces of a zinc sulfide 1
C. Bokhoven, J. C. Schoone, and J. M. Bijvoet, Kon. Ned. Akad. Wet. 51, 825 (1948). C. Bokhoven, J. C. Schoone, and J. M. Bijvoet, Kon. Ned. Akad. Wet. 52, 120 (1949). 3 C. Bokhoven, J. C. Schoone, and J. M. Bijvoet, Acta. Crystallogr. 4, 275 (1951). 4 H. C. Carlisle and D. M. Crowfoot, Proc. R. Soc. A 184, 64 (1945). 5 J. M. Bijvoet, Kon. Ned. Akad. Wet. 52, 313 (1949). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
4
phases
[1]
crystal. In a beautifully clear exposition of anomalous scattering effects, Bijvoet9 drew on this example: Normal X-ray reflection does not detect any difference between one side [of the octahedral faces of a zinc blende crystal], a dull and poorly developed tetrahedron plane, and the other, a shining well-developed one. In this respect, it is less sensitive than the human eye. Coster, however, chose a radiation— L1 radiation of gold—which just excites the K electrons of zinc. . . .Now X-ray analysis not only detects a difference, but it concludes—and this is, of course completely impossible for the human eye—that it is the dull plane that has the zinc plane facing outwards.
In 1951 Bijvoet and colleagues10 observed the intensity differences between Friedel-related pairs of X-ray reflections from sodium rubidium tartrate crystals. These observed differences showed that the convention established by Emil Fischer to discuss the configuration of bonds at asymmetric carbon atoms, especially in sugars, by good chance represents the true three-dimensional enantiomorph of these molecules. This was a substantial achievement, but Bijvoet was looking much further ahead. His visionary paper in 195411 opens by mentioning . . .the great successes of X-ray analysis that determined structures as complicated as those of sterols and alkaloids and that now approach the domain of Nature’s most complicated biochemical compounds, the proteins. . . .
A flow chart (see Fig. 1) sets the agenda for structure determination for the next half-century. It shows how isomorphous substitution and anomalous scattering can determine phases for all noncentrosymmetric reflections. But there is a question mark. Bijvoet warns,11 It has not yet been thoroughly investigated whether the small effect of the anomalous scattering will be measurable for a sufficient part of the reflections involved in a complete Fourier synthesis.
How thrilled he would have been to know that tunable X-ray sources could produce such measurable effects that anomalous scattering could solve protein structures on its own! These methods began to be introduced in the last year of his life.
6
S. Nishikawa and R. Matsukama, Proc. Imp. Acad. Jpn. 4, 96 (1928). D. Coster, K. S. Knol, and J. A. Prins, Z. Phys. 63, 345 (1930). 8 G. Friedel, Comptes Rendus 157, 1533 (1913). 9 J. M. Bijvoet, Endeavour 14, 71 (1955). 10 J. M. Bijvoet, J. F. Peerdeman, and J. A. van Bommel, Nature 168, 271 (1951). 11 J. M. Bijvoet, Nature 173, 888 (1954). 7
[1]
how bijvoet made the difference
5
PHASE DETERMINATION IN THE ISOMORPHOUS SUBSTITUTION METHOD Center of Symmetry (determination of amplitude sign) Heavy atom on center Algebraic amplitude addition (1936).
Heavy atom out of center (a) Location of the heavy atom (Patterson analysis). (b) Algebraic amplitude addition (1939).
No center of symmetry (determination of phase angle) (a) Location of the heavy atom. (b) Determination of the absolute value of phase angle from amplitude addition in vector diagram (1949). FB-FA
(c) Synthesis of double Fourier and resolution (c') Determination of all phase signs by geometrical considerations. by anomalous scattering (19 ?) (d) Phase shift by anomalous scattering (1930). Determination of absolute configuration (1951). FB-FA
∆f "
∆f " Fig. 1. The Bijvoet presentation of phase determination in the isomorphous substitution method. Redrawn with permission from Nature 173, 888–891. Copyright 1954 Macmillan Magazines Limited.
6
phases
[1]
Anomalous scattering became popular. Pepinsky’s group12 devised a type of anomalous difference Patterson function, known as the Ps function, which is the sine transform of the intensities: X jFðhÞj2 sinð2h uÞ Ps ðuÞ ¼ ð1=VÞ Because the sine function is odd [sin x ¼ sin(x)], the terms of this summation are only nonzero to the extent that intensities for h and h differ. The Ps function is antisymmetric, and its positive and negative peaks represent vectors between an anomalous scatterer and a normal scatterer. Peerdeman and Bijvoet13 and Ramachandran and Raman14 both discovered a simple way to use a centrosymmetric array of anomalous scatterers in a noncentrosymmetric structure to derive the imaginary components of the structure factors. Personal Notes
During 1954–1957 I was a student working on ways to apply isomorphous replacement to phase noncentrosymmetric reflections of proteins, and especially to deal with the ambiguities and errors that appeared to dominate the results in practice.15,16 I met Johannes Bijvoet at an international crystallography meeting in Madrid in April 1956, introduced myself, and outlined my research project to him. I remember him as a strongly built man with slightly receding hair (Fig. 2), who was gentle and encouraging to the nervous student talking to him—indeed, he was clearly excited by the progress in developing methods to exploit isomorphous replacement in proteins. He spoke excellent English, and his attitude was warm and friendly. I met him again at an International Crystallography Congress in Cambridge in August 1960. By that time the subject of protein crystallography had become established by three-dimensional electron density maps for hemoglobin and myoglobin, and anomalous scattering had been used to help with the phasing of haemoglobin. Bijvoet was enthusiastic about these developments. He retired in 1962 and I did not see him again, but I had one other personal involvement. In 1972 I was elected to the Royal Society, and a few days later (before the formal admission ceremony) I learned that a vote was to be taken on the election of Bijvoet as a Foreign Member of the 12
Y. Okaya, Y. Saito, and R. Pepinsky, Phys. Rev. 98, 1857 (1955). A. F. Peerdeman and J. M. Bijvoet, Acta Crystallogr. 9, 1012 (1956). 14 G. N. Ramachandran and S. Raman, Curr. Sci. 25, 348 (1956). 15 D. M. Blow, Proc. R. Soc. A. 247, 302 (1958). 16 D. M. Blow and F. H. C. Crick, Acta Crystallogr. 12, 794 (1959). 13
[1]
how bijvoet made the difference
7
Fig. 2. Johannes Bijvoet. Photograph courtesy of Han Meijer.
Royal Society. I was told it would be in order for me to vote, as I had already been elected. It was a pleasure and an honor to travel to London to cast my vote for his election. Anomalous Scattering in Proteins
In 1956, a consistent and obvious difference was observed between the diffracted intensities of a Friedel pair for a low-order reflection of myoglobin (what is now known as a Bijvoet difference).17 Wyckoff ruled out that it was an experimental artifact, or that it was dependent on solvent concentration, and it was recognized to be an anomalous scattering effect. These 17
J. C. Kendrew, G. Bodo, H. M. Dintzis, J. Kraut, and H. W. Wyckoff, unpublished data (1956).
8
phases
[1]
Fig. 3. (a) Normal and anomalous components of the heavy atom structure factor for reflection h, and for reflection h. (b) Comparison of the heavy atom structure factor FH(h) with the complex conjugate of its Friedel mate FH(h)*.
studies were made with CuK radiation, for which the iron atom of myoglobin has a significant anomalous scattering component (about 3.4 electrons). We can now recognize that this effect could have given clear evidence about the position of the iron atom in the myoglobin crystals. It suggested that anomalous scattering might provide useful phase information, even though the effects at accessible wavelengths were much smaller than those of isomorphous replacement. Let us refer18 to the normal part of the heavy atom structure factor as 0 FH . This is calculated using the real component of the atomic scattering factor f 0 þ f 0 . (In practice, f 0 is a negative quantity.) The anomalous part 00 is calculated using the imaginary part of the atomic scattering factor, FH if 00 . As indicated in Fig. 3, FH ðhÞ ¼ FH ðhÞ* 0
0
00 00 FH ðhÞ ¼ FH ðhÞ*
In the simple case of a centrosymmetric distribution of heavy atoms (which always exists for a single heavy atom site in a space group with an even-fold symmetry axis), the normal structure factor FH of the heavy atom is real. In this case, the isomorphous replacement method estimates the cosine of the phase angle, but gives no information about its sine; measurement of the Bijvoet difference estimates the sine of the phase angle, but gives no information about its cosine (Fig. 1). In a general case (Fig. 4), the isomorphous replacement method and the anomalous scattering method give orthogonal information about the phases. 18
Notation: When discussing anomalous scattering, the subscript P refers to all the ordered atoms in the crystal whose atomic scattering factors are real. The subscript H refers to atoms that exhibit significant anomalous scattering, usually assumed to be all of the same 0 type. The structure factor FH has two components: FH , which arises from the normal part 00 , which arises from the anomalous part f 0þ f of the scattering factor of the H atoms and FH 00 of their scattering factor f 00 . When all the anomalous scatterers are of the same type, FH is in 0 quadrature with FH .
[1]
how bijvoet made the difference
9
Fig. 4. Harker constructions for (a) isomorphous replacement difference; (b) Bijvoet amplitude difference, showing how they give orthogonal phase information. In (a) the real part of the scattering by the heavy atoms is used to calculate the structure factor FH0 appropriate for isomorphous replacement. In (b), the Bijvoet difference is due to opposite effects of the imaginary part of the heavy atom scattering factor on F(h) and on [F(h)]*.
Considering the effects with CuK too small, Blow15 plated a rotating anode with chromium and made measurements on a mercury derivative of hemoglobin, using CrK radiation (f 00 for mercury then estimated as 15e ˚ ). This was a mistake, because the need for large absorption corat 2.29 A rections at this wavelength seriously prejudiced precise observation of Bijvoet differences. It was subsequently concluded that MoK radiation would have been more suitable, being relatively close to the mercury L absorption edge, but at a wavelength at which absorption errors are much smaller. The Bijvoet differences did give significant information to resolve ambiguities in the phases determined by isomorphous replacement, but the large errors made a quantitative estimate difficult. The results were simply categorized as Bijvoet difference probably positive, insufficient information, or as Bijvoet difference probably negative. Even this information was useful in estimating phases, when available isomorphous replacements left a large ambiguity in phase.15 Using only CuK radiation, similar methods were employed by Cullis et al.19 to help resolve ambiguities of phase left by the isomorphous 19
A. F. Cullis, H. Muirhead, M. F. Perutz, M. G. Rossmann, and A. C. T. North, Proc. R. Soc. A. 265, 15 (1961).
10
phases
[1]
˚ replacement technique, in determining the hemoglobin structure to 5.8-A resolution. The squares of the Bijvoet amplitude differences provide a set of Fourier coefficients of an approximate Patterson function of the anomalous scatterers (closely analogous to the difference Patterson for an isomorphous pair). Using CuK radiation, the iron atoms of hemoglobin are the only important anomalous scatterers in the molecule and Rossmann showed how the iron atom positions could be determined directly from the Bijvoet differences.19a Blow and Rossmann20 showed that a recognizable but more noisy electron density map could be obtained using only the data from the parent crystal and a single isomorphous derivative, including anomalous scattering observations, the method now known as SIRAS (single isomorphous replacement with anomalous scattering). Methods of Analysis at Fixed Wavelength
Blow and Rossmann20 did not simply use the sign of the observed Bijvoet difference to resolve the ambiguity of phase left by the single isomorphous replacement. Instead, they followed a procedure similar to that of Blow and Crick,16 in which a probability is assigned to every possible phase angle depending on how accurately it fits the observations. For a particular reflection h, the observations of jFPH(h)j and jFPH(h)j were treated as separate observations. The calculated heavy-atom structure factors FH(h) and FH(h)* were calculated using a complex atomic scattering factor f 0 þ f 0 þ if 00 (Fig. 1). The analysis was carried out as though the two members of the Friedel pair were separate isomorphous derivatives. An improvement was suggested by North,21 who pointed out that the errors in exploiting the Bijvoet difference are far smaller than those that arise in isomorphous replacement, and the Bijvoet difference can be interpreted with greater precision. The implications of the isomorphous difference are confused by departures from ideal isomorphism between ‘‘parent’’ and ‘‘derivative,’’ but there is no corresponding inaccuracy affecting the Bijvoet difference. Also, because measurements are made on the same crystal, often under similar geometric conditions, the amplitude difference is measured more accurately. North suggested a different algorithm for calculation of the phase probabilities, depending on the three observed structure amplitudes, jFP(h)j, jFPH(h)j, and jFPH(h)j, and on the estimated normal and anomalous components of the scattering by 0 (h) and F 00 (h). However, as North recognized, the the heavy atoms, FH H 19a
M. G. Rossmann, Acta Crystallogr. 14, 383 (1961). D. M. Blow and M. G. Rossmann, Acta Crystallogr. 14, 1195 (1961). 21 A. C. T. North, Acta Crystallogr. 18, 212 (1965). 20
[1]
how bijvoet made the difference
11
algorithm depended on an approximation and could be used in different ways. This was a considerable improvement, but Matthews22 found a better formulation. The essence of the method was to change the variables used in the analysis. Instead of working with the observed quantities jFPH(h)j and jFPH(h)j, Matthews worked with the mean structure amplitude 12(jFPH(h)j þ jFPH(h)j) and the Bijvoet amplitude difference jFPH(h)j jFPH(h)j. The mean structure amplitude estimates the struc0 ðh). This ture amplitude that would exist if f 00 were zero, designated FPH is used in the usual way with jFP(h)j and with the calculated normal part 0 (h), to obtain a phase probability disof the heavy atom structure factor FH tribution by isomorphous replacement. The Bijvoet amplitude difference is used in a similar way with the calculated anomalous part of the heavy atom structure factor. This second contribution to the phase probability distribution is independent of any assumption about isomorphism with the parent crystal P. The Bijvoet amplitude difference (jFPH(h)j jFPH(h)j) is used to develop a phase probability distribution derived from anomalous effects. Because there are no errors due to nonisomorphism, the intrinsic errors in interpreting the Bijvoet difference are much smaller, so the root-meansquare (RMS) lack of closure E00 is smaller, leading to a more tightly defined phase distribution. This formulation22 is valid even when different types of anomalous scatterer exist, but usually one type of anomalous scatterer is assumed. This method of phase determination was coded by L. P. Ten Eyck23 and by J. E. Ladner24 into a widely used program PHARE (now incorporating a maximum likelihood refinement procedure and called MLPHARE25). The Ten Eyck phasing algorithm seems to have been used without significant change. In that era X-ray analysis of proteins was practicable only when using characteristic radiations such as CuK. Investigators concentrated on the effects of f 00 , whose effect causes a Bijvoet difference, and tended to ignore f 0 , which modifies the magnitude of the isomorphous difference, because at the given wavelength it is a fixed quantity. Minor criticisms of the Matthews algorithm22 can be made. 0 1. The normal part of the scattering FPH (h) is the complex quantity * 0 þ FPH (h)]. But Matthews approximates jFPH (h)j as 12[jFPH(h)j þ
1 2[FPH(h) 22
B. W. Matthews, Acta Crystallogr. 20, 82 (1966). L. F. Ten Eyck, J. Mol. Biol. 100, 3 (1976). 24 J. E. Ladner, personal communication (2002). 25 Z. Otwinowski, ‘‘CCP4 Study Weekend Proceedings’’ (W. Wolf, P. R. Evans, and A. G. W. Leslie, eds.), p. 80. Daresbury Laboratory, Warrington, UK, 1991. 23
12
phases
[1]
Fig. 5. The two triangles shown include the length jFPH0 (h)j, which in each triangle can be calculated from the lengths of the other sides, and from two related (but unknown) angles. Equating two trigonometric expressions for jFPH0 (h)j leads to Eq. (1).
jFPH(h)j]. By straightforward trigonometry (Fig. 5) (see also Burling et al.26), 1 0 00 jFPH ðhÞj2 ¼ ðjFPH ðhÞj2obs þ jFPH ðhÞj2obs Þ jFPH ðhÞj2calc 2
(1)
Subscripts ‘‘obs’’ and ‘‘calc’’ emphasize that the calculated part of this expression is a small correction to the value derived from observation. In practice the error will often be on the order of 1% and it will rarely exceed 00 3–4% unless FPH is extraordinarily large. 2. The isomorphous replacement method, as conventionally used, gives a phase probability distribution for the parent crystal FP. The phase probability distribution derived from the Bijvoet difference applies to the 0 normal part of the scattering from the derivative crystal FPH . That means 0 it is the phase probability distribution for (FP þ FH ). This fact was ignored by Matthews, who estimated the phase probability as if these two distributions applied to the same quantity. The errors created by these simplified assumptions were insignificant in relation to the precision of phase estimation at the time. A slightly different approach to isomorphous replacement was introduced by Hendrickson and Lattmann.27 They devised a method to summarize the phase probability distribution by four coefficients A, B, C, and D, which essentially represent the first two Fourier components of the phase probability curve. To achieve this, they expressed the lack of closure error e() as the lack of agreement of observed and calculated intensity, eðÞ ¼ jjFP jexpðiÞ þ FH j2 jFPH j2 26 27
F. T. Burling, W. I. Weis, K. M. Flaherty, and A. T. Bru¨ nger, Science 271, 72 (1996). W. A. Hendrickson and E. E. Lattman, Acta Crystallogr. B 26, 136 (1970).
[1]
how bijvoet made the difference
13
(In this expression jFPj and jFPHj are derived from the observed intensities, and FH is calculated from the heavy atom parameters.) This method has been incorporated into a number of computer programs. A criticism of it would be that the observational error in intensity tends to be proportional to the intensity, so that larger errors e are usually encountered for intense reflections. In contrast, the observational error in amplitude is fairly constant for weak and medium strength reflections. Moreover, errors due to nonisomorphism are not correlated with intensity. Therefore the rootmean-square error E ¼ he(best)2i1=2 depends on the intensity. Blow and Crick16 used the amplitude error xðÞ ¼ jjFP jexpðiÞ þ FH j jFPH j in their analysis, and this justifies using a single value for the root-meansquare lack of closure E ¼ hx(best)2i1=2 at a given resolution, independent of the observed intensity (see also Kumar and Rossmann28). A new era in the use of anomalous scattering began when Hendrickson and Teeter29 spectacularly demonstrated the possibilities of using anomalous scattering on its own, by total determination of the crambin structure using the anomalous scattering of its six sulfur atoms in CuK radiation. In terms of a Harker diagram (Fig. 4b), this method (now called SAD: single-wavelength anomalous diffraction) produces a phase ambiguity equivalent to that of a single isomorphous replacement, but still gives important information. Because the constellation of six sulfur atoms will never be centrosymmetric there is no restriction on the indicated phase angle. The resulting image is not the structure plus its inverse, but a noisy image of the structure, which can be refined using other information, especially at high resolution. As is discussed in the final section of this article, these methods have now become powerful. Synchrotron Radiation
In the late 1970s synchrotron radiation became more accessible, the first beamline facilities for macromolecular crystallography were set up, and for the first time experiments became feasible using any chosen wavelength. The possibilities of using anomalous scattering at several wavelengths were recognized early by Phillips et al.30
28
A. Kumar and M. G. Rossmann, Acta Crystallogr. D 52, 518 (1996). W. A. Hendrickson and M. Teeter, Nature 290, 107 (1981). 30 J. C. Phillips, A. Wlodawer, J. M. Goodfellow, K. D. Watenpaugh, L. C. Sieker, L. H. Jensen, and K. O. Hodgson, Acta Crystallogr. A 33, 445 (1977). 29
14
phases
[1]
It had now become possible to exploit the changes in f 0 at different wavelengths, as well as the existence of Bijvoet differences. Karle31 made a new analysis of the problem that was rigorous and general. It developed the possibility of determining phase angles using anomalous scattering by a single crystal at different wavelengths [now known as MAD: multiplewavelength anomalous diffraction (or dispersion)]. Karle’s analysis, although presented in a general way, concentrates on the usual case in practice, in which there is one type of anomalous scatterer and other types of atom whose anomalous scattering factors f 0 and f 00 can be taken as zero. To preserve the notation already adopted, this article shall identify the scattering by these two parts of the structure by subscripts H and P, respectively (Karle uses subscripts 2 and 1). The subscript PH is omitted, because it refers to the whole structure of the crystal under investigation, including normal and anomalous scatterers. Following Karle, a structure factor F(h) without subscript refers to scattering by the whole crystal. Because the crystal includes anomalous scatterers, this structure factor depends on the wavelength and is written F(h). An important feature of Karle’s analysis31 lies in the definition of the ‘‘normal’’ and ‘‘anomalous’’ parts of the structure. The normal structure factors F n are calculated as though all the electrons in the structure scatter normally. The anomalous component F a is the correction that must be made to give the actual scattering at some wavelength . To emphasize wavelength dependence, structure factors that are dependent on wavelength are given a presuperscript :
FðhÞ ¼ F n ðhÞ þ F a ðhÞ
(Although F a derives entirely from the heavy atoms that exhibit anomalous scattering, the subscript H is not needed, because there is no other anomalous scattering.) If the parameters of the anomalous scatterers are known (including the complex atomic scattering factors at wavelength ), a F may be calculated. In contrast to the earlier approaches discussed above, the effects of changes to the real parts of the atomic scattering factors f 0 are now included in the anomalous component F a. This facilitates comparison of scattering at different wavelengths. But the anomalous scattering F a is not in quadrature with the normal part F n. The relation00 (h)* ¼ F 00 (h) stated above does not apply to F a because F 00 is ship FH H H only one component of F a. The notation has many advantages, but it does not allow the Bijvoet difference and the Bijvoet amplitude difference to be interpreted so simply. The two notations are compared in Fig. 6.
31
J. Karle, Int. J. Quantum Chem. Quantum Biol. Symp. 7, 357 (1980).
[1]
how bijvoet made the difference
15
Fig. 6. Comparison of Karle notation using F n, F a with ‘‘pseudo-isomorphous’’ notation using FP, FH0 , and FH00 . FH0 and FH00 are orthogonal because all the atoms scattering anomalously are assumed to be of the same type.
Karle’s analysis31 expands the algebraic expression for each observable intensity as a linear combination of terms which depend on four variables. n j2 , and the other two depend on the sine Two variables are jFPn j2 and jFH n a and cosine of the angle ( ), which determines their phase relationship. If F a can be calculated from the parameters of the heavy atoms, this angle leads to a phase angle for the normal component of the structure factor F n. At each wavelength two independent intensity observations [F(h)]2 and [F(h)]2 provide two independent quantities that depend on these four variables. If measurements are made at two wavelengths, giving four independent observations, the system is determined in principle. If three wavelengths are used there are six equations in four unknowns, and standard linear algebra can give a definite least-squares solution (as in MADLSQ32). The precision of the result depends on the properties of the normal matrix of the equations—on how well ‘‘conditioned’’ they are. When the equations are solved for each reflection, the Fourier transform n j2 provides a Patterson function of the array of anomalous scatterers. of jFH From this, the parameters of the heavy atoms may be determined. Compared with the isomorphous replacement method, anomalous scattering analysis is relatively error-free. The experimental errors arise from inaccuracies of intensity measurement, and from inaccurate estimation of the anomalous scattering component, arising from errors in the estimated parameters of the atoms that cause it, at the particular wavelengths employed. There is also a significant but usually small error that arises from the assumption that all of the atoms in the P part of the structure are ‘‘normal’’ scatterers, with f 0 and f 00 precisely zero. For the Karle method,31 it was assumed at first that no sophisticated error analysis was 32
W. A. Hendrickson, Trans. Am. Crystallogr. Assoc. 21, 245 (1985).
16
phases
[1]
needed, and many structures were solved in this way, using three and sometimes four different wavelengths. These usually include two very close wavelengths at the maximum anomalous scattering f 00 (‘‘peak’’) and at the absorption edge (‘‘edge,’’ where the change f 0 to the normal scattering is maximized), and one or more ‘‘remote’’ wavelengths where f 0 is fairly small, although f 00 may be significant. A study of selenobiotinyl streptavidin by MAD undertook a direct analysis of experimental error.33 The formulas derived from Karle’s analysis31 were rearranged into a form parallel to Hendrickson and Lattman’s expressions for MIR (multiple isomorphous replacement).28 The lack of agreement of observed and calculated intensities e() at each wavelength could be expressed directly in terms of four quantities A, B, C, and D, which allow a phase probability distribution to be calculated, leading to a best phase and figure of merit for each structure factor. The method was reported to give a smaller phase error, and the map using the resulting Fourier coefficients was significantly enhanced in appearance and ease of interpretation, compared with results from MADLSQ. Two Ways with MAD
An alternative to the Karle approach is to apply the method of Matthews,22 originally developed to deal with anomalous dispersion at a single wavelength in conjunction with isomorphous replacement. Hendrickson and Ogata34 and Smith and Hendrickson35 have contrasted the two approaches. When the Matthews approach is applied to MAD, it is referred to as ‘‘pseudo-MIR’’ by Smith and Hendrickson. The methods based on Karle’s approach,31–33 developed specifically for multiple wavelength studies, are called the ‘‘explicit’’ approach. An important practical difference between the methods arises in identifying the positions of the anomalous scatterers. In pseudo-MIR the observed Bijvoet amplitude difference directly provides coefficients [jF(h)jjF(h)j]2 for an anomalous difference Patterson synthesis as in Rossmann.19a The coefficients from observations at several wavelengths may be combined. In the explicit approach, the Karle simultaneous equan j2 , which provide coefficients for a Patterson function tions generate jFH of the anomalous scatterers. The advantages and disadvantages of these approaches are discussed briefly by Smith and Hendrickson.35 33
A. Pa¨ hler, J. L. Smith, and W. A. Hendrickson, Acta Crystallogr. A 46, 537 (1990). W. A. Hendrickson and C. M. Ogata, Methods Enzymol. 276, 494 (1997). 35 J. L. Smith and W. A. Hendrickson, in ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, p. 299. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2001. 34
[1]
how bijvoet made the difference
17
Once the positions of the anomalous scatterers have been established n ð¼ jF n j exp in ) can be by either of these approaches, an estimate of FH H H calculated, and the parameters of the H atoms can be refined by many available methods. When this has been done, either the explicit or the pseudo-isomorphous method may be used to obtain phases. In the explicit approach, quantities proportional to cosðnP nH ) and sinðnP nH ) have already been calculated, so that nH leads directly to the phase angle nP , which allows calculation of an electron density map representing the scattering density of the normal scatterers. Alternatively, the Karle equations are revisited to generate a best phase and figure of merit using the ABCD algorithm. An adaptation of the MIRAS (multiple isomorphous replacement with anomalous scattering) approach36,37 uses the Bijvoet difference to give phase information at each wavelength as in the Matthews method.22 The 0 at different wavelengths are used like isomorphous replacechanges in FH ment differences. This approach worked well, but it includes an approximation because it is not defined which phase is being determined. (Each Bijvoet difference indicates the phase at a different wavelength.) Terwilliger38 suggested further approximations, but Burling and colleagues carried out a more precise analysis.26 In this scheme, every pair of observed intensities either related as a Bijvoet pair, or related by a wavelength change, can be treated separately to provide a phase probability curve. These results were compared with those obtained by Hendrickson’s method32 and consistent and significant improvement in phasing accuracy was reported. Similar results are reported by other authors. Some differences remain, however, about which phase is to be determined. Burling et al.26 chose diffraction at the ‘‘remote’’ wavelength to represent the ‘‘parent,’’ but comment that this represents a difference from calculating the phase of F n as defined by Karle.31 A Better Way?
We must be clear about what phase is actually to be determined. There are two sensible choices: either the phase of FP (the phase of the structure factor corresponding to the normally scattering atoms in the crystal, but omitting the anomalous scatterers), or the phase of F n (the phase of the structure factor if all the electrons in the crystal scattered normally). The 36
V. Ramakrishnan, J. T. Finch, V. Graziano, P. L. Lee, and R. M. Sweet, Nature 362, 219 (1993). 37 V. Ramakrishnan and V. Biou, Methods Enzymol. 276, 538 (1997). 38 T. C. Terwilliger, Acta Crystallogr. D 50, 17 (1994).
18
phases
[1]
Fourier transform of FP will show the electron density of all the normal scatterers in the crystal; the Fourier transform of F n will show the density of all the electrons in the crystal. In the MAD technique, neither of these structure factors is observable. But because the structure factors FH corresponding to the scattering by the anomalous scatterers are calculable from their parameters (equivalently, the structure factor F a caused by anomalous scattering effects may be calculated), this creates no fundamental problem. A straightforward approach was suggested by Bella and Rossmann,39 who chose to estimate the phase of FP. Each experimental observation of jF(h)j or jF(h)j, together with the calculated contribution of the anomalous * (h) follows the usual relationship scatterers FH(h) or FH
FðhÞ ¼ FP ðhÞ þ FH ðhÞ
Using the Harker construction, each observation generates a circle of possible values for FP(h) (Fig. 7), but because of observational and systematic errors the circles do not all intersect perfectly. In the MAD technique there is no direct measure of jFPj, and Bella and Rossmann exploited a method of analysis in which the most probable phase is identified as that where the Harker circles intersect most closely.19 This method has several advantages. The analysis is done directly in terms of the observed quantities jF(h)j. Each observation is treated in an equivalent way. There is total clarity about which phase is being evaluated. Compared with the isomorphous replacement method, there is a complication because there is no direct measure of jFPj. The analysis does not need to be restricted to the method of selecting close intersection.19 For any chosen value of FP (specifying both amplitude and phase), a lack of closure for each observation of jF( h)j is readily calculated. As seen on the Harker diagram, it is simply the radial distance xi of this value of FP from the corresponding Harker circle (Fig. 8). This approach could be applied equally well to find a ‘‘best’’ value of the quantity F n defined by Karle.31 A more sophisticated method of analysis is embodied in the program SHARP (statistical heavy atom refinement and phasing).40 In SHARP, all possible values of an ‘‘unperturbed native structure factor’’ FP* are considered, using a maximum-likelihood formulation in which errors in all observations and parameters of the problem including the ‘‘lack of closure’’ are retained as variables. In this way the effects of correlations and feedback in estimated parameters for a particular reflection are 39 40
J. Bella and M. G. Rossmann, Acta Crystallogr. D 54, 159 (1998). E. de la Fortelle and G. Bricogne, Methods Enzymol. 276, 472 (1997).
[1]
how bijvoet made the difference
19
Fig. 7. Harker diagram for the MAD technique. To keep the diagram as simple as possible only two wavelengths are illustrated, denoted as 1 and 2. Data from reflection h are identified by a subscript plus symbol, and data from its Friedel mate h are denoted by a subscript negative symbol. A possible value for FP is indicated as a dashed line, but because jFPj is not observable, its amplitude is unknown. The open circle represents the structure factor if all atoms scattered normally.
included in the analysis, and the resulting estimate of the complex quantity FP* is said to be unbiased. In SHARP every observation is handled equivalently, and all contribute to the likelihood function for FP* (a two-dimensional function representing its amplitude as well as its phase). The absence of any observation of jFPj does not change the formulation of the problem, so the MAD technique can be treated in the same way as usual. Around the turn of the century, MAD became the most frequently used technique for direct determination of an unknown macromolecular structure (where the structure cannot be inferred from a homologous structure). The different approaches remain in competition. In either case, the analysis proceeds by two steps, the first of which obtains the positions of the anomalous scatterers by Patterson methods, either using the Karle equan j2 , or using coefficients derived directly from the Bijvoet tions for jFH differences. Phase determination (or maximum likelihood analysis) then uses all the parameters of the anomalous scatterers. Again, two methods
20
phases
[1]
Fig. 8. An enlarged view of a small part of Fig. 7. The solid square indicates a point chosen as a possible origin for the FP vector. The radial distance, measured along FP, between this point and any Harker circle represents a lack of closure x(FP) for the corresponding observation of jF( h)j. One such distance is identified.
are available (explicit, using the Karle equations, or pseudo-isomorphous), and there is no reason why the second step should use the same formulation as the first. The Future Is SAD
It seems likely, however, that the various improvements to analyze MAD data more correctly are fading into insignificance. The MAD technique is losing ground to SAD. SAD has problems similar to those of SIR, because there are only two measurements, jF(h)j and jF(h)j, so that they indicate two possible values for the phase angle as in Fig. 4b. If the distribution of anomalous scattering electrons does not have a tendency to centrosymmetry, the phases of the anomalous scattering contributions will be fairly random, and will not tend to generate false symmetry of the kind that Bokhoven et al. encountered with strychnine.3 Without other information, the ‘‘best’’ value for F(h) is the mean of the two possible structure factors indicated by the two phase angles. The use of this mean structure factor allows considerable error, introducing noise into the electron density map. In the absence of other
[1]
how bijvoet made the difference
21
errors, the electron density introduced by this noise is equal to the signal, equivalent to a mean phase error of 45 . But, in fact, we know that interpretable maps have been obtained from data with phase errors larger than 45 . Phase errors contribute to ‘‘noise’’ distributed over the whole map, whereas the correct components of the phase information contribute density to ‘‘signal’’ at the atomic positions. As the resolution improves, additional reflections will not only sharpen the image but will also increase the signal-to-noise ratio. For this reason the consequences of phase ambiguity become less severe. A further method to improve phase estimation was first used by Hendrickson and Teeter.27 It becomes more effective as the diffraction from the anomalous scatterers becomes a larger part of the total scattering. If the scattering by the anomalous scatterers FH(h) represents a significant fraction of the total scattering F(h), the phase angle of F(h) is more likely to be close to that of FH(h), and this may partly resolve the ambiguity between two possible phases. Sim41 derived the relevant probability function. In the last decade there has been great progress in the improvement of poor-quality electron density maps. In addition to improved refinement procedures for the parameters of anomalous scatterers, the solvent flattening and histogram matching algorithms have become powerful. At a resolution at which individual atoms can be resolved, direct methods allow further refinement. Many authors have reported excellent results using SAD. A significant advantage is that because only one wavelength is used, the complication of resetting the apparatus to precisely defined wavelengths is avoided. Data collection can proceed without interruption, so that radiation damage problems are greatly reduced, and more accurate measurement is possible. Brodersen et al.42 determined two structures from a direct SAD analysis in 2000. From a thorough comparison of MAD and SAD techniques to interpret eight different structures, Rice et al.43 concluded that ‘‘the combination of SAD phasing and solvent flattening will be sufficient to determine most structures.’’ Dauter et al.44 have reported on 13 structures, mostly macromolecules, which with one exception proved interpretable by SAD. An important advance is the use of heavy halide ions in the supernatant, which frequently bind to favorable sites on the protein surface, and that can be used as anomalous scatterers.45 41
G. A. Sim, Acta Crystallogr. 12, 813 (1959). D. E. Brodersen, E. de la Fortelle, C. Vonrhein, G. Bricogne, J. Nyborg, and M. Kjeldgaard, Acta Crystallogr. D 56, 431 (2000). 43 L. M. Rice, T. N. Earnest, and A. T. Brunger, Acta Crystallogr. D 56, 1413 (2000). 44 Z. Dauter, M. Dauter, and E. Dodson, Acta Crystallogr. D 58, 494 (2002). 45 Z. Dauter and M. Dauter, Structure 9, R21 (2001). 42
22
phases
[2]
Bijvoet’s work has come full circle. Despite the development of sophisticated algorithms for MAD techniques, emphasis is returning to a method that relies simply on measurement of the mean intensity and the Bijvoet difference at a single wavelength, for every reflection. Acknowledgment I thank Gerard Bricogne and Brian Matthews for constructive criticism and helpful comment.
[2] SOLVE and RESOLVE: Automated Structure Solution and Density Modification By Thomas C. Terwilliger The analysis of X-ray diffraction measurements from macromolecular crystals and their interpretation in terms of a model of the macromolecule constitute a process that consists of many steps involving significant decisions to be made. Major steps include structure solution (scaling, heavy atom location, and phasing), density modification, model building, and refinement. The complexity of this process has for many years required the involvement of a highly trained crystallographer for reasonable decisionmaking and successful completion of the process. The combination of several factors has made it possible to carry out the critical structure solution process (scaling through phasing) in a fully automated fashion (SOLVE1). In addition, automated methods for refinement and model building with high-resolution data have been developed (wARP/ARP; Perrakis et al.2,3). The separate automation of structure solution and of model building and refinement presents the promise of full automation of the entire structure determination process from scaling diffraction data to a refined model. The software packages SOLVE and RESOLVE comprise a suite for automated structure solution using multiple isomorphous replacement (MIR), multiwavelength anomalous dispersion (MAD), single-wavelength anomalous diffraction (SAD), and other approaches. Their defining feature is that they can carry out all the steps necessary for structure solution 1
T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 849 (1999). A. Perrakis, T. K. Sixma, K. S. Wilson, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 53, 448 (1997). 3 A. Perrakis, R. Morris, and V. S. Lamzin, Nat. Struct. Biol. 6, 458 (1999). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
22
phases
[2]
Bijvoet’s work has come full circle. Despite the development of sophisticated algorithms for MAD techniques, emphasis is returning to a method that relies simply on measurement of the mean intensity and the Bijvoet difference at a single wavelength, for every reflection. Acknowledgment I thank Gerard Bricogne and Brian Matthews for constructive criticism and helpful comment.
[2] SOLVE and RESOLVE: Automated Structure Solution and Density Modification By Thomas C. Terwilliger The analysis of X-ray diffraction measurements from macromolecular crystals and their interpretation in terms of a model of the macromolecule constitute a process that consists of many steps involving significant decisions to be made. Major steps include structure solution (scaling, heavy atom location, and phasing), density modification, model building, and refinement. The complexity of this process has for many years required the involvement of a highly trained crystallographer for reasonable decisionmaking and successful completion of the process. The combination of several factors has made it possible to carry out the critical structure solution process (scaling through phasing) in a fully automated fashion (SOLVE1). In addition, automated methods for refinement and model building with high-resolution data have been developed (wARP/ARP; Perrakis et al.2,3). The separate automation of structure solution and of model building and refinement presents the promise of full automation of the entire structure determination process from scaling diffraction data to a refined model. The software packages SOLVE and RESOLVE comprise a suite for automated structure solution using multiple isomorphous replacement (MIR), multiwavelength anomalous dispersion (MAD), single-wavelength anomalous diffraction (SAD), and other approaches. Their defining feature is that they can carry out all the steps necessary for structure solution 1
T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 849 (1999). A. Perrakis, T. K. Sixma, K. S. Wilson, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 53, 448 (1997). 3 A. Perrakis, R. Morris, and V. S. Lamzin, Nat. Struct. Biol. 6, 458 (1999). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[2]
SOLVE and RESOLVE
23
in a fully automated way. SOLVE can scale data, find heavy atom sites, and calculate phases, while RESOLVE can improve density, find patterns of electron density corresponding to helices and sheets, and build a preliminary model. The theory and operation of both SOLVE and RESOLVE have been described in detail.1,4,5 In addition, an overview of SOLVE and a comparison with other methods for finding heavy atom sites is presented elsewhere (see [3] in this volume).5a This chapter reviews how SOLVE and RESOLVE operate, emphasizing the computational approaches and philosophy that have been employed. General Computational Approaches for Automated Structure Solution in SOLVE and RESOLVE
SOLVE and RESOLVE use three basic principles to carry out automated structure solution. The first is to create a seamless set of subprograms that carry out all the operations that are needed. The second is to develop scoring algorithms that can be used to replace conventional decision-making steps, and the third is to make these part of software systems that are highly error tolerant. Creating a sequential set of subprograms that carry out all the tasks needed for structure solution is an obvious necessity for automated structure solution. Somewhat less obvious is the importance of having each subprogram provide, in a convenient fashion, all the information necessary for the next one to operate. Many software packages contain programs that can carry out all the steps of structure solution, but often the key parameters for one step (e.g., heavy atom sites) are simply part of the printout from a previous step. The key requirement for automation is that this information and all necessary data to go with it be passed from one step to another in as straightforward a fashion as possible. It is possible to automate by scanning output files for parameters, but this is far more difficult than simply passing on the key information. Decision-making is the most complex part of structure solution, and is important in density modification and model building as well. There are many ways to make decisions in a process such as structure solution, which is iterative and somewhat branched. The approach taken in SOLVE is to use a scoring system to replace the conventional decision-making process. 4
T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 55, 1863 (1999). T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 58, 2082 (2002). 5a C. M. Weeks, P. D. Adams, J. Berendsen, A. T. Brunger, E. J. Dodson, R. W. GosseKunstleve, T. R. Schneider, G. M. Sheldrick, T. C. Terwilliger, M. G. W. Turkenburg, and I. Uson, Methods Enzymol. 374, [3], 2003 (this volume). 5
24
phases
[2]
In SOLVE, what is scored is the quality of each potential heavy atom solution. The advantage of this approach is that once a scoring system is devised, the decision-making process simply becomes an optimization process for which there are many well-known algorithms. Error-tolerant programming is a final and useful element in automating complex procedures. Although it would be convenient to eliminate all errors in algorithms and in programming these algorithms, in practice it is difficult to reduce these to fewer than about 1 error per 1000 lines of software code. In a package with 150,000 lines of code, it is reasonable to expect hundreds of errors. The approach used in SOLVE and RESOLVE is to minimize the effects of these difficult-to-identify errors by using the scoring and optimization system to reject analyses that result in identifiable errors. For example, any time the refinement of heavy atom positions in SOLVE fails for any reason, from zero occupancy of all sites to a programming error leading to a failed refinement, the score for that heavy atom solution is recorded as ‘‘very poor.’’ That particular solution is then rejected and other similar solutions are considered instead. As other solutions may have different characteristics, they may refine successfully and be used, getting around either truly incorrect solutions or programming errors that prevent successful refinement. This approach has the underlying reasonable premise that most errors will lead to scores that are poor. SOLVE: Automated Structure Solution
The SOLVE software1 is capable of full automation of structure solution for MIR (isomorphous replacement) and MAD or SAD (anomalous diffraction) data. SOLVE can begin with raw measurements of intensities of crystallographic intensities, scale the data, carry out the process of finding the heavy atom sites, refine parameters, and calculate an electron density map. The automation of this key step shows the feasibility of automation of all steps in structure determination. The SOLVE software also demonstrates the usefulness of a decision-making process based on a scoring algorithm and provides a basis for decision-making in full structure determination. Key Technical Developments for Automated Structure Solution SOLVE contains a number of advances in the central steps in structure solution as well as the seamless linkage and decision-making necessary for automation. SOLVE includes developments in the treatment of MAD data, estimation of heavy atom structure factor amplitudes (FA values), heavy atom refinement, and phasing, which in turn have made automated structure solution practical.
[2]
SOLVE and RESOLVE
25
A MAD data set containing any number of wavelengths of data can be viewed with good approximations as if it were a single isomorphous replacement data set with anomalous scattering (SIRAS6). This meant that our fast and unbiased method for refinement of heavy atom parameters using an origin-removed difference Patterson function7 could be used for heavy atom refinement. A method for estimation of amplitudes of heavy atom structure factors (FA values) in an optimal way from MAD data was also developed.8 This allowed the calculation of optimal heavy atom Patterson functions that could be searched with our automated heavy atom search procedure9 to find initial partial solutions for the locations of anomalously scattering atoms in the crystals. Once the parameters describing the locations, occupancies, and thermal factors for the anomalously scattering atoms were refined, we used our Bayesian correlated MAD phasing approach to calculate optimal estimates of the crystallographic phases.10 For multiple isomorphous replacement (MIR) structure solution, SOLVE incorporates methods for phasing that include a detailed analysis of error estimates.11 It also includes methods for calculating phases in cases in which the derivatives have substantial lack of isomorphism, but in which this nonisomorphism is correlated among several derivatives.12 The use of this Bayesian correlated phasing allows the SOLVE software to handle MIR data sets that contain serious nonisomorphism in exactly the same way as it deals with data sets that are highly isomorphous. Decision-Making in Automated Structure Solution One of the most time-consuming steps in macromolecular structure solution is the testing of many possible arrangements of heavy atoms in the crystal for compatibility with the X-ray data. In the multiwavelength (MAD) technique, for example, the differences in X-ray diffraction at different X-ray wavelengths can be used to generate sets of potential arrangements of heavy atoms in the crystal, but they generally cannot be used directly to identify which one of these arrangements is correct. Consequently, the crystallographer is faced with the tedious prospect of using manual or semiautomated methods to generate potential heavy atom arrangements, using each heavy atom arrangement to calculate an electron 6
T. T. 8 T. 9 T. 10 T. 11 T. 12 T. 7
C. C. C. C. C. C. C.
Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 50, 17 (1994). Terwilliger and D. Eisenberg, Acta Crystallogr. A 39, 813 (1983). Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 50, 11 (1994). Terwilliger, S.-H. Kim, and D. Eisenberg, Acta Crystallogr. A 43, 1 (1987). Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 53, 571 (1997). Terwilliger and D. Eisenberg, Acta Crystallogr. A 43, 6 (1987). Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 52, 749 (1996).
26
phases
[2]
density map, and looking at each map on a graphics screen to subjectively evaluate whether it looks like a picture of a protein. In parallel with this, the crystallographer evaluates in a subjective way whether each heavy atom arrangement appears to be compatible with the differences in diffraction at different X-ray wavelengths used to generate it (using the Patterson function), and whether a subset of the heavy atom sites in each arrangement can be used to derive the other sites.13 A key feature of the SOLVE software is the introduction of a uniform scoring scheme for evaluating heavy atom arrangements. For each criterion that crystallographers have used to compare potential solutions, a Z-score is calculated that reflects how each trial arrangement compares with the mean and standard deviation of the set of trial solutions as a whole. The composite Z-score for each trial arrangement is then used as the criterion for choosing the arrangement leading to a good electron density map. At every stage at which a decision would ordinarily be made by a crystallographer, such as whether to include a particular heavy atom site in a heavy atom arrangement or not, the choice is made by picking the arrangement that leads to the higher score. In this way, a complicated decision-making process is converted into a well-defined optimization process. Scoring Criteria for Heavy Atom Partial Structures
We developed four criteria for evaluating the quality of a heavy atom partial structure. These are as follows: . Evaluation of the match between a heavy-atom model and the Patterson function . Cross-validation difference Fourier maps . Figure of merit (internal consistency) of the phasing . Analysis of the native Fourier maps The match between a heavy atom model and the Patterson function has always been an important criterion in the MIR and MAD methods.14 Our scoring in this case essentially consists of the average value of the Patterson function at the predicted locations of peaks (based on the model), weighted by a factor based on the number of heavy atom sites in the trial solution. Our cross-validation difference Fourier method is based on the idea of Dickerson et al.13 for MIR cross-validation, in which one derivative is omitted from phasing and the others are used to calculate phases and a heavy 13 14
R. E. Dickerson, J. C. Kendrew, and B. E. Strandberg, Acta Crystallogr. 14, 1188 (1961). T. L. Blundell and L. N. Johnson, ‘‘Protein Crystallography,’’ p. 368. Academic Press, New York, 1976.
[2]
SOLVE and RESOLVE
27
atom difference Fourier for it. We extended this to MAD data by omitting one site at a time and using all others to calculate phases and a heavy atom Fourier. Those sites that have a high peak height in the resulting map are likely to be correct. The third criterion for scoring is the figure of merit. Although the figure of merit is sensitive to errors in heavy atom occupancies, the origin-removed Patterson refinement procedure we use11 yields essentially unbiased estimates of these parameters, so the criterion is useful.1 The final criteria for judging the quality of a heavy atom arrangement is the quality of the resulting electron density map. There are several features of protein crystals that could be used for such a measure. One of these would be the connectivity of the positive density in the man.15 Another feature of protein crystals is the presence of distinct regions of protein and solvent. The electron density maps in protein regions are rough, whereas in the solvent regions they are flat. Protein molecules are relatively compact and when they pack together in a crystal the regions between them are filled with solvent. As the solvent molecules are not fixed, and as an X-ray diffraction experiment averages the scattering of X-rays both over time and over the many repetitions of the protein in the crystal, the electron density in the solvent region is generally flat. In contrast, the electron density in the protein region is rough, as it is high at the locations of protein atoms and low between them. We have found that the variation in local roughness is a powerful indicator of the quality of an electron density map.16,17 The SOLVE software uses the variation in local roughness as a measure of the quality of the electron density map. Using this measure, it can reliably differentiate between a random map and a noisy map of a protein molecule at just about the same map quality (a signal-to-noise ratio of about 1) that a crystallographer can.1,16 SOLVE: Working Automated System for Structure Solution
We have put the algorithms and decision-making process described above into a single system (SOLVE) that is capable of fully automated structure solution.1 The information that is required from a user consists of (1) the locations of data files, (2) space group and symmetry information, (3) the identity of the anomalously scattering atom, scattering factor estimates, and the number of sites (for MAD data), and (4) the number of 15
D. Baker, A. E. Krukowski, and D. A. Agard, Acta Crystallogr. D Biol. Crystallogr. 49, 186 (1993). 16 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 501 (1999). 17 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 1872 (1999).
28
phases
[2]
Fig. 1. SOLVE electron density map for IF-5A. Reprinted from T. C. Terwilliger, ‘‘Maximum-likelihood density modification for X-ray crystallography’’, in Image reconstruction from incomplete data, SPIE, vol. 4123, pp. 243–247.
amino acid residues in the asymmetric unit (for proteins). The output from this system consists of information about heavy atom locations, estimates of crystallographic phases, and an electron density map. The software has been used to solve structures as large as the ribosome 30S subunit18 and with as many as 56 selenium sites (W. Smith and C. Jansen, unpublished results). Figure 1 shows an example of an electron density map produced automatically by the SOLVE software.19 The X-ray data consisted of three wavelengths of selenomethionine MAD data collected to a resolution of ˚ . The protein (initiation factor 5A; IF-5A) contains 149 amino acids 2.1 A and there is one molecule in the asymmetric unit of space group P4122. SOLVE found three of the four selenium atoms (the selenomethione at position 7 was relatively disordered). The electron density map produced by SOLVE was highly interpretable. Figure 1 shows this map overlaid with the final refined atomic model. 18
W. M. Clemons, J. L. C. May, B. T. Wimberly, J. P. McCutcheon, M. S. Capel, and V. Ramakrishnan, Nature 400, 833 (1999). 19 T. S. Peat, J. Newman, G. S. Waldo, J. Berendzen, and T. C. Terwilliger, Structure 6, 1207 (1998).
[2]
SOLVE and RESOLVE
29
RESOLVE: Statistical Density Modification
Density modification is a method for improving the quality of electron density maps by incorporating real-space information such as the flatness of the solvent region.20 If density modification techniques could be made even more powerful than they already are, then structures could be solved with fewer methionines for selenomethionine MAD, with weakly diffracting crystals, and with less diffraction data. In addition, because of the high volume of data collection in structural genomics efforts and the limited supply of the necessary synchrotron beam time, it is advantageous to collect a minimal amount of X-ray data sufficient to obtain phase information. The MAD method requires several complete data sets to be collected at different X-ray wavelengths, whereas the related SAD (single-wavelength anomalous diffraction) method21 requires only one complete data set with anomalous measurement. When the SAD method is used alone, it leads to incomplete phasing information, but with density modification it can lead to highly interpretable electron density maps.22,23 Basis for Density Modification Many density modification methods have been developed. Exceptionally powerful for phase improvement are solvent flattening and noncrystallographic symmetry averaging.20,24–27 Additional density modification methods include histogram matching and phase extension,28 entropy maximization,29 iterative skeletonization,30,31 and iterative model building and refinement.2,3 The fundamental basis of density modification is that there are many possible sets of structure factor amplitudes and phases that are all reasonable based on the limited experimental data. Those structure factors that lead to maps that are most consistent with both the 20
B. C. Wang, Methods Enzymol. 115, 90 (1985). Z. J. Liu, E. S. Vysotski, C. J. Chen, J. P. Rose, J. Lee, and B. C. Wang, Protein Science 9, 2085 (2000). 22 Z. Dauter and N. Dauter, J. Mol. Biol. 289, 93 (1999). 23 M. A. Turner, C. S. Yuan, R. T. Borchardt, M. S. Hershfield, G. D. Smith, and P. L. Howell, Nat. Struct. Biol. 5, 369 (1998). 24 J. P. Abrahams and A. G. W. Leslie, Acta Crystallogr. D Biol. Crystallogr. 52, 30 (1996). 25 G. Bricogne, Acta Crystallogr. A 30, 395 (1974). 26 K. D. Cowtan and P. Main, Acta Crystallogr. D Biol. Crystallogr. 49, 148 (1993). 27 F. M. D. Vellieux and R. J. Read, Methods Enzymol. 277, 18 (1997). 28 K. Y. J. Zhang and P. Main, Acta Crystallogr. A 46, 377 (1990). 29 S. B. Xiang, C. W. Carter, G. Bricogne, and C. J. Gilmore, Acta Crystallogr. A 49, 193 (1993). 30 D. Baker, C. Bystroff, R. J. Fleterick, and D. A. Agard, Acta Crystallogr. D Biol. Crystallogr. 49, 429 (1993). 31 C. Wilson and D. A. Agard, Acta Crystallogr. A 49, 97 (1993). 21
30
phases
[2]
experimental data and prior knowledge about what the electron density map should look like are the most likely overall. Until more recently the statistical foundation of density modification has been poorly developed.32–34 The procedure used to carry out density modification has involved iterations of calculating an electron density map, using the best estimates of phases; modifying the map by flattening the solvent or otherwise making it conform to expectations; calculating ‘‘model’’ phases; and estimating new phases based on a weighted average of model and experimental phases.20 The problem with this approach is that the model phases are partly based on the experimental ones, so that it is not clear how to weight the model and experimental phases in an optimal fashion. Several methods have been devised to address this problem, including ‘‘solvent flipping’’35 and cross-validation,32,36 but neither of these approaches fully addresses the fundamental problem of the correlation between model and experimental phases.34 We have invented a method of density modification based on a statistical formulation that preserves the independence of model and experimental phases (see Terwilliger4,5 for details). This statistical density modification technique (previously known as maximum-likelihood density modification) is able to make much better use of knowledge about characteristics of electron density maps than the earlier methods because of its improved and completely different statistical treatment.34 In particular, statistical density modification is able to properly treat the relationship between experimental phases and phase information obtained by solvent flattening. In addition, statistical density modification is capable of taking advantage of information specifying which parts of the modified electron density map are known and which are not, whereas previous formulations could not. The statistical density modification method we have developed can be applied to a wide range of situations in which information from experimental measurements is to be combined with information derived from expectations about plausible electron density arrangements in a map. These range from solvent flattening and noncrystallographic symmetry averaging, to phasing using a partial model and molecular replacement.5 Furthermore, as the method has a sound statistical basis, it can be combined with probabilistic methods for molecular fragment detection in an electron density map37 to yield phase improvement and to serve as the basis for model building. 32
K. D. Cowtan and P. Main, Acta Crystallogr. D Biol. Crystallogr. 52, 43 (1996). K. Cowtan, Acta Crystallogr. D Biol. Crystallogr. 55, 1555 (1999). 34 K. Cowtan, in ‘‘CCP4 Newsletter’’: http: //www.dl.ac.uk/CCP/CCP4/newsletter38/ 07_gaussian.html 35 J. P. Abrahams, Acta Crystallogr. D Biol. Crystallogr. 53, 371 (1997). 36 A. L. U. Roberts and A. T. Brunger, Acta Crystallogr. D Biol. Crystallogr. 51, 990 (1995). 33
[2]
SOLVE and RESOLVE
31
Statistical Density Modification The general idea of statistical density modification is simple. It combines the experimental information that is available about the probability of a particular value of the phase for each reflection with an examination of the electron density map that results from such a set of phases. Statistical density modification is a way of finding the set of phases that is compatible with the experimental information and that produces the most plausible electron density map. There are a number of cases in which we may have a good idea about what the features in an electron density map should look like. If we have even a poor set of crystallographic phases, then the electron density maps of most macromolecules will show regions that are relatively flat (the solvent region) and others that have a large amount of variation (where the macromolecule is located). Once the solvent region is identified, we can be confident that the true electron density in that region is nearly constant. This means we have a great deal of information about the pattern of electron density in that region. Another case occurs if we have been able to identify a feature such as an helix in the map. As we know what helices look like, we may have a better idea of the electron density in that region than the map alone provides. The power of density modification (and statistical density modification in particular) is that by choosing a set of phases that are compatible with the experimental information and that lead to a plausible map, all the other features of the map become clearer as well, even those about which we have no information. This comes about because the crystallographic phases are improved by the density modification.20,25 The key requirement for statistical density modification is that we have an estimate, for each point in our electron density map, of what values of electron density are plausible. This can be in the form of a probability distribution for each point in the map, or an estimate of electron density and an uncertainty, for example. Mathematics of Statistical Density Modification The goal of statistical density modification is to come up with the set of crystallographic phases that is the most likely given all available information. To carry out statistical density modification,4,5 we express the logarithm of the probability of a set of structure factors as the sum of two basic quantities: (1) the probability that we would have measured the
37
K. Cowtan, Acta Crystallogr. D Biol. Crystallogr. 54, 750 (1998).
32
phases
[2]
observed set of structure factors if this structure factor set were correct, and (2) the log probability that the map resulting from this structure factor set is consistent with our prior knowledge about this and other macromolecular structures. In this formulation, density modification consists of maximizing the total probability. To maximize this probability it is necessary both to define a ‘‘map probability function’’ and to have a practical way of finding structure factors that maximize it. We developed a formulation of the map probability function that allows a straightforward and rapid optimization of the total probability.4,5 The log probability for an electron density map is written as the integral over the map of a local log probability of electron density. In essence, we look at the map, point by point, and evaluate how likely it is that the true electron density at this point has the value given in the map. For example, if we look in the solvent region, it is unlikely that the true electron density would have a high or low value. We have shown that as long as the first and second derivatives of the local log probability of electron density with respect to electron density can be calculated, a steepest ascent method can be used to optimize the total probability when expressed in this way.4,5 In this broad class of situations, a fast fourier transform (FFT)-based method can be used to approximate derivatives of the total map log probability function with respect to each structure factor. These derivatives in turn can then be used in a Taylor’s series expansion to approximate the total map log probability function as a function of each structure factor. This makes it practical to optimize the total probability because the other terms (a priori knowledge of phases and experimental phase information) are also normally expressed separately for each structure factor. Outline of Cycle of Solvent Flattening-Based Phase Improvement with RESOLVE The process used to improve crystallographic phases with this maximum-probability algorithm is straightforward in concept. At the start of the process, the crystallographic phases are known only approximately (an experimentally derived probability distribution for their possible values is known). Each cycle of solvent flattening using statistical density modification (RESOLVE) involves six basic steps (see Terwilliger5 for more details). They are as follows: . Calculation of an electron density map based on current best estimates of crystallographic phases . Estimation of the probability that each point in the map lies in the protein or solvent region
[2]
SOLVE and RESOLVE
33
. Calculation of expected probability distributions for values of electron density in the protein and solvent regions (based on statistics of maps generated from model structures) . Calculation of the local map log probability function (the logarithm of the probability that each value of electron density in the map is consistent with the designated protein and solvent regions) . Calculation of how the probability of the map would change if an individual phase were changed . Adjustment of each phase to maximize its contribution to the total probability (the probability of the map plus the probability based on the experimental measurements of the phase) The local map log probability function is a critical element in our statistical density modification approach. This probability function can include any type of expectations about the electron density value at a particular point in the map. In particular, we have shown that expectations about electron density values at points both in the solvent region and in the protein region of a protein crystal can be included in statistical density modification and that this approach can be powerful for improving crystallographic phases.4,5 We have also shown that the same approach can be used to incorporate detailed information about patterns of electron density in a map, such as those corresponding to secondary structural elements in a protein structure (see below and Terwilliger5). Example of Statistical Density Modification Using Solvent Flattening Figure 2 illustrates the power of statistical density modification based on solvent flattening.5 Figure 2A shows a section through a model (perfect) electron density map based on the refined model of EF-5A (from Fig. 1). We then created a poor ‘‘experimental’’ electron density map by using just one of the three selenium atoms used in the EF-5A selenomethionine MAD structure solution example shown in Fig. 1 to calculate phases. This electron density map is shown in Fig. 2B. The correlation coefficient of this initial map to one calculated using the final model is only 0.37. Crystals of EF-5A contain 60% solvent, and the solvent region can be identified from this initial electron density map (notice that the contours of electron density are less pronounced on the right side of Fig. 2B, where the solvent is located). We tested both our statistical density modification technique and existing methods (dm, Cowtan and Main32; SOLOMON, Abrahams35) for the improvement in map quality obtained with solvent flattening. Figure 2C shows the RESOLVE-modified map. It has a correlation coefficient to the model map of 0.79, and the strand running vertically next to the
34
phases
[2]
Fig. 2. Section through electron density maps. (A) Model map; (B) map created by SOLVE (using just one selenium in phasing); (C) map produced from (B) by RESOLVE (maximum-likelihood density modification); and (D) map produced from (B) by RESOLVE (maximum-likelihood density modification); and (D) map produced from (B) by dm (conventional density modification). See text for details. Reprinted from T. C. Terwilliger, ‘‘Maximum-likelihood density modification for X-ray crystallography’’, in Image reconstruction from incomplete data, SPIE, vol. 4123, pp. 243–247.
[2]
SOLVE and RESOLVE
35
solvent region is clearly visible. Figure 2D shows the dm-modified map. It has a correlation coefficient to the model map of 0.65, and the density is much less clear. A similar result (correlation coefficient of 0.63) was obtained with SOLOMON. Example of Pattern Matching with Statistical Density Modification Figure 3 illustrates how statistical density modification can be combined with a search for fragments of secondary structure in an electron density map. We tested a pattern-matching approach to density modification, using the armadillo repeat region of -catenin, which is largely helical and con˚, tains 50% solvent.38 This structure was solved at a resolution of 2.7 A using MAD phasing on 15 selenium atoms incorporated into methionine residues in the protein. To make the test suitably difficult, we used only 3 of the 15 selenium atoms in calculating initial phases. As expected, this led to a noisy map; the correlation coefficient of this map with a map calculated on the basis of phases from the refined model was only 0.33 (Fig. 3A). The statistical density modification approach (without any pattern recognition) resulted in a great improvement in the map, with a correlation coefficient of 0.62 (Fig. 3C). Next we identified the location of helices in the map, using an FFTbased method37 in which a template of a helix (in all orientations) was placed at all locations in the unit cell and the correlation of the template with the local electron density was calculated. Those locations where there was a high correlation were considered to be locations of helices (Fig. 3B). The expected electron density in the region was then estimated from the template, and these expectations about the map, combined with the expectation of a flat solvent region, were used in statistical density modification. The statistical density modification with pattern recognition of helices improved the map even more substantially, with an overall correlation coefficient of 0.67 (Fig. 3D). This density-modified map is of sufficiently high quality that a model could be built into most of it, yet it is derived using phases based on just 3 selenium atoms in 700 amino acid residues and an initial map that is completely uninterpretable. This example shows that advances in density modification techniques will allow even further extensions of the range of targets that are accessible to X-ray crystallographic structure analysis. This example of pattern recognition combined with density modification also suggests the possibility of iterative pattern recognition and map improvement as a way of building up an atomic model of a macromolecule. 38
A. H. Huber, W. J. Nelson, and W. I. Weis, Cell 90, 871 (1997).
36
phases
[2]
Fig. 3. Pattern matching and statistical density modification. (A) ‘‘Experimental’’ electron density map obtained with 3 selenium atoms in 700 amino acids used in phasing. The model is the refined model of -catenin (Huber et al.38). (B) Electron density based on helical segments recognized by pattern matching. Note that some of the helices were recognized in this poor map, but not all. (C) Electron density after statistical density modification (without pattern matching) of the map shown in (A). (D) Electron density after statistical density modification with pattern matching. CC, Correlation coefficient.
The identification of helices in Fig. 3A and their use in making the map shown in Fig. 3D correspond to building part of the atomic model (the main-chain atoms for some of the helical segments) for -catenin. We expect that this process could be extended and repeated to build up a large part of the structure. Conclusions and Summary
SOLVE and RESOLVE have shown that it is possible to automate a significant part of the macromolecular X-ray structure determination process. The key elements of seamless and compatible subprograms, scoring algorithms, and error-tolerant software systems have been important in implementing these programs. The principles used in SOLVE and
[3]
automatic solution of heavy-atom substructures
37
RESOLVE can be applied to other aspects of structure determination as well, suggesting that full automation of the entire structure determination process from scaling diffraction data to a refined model will be possible in the near future.
[3] Automatic Solution of Heavy-Atom Substructures By Charles M. Weeks, Paul D. Adams, Joel Berendzen, Axel T. Brunger, Eleanor J. Dodson, Ralf W. Grosse-Kunstleve, Thomas R. Schneider, George M. Sheldrick, Thomas C. Terwilliger, Maria G. W. Turkenburg, and Isabel Uso´n Introduction
With the exception of small proteins that can be solved by ab initio direct methods1 or proteins for which an effective molecular replacement model exists, protein structure determination is a two-step process. If two or more measurements are available for each reflection with differences arising only from some property of a small substructure, then the positions of the substructure atoms can be found first and used as a bootstrap to initiate the phasing of the complete structure. Historically, substructures were first created by isomorphous replacement in which heavy atoms (usually metals) are soaked into crystals without displacing the protein structure, and measurements were made from both the unsubstituted (native) and substituted (derivative) crystals. When possible, measurements were made also of the anomalous diffraction generated by the metals at appropriate wavelengths. Now, it is common to incorporate anomalous scatterers such as selenium into proteins before crystallization and to make measurements of the anomalous dispersion at multiple wavelengths. The computational procedures that can be used to solve heavy-atom substructures include both Patterson-based and direct methods. In either case, the positions of the substructure atoms are determined from difference coefficients based on the measurements available from the diffraction experiments as summarized in Table I. The isomorphous difference magnitude, jFj iso (¼kFPHjjFPk), approximates the structure amplitude, jFH cos()j, and the anomalous-dispersion difference magnitude, jFj ano 1
G. M. Sheldrick, H. A. Hauptman, C. M. Weeks, R. Miller, and I. Uso´ n, In ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, p. 333. Kluwer Academic, Dordrecht, The Netherlands, 2001.
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[3]
automatic solution of heavy-atom substructures
37
RESOLVE can be applied to other aspects of structure determination as well, suggesting that full automation of the entire structure determination process from scaling diffraction data to a refined model will be possible in the near future.
[3] Automatic Solution of Heavy-Atom Substructures By Charles M. Weeks, Paul D. Adams, Joel Berendzen, Axel T. Brunger, Eleanor J. Dodson, Ralf W. Grosse-Kunstleve, Thomas R. Schneider, George M. Sheldrick, Thomas C. Terwilliger, Maria G. W. Turkenburg, and Isabel Uso´n Introduction
With the exception of small proteins that can be solved by ab initio direct methods1 or proteins for which an effective molecular replacement model exists, protein structure determination is a two-step process. If two or more measurements are available for each reflection with differences arising only from some property of a small substructure, then the positions of the substructure atoms can be found first and used as a bootstrap to initiate the phasing of the complete structure. Historically, substructures were first created by isomorphous replacement in which heavy atoms (usually metals) are soaked into crystals without displacing the protein structure, and measurements were made from both the unsubstituted (native) and substituted (derivative) crystals. When possible, measurements were made also of the anomalous diffraction generated by the metals at appropriate wavelengths. Now, it is common to incorporate anomalous scatterers such as selenium into proteins before crystallization and to make measurements of the anomalous dispersion at multiple wavelengths. The computational procedures that can be used to solve heavy-atom substructures include both Patterson-based and direct methods. In either case, the positions of the substructure atoms are determined from difference coefficients based on the measurements available from the diffraction experiments as summarized in Table I. The isomorphous difference magnitude, jFj iso (¼kFPHjjFPk), approximates the structure amplitude, jFH cos()j, and the anomalous-dispersion difference magnitude, jFj ano 1
G. M. Sheldrick, H. A. Hauptman, C. M. Weeks, R. Miller, and I. Uso´n, In ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, p. 333. Kluwer Academic, Dordrecht, The Netherlands, 2001.
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
38
[3]
phases TABLE I Measurements Used for Substructure Determinationa
Acronym SIR SIRAS MIR MIRAS SAD or SAS MAD a
Type of experiment Single isomorphous replacement Single isomorphous replacement with anomalous scattering Multiple isomorphous replacement Multiple isomorphous replacement with anomalous scattering Single anomalous dispersion or single anomalous scattering Multiple anomalous dispersion
Measurements FP, FPH FP, FPHþ, FPH FP, FPH1, FPH2, . . . FP, FPH1þ, FPH1, FPH2þ, FPH2, . . . FPHþ, FPH at one wavelength FPHþ, FPH at several wavelengths
The notation used for the structure factors is FP (native protein), FPH (derivative), FH or FA (substructure), Fþ and F (for Fhkl and Fhkl , respectively, in the presence of anomalous dispersion).
00 sin()j. (The angle is the difference (¼k FþjjFk), approximates 2jFH between the phase of the whole protein and that of the substructure.) When SIRAS or MAD data are available, the differences can be combined to give an estimate of the complete FA structure factor.2,3 Both Patterson and direct methods require extremely accurate data for the successful determination of substructures. Care should be taken to eliminate outliers and observations with small signal-to-noise ratios, especially in the case of single anomalous differences. Fortunately, it is usually possible to be stringent in the application of appropriate cutoffs because the problem is overdetermined in the sense that the number of available observations is much larger than the number of heavy-atom positional parameters. In particular, it is important that the largest isomorphous and anomalous differences be reliable. The coefficients that are used consider small differences between two or more much larger measurements, so errors in the measurements can easily disguise the true signal. If there are even a few outliers in a data set, or some of the large coefficients are serious overestimates, substructure determination is likely to fail. Patterson and direct-methods procedures have been implemented in a number of computer programs that permit even large substructures to be determined with little, if any, user intervention. (The current record is 160 selenium sites.) The methodology, capabilities, and use of several such 2 3
J. Karle, Acta Crystallogr. A 45, 303 (1989). W. Hendrickson, Science 254, 51 (1991).
[3]
automatic solution of heavy-atom substructures
39
popular programs and program packages are described in this chapter. The SOLVE4 program, which uses direct-space Patterson search methods to locate the heavy-atom sites, provides a fully automated pathway for phasing protein structures, using the information obtained from MIR or MAD experiments. The two major software packages currently in use in macromolecular crystallography [i.e., the Crystallography and NMR System (CNS5) and the Collaborative Computational Project Number 4 (CCP46)] provide internally consistent formats that make it easy to proceed from heavy-atom sites to density map, but user intervention is required. CNS employs both direct-space and reciprocal-space Patterson searches. The CCP4 suite includes programs for computing Pattersons as well as the direct-method programs RANTAN7 and ACORN.8 The dualspace direct-method programs SnB9,10 and SHELXD11,11a provide only the heavy-atom sites, but they are efficient and capable of solving large substructures currently beyond the capabilities of programs that use only Patterson-based methods. SnB uses a random number generator to assign initial positions to the starting atoms in its trial structures, but SHELXD strives to obtain better-than-random initial coordinates by deriving information from the Patterson superposition minimum function. In some cases, this has significantly decreased the computing time needed to find a heavyatom solution. Other direct-method programs (e.g., SIR200012), not described in this chapter, also can be used to solve substructures. Pertinent aspects of data preparation are described in detail in the following sections devoted to the individual programs. Automated or semiautomated procedures for locating heavy-atom sites operate by generating many trial structures. Thus, a key step in any such procedure is the scoring or ranking of trial structures by some measure of quality in such a way that 4
T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 55, 849 (1999). A. T. Brunger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gross, R. W. GrosseKunstleve, J.-S. Jiang, J. Kuszewski, M. Nilges, N. S. Pannu, R. J. Read, L. M. Rice, T. Simonson, and G. L. Warren, Acta Crystallogr. D. Biol. Crystallogr. 54, 905 (1998). 6 Collaborative Computational Project Number 4, Acta Crystallogr. D. Biol. Crystallogr. 50, 760 (1994). 7 J.-X. Yao, Acta Crystallogr. A 39, 35 (1983). 8 J. Foadi, M. M. Woolfson, E. J. Dodson, K. S. Wilson, J.-X. Yao, and C.-D. Zheng, Acta Crystallogr. D. Biol. Crystallogr. 56, 1137 (2000). 9 R. Miller, S. M. Gallo, H. G. Khalak, and C. M. Weeks, J. Appl. Crystallogr. 27, 613 (1994). 10 C. M. Weeks and R. Miller, Acta Crystallogr. D. Biol. Crystallogr. 55, 492 (1999). 11 G. M. Sheldrick, in ‘‘Direct Methods for Solving Macromolecular Structures’’ (S. Fortier, ed.), p. 401. Kluwer Academic, Dordrecht, The Netherlands, 1998. 11a T. R. Schneider and G. M. Sheldrick, Acta Crystallogr. D. Biol. Crystallogr. 58, 1772 (2002). 12 M. C. Burla, M. Camalli, B. Carrozzini, G. L. Cascarano, C. Giacovazzo, G. Polidori, and R. Spagna, Acta Crystallogr. A 56, 451 (2000). 5
40
phases
[3]
any probable solution can be identified. Therefore, the methods used to accomplish this are described for each program, along with methods for validating the correctness of individual sites. Where applicable, methods used to determine the correct hand (enantiomorph) and refine the substructure also are described. Finally, interesting applications to large selenomethionine derivatives, substructures phased by weak anomalous signals, and substructures created by short halide cryosoaks are discussed. SOLVE
In favorable cases, the determination of heavy-atom substructures using MAD or MIR data is a straightforward, although often lengthy, process. SOLVE4 is designed to automate fully the analysis of such data. The overall approach is to link together into one seamless procedure all the steps that a crystallographer would normally do manually and, in the process, to convert each decision-making step into an optimization problem. A somewhat more generalized description of SOLVE, together with a description of RESOLVE, a maximum-likelihood solvent-flattening routine, appear in the chapter by T. Terwilliger (see [2] in this volume12a). The MAD and MIR approaches to structure solution are conceptually similar and share several important steps. In each method, trial partial structures for the heavy or anomalously scattering atoms often are obtained by inspection of difference-Patterson functions or by semiautomated analysis.13–15 These initial structures are refined against the observed data and used to generate initial phases. Then, additional sites and sites in other derivatives can be found from weighted difference or gradient maps using these phases. The analysis of the quality of potential heavyatom solutions is also similar for the two methods. In both cases, a partial structure is used to calculate native phases for the entire structure, and the electron density that results is then examined to see whether the expected features of the macromolecule can be found. In addition, the figure of merit of phasing and the agreement of the heavy atom model with the difference Patterson function are commonly used to evaluate the quality of a solution. In many cases, an analysis of heavy-atom sites by sequential deletion of individual sites or derivatives is also an important criterion of quality.16 12a
T. C. Terwilliger, Methods Enzymol. 374, [2], 2003 (this volume). T. C. Terwilliger, S.-H. Kim, and D. Eisenberg, Acta Crystallogr. A 43, 1 (1987). 14 G. Chang and M. Lewis, Acta Crystallogr. D. Biol. Crystallogr. 50, 667 (1994). 15 A. Vagin and A. Teplyakov, Acta Crystallogr. D. Biol. Crystallogr. 54, 400 (1998). 16 R. E. Dickerson, J. C. Kendrew, and B. E. Strandberg, Acta Crystallogr. 14, 1188 (1961). 13
[3]
automatic solution of heavy-atom substructures
41
Data Preparation SOLVE prepares data for heavy-atom substructure solution in two steps. First, the data are scaled using the local scaling procedure of Matthews and Czerwinski.17 Second, MAD data are converted to a pseudo-SIRAS form that permits more rapid analysis.18 Systematic errors are minimized by scaling all types of data (e.g., Fþ and F, native and derivative, and the different wavelengths of MAD data) in similar ways and by keeping different data sets separate until the end of scaling. The scaling procedure is optimized for cases in which the data are collected in a systematic fashion. For both MIR and MAD data, the overall procedure is to construct a reference data set that is as complete as possible and that contains information from either a native data set (for MIR) or for all wavelengths (for MAD data). This reference data set is constructed for just the asymmetric unit of data and is essentially the average of all measurements obtained for each reflection. The reference data set is then expanded to the entire reciprocal lattice and used as the basis for local scaling of each individual data set (see Terwilliger and Berendzen4 for additional details). For MAD data, Bayesian calculations of phase probabilities are slow.19,20 Consequently, SOLVE uses an alternative procedure for all MAD phase calculations except those done at the final stage. This alternative is to convert the multiwavelength MAD data set into a form that is similar to that used for SIRAS data. The information in a MAD experiment is largely contained in just three quantities: a structure factor Fo corresponding to the scattering from nonanomalously scattering atoms, a dispersive or isomorphous difference at a standard wavelength o (ISO o ), 18 and an anomalous difference (ANO ) at the same standard wavelength. It o is easy to see that these three quantities could be treated just like an SIRAS data set with the ‘‘native’’ structure factor FP replaced by Fo, the derivative structure factor FPH replaced by Fo þ (ISO o ), and the anomalous difference replaced by ANO o . In this way, a single data set with isomorphous and anomalous differences is obtained that can be used in heavy-atom refinement by the origin-removed Patterson refinement method and in phasing by conventional SIRAS phasing.21 The conversion of MAD data to a pseudo-SIRAS form that has almost the same information content requires two important assumptions. The first assumption is that the structure factor 17
B. W. Matthews and E. W. Czerwinski, Acta Crystallogr. A 31, 480 (1975). T. C. Terwilliger, Acta Crystallogr. D. Biol. Crystallogr. 50, 17 (1994). 19 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 53, 571 (1997). 20 E. de la Fortelle and G. Bricogne, Methods Enzymol. 277, 472 (1997). 21 T. C. Terwilliger and D. Eisenberg, Acta Crystallogr. A 43, 6 (1987). 18
42
phases
[3]
corresponding to anomalously scattering atoms in a structure varies in magnitude, but not in phase, at various X-ray wavelengths. This assumption will hold when there is one dominant type of anomalously scattering atom. The second assumption is that the structure factor corresponding to anomalously scattering atoms is small compared with the structure factor from all other atoms. The conversion of MAD to pseudo-SIRAS data is implemented in the program segment MADMRG.18 In most cases, there is more than one pair of X-ray wavelengths corresponding to a particular reflection. The estimates from each pair of wavelengths are all averaged, using weighting factors based on the uncertainties in each estimate. Data from various pairs of X-ray wavelengths and from various Bijvoet pairs can have different weights in their contributions to the total. This can be understood by noting that pairs of wavelengths that differ considerably in dispersive contributions would yield relatively accurate estimates of ISO o . In the same way, Bijvoet differences measured at the wavelength with the largest value of f 00 will contribute by far the most to estimates of ANO o . The standard wavelength choice in this analysis is arbitrary because values at any wavelength can be converted to values at any other wavelength. The standard wavelength does not even have to be one of the wavelengths in the experiment, although it is convenient to choose one of them. Heavy-Atom Searching and Phasing The process of structure solution can be thought of largely as a decision-making process. In the early stages of solution, a crystallographer must choose which of several potential trial solutions may be worth pursuing. At a later stage, the crystallographer must choose which peaks in a heavy-atom difference Fourier are to be included in the heavy-atom model, and which hand of the solution is correct. At a final stage, the crystallographer must decide whether the solution process is complete and which of the possible heavy-atom models is the best. The most important feature of the SOLVE software is the use of a consistent scoring algorithm as the basis for making all these decisions. To make automated structure solution practical, it is necessary to evaluate trial heavy-atom solutions (typically 300–1000) rapidly. For each potential solution, the heavy-atom sites must be refined and the phases calculated. In implementing automated structure solution, it was important to recognize the need for a trade-off between the most accurate heavyatom refinement and phasing at all stages of structure solution and the time required to carry it out. The balance chosen for SOLVE was to use the most accurate available methods for final phase calculations and
[3]
automatic solution of heavy-atom substructures
43
to use approximate, but much faster, methods for all intermediate refinements and phase calculations. The refinement method chosen on this basis was origin-removed Patterson refinement,22 which treats each derivative in an MIR data set independently, and which is fast because it does not require phase calculation. The phasing approach used for MIR data throughout SOLVE is Bayesian-correlated phasing,21,23 a method that takes into account the correlation of nonisomorphism among derivatives without slowing down phase calculations substantially. Once MIR data have been scaled, or MAD data have been scaled and converted to a pseudo-SIRAS form, automated searches of difference Patterson functions are then used to find a large number (typically 30) of potential one-site and two-site solutions. In the case of MIR data, difference-Patterson functions are calculated for each derivative. For MAD data, anomalous and dispersive differences are combined to yield a Bayesian estimate of the Patterson function for the anomalously scattering atoms.24 In principle, Patterson methods could be used to solve the complete heavy-atom substructure, but the approach used in SOLVE is to find just the initial sites in this way and to find all others by difference Fourier analysis. This initial set of one-site and two-site trial solutions becomes a list of ‘‘seeds’’ for further searching. Once each of the potential seeds is scored and ranked, the top seeds (typically five) are selected as independent starting points in the search for heavy-atom solutions. For each seed, the main cycle in the automated structure-solution algorithm used by SOLVE consists of two basic steps. The first is to refine heavy-atom parameters and to rank all existing solutions generated from this seed so far, on the basis of the four criteria discussed below. The second is to take the highest-ranking partial solution that has not yet been analyzed exhaustively and use it in an attempt to generate a more complete solution. Generation of new solutions is carried out in three ways: by deletion of sites, by addition of sites from difference Fouriers, and by reversal of hand. A partial solution is considered to have been analyzed exhaustively when all single-site deletions have been considered, when no more peaks that result in improvement can be found in a difference Fourier, when inversion does not cause improvement, or when the maximum number of sites specified by the user has been reached. In each case, new solutions generated in these ways are refined, scored, and ranked, and the cycle is continued until all the top trial solutions have been analyzed fully and no new possibilities are found. Throughout this process, a tally of the 22
T. C. Terwilliger and D. Eisenberg, Acta Crystallogr. A 39, 813 (1983). T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 52, 749 (1996). 24 T. C. Terwilliger, Acta Crystallogr. D. Biol. Crystallogr. 50, 11 (1994). 23
44
phases
[3]
solutions that have already been considered is kept, and any duplicates are eliminated. In some cases, one clear solution appears early in this process. In other cases, there are several solutions that have similar scores at early (and sometimes even late) stages of the analysis. When no one possibility is much better than the others, all the seeds are analyzed exhaustively. On the other hand, if a promising partial solution emerges from one seed, then the search is narrowed to focus on that seed, deletions are not carried out until the end of the analysis, and many peaks from the difference Fourier analysis are added simultaneously so as to build up the solution as quickly as possible. Once the expected number of heavy-atom sites is found, then each site is deleted in turn to see whether the solution can be further improved. If this occurs, then the process is repeated in the same way by addition and deletion of sites and by inversion until no further improvement is obtained. At the conclusion of the SOLVE algorithm, an electron-density map and phases for the top solution are reported in a form that is compatible with the CCP46 suite. In addition, command files that can be modified to look for additional heavy-atom sites or to construct other electrondensity maps are produced. If more than one possible solution is found, the heavy-atom sites and phasing statistics for all of them are reported. Scoring, Site Validation, Enantiomorph Determination, and Substructure Refinement Scoring of potential heavy-atom solutions is an essential part of the SOLVE algorithm because it allows ranking of solutions and appropriate decision-making. Scoring, validation, and enantiomorph determination are all part of the same process, and they are carried out continuously during the solution process. For each trial solution, SOLVE first refines the heavy-atom substructure against the origin-removed Patterson function. Then, it scores the trial solutions using four criteria that are described in detail below: agreement with the Patterson function, cross-validation of heavy-atom sites, the figure of merit, and nonrandomness of the electrondensity map. The scores for each criterion are normalized to those for a group of starting solutions (most of which are incorrect) to obtain a socalled Z score. The total score for a solution is the sum of its Z scores after correction for anomalously high scores in any category. SOLVE identifies the enantiomorph, using the score for the nonrandomness criterion. All the other scores are independent of the hand of the heavy-atom substructure, but the final electron-density map will be just noise if anomalous differences are measured and the hand of the heavy atoms is incorrect.
[3]
automatic solution of heavy-atom substructures
45
Consequently, this score can be used effectively in later stages of structure solution to identify the correct enantiomorph. Patterson Agreement. The first criterion used by SOLVE for evaluating a trial heavy-atom solution is the agreement between calculated and observed Patterson functions. Comparisons of this type have always been important in the MIR and MAD methods.25 The score for Patterson function agreement is the average value of the Patterson function at predicted peak locations after multiplication by a weighting factor based on the number of heavy-atom sites in the trial solution. The weighting factor4 is adjusted such that, if two solutions have the same mean value at predicted Patterson peaks, the one with the larger number of sites receives the higher score. In some cases, predicted Patterson vectors fall on high peaks that are not related to the heavy-atom solution. To exclude these contributions, the occupancies of each heavy-atom site are refined so that the predicted peak heights approximately match the observed peak heights at the predicted interatomic positions. Then, all peaks with heights more than 1 larger than their predicted values are truncated. The average values are corrected further for instances in which more than one predicted Patterson vector falls at the same location by scaling that peak height by the fraction of predicted vectors that are unique. Cross-Validation of Sites. A cross-validation difference Fourier analysis is the basis of the second scoring criterion. One at a time, each site in a solution (and any equivalent sites in other derivatives for MIR solutions) is omitted from the heavy-atom model, and the phases are recalculated. These phases are used in a difference Fourier analysis, and the peak height at the location of the omitted site is noted. A similar analysis, in which a derivative is omitted from phasing and all other derivatives are used to phase a difference Fourier, has been used for many years.16 The score for cross-validation difference Fouriers is the average peak height after weighting by the same factor used in the difference Patterson analysis. Figure of Merit. The mean figure of merit of phasing, m,25 can be a remarkably useful measure of the quality of phasing despite its susceptibility to systematic error.4 The overall figure of merit is essentially a measure of the internal consistency of the heavy-atom solution with the data. Because heavy-atom refinement in SOLVE is carried out using origin-removed Patterson refinement,22 occupancies of heavy-atom sites are relatively unbiased. This minimizes the problem of high occupancies leading to inflated figures of merit. In addition, using a single procedure for phasing allows
25
T. L. Blundell and L. N. Johnson, ‘‘Protein Crystallography.’’ Academic Press, New York, 1976.
46
phases
[3]
comparison among solutions. The score based on figure of merit is simply the unweighted mean for all reflections included in phasing. Nonrandomness of Electron Density. The most important criterion used by a crystallographer in evaluating the quality of a heavy-atom solution is the interpretability of the resulting electron-density map. Although a full implementation of this criterion is difficult, it is quite straightforward to evaluate instead whether the electron-density map has general features that are expected for a crystal of a macromolecule. A number of features of electron-density maps could be used for this purpose, including the connectivity of electron density in the maps,26 the presence of clearly defined regions of protein and solvent,27–33 and histogram matching of electron densities.31,34 The identification of solvent and protein regions has been used as the measure of map quality in SOLVE. This requires that there be both solvent and protein regions in the electron-density map. Fortunately, for most macromolecular structures the fraction of the unit cell that is occupied by the macromolecule is in the suitable range of 30–70%. The criteria used in scoring by SOLVE are based on the solvent and protein regions each being fairly large, contiguous regions.33 The unit cell is divided into boxes having each dimension approximately twice the resolution of the map, and the root–mean–square (rms) electron density is calculated within each box without including the F000 term in the Fourier synthesis. Boxes within the protein region will typically have high values of this rms electron density (because there will be some points where atoms are located and other points that lie between atoms) whereas boxes in the solvent region will have low values because the electron density will be fairly uniform. The score, based on the connectivity of the protein and solvent regions, is simply the correlation coefficient of the density for adjacent boxes. If there is a large contiguous protein region and a large contiguous solvent region, then adjacent boxes will have highly correlated values. If the electron density is random, there will be little or no correlation. On the other hand, the correlation may be as high as 0.5 or 0.6 for a good map. 26
D. Baker, A. E. Krukowski, and D. A. Agard, Acta Crystallogr. D. Biol. Crystallogr. 49, 186 (1993). 27 B.-C. Wang, Methods Enzymol. 115, 90 (1985). 28 S. Xiang, C. W. Carter, Jr., G. Bricogne, and C. J. Gilmore, Acta Crystallogr. D. Biol. Crystallogr. 49, 193 (1993). 29 A. D. Podjarny, T. N. Bhat, and M. Zwick, Annu. Rev. Biophys. Biophys. Chem. 16, 351 (1987). 30 J. P. Abrahams, A. G. W. Leslie, R. Lutter, and J. E. Walker, Nature 370, 621 (1994). 31 K. Y. J. Zhang and P. Main, Acta Crystallogr. A 46, 377 (1990). 32 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 55, 501 (1998). 33 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D. Biol. Crystallogr. 55, 1872 (1999). 34 A. Goldstein and K. Y. J. Zhang, Acta Crystallogr. D. Biol. Crystallogr. 54, 1230 (1998).
[3]
automatic solution of heavy-atom substructures
47
The four-point scoring scheme described above provides the foundation for automated structure solution. To make it practical, the conversion of MAD data to a pseudo-SIRAS form and the use of rapid origin-removed, Patterson-based, heavy-atom refinement have been critical. The remainder of the SOLVE algorithm for automated structure solution is largely a standardized form of local scaling, an integrated set of routines to carry out all the calculations required for heavy-atom searching, refinement, and phasing as well as routines to keep track of the lists of current solutions being examined and past solutions that have already been tested. SOLVE is an easy program to use. Only a few input parameters are needed in most cases, and the SOLVE algorithm carries out the entire process automatically. In principle, the procedure also can be thorough: many starting solutions can be examined, and difficult heavy-atom structures can be determined. In addition, for the most difficult cases, the failure to find a solution can be useful in confirming that additional information is needed. Crystallography and NMR System
The Crystallography and NMR System (CNS)5 implements a novel Patterson-based method for the location of heavy atoms or anomalous scatterers.35 The procedure is implemented using a combination of direct-space and reciprocal-space searches, and it can be applied to both isomorphous replacement and anomalous scattering data. The goal of the algorithm is to make it practical to locate automatically a subset of the heavy atoms without manual interpretation or intervention. Once the sites have been located, CNS provides tools for heavy-atom refinement, phase estimation, density modification, and heavy-atom model completion. These tools, known as task files, are scripts written in the CNS language and are supplied with reasonable default parameters. Using these task files, the process of phasing is greatly simplified and initial electron-density maps, even for large complex structures, can be calculated in a relatively short time. CNS has been used successfully to solve problems with up to 4036 and 66 selenium sites (see Applications, below). Data Preparation Sigma Cutoffs and Outlier Elimination. The peaks in a Patterson map correspond to interatomic vectors of the crystal structure.37 However, the 35
R. W. Grosse-Kunstleve and A. T. Brunger, Acta Crystallogr. D. Biol. Crystallogr. 55, 1568 (1999). 36 M. A. Walsh, Z. Otwinowski, A. Perrakis, P. M. Anderson, and A. Joachimiak, Struct. Fold. Des. 8, 505 (2000).
48
[3]
phases
atoms are not point scatterers, and there are errors associated with experimental data, making the interpretation of the Patterson map difficult. Therefore, steps are taken to minimize the amount of error that is introduced. In practice, the suppression of outliers can be essential to the success of a heavy atom search.38 In CNS, reflections are first rejected on the basis of their signal-to-noise ratio (‘‘sigma cutoff’’). This is performed on both the observed amplitudes and the computed difference between pairs of amplitudes. For the computation of differences, the observed amplitudes are scaled relative to each other, using overall k-scaling and B-scaling in order to compensate for systematic errors caused by differences between crystals and data collection conditions. Additional reflections are rejected if their amplitudes or difference amplitudes deviate too much from the corresponding root–mean–square (rms) value for all of the data in their resolution shell (‘‘rms outlier removal’’). Empirical observation has led to the values of the rejection criteria shown in Table II. Except for the TABLE II Default Parameters for CNS Automated Heavy-Atom Search Procedure Parameter
Default valuea
Number of sites
2/3 of total expected
Minimum Bragg spacing
˚ 4.0 A
Averaging of Patterson maps Special positions
No
Sigma cutoff on F RMS outlier cutoff on F for native or on F for difference Patterson maps Expected increase in correlation coefficient for dead-end test a b
37 38
No 1 4
0.01
Commentb Typically not all sites are well ordered, and it is easy to add additional sites using gradient map methods once phasing has started with the 2/3 partial solution If there are a large number of heavy-atom sites per macromolecule, a higher resolution ˚) limit may be required (3.5 A If solutions are not found with a single map, then multiple maps can be tried Can be set to true if the heavy atoms have been soaked into the crystal Decrease to 0 for FA structure factors Increase to 10 for FA structure factors
When there are a large number of heavy-atom sites, it may be necessary to decrease this value (to 0.005)
Values present in the heavy_search.inp task file supplied with CNS. Situations in which the default parameter may require modification.
M. J. Buerger, ‘‘Vector Space.’’ John Wiley & Sons, New York, 1959. G. M. Sheldrick, Methods Enzymol. 276, 628 (1997).
[3]
automatic solution of heavy-atom substructures
49
instances noted in Table II, these values can generally be used without modification. Combining Patterson Maps. CNS provides the option to average Patterson maps based on different data sets. For example, several MAD wavelengths or a combination of isomorphous and anomalous difference maps can be combined. This is useful if the signal in any individual data set is too weak to locate the heavy atoms unambiguously. A small signal-to-noise ratio in the observed data leads to noise in the Patterson maps. The combination of data increases the signal-to-noise ratio in the resulting Patterson map by averaging out the noise and, therefore, improves the chances of locating the heavy-atom positions (Fig. 1d). Using FA Structure Factors. If MAD data are available, it is possible to define structure factors FA that are approximations to the component of the observed structure factors resulting from the anomalous scatterers.2,3,18 FA structure factors can be calculated using programs such as XPREP,39 MADSYS,3 or the MADBST module of SOLVE.4 Although CNS does not perform FA estimation, the heavy-atom search procedure can make use of this information and that has been found to increase the chances for locating the correct sites (Fig. 1e). Ideally, an algorithm for the estimation of FA structure factors includes a careful treatment of outliers similar to the sigma cutoff and rms outlier removal outlined above. If this is the case, the parameters for the sigma cutoff and rms outlier removal in CNS should be adjusted to include all data in the heavy-atom search procedure (see Table II). Heavy-Atom Searching The CNS heavy-atom search procedure (Fig. 2) consists of four stages that are described in more detail by Grosse-Kunstleve and Brunger.35 In the first stage, the observed diffraction intensities are filtered by the criteria described above, and two or more Patterson maps (calculated from MIR, MAD, or MIRAS data) can be averaged. The second stage consists of a Patterson search by either a reciprocal-space single-atom fast translation function, by a direct-space symmetry minimum function, or by a combination of both. Combination searches have been shown to be the most accurate.35 A given number (typically 100) of the highest peaks in the resulting Patterson search map are sorted and subsequently used as initial trial sites. The third stage consists of a sequence of alternating reciprocalspace or direct-space Patterson searches as well as Patterson-correlation 39
Written by G. Sheldrick. Available from Bruker Advanced X-Ray Solutions (Madison, WI).
50
phases
CC
(a)
[3]
0.6 0.4 0.2 0
CC
(b)
0.6 0.4 0.2 0
CC
(c)
0.6 0.4 0.2 0
CC
(d)
0.6 0.4 0.2 0
CC
(e)
0.6 0.4 0.2 0
Trial
Fig. 1. Results of automated CNS heavy-atom search with the MAD data from 2aminoethylphosphonate transaminase. Sixty-six selenium sites are present in the asymmetric unit. Automated searches for 44 sites (two-thirds of the expected total) were performed. In all cases, 100 trial solutions were generated and sorted by the correlation coefficient (F2F2). (a) No solutions were found using the anomalous F structure factors at the high-energy remote wavelength as indicated by no separation between the trials. (b) A few solutions were found using the anomalous F structure factors at the peak wavelength. (c) The anomalous F structure factors at the inflection-point wavelength found more solutions, indicating a larger anomalous signal than the peak wavelength. (d) Using combined anomalous F structure factors at the inflection-point wavelength and the dispersive differences between the inflection point and high-energy remote gave an even higher success rate. (e) Finally, the greatest success rate was with FA structure factors calculated from all three wavelengths, using XPREP.39
(PC) refinements40 starting with each of the initial trial sites. The highest peak is selected that has distances to its symmetrically equivalent points and all preexisting sites larger than the given cutoff distance. If two or more sites already have been placed, a dead-end elimination test is performed.
[3]
automatic solution of heavy-atom substructures
51
First patterson search => list of initial trial sites
go through list of initial trial sites
distance within specified range to all sites?
no
yes Positional and/or B-factor PC refinement of all sites
Expected number of sites placed?
yes
write sites to file
no
dead end?
yes
no
do Patterson search for next site go through top peaks of this search
The correlation coefficient computed before placing and refining the last new site is compared with the correlation coefficient computed after the addition of the new site. If the target value does not increase by a specified amount, typically 0.01 (see Table II), then the search for that particular initial trial site is deemed to have reached a dead end, and no additional sites are placed. Otherwise, another Patterson search is carried out until the expected number of sites is found. The final stage consists of sorting the solutions ranked by the value of the target function (a correlation coefficient) 40
A. T. Brunger, Acta Crystallogr. A 47, 195 (1991).
52
[3]
phases
of the PC refinement. If the correct solution has been found, it is normally characterized by the best value of the target function and a significant separation from incorrect solutions (compare, e.g., Fig. 1a and b). Reciprocal-Space Method: Single-Atom Fast Translation Function. A single heavy-atom site is translated throughout an asymmetric unit, and 2 2 (t) (referred to the standard linear correlation coefficient of Fpatt and Fcalc as F2F2) is computed for each position t: P 2 2 iÞðF 2 2 ðFH;patt hFpatt H;calc hFcalc iÞ H ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi rP F2F2ðtÞ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (1) P 2 2 iÞ2 2 2 iÞ2 ðFH;patt hFpatt ðFH;calc hFcalc H
H
The summations are computed for all Miller indices H, and hF 2i denotes the mean of F 2 over all Miller indices. Other target expressions can be used including the correlation coefficient between Fpatt and Fcalc(t), E2patt and E2calc (t), and Epatt and Epatt and Ecalc(t), where the E values are normalized structure factors (see Dual-Space Direct Methods, below). The F2F2 target function is preferred because it permits the use of a fast translation function (FTF),41 which is 300–500 times faster35 than the conventional translation function.42 Thus, the FTF makes the automated reciprocal-space heavy-atom search procedure practical even for large numbers of sites. The reciprocal-space search for an additional site is similar to the search for the initial trial sites, except that the previously placed sites are kept fixed and are included in the structure-factor (Fcalc) calculation.41 Direct-Space Method: Symmetry and Image-Seeking Minimum Functions. The symmetry minimum function (SMF)43–45 makes maximal use of the information contained in the Harker regions. The computation of an SMF requires a Patterson map as well as a table of the unique Harker vectors and their weights.43 These Harker vectors and weights are supplied automatically by CNS. The image-seeking minimum function (IMF)43,45 can be used to locate additional sites once one or more are placed. Computing an IMF map is equivalent to a deconvolution of the Patterson map using knowledge of the already placed heavy-atom sites. Because of coincidental overlap of peaks in the Patterson map, thermal motion of the sites, and noise in the data, the IMF maps typically provide only limited information for macromolecular crystal structures. 41
J. Navaza and E. Vernoslova, Acta Crystallogr. A 51, 445 (1995). M. Fujinaga and R. J. Read, J. Appl. Crystallogr. 20, 517 (1987). 43 P. G. Simpson, R. D. Dobrott, and W. N. Lipscomb, Acta Crystallogr. 18, 169 (1965). 44 F. Pavelcik, J. Appl. Crystallogr. 19, 488 (1986). 45 M. A. Estermann, Nucl. Instr. Methods Phys. Res. A 354, 126 (1995). 42
[3]
automatic solution of heavy-atom substructures
53
Peak Search and Special Position Check. The list of initial trial sites is determined by a peak search in the single-atom FTF, the SMF, or their combination. A grid point is considered to be a peak if the corresponding density in the map is at least as high as that of its six nearest neighbors. Redundancies due to space-group symmetry and allowed origin shifts are automatically removed. Similarly, additional sites are determined by a peak search in the FTF, the IMF, or their combination. The treatment of redundancies due to symmetry is fully integrated into the search procedure. Sites at or close to a special position can be accepted or rejected. In the latter case, the shortest distance to all its symmetry equivalent sites is computed for each of the trial sites. If this distance is less than a given cutoff ˚ ), the site is rejected. Because selenomethionine distance (typically 3.5 A substitution is the predominant technique for introducing anomalous scatterers into a macromolecule, the rejection of peaks on special positions is set to be the default. However, if heavy atoms have been soaked, cocrystallized, or chemically reacted with the macromolecule, a site could be located on a special position. In such cases, it is appropriate to search for heavy atoms first with special positions rejected and then with them accepted in order to determine whether further sites are found. Scoring Trial Structures The result of the CNS heavy-atom search is a number of trial solutions, each containing up to the specified maximum number of sites. There are typically as many of these trial solutions as were requested by the user before running the heavy_search.inp task file. However, when the input Patterson map has only a small number of peaks, it is possible that there will be fewer trial solutions found. The trial solutions can be ranked by the scoring function (which is typically F2F2, the correlation between the squared amplitudes), but other score functions can be used. Although the absolute value of the correlation coefficient could be used as a guide to the correctness of each trial solution, empirical observation has shown that a more informative guide is the presence of solutions with correlation coefficients that are outstanding compared with the rest (Fig. 1). Similar observations have also been made by the authors of other automatic programs for locating heavy atoms.9 The heavy_search.inp task file creates a list file (heavy_search.list) that contains an unsorted list of the score function for each trial solution. Each solution with a correlation score that is 1.5 above the mean of all the solutions is marked with a plus sign (þ). To interpret the results easily, the list of configurations can be sorted by correlation coefficient and then plotted graphically (Fig. 1). In the majority of cases encountered to date, if the
54
phases
[3]
solution with the highest correlation is also more than 1.5 above the mean, then all or most of the heavy-atom positions in that solution are correct. Substructure Refinement, Site Validation, and Enantiomorph Determination The trial solutions produced by the automated heavy-atom search are used to determine initial phases to generate an electron-density map. Several different tasks must be performed in order to refine the heavy-atom substructure, calculate phases, complete the heavy-atom model, resolve the enantiomorph, and possibly resolve phase ambiguities. A similar approach is followed for MAD, SAD, and (M/S)IR(AS) experiments. In all cases, the following methods are employed. Substructure Refinement. The heavy-atom sites located automatically with CNS are refined and phase probability distributions generated using the ir_phase.inp or mad_phase.inp task files that deal with isomorphous replacement and anomalous diffraction, respectively. A generalized phase refinement formulation is used when lack-of-closure expressions are calculated between a user-selected reference data set and all other data sets.46,47 A maximum-likelihood target function47 is employed that makes use of an error model similar to that of Terwilliger and Eisenberg.21 Coordinates, B-factors and, when appropriate, occupancies are refined using the Powell conjugate gradient minimization algorithm.48 Site Validation. The heavy-atom positions are not extensively validated during the search procedure; instead, the refinement of B-factors during each cycle decreases the contribution from incorrect sites. After phase calculation, the gradient map technique is used to validate the existing sites further, and also to detect sites missing from the current model.49 The gradient map is a Fourier synthesis calculated from the first derivative of the phasing target function, which can be interpreted as a difference map. A positive peak, clearly separated from any existing atom, corresponds to an atom missing from the heavy-atom model whereas a negative peak, located at the position of an existing atom, indicates that this atom is either incorrectly placed or has been assigned an incorrect chemical type or occupancy. Anisotropic motion of atoms in the substructure also can lead to peaks in the gradient map close to existing sites. Enantiomorph Determination. The use of the gradient map method in combination with substructure refinement allows the heavy-atom model 46
J. C. Phillips and K. O. Hodgson, Acta Crystallogr. A 36, 856 (1980). F. T. Burling, W. I. Weis, K. M. Flaherty, and A. T. Brunger, Science 271, 72 (1996). 48 M. J. D. Powell, Math. Program. 12, 241 (1977). 49 G. Bricogne, Acta Crystallogr. A 40, 410 (1984). 47
[3]
automatic solution of heavy-atom substructures
55
to be completed even though the correct hand of the heavy-atom configuration is often still unknown. In CNS, the correct hand is determined by repeating the phase determination with the alternate hand followed by inspection of the two electron-density maps (see below). In the majority of cases, obtaining the alternative hand is achieved simply by inverting the coordinates about the origin. However, in the case of enantiomorphic space groups, the space group must be changed at the same time as the coordinates are inverted (e.g., P61 is mapped to P65). In addition, in a small number of space groups, the inversion of the coordinates is not about the origin, but rather some other point in the unit cell. The CNS task file flip_sites.inp automatically takes account of both of these situations. Once phasing has been performed with the two possible choices of heavy-atom coordinates, the electron-density maps can be compared to determine which hand is correct. Making this decision from the raw experimental phases is feasible only with high-quality MIR(AS) or MAD data sets. In such cases, the solvent boundary, secondary structure elements, or atomic detail in the electron-density map can show clearly which heavy-atom configuration is correct. However, in the general case the raw experimental phases are not sufficient to reveal such features. In particular, in the case of a single anomalous diffraction (SAD) or a single isomorphous replacement (SIR) experiment, it is not possible to distinguish the two hands in this way because of the bimodal phase distributions that are produced. Therefore, it is usually better to perform phase improvement by density modification in the form of solvent flattening or solvent flipping50 to resolve the phase ambiguity present in the SAD and SIR cases. The CNS task file density_modify.inp should be used to improve the phases irrespective of the type of phasing experiment. After density modification of phases from both heavy-atom hands, the electron-density maps usually identify the correct hand unambiguously and generate maps good enough to begin model building. Dual-Space Direct Methods: SnB and SHELXD
Direct methods are techniques that use probabilistic relationships among the phases to derive values of the individual phases from the measured amplitudes. The purpose of this section is to give a concise summary of these techniques as they apply to substructure determination. The basic theory underlying direct methods,51 as well as macromolecular applications 50 51
J. P. Abrahams and A. G. W. Leslie, Acta Crystallogr. D. Biol. Crystallogr. 52, 30 (1996). C. Giacovazzo, in ‘‘International Tables for Crystallography’’ (U. Shmueli, ed.), Vol. B, p. 201. Kluwer Academic, Dordrecht, The Netherlands, 1996.
56
phases
[3]
of direct methods,1 have been reviewed; the reader is referred to these sources for additional details. Historically, direct methods have targeted the determination of complete structures, especially small molecules containing fewer than 100 nonhydrogen atoms. In the early 1990s, the size range of routine direct-methods applications was extended by almost an order of magnitude through a procedure that has come to be known as Shake- and-Bake.52,53 The distinctive feature of this procedure is the repeated and unconditional alternation of reciprocal-space phase refinement (Shaking) with a complementary real-space process that seeks to improve phases by applying constraints (Baking). This algorithm has been implemented independently in two computer programs, SnB9,10 and SHELXD11,11a (alias Halfbaked or SHELXM). These programs provide default parameters and protocols for the phasing process, but they allow easy user intervention in difficult cases. It has been recognized for some time that the formalism of direct methods carries over to substructures when applied to single isomorphous54 (SIR) or single anomalous55 (SAD or SAS) difference data. MIR data can be accommodated simply by treating the data separately for each derivative, and MAD data can be handled by examining the anomalous differences for each wavelength individually or by combining them together in the form of FA structure factors.2,3 The dispersive differences between two wavelengths of MAD data also can be treated as pseudo-SIR differences. If substructure determination were the only concern, it is unclear whether it would be best to measure anomalous scattering data a few times for each of three wavelengths or many times for one wavelength. What is clear is that high redundancy leads to a highly beneficial reduction in measurement errors. SnB and SHELXD can both use either jFANOj or jFAj values, and so far both approaches have worked well. SnB is normally applied to peak-wavelength anomalous differences computed using the DREAR56 program suite, and SHELXD is normally applied to jFANOj or jFAj values that have been calculated using XPREP.39 It is reassuring to know that one wavelength is generally sufficient for substructure determination when not all wavelengths were measured or when one or more wavelengths were in error. In addition, treating the wavelengths separately allows for useful cross-correlation of sites (see below, Site Validation). 52
C. M. Weeks, G. T. DeTitta, R. Miller, and H. A. Hauptman, Acta Crystallogr. D. Biol. Crystallogr. 49, 179 (1993). 53 C. M. Weeks, G. T. DeTitta, H. A. Hauptman, P. Thuman, and R. Miller, Acta Crystallogr. A 50, 210 (1994). 54 K. S. Wilson, Acta Crystallogr. B 34, 1599 (1978). 55 A. K. Mukherjee, J. R. Helliwell, and P. Main, Acta Crystallogr. A 45, 715 (1989). 56 R. H. Blessing and G. D. Smith, J. Appl. Crystallogr. 32, 664 (1999).
[3]
automatic solution of heavy-atom substructures
57
The largest substructure solved so far by direct methods contained 160 independent selenium sites.57 The upper limit of size is unknown, but, by analogy to the complete structure case, it is reasonable to think that it is at least a few hundred sites. In all likelihood, the inherently noisier nature of difference data and the fact that jFANOj and jFAj values provide imperfect approximations to the substructure amplitudes mean that the maximal substructure size that can be accommodated is probably less than that of complete structures. Although, at present, full structure direct-methods ap˚ or better, the resolution plications require atomic-resolution data of 1.2 A of the data typically collected for isomorphous replacement or MAD experiments is sufficient for direct-methods determinations of substructures. Because it is rare for heavy atoms or anomalous scatterers to be closer than ˚ , data having a maximum resolution in this range are adequate. 3–4 A Data Preparation Normalization. To take advantage of the probabilistic relationships that form the foundation of direct methods, the usual structure factors, F, must be replaced by the normalized structure factors,58 E. The condition hjEj2i ¼ 1 is always imposed for every data set. Unlike hjFji which decreases as sin()/ increases, the values of hjEji are constant for concentric resolution shells. Similarly, correction factors (e) are applied that take into account the average intensities of particular classes of reflections as a result of space-group symmetry.59 The distribution of jEj values is, in principle, and often in practice, independent of the unit cell size and contents, but it does depend on whether a center of symmetry is present. Normalization is a necessary first step in data processing for direct-methods computations. It can be accomplished simply by dividing the data into resolution shells and applying the condition hjEj2i ¼ 1 to each shell. Alternatively, a leastsquares-fitted scaling function can be used to impose the normalization condition. The procedures are similar regardless of whether the starting information consists of jFj, jFj (iso or ano), or jFAj values and leads to jEj, jEj, or jEAj values. Mathematically precise definitions of the SIR and SAD difference magnitudes, jEj, that take into account the atomic scattering factors jfj j ¼ jfjo þ fj0 þ ifj00 j have been presented by Blessing and Smith56 and implemented in the program DIFFE that is distributed as part 57
F. von Delft, T. Inoue, S. A. Saldanha, H. H. Ottenhof, F. Schmitzberger, L. M. Birch, V. Dhanaraj, M. Witty, A. G. Smith, T. L. Blundell, and C. Abell, Struct. 11, 985 (2003). 58 H. A. Hauptman and J. Karle, ‘‘Solution of the Phase Problem. I. The Centrosymmetric Crystal.’’ ACA Monograph No. 3. Polycrystal Book Service, Dayton, OH, 1953. 59 U. Shmueli and A. J. C. Wilson, in ‘‘International Tables for Crystallography’’ (U. Shmueli, ed.), Vol. B, p. 190. Kluwer Academic, Dordrecht, The Netherlands, 1996.
58
phases
[3]
of the SnB package. The jFAj values that are used in SHELXD to form jEAj values are computed in XPREP,39 using algorithms similar to those employed in the MADBST component of SOLVE.4 Sigma Cutoffs and Outlier Elimination. Direct methods are notoriously sensitive to the presence of even a small number of erroneous measurements. This is especially problematical in the case of difference data, which can be quite noisy. The best antidote is to eliminate any questionable measurement before initiating the phasing process. Fortunately, it is possible to be stringent in the application of cutoffs because the number of difference reflections that must be phased is typically a small fraction of the total available observations. In small-molecule cases in which all reflections accessible to copper radiation have been measured, it is normal to phase about 10 reflections for every atom to be found, and this means that about 15% of the total data are used. In substructure cases, the unit cell for an N-site problem will be much larger than it would be for a small molecule with the same number of atoms to be positioned. Thus, the number of possible reflections will also be much larger, and many more can be rejected if ˚ need necessary. In fact, only 2–3% of the total possible reflections at 3 A be phased in order to solve substructures using direct methods, but these reflections must be chosen from those with the largest jEj values. The DIFFE56 program rejects data pairs (jE1j, jE2j) [i.e., SIR pairs (jEPj, jEPHj), SAD pairs (jEþj, jEj), and pseudo-SIR dispersive pairs (jE1j, jE2j)] or difference E magnitudes (jEj) that are not significantly different from zero or deviate markedly from the expected distribution. The following tests are applied when the default values, supplied by the SnB interface for the cutoff parameters (TMAX, XMIN, YMIN, ZMIN, and ZMAX), are shown in parentheses and are based on empirical tests with known data sets.60,61 1. Pairs of data are excluded if j(jE1jjE2j)median(jE1jjE2j)j/{1.25 median[j(jE1jjE2j)median(jE1jjE2j)j]} > TMAX (6.0). 2. Pairs of data are excluded for which either jE1j/(jE1j) or jE2j/ (jE2j) < XMIN (3.0). 3. Pairs of data are excluded if kE1jjE2k/[2(jE1j) þ 2(jE2j)]1/2 < YMIN (1.0). 4. Normalized jEj are excluded if jEj/(jEj) < ZMIN (3.0). 5. Normalized jEj are excluded if [jEjjEjMAX]/(jEj) > ZMAX (0.0). 60
G. D. Smith, B. Nagar, J. M. Rini, H. A. Hauptman, and R. H. Blessing, Acta Crystallogr. D. Biol. Crystallogr. 54, 799 (1998). 61 P. L. Howell, R. H. Blessing, G. D. Smith, and C. M. Weeks, Acta Crystallogr. D. Biol. Crystallogr. 56, 604 (2000).
[3]
automatic solution of heavy-atom substructures
59
The parameter TMAX is used to reject data with unreliably large values of kE1jjE2k in the tails of the (jE1jjE2j) distribution. This test assumes that the distribution of (jE1jjE2j)/(jE1jjE2j) should approximate a zeromean unit-variance normal distribution for which values less than TMAX or greater than þTMAX are extremely improbable.P The quantity jMAX is P 2 jE 1/ 2 a physical least upper bound such that jE j ¼ jf j/[e jfj ] for SIR MAX P P data and jEj MAX ¼ f 00 /[e (f 00 )2]1/2 for SAD data. Resolution Cutoffs. Before attempting to use MAD or SAD data to locate the anomalous scatterers, a critical decision is to choose the resolution to which the data should be truncated. If data are used to a higher resolution than is supported by significant dispersive and anomalous information, the effect will be to add noise. Because direct methods are based on normalized structure factors, which emphasize the high-resolution data, they are particularly sensitive to this. Because there is some anomalous signal at all the wavelengths in the MAD experiment, a good test is to calculate the correlation coefficient between the signed anomalous differences F at different wavelengths as a function of the resolution. A good general rule is to truncate the data where this correlation coefficient falls below 25–30%. Table III (calculated using XPREP39) illustrates three different cases. In case A, the high values involving the peak (PK) and inflectionpoint (IP) data show that it is not necessary to truncate the data because there is significant MAD information at the highest resolution collected. A poorer correlation would be expected with the low-energy remote data (LR), which has a much smaller anomalous signal. In case B, it is advisable ˚ (which indeed led to a successful soluto truncate the data to about 3.9 A tion using SHELXD). Case C is clearly hopeless and, in fact, could not be solved. For SAD data collected at a single wavelength, it is still possible to use the correlation coefficient between the anomalous differences collected from two crystals, or from one crystal in two orientations, before merging the two data sets. Such information is also available from the CCP4 programs SCALA and REVISE (see Collaborative Computational Project Number 4, below). Heavy-Atom Searching and Phasing The phase problem of X-ray crystallography may be defined as the problem of determining the phases of the normalized structure factors E when only the magnitudes jEj are given. Owing to the atomicity of crystal structures and the redundancy of the known magnitudes, the phase problem is overdetermined. This overdetermination implies the existence of relationships among the phases that are dependent on the known magnitudes alone, and the techniques of probability theory have identified the linear
60
[3]
phases TABLE III Correlation Coefficients (%) Between High-Energy Remote Data and Other Wavelengths as a Function of Resolution Range A. Apical domain,a 1 (3 SeMet in 144 residues), C2221
Abbreviations: PK, peak; IP, inflection point; LR, low-energy remote. a M. A. Walsh, I. Dementieva, G. Evans, R. Sanishvili, and A. Joachimiak, Acta Crystallogr. D. Biol. Crystallogr. 55, 1168 (1999). b M. Selmer, S. Al-Karadaghi, G. Hirokawa, A. Kaji, and A. Liljas, Science 286, 2349 (1999).
combinations of three phases whose Miller indices sum to zero (i.e., HK ¼ H þ K þ HK) as relationships useful for determining unknown structures. (The quantities HK are known as structure invariants because their values are independent of the choice of origin of the unit cell.) The conditional probability distribution of the three-phase or triplet invariants depends on the parameter AHK, where AHK ¼ (2/N 1/2)jEHEKEHKj and N is the number of atoms, here presumed to be identical, in the asymmetric unit of the corresponding primitive unit cell.62 Probabilistic estimates of the invariant values are most reliable when the associated normalized magnitudes (jEHj, jEKj, and jEHKj) are large and the number of atoms in the unit cell is small. Thus, it is the largest jEj or jEAj, remaining after the application of all appropriate cutoffs, that are phased in direct-methods substructure determinations. The triplet invariants involving these reflections are generated, and a sufficient number of those invariants with the highest AHK values are retained to achieve the desired invariant-to-reflection ratio (e.g., SnB uses a default ratio of 10:1). The inability to obtain a sufficient 62
W. Cochran, Acta Crystallogr. 8, 473 (1955).
[3]
automatic solution of heavy-atom substructures
61
number of accurate invariant estimates is the reason why full-structure phasing by direct methods is possible only for the smallest proteins. ‘‘Multisolution’’ Methods and Trial Structures. Once the values for some pairs of phases (K and HK) are known, the triplet structure invariants can be used to generate further phases (H) which, in turn, can be used iteratively to evaluate still more phases. The number of cycles of phase expansion or refinement that must be performed depends on the size of the structure to be determined. Older, conventional, direct-methods programs operate in reciprocal space alone, but the SnB and SHELXD programs alternate phase improvement in both reciprocal and real spaces within each cycle. To obtain starting phases, a so-called multisolution or multitrial approach63 is taken in which the reflections are each assigned many different starting values in the hope that one or more of the resultant phase combinations will lead to a solution. Solutions, if they occur, must be identified on the basis of some suitable figure of merit. Typically, a random-number generator is used to assign initial values to all phases from the outset.64 A variant of this procedure employed in SnB is to use the random-number generator to assign initial coordinates to the atoms in the trial structures and then to obtain initial phases from a structure-factor calculation. The efficiency of direct methods, however, often can be improved considerably by using better-than-random starting trial structures that are, in some way, consistent with the Patterson function. In SHELXD, this is accomplished by computing a Patterson minimum function (PMF)65 to screen for likely candidates. First, one presumes that the strongest general Patterson peaks may well correspond to a vector between two heavy atoms. For a selected number (e.g., 100) of these vectors, the pair of atoms related by the vector are subjected to a number of random translations (e.g., 99,999). For each of these potential two-atom trial structures, all the symmetryequivalent atoms are found, the Patterson-function values corresponding to the unique vectors between all of these atoms are calculated and sorted in ascending order, and then the PMF scoring criterion is computed as the mean value of the lowest (e.g., 30%) values in this list. For each two-atom vector, the random translation with the highest PMF is retained. Next, the two-atom trial structures are extended to N atoms by using a technique that involves the computation of a full-symmetry Patterson superposition minimum function (PSMF).37 A list containing all symmetry equivalents of the two starting atoms is generated. Then, each pixel of the PSMF map is 63
G. Germain and M. M. Woolfson, Acta Crystallogr. B 24, 91 (1968). R. Baggio, M. M. Woolfson, J.-P. Declercq, and G. Germain, Acta Crystallogr. A 34, 883 (1978). 65 C. E. Nordman, Trans. Am. Crystallogr. Assoc. 2, 29 (1966). 64
62
phases
[3]
assigned a value equal to the PMF for all vectors in the list and a dummy atom placed at that pixel. Finally, the N 2 highest peaks in the PSMF map are obtained by interpolation and sorting, and then they are added to the trial structure. Tests using SHELXD have shown that this combination of direct and Patterson methods produces more complete and precise solutions than just using the Patterson methods alone. To make this method applicable in space group P1, SHELXD places an extra atom at the origin and performs random translations of the two-atom fragment. Reciprocal-Space Phase Refinement or Expansion: Shaking. Once a set of initial phases has been chosen, it must be refined against the set of structure invariants whose values are presumed known. So far, two optimization methods (tangent refinement and parameter-shift reduction of the minimal function) have proved useful for extracting phase information in this way. Both of these optimization methods are available in both SnB and SHELXD, but SnB uses the minimal function by default whereas SHELXD uses the tangent formula. The tangent formula66 P jEK EHK j sin ðK þ HK Þ (2) tan ðH Þ ¼ PK jEK EHK j cos ðK þ HK Þ K
is the relationship used in conventional direct-methods programs to compute H given a sufficient number of pairs (K, HK) of known phases. It is also an option within the phase-refinement portion of the dual-space Shake-and-Bake procedure.67,68 In each cycle, SnB uses the tangent formula to redetermine all the phases, a process referred to as tangent-formula refinement. On the other hand, SHELXD performs a process of tangent expansion in which, during each cycle, the phases of (typically) the 40% highest calculated E magnitudes are held fixed while the phases of the remaining 60% are determined by the tangent formula. The tangent formula suffers from the disadvantage that, in space groups without translational symmetry, it is perfectly fulfilled by a false solution with all phases equal to zero, thereby giving rise to the so-called ‘‘uranium-atom’’ solution with one dominant peak in the corresponding Fourier synthesis. In conventional direct-methods programs, the tangent formula is often modified in various ways to include (explicitly or implicitly) information from the so-called negative quartet or four-phase structure invariants69,70 that are 66
J. Karle and H. A. Hauptman, Acta Crystallogr. 9, 635 (1956). C. M. Weeks, H. A. Hauptman, C.-S. Chang, and R. Miller, Trans. Am. Crystallogr. Assoc. 30, 153 (1994). 68 G. M. Sheldrick and R. O. Gould, Acta Crystallogr. B 51, 423 (1995). 67
[3]
automatic solution of heavy-atom substructures
63
dependent on the smallest as well as the largest E magnitudes. Such modified tangent formulas do indeed largely overcome the problem of false minima for small structures, but because of the dependence of quartet term probabilities on 1/N, they are little more effective than the normal tangent formula for large structures. Constrained minimization of an objective function like the minimal function71,72 X X RðÞ ¼ AHK ½ cos HK I1 ðAHK Þ=I0 ðAHK Þ 2 = AHK (3) H;K
H;K
provides an alternative approach to phase refinement or phase expansion. R() is a measure of the mean-square difference between the values of the triplets calculated using a particular set of phases and the expected probabilistic values of the same triplets as given by the ratio of modified Bessel functions [i.e., I1(AHK)/I0(AHK)]. The minimal function is expected to have a constrained global minimum when the phases are equal to their correct values for some choice of origin and enantiomorph. The minimal function also can be written to include contributions from quartet invariants, although their use is not as imperative as with the tangent formula because the minimal function does not have a minimum when all phases are zero. An algorithm known as parameter shift73 has proved to be quite powerful and efficient as an optimization method when used within the Shake-andBake context to reduce the value of the minimal function. For example, a typical phase-refinement stage consists of three iterations or scans through the reflection list, with each phase being shifted a maximum of two times by 90 in either the positive or negative direction during each iteration. The refined value for each phase is selected, in turn, through a process that involves evaluating the minimal function using the original phase and each of its shifted values.53 The phase value that results in the lowest minimalfunction value is chosen at each step. Refined phases are used immediately in the subsequent refinement of other phases. Real-Space Constraints: Baking. Peak picking is a simple but powerful way of imposing an atomicity constraint. Karle74 found that even a relatively small, chemically sensible, fragment extracted by manual interpretation of a small-molecule electron-density map could be expanded 69
H. Schenk, Acta Crystallogr. A 30, 477 (1974). H. Hauptman, Acta Crystallogr. A 30, 822 (1974). 71 T. Debaerdemaeker and M. M. Woolfson, Acta Crystallogr. A 39, 193 (1983). 72 G. T. DeTitta, C. M. Weeks, P. Thuman, R. Miller, and H. A. Hauptman, Acta Crystallogr. A 50, 203 (1994). 73 A. K. Bhuiya and E. Stanley, Acta Crystallogr. 16, 981 (1963). 74 J. Karle, Acta Crystallogr. B 24, 182 (1968). 70
64
phases
[3]
into a complete solution by transformation back to reciprocal space and then performing additional iterations of phase refinement with the tangent formula. Automatic real-space electron-density map interpretation in the Shake-and-Bake procedure consists of selecting an appropriate number of the largest peaks in each cycle to be used as an updated trial structure without regard to chemical constraints other than a minimum allowed distance ˚ for full structures and 3–3.5 A ˚ for substructures). between atoms (e.g., 1.0 A If markedly unequal atoms are present, appropriate numbers of peaks (atoms) can be weighted by the proper atomic numbers during transformation back to reciprocal space in a subsequent structure-factor calculation. Thus, a priori knowledge concerning the chemical composition of the crystal is used, but no knowledge of constitution is required or used during peak selection. It is useful to think of peak picking in this context as simply an extreme form of density modification appropriate when the resolution of the data is small compared with the distance separating the atoms. In theory, under appropriate conditions it should be possible to substitute alternative density-modification procedures such as low-density elimination75,76 or solvent flattening,27 but no practical applications of such procedures have yet been made. The imposition of physical constraints counteracts the tendency of phase refinement to propagate errors or produce overly consistent phase sets. For example, the ability to eliminate chemically impossible peaks at special positions using a symmetry-equivalent cutoff distance (similar to the procedure described in the Crystallography and NMR System section) prevents the occurrence of most cases of false minima.10 In its simplest form as implemented in the SnB program, peak picking consists of simply selecting the top N E-map peaks, where N is the number of unique nonhydrogen atoms in the asymmetric unit. This is adequate for small-molecule structures. It has also been shown to work well for heavyatom or anomalously scattering substructures where N is taken to be the number of expected substructure sites.60,77 For larger structures or substructures (e.g., N > 100), the number of peaks selected is reduced to 0.8N peaks, thereby taking into account the probable presence of some atoms that, owing to high thermal motion or disorder, will not be visible. An alternative approach to peak picking used in SHELXD is to begin by selecting approximately N top peaks, but then to eliminate some of them (typically one-third) at random. By analogy to the common practice in macromolecular crystallography of omitting part of a structure from a 75
M. Shiono and M. M. Woolfson, Acta Crystallogr. A 48, 451 (1992). L. S. Refaat and M. M. Woolfson, Acta Crystallogr. D. Biol. Crystallogr. 49, 367 (1993). 77 M. A. Turner, C.-S. Yuan, R. T. Borchardt, M. S. Hershfield, G. D. Smith, and P. L. Howell, Nat. Struct. Biol. 5, 369 (1998). 76
[3]
65
automatic solution of heavy-atom substructures
Fourier calculation in hopes of finding an improved position for the deleted fragment, this version of peak picking is described as making a random omit map. It has the potential for being a more efficient search algorithm. Scoring Trial Structures SnB and SHELXD compute figures of merit that allow the user to judge the quality of a trial structure and decide whether or not it is a solution. It is worth repeating the caution given above (see Crystallography and NMR System). Although it is sometimes possible to give absolute values that strongly indicate a solution, it is safer to consider relative values. A true solution should have one or more figure-of-merit values that are outstanding relative to the nonsolutions, which generally are in the majority. Minimal Function. The minimal function itself, R() [Eq. (3)], is a highly reliable figure of merit, provided that it has been calculated directly from the constrained phases corresponding to the final peak positions.53 This figure of merit is computed by both programs, and solutions typically have the smallest values. The SnB graphical user interface provides an option for checking the status of a running job by displaying a histogram of the minimal-function values for all trials that have been processed so far, as illustrated in Fig. 3 for the peak-anomalous difference data for a 30-site selenomethionyl (SeMet) substructure.77 A clear bimodal distribution of figure-of-merit values is a strong indication that a solution has, in fact, been found. Confirmation that this is true for trial 913 in the example in Fig. 3 can be obtained by inspecting a trace of the minimal-function value as a function of refinement cycle (Fig. 4). Solutions usually show an abrupt decrease in value over a few cycles, followed by stability at the lower value. P Crystallographic R. SnB and SHELXD compute RCRYST ¼ ( kEOj P jECk)/ jEOj. This figure of merit, which is also highly reliable, has small values for solutions. PATFOM. The Patterson figure of merit, PATFOM, is the mean Patterson minimum function value for a specified number of atoms. It is computed by SHELXD. Although the absolute value depends on the structure in question, solutions almost always have the largest PATFOM values. Correlation Coefficient. The correlation coefficient42 computed in SHELXD is defined by hX i X X X w wEo wEc CC ¼ wEo Ec X
wE2o
X
w
X
wEo
2 X
wE2c
X
w
X
wEc
2 1=2 (4)
66
phases
[3]
Fig. 3. This bimodal histogram of minimal function (RMIN) values for 1000 trials suggests that there are 39 solutions. RTRUE and RRANDOM are theoretical values for true and random phase sets, respectively.53
Fig. 4. Plots of the minimal-function value over 60 cycles (a) for a solution (trial 913) and (b) for a nonsolution (trial 914).
with default weights w ¼ 1/[0.1 þ 2 (E)]. Solutions typically have the largest values for this figure of merit. Values of 0.7 or greater when based on all, or almost all, of the jEj data for full structures strongly indicate that a solution has been found. Also, when computed in SHELXD for substructures using jEAj data, values greater than 0.4 typically indicate a solution. SnB also computes a correlation coefficient, but this criterion has not been found to be reliable for substructures when based on the limited number of jEj difference data normally used.
[3]
automatic solution of heavy-atom substructures
67
Site Validation Direct-methods programs provide as output a file of peak positions, for one or more of the best trials, sorted in descending order according to the electron density at those positions on the Fourier map. For an N-site substructure, SnB provides 1.5N peaks for each trial. The user must then decide which, and how many, of these peaks correspond to actual atoms. The first N peaks have the highest probability of being correct, and in many cases this simple guideline is adequate. Sometimes, there will be a significant break in the density values between true and false peaks, and, when this occurs in the expected place, it is additional confirmation. In other cases, a conservative approach is to accept the 0.8N to 0.9N top peaks, compute a difference Fourier map, and compare the peaks on this map to the original direct-methods map. Crossword Tables. The Patterson superposition function is the basis of the crossword table,78,79 introduced in SHELXS-8680 and available also in SHELXD, that provides another way to assess which of the heavy-atom sites are correct and, in some cases, to recognize the presence of noncrystallographic symmetry. Each entry in the table links the potential atom forming the row with the potential atom forming the column. For each pair of atoms, the top number is the minimum distance between them, taking the space-group symmetry into account. The bottom number is the Patterson minimum function (PMF) value calculated from all vectors between the two atoms, also taking symmetry into account. The first vertical column is based on the self-vectors (i.e., the vectors between one atom and its symmetry equivalents). In general, wrong sites can be recognized by the presence in the table of several zero PMF values (negative values are replaced by zero). Table IV shows the crossword table for the CuK anomalous F data for a HiPIP with two Fe4S4 clusters in the asymmetric unit.81 It is easy to find the two clusters (atoms 1–4 and 5–8) by looking for Fe Fe dis˚ , and the PMF values for the eight correct tances of approximately 2.8 A atoms are, in general, higher than those involving spurious atoms despite the weakness of the anomalous signal. Comparison of Trials. When trying to decide which peaks are correct, it is also helpful to compare the peak positions from two or more solutions. 78
G. M. Sheldrick, Z. Dauter, K. S. Wilson, and L. C. Sieker, Acta Crystallogr. D. Biol. Crystallogr. 49, 18 (1993). 79 G. M. Sheldrick, in ‘‘Direct Methods for Solving Macromolecular Structures’’ (S. Fortier, ed.), p. 131. Kluwer Academic, Dordrecht, The Netherlands, 1998. 80 G. M. Sheldrick, J. Mol. Struct. 130, 9 (1985). 81 I. Rayment, G. Wesenberg, T. E. Meyer, M. A. Cusanovich, and H. M. Holden, J. Mol. Biol. 228, 672 (1992).
68
[3]
phases TABLE IV Crossword Table for Location of Eight Iron Atoms
Peaks recurring in several solutions are more likely to be real. However, in order to do this comparison, one must take into account the fact that different solutions may have different origins and/or enantiomorphs. A standalone program for doing this is available,82 and the capability of making such comparisons automatically for all space groups will be available in future versions of SnB and SHELXD. The usefulness of peak correlation is illustrated by an example for a 30-site SeMet substructure.61,77 Table V presents the relative rankings of peaks, from nine other trials, that correspond to peaks 29–45 of trial 149, which had the lowest minimal-function value for the peak-wavelength difference data for crystal 1. The top 29 peaks for trial 149 were correct selenium positions, but peak 30 (the Nth peak) was spurious. Peak 33 of trial 149 was found to have a match on every other map, and indeed, it did correspond to the final selenium site. It appears that, in general, the same noise is not reproduced on different maps, especially maps originating from different data sets. Thus, peak correlation can be used to identify correct peaks ranking below the Nth peak. 82
G. D. Smith, J. Appl. Crystallogr. 35, 368 (2002).
[3]
69
automatic solution of heavy-atom substructures TABLE V Trial Comparison for 30-Site Substructure
Crystal: Wavelengtha: Trial no.:
1 PK 149
1 PK 31
1 PK 158
1 PK 165
1 PK 176
Peak rank:
29 31 33 34 37 39 40 45
22
29
29
42
30 33
29 34 30
a
35 42
1 IP 104
24
1 HR 23
2 IP 476
2 PK 93
2 HR 86
21
38
29
28
22
34
30
30
43 40 42
38 42
40
The wavelengths are peak (PK), inflection point (IP), and high-energy remote (HR).
Enantiomorph Determination Because all publicly distributed direct-methods programs, including SnB and SHELXD, work with only jEj, jEj, or jEAj values, they have no way to determine the proper hand. Both enantiomorphs are found with equal frequency among the solutions. If a structure crystallizes in an enantiomorphic space group, either of the space groups may be used during the directmethods step, but chances are 50% that, at a later stage, the coordinates will have to be inverted and the space group changed to its enantiomorph in order to produce an interpretable protein map. A direct-methods formalism has been proposed83 that uses both jEþj and jE–j and, in theory, should make it possible to produce only solutions with the proper hand. However, this theory has never been successfully applied to actual experimental data. Similarly, it should be noted that solutions occur at all permitted origin positions with equal frequency. This means that, in the MIR case, cross-phasing is necessary to ensure that all derivatives are referred to the same origin. A direct-methods formalism84 exists that should automatically do this, but it has never been implemented in a distributed program. Substructure Refinement Fourier refinement, often called E-Fourier recycling, has been used for many years in direct-methods programs to improve the quality and completeness of solutions.85 Additional refinement cycles are performed in real 83 84
H. Hauptman, Acta Crystallogr. A 38, 632 (1982). S. Fortier, C. M. Weeks, and H. Hauptman, Acta Crystallogr. A 40, 646 (1984).
70
phases
[3]
space alone, using many more reflections than is possible in the directmethods steps that are dependent on the accuracy of triplet-invariant relationships. In SHELXD, the final model can be improved further by occupancy or isotropic displacement parameter (Biso) refinement for the individual atoms,86 followed by calculation of the Sim87- or sigma-A88weighted map. The development of a common interface89 for SnB and the PHASES package90 permits coordinates determined by direct methods to be passed easily for conventional substructure phase refinement and protein phasing, and for SHELXD this facility is provided by a program SHELXE.90a Collaborative Computational Project Number 4
Unlike many other packages, the Collaborative Computational Project Number 4 (CCP4) suite is a set of separate programs that communicate via standard data files rather than having all operations integrated into one huge program. This has some disadvantages in that it is less easy for programs to make decisions about what operation to do next even though communication is now being coordinated through a graphical user interface (CCP4i). The advantage of loose organization is that it is easy to add new programs or to modify existing ones without upsetting other parts of the suite. Data Preparation The CCP4 suite provides a number of programs (i.e., SCALA,91 TRUNCATE,92 and SCALEIT) that are useful in preparing data for experimental phasing. SCALA treats scaling and merging as different operations, thereby allowing an analysis of data quality before merging. For isomorphous replacement studies, the native data can be used as the reference set, and all of the derivatives scaled to it. This provides 85
G. M. Sheldrick, in ‘‘Crystallographic Computing’’ (D. Sayre, ed.), p. 506. Clarendon Press, Oxford, 1982. 86 I. Uso´ n, G. M. Sheldrick, E. de la Fortelle, G. Bricogne, S. di Marco, J. P. Priestle, M. G. Gru¨ tter, and P. R. E. Mittl, Struct. Fold. Des. 7, 55 (1999). 87 G. A. Sim, Acta Crystallogr. 12, 813 (1959). 88 R. J. Read, Acta Crystallogr. A 42, 140 (1986). 89 C. M. Weeks, R. H. Blessing, R. Miller, R. Mungee, S. A. Potter, J. Rappleye, G. D. Smith, H. Xu, and W. Furey, Z. Kristallogr. 217, 686 (2002). 90 W. Furey and S. Swaminathan, Methods Enzymol. 277, 590. 90a G. M. Sheldrick, Z. Kristallogr. 217, 644 (2002). 91 P. R. Evans, in ‘‘Recent Advances in Phasing.’’ Proceedings of CCP4 Study Weekend (1997). 92 G. S. French and K. S. Wilson, Acta Crystallogr. A 34, 517 (1978).
[3]
automatic solution of heavy-atom substructures
71
well-parameterized ‘‘local’’ scales. For MAD data, all sets are scaled in one pass, gross outliers are rejected (e.g., any measurement four to five times greater than the mean), and then each data set is merged separately to give a weighted mean for each reflection. A detailed analysis of the data is provided in a graphical form. Useful information is given on the scale factors themselves (which can often pinpoint rogue images), on the Rmerge values, and on the correlation coefficients between wavelengths for MAD data (coefficients <0.4 suggest a resolution cutoff; see discussion of Table III). Various scaling models related to the experiment can be used. The scale factor is a function of the primary beam direction, treated either as a smooth function of the rotation angle or as an image-by-image correction. In addition, the scale may be a function of the secondary beam direction, acting principally as an absorption correction, expanded either as spherical harmonics or as an interpolated three-dimensional function of the rotation angle and the spatial coordinates of the measured spot on the detector. The secondary beam correction is related to the absorption anisotropy correction described by Blessing,93 and the interpolated three-dimensional correction is similar to that described by Kabsch.94 Optimum scaling depends a great deal on exactly how the data were collected, and it is not possible to lay down rules for all cases. TRUNCATE can convert merged intensities to amplitudes in two ways. The simplest way is just to take the square root of the intensities, setting any negative values to zero. Alternatively, a best estimate of F can be calculated from I, (I), and the distribution of intensities in resolution shells. This has the effect of forcing all negative observations to be positive and of inflating the weakest reflections (<3) because an observation significantly smaller than the average intensity is likely to be underestimated. TRUNCATE also analyzes the data to verify that the expected distributions are satisfied. It generates a Wilson plot that should be linear for the ˚ , moments for the intensities (which are resolution shells greater than 4 A excellent indicators of twinning), the cumulative intensity distribution (another clue to both twinning and sometimes noncrystallographic symmetry), and an analysis of anisotrophy. All these criteria need to be examined carefully before using the data. SCALEIT puts all data sets on the same relative scale and uses normal probability plots95 to test whether the differences between them are significant. First, the reflections in each resolution bin are sorted according to the value of (real) ¼ (FPFPH)/[2(FP) þ 2(FPH)]1/2, where FPH and (FPH) 93
R. H. Blessing, Acta Crystallogr. A 51, 33 (1995). W. Kabsch, J. Appl. Crystallogr. 21, 916 (1988). 95 P. L. Howell and G. D. Smith, J. Appl. Crystallogr. 25, 81 (1992). 94
72
phases
[3]
are the scaled values for the derivative. For each reflection, the corresponding (expected) is then calculated assuming a normal distribution, and (real) is plotted against (expected). If the native and scaled derivative data sets are essentially identical (in statistical parlance, they represent two samplings of the same population), the normal probability plot will be linear with a slope of unity and an intercept of zero. The size of the substructure contribution can be gauged by the deviation of the slope and intercept from these values, and the variation with resolution indicates to what resolution the heavy-atom contribution extends. A similar analysis can be applied to MAD data to estimate the significance of the dispersive and anomalous differences. Heavy-Atom Searching and Phasing The CCP4 suite includes two direct-methods programs that can be used to locate heavy-atom sites using a variety of difference structure-factor coefficients. The simplest approach is to use the best SAD or SIR difference. Program REVISE96 can be used to estimate FA or FM, the full contribution from the substructure. Normalized difference magnitudes, jEj, are computed using program ECALC. RANTAN. RANTAN7 is a classic direct-methods program that performs reciprocal-space phase refinement. The program determines reflections for fixing the origin and enantiomorph, and then assigns a set of random phases with default weights of 0.25 to a starting set of large jEj values. The phases are refined by the tangent formula and expanded to include the whole set of large E magnitudes. Up to five sets of refined phases and weights with the best combined figures of merit are output. ACORN. ACORN8 is a fast ab initio procedure for solving structures when the data are sufficient to separate atomic sites in the E maps. In the ˚ data (sometimes even lower) will usually suffice. case of substructures, 4-A The initial phase sets are generated from the atomic coordinates of a putative structural fragment. The fragment can be made up in various ways. In simple cases, such as metalloproteins or heavy-atom substructures, it is sufficient to generate many trial structures starting from a single randomly placed atom. The reflections are divided into three groups (strong, medium, and weak) according to their jEj values. Correlation coefficients (CC; see Dual-Space Direct Methods, above), between the observed and calculated E values for each class are used in different ways throughout the procedure. All reflections are used to select likely trial sets. The strong and weak reflections are used in the phase refinement, and the CC for the 96
H.-F. Fan, M. M. Woolfson, and J.-X. Yao, Proc. R. Soc. Lond. A 442, 13 (1993).
[3]
automatic solution of heavy-atom substructures
73
medium reflections provides a simple criterion of correctness for a phase set. The starting phase sets are refined primarily using dynamic density modification, supplemented by Patterson superposition and real-space Sayre equation refinement. Dynamic density modification (DDM) eliminates the negative densities and truncates the highest density. For the first cycle, this truncation will occur at the sites of the starting coordinate(s). During later cycles, the density is modified according to a formula based on the standard deviation of the map and the cycle number. Patterson superposition generates a semisharpened Patterson sumfunction map from the starting fragment. Sayre equation refinement is carried out in real space, using fast Fourier transforms instead of working directly with the phase relationships. The equations are identical, but the real-space formulation is much faster. ACORN first uses DDM for many cycles. Then, if no solution can be found, a few cycles of Sayre equation refinement are performed. This may modify the phase set sufficiently to allow the DDM algorithm to function more effectively. Scoring Trial Structures ACORN will stop automatically if the value of CC for the medium E values becomes greater than a preset value during DDM, thereby indicating that a probable solution has been found. This CC value needs to be adjusted according to the data quality, particularly when searching for anomalous scatterers using SAD or MAD data. Another criterion for success, similar to that used in SnB, is that the same solution is found more than once. In CCP4, this is checked using the phased translation function, a function that detects similar solutions after taking both hands (enantiomorphs) and alternative origins into account. The third, and most significant, criterion is whether the trial solution gives the appropriate number of sites with more-or-less appropriate peak heights. Site Validation, Enantiomorph Determination, and Substructure Refinement Within CCP4, the program MLPHARE is used to refine the substructure sites and to generate protein phases. Initially, it is usually sufficient to refine putative sites against the centric data or some other subset. Typically, the refinement is enormously overdetermined (i.e., there are many more observations than parameters), and the refined phases are sufficiently
74
phases
[3]
good to allow the cross-checking of sites and the choice of hand. Numerical criteria are the figure of merit and phasing power, both of which are useful criteria for assessing whether a new site is improving the solution or not. However, it is difficult to define an absolute required value for either of these quantities. Another useful criterion is the extended Cullis R factor, defined as the hLack of closurei/hIsomorphous differencei. (The isomorphous difference is jFPHFPj; lack of closure is jFPHjFP þ FHk, where jFP þ FHj is a vector sum of the calculated FH and FP using the current best protein phases.) This is the most reliable signal for a usable derivative. For centric data, values less than 0.6 are excellent, and values greater than 0.9 indicate that something is not right. If a new site does not reduce the existing Cullis R value, it is probably not correct. Applications
This section contains a discussion of applications of the programs described above to substructures that can be regarded in some way as being at the cutting edge. These applications include large selenomethionine derivatives, substructures phased by weak anomalous signals, and substructures created by soaking protein crystals in cryobuffers containing concentrated halide salts. The tabulations presented below should not be regarded as a complete survey of the literature. The intention here is to focus on how to use the programs effectively in these challenging situations. Large Substructures Improvements in data collection instruments and methods have permitted macromolecular diffraction data, especially small anomalous-scattering differences, to be measured much more accurately. At the same time, the use of genetic engineering to replace methionine by selenomethionine97 (SeMet) has provided a convenient means for inserting many anomalous scatterers into large proteins. In the last 3 or 4 years, this has resulted in a dramatic increase in the size of the substructures that have challenged phasing methods and the programs that implement them. So far, as shown in Table VI, the programs described above (especially those that employ direct methods) have met this challenge well, and the upper size of substructures manageable with current software has clearly not been reached yet.
97
W. A. Hendrickson, J. R. Horton, and D. M. LeMaster, EMBO J. 9, 1665 (1990).
[3]
automatic solution of heavy-atom substructures
75
In general, recognizing when a solution has occurred has not been a problem, but selecting all the correct sites has been difficult in a few cases (e.g., carrier protein reductase and lactonizing enzyme) and was aided by a careful consideration of the noncrystallographic symmetry. On the other hand, some of the largest studies have proceeded smoothly once a solution was identified. For example, in the case of the bifunctional enzyme DmpFG, the 86 top selenium sites found by SnB were put into the ADDSOLVE component of the SOLVE package for refinement and to search for additional sites. ADDSOLVE found 14 more sites (total of 100 of 108 Se), and it was followed by solvent flattening using RESOLVE. Then, two ˚ for the rounds of ARP/wARP98 tracing, extending the resolution to 1.7 A native data, found 2330 residues (88%) automatically. ˚ , peak-wavelength, anomalous data set (136,609 KPHMT. The 2.8-A unique reflections) for the largest SeMet substructure, ketopantoate hydroxymethyltransferase (KPHMT) from Escherichia coli, was highly redundant (average multiplicity per Friedel mate, 10.6), complete (all data, 99.9%; anomalous completeness, 99.8%), accurate (Rsym ¼ 0.120; Ranom ¼ 0.073), and had a good signal-to-noise ratio [I/(I) ¼ 25.6 overall; I/(I) ¼ 6.0 in the highest resolution shell]. It is assumed that the high quality of the data was important for a successful outcome. Although the KPHMT substructure was originally solved by SnB, it can also be solved by SHELXD using the peak data alone. In fact, the use of Patterson-based trial structures in SHELXD improves the success rate (percentage of trials going to solution) by perhaps an order of magnitude, resulting in one solu˚ data set is tion every 14 h on an 800-MHz Athlon PC when the full 2.8-A used. On the other hand, an experimental version of SnB that uses the sineenhanced minimal function99 also gives a significantly improved success rate relative to the distributed program (SnB version 2.1). ˚ data, 0.75-A ˚ Fourier map grid) for The best SHELXD solution (2.8-A KPHMT had 145 of the top 160 peaks, and 149 of the top 200 peaks, within ˚ of the methionine sulfurs in the native structure. These matches could 2A be improved to 152 and 157 peaks, respectively, by combining the phases for the best 16 solutions. In this case, 97 of the peaks were actually within ˚ of the sulfur positions. In comparison, the original SnB solution (3.50.5 A ˚ data, 2-A ˚ grid) gave corresponding matches of 122 and 127 peaks. The A KPHMT structure consists of two independent decamers with N-terminal methionines that have never been found. The strategy followed by von Delft57 in solving the structure was to take the top 120 SnB peaks 98 99
A. Perrakis, R. Morris, and V. S. Lamzin, Nat. Struct. Biol. 6, 458 (1999). H. Xu, H. A. Hauptman, and C. M. Weeks, Acta Crystallogr. D. Biol. Crystallogr. 58, 90 (2002).
76
[3]
phases TABLE VI Twenty Selenomethionine Substructures with 40 or More Sites
Protein Cyanaseb Pyruvate dehydrogenase: E1c EphB2 receptor SAM domaind Arylamine N-acetyltransferasee MutS repair proteinf Target protein MP883g Confidential d-Hydantoinaseh Confidential Cap-binding complexi Nicotinamide nucleotide transhydrogenasej Tryparedoxin peroxidasek Gastroenteritis viral proteasel 2-Aminoethylphosphonate transaminasem Human HMG-CoA reductasen Acyl carrier protein reductaseo d-Mannoheptose 6-epimerasep Muconate lactonizing enzymeq Pseudomonas sp. DmpFGr Ketopantoate hydroxymethyltransferases
˚) d (A
kDa/ asymmetric unit
Program useda
Actual sites
Sites found
P1 P21
3.0 3.5
170 200
SHELXD SnB
40 40
40 40
P41
1.95
78
SnB
48
?
P21212
4.0
240
SnB
48
48
P212121 P212121 P212121 C2221 P21 P212121 P21
3.0 3.0 2.4 3.0 4.0 3.0 3.0
230 180 183 300 61 300 160
SnB SHELXD SOLVE SOLVE SOLVE SHELXD SHELXD
48 50 52 54 56 57 59
32 50 52 54 56 57 58
P21
3.2
230
SOLVE
60
46
P21
2.9
198
SnB
60
37
P21
2.55
270
SHELXD
66
66
P21
2.6
200
SnB
68
45
P21
3.0
204
SnB
69
31
P21
3.0
370
SnB
70
65
P212121
4.0
112
SnB
80
57
P212121 P21
2.2 3.5
280 567
SnB SnB
108 160
86 120
Space group
a
Program used for the original solution. SnB applications used peak-wavelength anomalous jEj data. SOLVE and SHELXD applications used MAD jEAj data. b M. A. Walsh, Z. Otwinowski, A. Perrakis, P. M. Anderson, and A. Joachimiak, Struct. Fold. Des. 8, 505 (2000). c P. Arjunan, N. Nemeria, A. Brunskill, K. Chandrasekhar, M. Sax, Y. Yan, F. Jordan, J. R. Guest, and W. Furey, Biochemistry 41, 5213 (2002). d C. D. Thanos, K. E. Goodwill, and J. U. Bowie, Science 283, 833 (1999). (continued)
[3]
automatic solution of heavy-atom substructures
77
(two-thirds of the originally expected 180 sites), refine them with SHARP,20 and locate the other 40 sites using difference Fouriers. This strategy resulted in a map that could be interpreted easily. AEP Transaminase. Table VII compares the application of several programs to the data for the 66-site SeMet substructure of 2-aminoethylphosphonate (AEP) transaminase. The data are of high quality, but the selenium absorption edge was missed because of problems with the beamline at the time of data collection. As a result, what was thought to be the inflection-point data actually had the strongest anomalous signal. Despite this complication, all the programs tested could solve the structure although there is variation with respect to the data set that gives the highest success rate. The superiority of the combined (direct methods and Patterson) approach that uses Patterson-based seeds to generate the starting structures is apparent. Because CNS uses a dead-end criterion to terminate the Patterson search and, typically, the search is abandoned early when the anomalous signal is poor, the average time per trial will usually be less for the less successful runs. The CCP4 program ACORN runs trials in an order dependent on the scoring for a single randomly positioned
e
J. C. Sinclair, J. Sandy, R. Delgoda, E. Sim, and M. E. Noble, Nat. Struct. Biol. 7, 560 (2000). f M. H. Lamers, A. Perrakis, J. H. Enzlin, H. H. K. Winterwerp, N. deWind, and T. K. Sixma, Nature 407, 711 (2000). g Berkeley Structural Genomics Center, personal communication. h J. Abendroth, K. Niefind and D. Schomburg, J. Mol. Biol. 320, 143 (2002). i C. Mazza, M. Ohno, A. Segref, I. W. Mattaj, and S. Cusack, Mol. Cell 8, 383 (2001). j P. A. Buckley, J. B. Jackson, T. R. Schneider, S. A. White, D. W. Rice, and P. J. Baker, Struct. Fold. Des. 8, 809 (2000). k M. S. Alphey, C. S. Bond, E. Tetaud, A. H. Fairlamb, and W. N. Hunter, J. Mol. Biol. 300, 903 (2000). l K. Anand, G. J. Palm, J. R. Mesters, S. G. Siddell, J. Ziebuhn, and R. Hilgenfeld, Embo. J. 21, 3213 (2002). m C. C. H. Chen, A. Kim, H. Zhang, A. J. Howard, G. Sheldrick, D. Dunaway-Mariano, and O. Herzberg, Biochemistry 41, 13162 (2002). n E. S. Istvan, M. Palnitkar, S. K. Buchanan, and J. Deisenhofer, EMBO J. 19, 819 (2000). o A. C. Price, Y.-M. Zhang, C. O. Rock, and S. W. White, Biochemistry 40, 12772 (2001). p A. M. Deacon, Y. S. Ni, W. G. Coleman, Jr., and S. E. Ealick, Struct. Fold. Des. 8, 453 (2000). q M. Merckel, T. Kajander, A. M. Deacon, A. Thompson, J. G. Grossman, N. Kalkkinen, and A. Goldman, Acta Crystallogr. D. Biol. Crystallogr. 58, 727 (2002). r B. A. Manjasetty, J. Powlowski, and A. Vrielink, Proc. Natl. Acad. Sci. USA 100, 6992 (2003) s F. von Delft, T. Inoue, S. A. Saldanha, H. H. Ottenhof, F. Schmitzbergera, L. M. Birch, V. Dhanaraj, M. Witty, A. G. Smith, T. L. Blundell, and C. Abell, Struct. 11, 985 (2003).
78
[3]
phases TABLE VII Success Rates for 2-Aminoethylphosphonate Transaminase Data Setsa
Program: Trials run: Time per triald: Success rate IP PK HR IP/HR PK/HR IP þ IP/HR FA
CNS 100 600e
7% 3 0 — — 13 17
SnB 1000 250
SHELXDb 1000 90
SHELXDc 1000 40
ACORN Variable —
12.1% 4.1 0.2 16.0 0.1 — 3.8f
15.0% 9.3 2.5 6.8 0.0 13.7 6.1
42.4% 38.0 14.8 12.7 0.0 65.3 56.4
1 of 17 1 of 81 0 — — — 1 of 26
a
The data sets are as follows: inflection point (IP), peak (PK), high-energy remote (HR), IP and HR dispersive differences (IP/HR), PK and HR dispersive differences (PK/HR), combined IP and IP/HR, and FA structure factors computed using XPREP.39 b Random-atom trial structures. c Patterson-seeded trial structures. d Seconds on a 300-MHz SGI R12000. e Average time per trial for the FA data set (estimated from a run on a 833-MHz Compaq Alpha). f Optimum parameters differ from the default values used for single-wavelength differences.
starting atom. ACORN terminates as soon as it finds what it regards as a solution. SOLVE builds trial structures in ways that make an exact comparison with the other programs difficult. The inflection-point data for AEP transaminase were input to SOLVE, and the automatic protocol for SAD data (specifying a maximum number of 66 sites) was used. SOLVE found 66 sites in 7 h on a 500-MHz Compaq Alpha (10 h on a 300-MHz SGI R12000), and 65 of these matched the 66 known Se sites with distances in ˚ . RESOLVE then took the 66 sites and found all the range of 0.06–0.75 A six NCS operators automatically, carried out NCS averaging and solvent flattening, and autobuilt a model including side chains for 78% of the 2232 residues. Weak Anomalous Signals It has long been the dream of crystallographers to use the resonant scattering from naturally occurring elements, in particular sulfur, to phase protein structures. However, the K absorption edges of sulfur and other, smaller atoms such as phosphorus and chlorine correspond to wavelengths
[3]
automatic solution of heavy-atom substructures
79
˚ , well beyond the tunable range (0.8–2.0 A ˚ ) of most longer than 4 A synchrotrons. Furthermore, the severe absorption and radiation-damage problems encountered at such long wavelengths are likely to be insurmountable in most cases. It is fortunate, then, that elements such as sulfur retain some anomalous scattering effect even at wavelengths far removed from their absorption edges. It has been 20 years since Hendrickson and Teeter pioneered the use of sulfur anomalous diffraction to solve the structure of a small protein, crambin.100 Similar applications have been slow to follow, principally because of the difficulty in measuring the small anomalous signal with sufficient accuracy. However, as the applications summarized in Table VIII attest, the ways and means are now being found to conduct the necessary experiments successfully. Tetragonal hen egg-white lysozyme101 and the metalloprotease thermolysin102 are previously known test structures used to demonstrate feasibility. Obelin was the first de novo structure determined by sulfur anomalous-scattering data and solvent flat˚ using the iterative singletening with the latter step carried out at 3.0 A wavelength anomalous scattering method first proposed by Wang.27 In the second de novo determination, that of the C1 subunit of -crustacyanin, the top six peaks corresponded to a single member of each of the six disulfide moieties present in the asymmetric unit. In some cases, it was necessary to deviate from the default parameters used for the determination of substructures with stronger signals (e.g., use larger phase-to-atom ratios or decrease sigma cutoffs). (See also [5] in this volume103). Two facts stand out regarding the examples in Table VIII. First, X-rays ˚ are chosen to reach a workable in the wavelength range of 1.5 to 2.0 A compromise that minimizes absorption and radiation-damage effects while ˚ , the f 00 values are 0.56 maintaining some anomalous signal. (At ¼ 1.54 A electrons for sulfur and 0.70 for chlorine.) Second, highly redundant data are measured in an attempt to maximize accuracy. For example, in the lysozyme study by Weiss,104 no solutions were obtained when the data were truncated such that the redundancy factor was 13 or less. One solution out of 5000 trials was obtained with a redundancy of 16, but this increased to 40 per 5000 trials (0.8% success) when the redundancy was 25.
100
W. A. Hendrickson and M. M. Teeter, Nature 290, 107 (1981). C. C. F. Blake, G. A. Mair, A. C. T. North, D. C. Phillips, and V. R. Sarma, Proc. R. Soc. B 167, 365 (1967). 102 B. W. Matthews, J. N. Jansonius, P. M. Colman, B. P. Schoenborn, and D. Duporque, Nat. New Biol. 238, 37 (1972). 103 R. A. P. Nagem, I. Polikarpov, and Z. Dauter, Methods Enzymol. 374, [5], 2003 (this volume). 104 M. S. Weiss, J. Appl. Crystallogr. 34, 130 (2001). 101
80
TABLE VIII Substructure Determinations Using Weak Anomalous Signals Wavelength ˚) used (A
Redundancy
Space group
˚) d (A
kDa/asymmetric unit
Program used
Actual sites
Sites found
Lysozymea Lysozymeb Thermolysinc Obeline
-Crustacyaninf
1.54 1.54 1.5–2.1 1.74 1.77
23 >16 35–40 6 11
P43212 P43212 P6122 P62 P212121
1.8 1.63 1.83 3.5 2.6
14 14 35 22 40
SHELXD SnB SnB SOLVE SnB
S10Cl8 S10Cl8 ZnCa5S3 S8Cl S12
17 ? 15d 9 6(S–S)
phases
Protein
a
Z. Dauter, M. Dauter, E. de la Fortelle, G. Bricogne, and G. M. Sheldrick, J. Mol. Biol. 289, 83 (1999). M. S. Weiss, J. Appl. Crystallogr. 34, 130 (2001). c M. S. Weiss, T. Sicker, and R. Hilgenfeld, Structure 9, 771 (2001). d Selected for semiautomated refinement using MLPHARE,6 DM,6 and ARP/wARP.98 e Z.-J. Liu, E. S. Vysotski, C.-J. Chen, J. P. Rose, J. Lee, and B.-C. Wang, Protein Sci. 9, 2085 (2000). f E. J. Gordon, G. A. Leonard, S. McSweeney, and P. F. Zagalsky, Acta Crystallogr. D Biol. Crystallogr. 57, 1230 (2001). b
[3]
[3]
automatic solution of heavy-atom substructures
81
Short Halide Soaks Pioneering work has shown that the phasing power of the chloride anions present in tetragonal lysozyme can be exploited further by substituting their higher homologs bromine and iodine, either by replacing the NaCl in the crystallization buffer by NaBr105 or by a quick soak (less than 1 min) of crystals in a cryobuffer containing concentrated (e.g., 0.25–1.0 M) halide salt.106 The latter method appears to be generally applicable, and it leads to incorporation of anomalous scatterers into the ordered solvent regions around protein molecules.106,107 The bromine K absorption edge at 0.92 ˚ can be employed for MAD experiments, and either bromine or iodine A can be used in the SAD or SIRAS approach. In practice, the use of a single, near-remote wavelength has been used effectively to solve structures of bromine-soaked crystals. Prolonging the soak time beyond about 20 s does not seem to lead to greater incorporation of halide ions, but a higher concentration of salt leads to more sites with higher occupancies. Table IX contains a listing of some previously unknown protein structures determined with the aid of halide cryosoaks. Direct methods were used to locate the halide substructures. The primary difference between these applications and those described in the previous sections is that the total number of sites to be found was uncertain. (Fortunately, the formula FANOM/F ¼ 21/2 [(f 00 NA1/2)/(6.7 N1/2 P )], where NA and NP are the numbers of anomalously scattering and protein atoms, respectively, gives an indication of the equivalent number of fully occupied anomalous sites when applied to the low-resolution data.107 It appears, however, that this uncertainty has not been a significant problem, and the number of sites selected from the direct-methods map can be arbitrary. There is no sharp boundary between the strong, highly occupied sites and noise. In general, it is a good idea to underestimate the number of sites initially so that figures of merit do not become ‘‘diluted’’ by the inclusion of incorrect sites. Additional sites can be located easily using appropriate residual maps. Obtaining the Programs
Detailed information about each of the programs described in this chapter, including instructions for downloading, can be obtained from their respective Web sites (Table X).
105
Z. Dauter and M. Dauter, J. Mol. Biol. 289, 93 (1999). Z. Dauter, M. Dauter, and K. R. Rajashankar, Acta Crystallogr. D. Biol. Crystallogr. 56, 232 (2000). 107 Z. Dauter and M. Dauter, Structure 9, R21 (2001). 106
82
TABLE IX Substructure Determinations Using Halide Soaks Protein
Soak time
Space group
˚) d (A
kDa/asymmetric unit
Program used
Sites used
Total sites
0.25 M KBrb 0.5 M NaBr 1.0 M NaBr 1.0 M NaBr
60 45 30 20
P21212 P43212 P62 P21
? 2.8 1.8 1.8
16 36 37 56
SHELXS SnBd SHELXD SnB
9 7 9 7
? ? 22 40
phases
-Defensin-2a Yeast YKG9c PCPe Thioesterase 1f
Salt concentration
a
D. M. Hoover, K. R. Rajashankar, R. Blumenthal, A. Puri, J. J. Oppenheim, O. Chertov, and J. Lubkowski, J. Biol. Chem. 275, 32911 (2000). b 0.25 M KI also used. c Y.-S. J. Ho, L. M. Burden, and J. H. Hurley, EMBO J. 19, 5288 (2000). d SHELXS also used. e Z. Dauter, M. Li, and A. Wlodawer, Acta Crystallogr. D Biol. Crystallogr. 57, 239 (2001). f Y. Devedjev, Z. Dauter, S. R. Kuznetsov, T. L. Z. Jones, and Z. S. Derewenda, Struct. Fold. Des. 8, 1137 (2000).
[3]
[4]
83
xenon and krypton as heavy atoms TABLE X Software Web Sites for Substructure Determination
The various sites feature a variety of instructional material. For example, the CNS distribution contains Web-based tutorials describing the steps required for MIR, MAD, and SAD phasing. The SnB site features a short tutorial on direct methods. Most of these programs are available at no cost to nonprofit organizations. Acknowledgments The authors thank Stephen Potter, who created and maintained a Web site for reporting substructure results; the many users who responded with information about their applications; and Osnat Herzberg and Celia Chen, who allowed us to use their data for 2-aminoethylphosphonate transaminase as test data. We are also grateful for financial support: NIH GM-46733 (C.M.W.), NIH 1P50GM-62412 (P.D.A.), and U.S. Department of Energy Contract No. DEAC03-76SF00098 (P.D.A). The CCP4 programs have been collected and developed under the auspices of Collaborative Computing Project Number 4, in Protein Crystallography, supported by the U.K. Science and Engineering Research Councils and coordinated at Daresbury Laboratory.
[4] Use of Noble Gases Xenon and Krypton as Heavy Atoms in Protein Structure Determination By Marc Schiltz, Roger Fourme, and Thierry Prange´ Introduction
Xenon and krypton derivatives of proteins can be obtained by subjecting a native protein crystal to a xenon or krypton gas atmosphere pressurized in the range of 1–100 bar.1 The noble gas atoms are able to diffuse rapidly toward potential interaction sites in proteins via the solvent channels that are always present in crystals of macromolecules. The number and occupancies of xenon/krypton-binding sites vary with the
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[4]
83
xenon and krypton as heavy atoms TABLE X Software Web Sites for Substructure Determination
The various sites feature a variety of instructional material. For example, the CNS distribution contains Web-based tutorials describing the steps required for MIR, MAD, and SAD phasing. The SnB site features a short tutorial on direct methods. Most of these programs are available at no cost to nonprofit organizations. Acknowledgments The authors thank Stephen Potter, who created and maintained a Web site for reporting substructure results; the many users who responded with information about their applications; and Osnat Herzberg and Celia Chen, who allowed us to use their data for 2-aminoethylphosphonate transaminase as test data. We are also grateful for financial support: NIH GM-46733 (C.M.W.), NIH 1P50GM-62412 (P.D.A.), and U.S. Department of Energy Contract No. DEAC03-76SF00098 (P.D.A). The CCP4 programs have been collected and developed under the auspices of Collaborative Computing Project Number 4, in Protein Crystallography, supported by the U.K. Science and Engineering Research Councils and coordinated at Daresbury Laboratory.
[4] Use of Noble Gases Xenon and Krypton as Heavy Atoms in Protein Structure Determination By Marc Schiltz, Roger Fourme, and Thierry Prange´ Introduction
Xenon and krypton derivatives of proteins can be obtained by subjecting a native protein crystal to a xenon or krypton gas atmosphere pressurized in the range of 1–100 bar.1 The noble gas atoms are able to diffuse rapidly toward potential interaction sites in proteins via the solvent channels that are always present in crystals of macromolecules. The number and occupancies of xenon/krypton-binding sites vary with the
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
84
phases
[4]
applied pressure. The interaction of noble gas atoms with proteins is the result of noncovalent weak-energy van der Waals forces and therefore the process of xenon/krypton binding is completely reversible. It also implies that noble gas binding induces only marginal perturbations to the surrounding molecular structures. As a consequence, xenon and krypton derivatives are highly isomorphous with the native crystals. Xenon and krypton are able to bind to a large variety of sites in proteins, including closed intramolecular hydrophobic cavities, accessible enzymatic sites, intermolecular cavities, and channel pores. Site-specific mutagenesis can be used to create xenon- and krypton-binding cavities in proteins. In the present chapter various aspects of the preparation of xenon and krypton derivatives and their use as heavy atoms or anomalous scatterers in protein crystallography are presented. Brief Historical Survey
From Discovery of Xenon Anesthesia to First Structural Studies on Protein–Xenon Complexes Experimental and theoretical studies of the interactions of xenon and krypton with proteins date back to 1946, when Lawrence et al.2 noticed that the solubility of xenon in olive oil is similar to that observed for other nonpolar nonhydrogen bond-forming anesthetics such as cyclopropane or ethylene. On the basis of these observations they predicted that inert gases possess narcotic properties. This hypothesis was validated in 1951 by Cullens and Gross,3 who reported the first xenon anesthesia in humans. However, the high cost of purified xenon gas has precluded its widespread use as a general anesthetic in surgical procedures, although this situation has changed more recently.4 Nevertheless, the fact that simple spherically symmetrical atoms without any permanent dipoles can produce narcosis triggered considerable interest in the use of xenon and krypton as prototype probes to experimentally investigate the molecular mechanisms of anesthesia.5,6 In particular, it was found by solubility measurements that
1
We use the unit bar for pressures throughout this chapter: 1 bar ¼ 105 Pa ¼ 0.987 atm ¼ 750.06 Torr ¼ 14.50 lb/in2. 2 J. Lawrence, W. F. Loomis, C. A. Tobias, and F. H. Turpin, J. Physiol. 105, 197 (1946). 3 S. C. Cullens and E. G. Gross, Science 113, 580 (1951). 4 J. Leclerc, R. Nieuviarts, B. Tavernier, B. Vallet, and P. Scherpereel, Ann. Fr. Anesth. Reanim. 20, 70 (2001). 5 R. M. Featherstone and C. A. Muehlbaecher, Pharmacol. Rev. 15, 97 (1963). 6 R. M. Featherstone and W. Settle, Actual. Pharmacol. 27, 69 (1974).
[4]
xenon and krypton as heavy atoms
85
Fig. 1. A schematic representation of the four xenon-binding sites in sperm whale myoglobin as reported in Ref. 21. Site 1 is the major binding site.
xenon reversibly binds to hemoglobin, myoglobin,7,8 and albumin.9 These observations motivated Schoenborn and coworkers to launch a number of pioneering X-ray studies on crystallized xenon–protein complexes in 1965–1969. These X-ray diffraction measurements were carried out on protein crystals put under a pressurized xenon gas atmosphere of typically 2–2.5 bar. The noticeable solubility of xenon in water10 allows for the rapid diffusion of xenon atoms toward potential sites in protein molecules via the numerous solvent channels that are always present in protein crystals. Xenon was shown to bind to a discrete site in sperm whale metmyoglobin,11 deoxymyoglobin,12 and alkaline metmyoglobin.13 The binding site is located in the proximal cavity, next to the prosthetic group and opposite the oxygen-binding site (see Fig. 1). The xenon atom is in close contact with the heme-linked histidine and a pyrrole ring of the porphyric group. A second binding site with lower occupancy was detected in the alkaline form of metmyoglobin. A single discrete xenon-binding site was also observed in each subunit of horse hemoglobin.14 The binding sites are almost identical for the and subunits but different from the myoglobin-binding site. 7
C. A. Muehlbaecher, F. L. Debon, and R. M. Featherstone, Anesthesiol. Clin. 1, 937 (1963). C. A. Muehlbaecher, F. L. Debon, and R. M. Featherstone, Mol. Pharmacol. 2, 86 (1966). 9 H. L. Conn, Jr., J. Appl. Physiol. 16, 1065 (1961). 10 E. Wilhem, R. Battino, and R. J. Wilcock, Chem. Rev. 22, 219 (1977). 11 B. P. Schoenborn, H. C. Watson, and J. C. Kendrew, Nature 207, 28 (1965). 12 B. P. Schoenborn and C. L. Nobbs, Mol. Pharmacol. 2, 495 (1966). 13 B. P. Schoenborn, J. Mol. Biol. 45, 297 (1969). 14 B. P. Schoenborn, Nature 207, 760 (1965). 8
86
phases
[4]
Preliminary reports also showed xenon to bind to renin15 and to the protein subunit of tobacco mosaic virus,16 but no full structural studies have been conducted on these proteins, so that the detailed nature of the binding sites remains unknown. It is also worth mentioning that the complexes of sperm whale myoglobin with cyclopropane16,17 and dichloromethane18 (two other anesthetic agents), as well as with HgI3, AuI3, and I3,19 have been solved by X-ray crystallography, showing all these molecules to bind to the same site as xenon. Even a single N2 molecule is known to bind to this site at a high gas pressure of 145 bar.20 Further Studies of Interactions of Xenon with Macromolecules Further experimental studies on myoglobin–xenon complexes were carried out by Tilton and co-workers in the 1980s. Crystallographic studies at the somewhat higher xenon pressure of 7 bar revealed that, in addition to the major binding site, three secondary sites are present with xenon occupancies close to 0.5.21 These secondary sites are all located in preexisting hydrophobic cavities in the protein matrix (see Fig. 1). The sites are surrounded by aliphatic and aromatic side chains and structural as well as energy considerations suggest that they are void in the absence of xenon (i.e., not filled with water molecules). The detailed analysis of the myoglobin–xenon complex shows that no significant structural rearrangement in the protein is associated with the binding of xenon.21,22 As a consequence, the native and xenon-complexed protein structures are highly isomorphous to each other. Model calculations,22 molecular dynamics simulations,23 and nuclear magnetic resonance (NMR) studies24 were used to investigate the thermodynamics and kinetics of xenon binding. Thermodynamic data were also obtained from solution studies conducted by Ewyng and Maestas.25
15
B. P. Schoenborn and R. M. Featherstone, Adv. Pharmacol. 5, 1 (1967). B. P. Schoenborn, Fed. Proc. Fed. Am. Soc. Exp. Biol. 27, 888 (1968). 17 B. P. Schoenborn, Nature 214, 1120 (1967). 18 A. C. Nunes and B. P. Schoenborn, Mol. Pharmacol. 9, 835 (1973). 19 R. H. Kretsinger, J. Mol. Biol. 31, 305 (1968). 20 R. F. Tilton, Jr., I. D. Kuntz, Jr., and G. A. Petsko, Biochemistry 27, 6574 (1988). 21 R. F. Tilton, Jr., I. D. Kuntz, Jr., and G. A. Petsko, Biochemistry 23, 2849 (1984). 22 R. F. Tilton, Jr., U. C. Singh, S. J. Weiner, M. Connolly, I. D. Kuntz, Jr., P. A. Kollman, N. Max, and D. Case, J. Mol. Biol. 192, 443 (1986). 23 R. F. Tilton, Jr., U. C. Singh, I. D. Kuntz, Jr., and P. A. Kollman, J. Mol. Biol. 99, 195 (1988). 24 R. F. Tilton, Jr. and I. D. Kuntz, Jr., Biochemistry 21, 6850 (1982). 25 G. J. Ewyng and S. Maestas, J. Phys. Chem. 74, 2341 (1970). 16
[4]
xenon and krypton as heavy atoms
87
Using Xenon as Heavy Atom in Protein Crystallography The use of xenon as a heavy atom to solve the phase problem in protein crystallography was first suggested in 1967 by Schoenborn and Faetherstone,15 who immediately realized its potential advantages by stating that ‘‘the use of xenon as a ‘‘heavy atom’’ is of some interest in protein crystallography. . . . Xenon is a little ‘‘lighter’’ than desirable for a heavy atom, but this is counteracted by the fact that xenon protein complexes show a very high degree of isomorphism with the native crystals—a fact often not true with most of the commonly used ‘‘heavy atoms’’ which are generally ionic groups capable of inducing some disorder into the native structure.’’ However, it was not until 1991 that Vitali et al.26 demonstrated that SIRAS27 phases computed from the xenon complex of sperm whale myoglobin yielded an interpretable electron density map for that protein with data collected on a laboratory source at the Cu K wavelength. Following an initiative by Prange´ , a comprehensive research project on xenon and krypton derivatives was started in our laboratory in 1993. The immediate outcomes of this project were (1) the design of a simple and generally applicable method for the preparation and X-ray data collection of isomorphous noble-gas derivatives,28 (2) the investigation of xenon and krypton binding to proteins other than myoglobin and hemoglobin,29,30 (3) the use of anomalous signals of xenon and krypton in protein phasing,31–33 and (4) the use of xenon derivatives for the determination of phases of unknown protein structures.34–36
26
J. Vitali, A. H. Robbins, S. C. Almo, and R. F. Tilton, J. Appl. Crystallogr. 24, 931 (1991). Abbreviations used: MIR(AS), multiple isomorphous replacement (with anomalous scattering); SIRAS, single-isomorphous replacement with anomalous scattering; MAD, multiwavelength anomalous scattering. 28 M. Schiltz, T. Prange´ , and R. Fourme, J. Appl. Crystallogr. 27, 950 (1994). 29 M. Schiltz, R. Fourme, I. Broutin, and T. Prange´ , Structure 3, 309 (1995). 30 T. Prange´ , M. Schiltz, L. Pernot, N. Colloc’h, S. Longhi, W. Bourguet, and R. Fourme, Protein Struct. Funct. Genet. 30, 61 (1998). 31 M. Schiltz, W. Shepard, R. Fourme, T. Prange´ , E. de la Fortelle, and G. Bricogne, Acta Crystallogr. D Biol. Crystallogr. 53, 78 (1997). 32 ˚ . Kvick, O. Svensson, W. Shepard, E. de la Fortelle, T. Prange´ , R. Kahn, M. Schiltz, A G. Bricogne, and R. Fourme, J. Synchrotron Radiat. 4, 287 (1997). 33 M. Schiltz, Ph. D. thesis. Universite´ de Paris XI, Orsay, France, 1997. 34 N. Colloc’h, M. El Hajji, B. Bachet, G. Lhermite, M. Schiltz, T. Prange´ , B. Castro, and J. P. Mornon, Nat. Struct. Biol. 4, 947 (1997). 35 I. Li de la Sierra, L. Pernot, T. Prange´ , P. Saludjian, M. Schiltz, R. Fourme, and G. Padro´ n, J. Mol. Biol. 269, 129 (1997). 36 W. Bourguet, M. Ruff, P. Chambon, H. Gronenmayer, and D. Moras, Nature 375, 377 (1995). 27
88
phases
[4]
A kind of breakthrough was achieved in 1994, when the structure of the ligand-binding domain of the human nuclear receptor RXR- was solved with a xenon derivative prepared at LURE at a gas pressure of 20 bar.36 This was the first published structure in which a xenon derivative had been used to determine the phases of an unknown protein structure. It established the credentials of the method and triggered the attention of the crystallographic community. Since then, xenon has been used successfully for the structure determination of a significant number of other proteins (see Selected Case Studies, below). Various improvements and extensions to the method were proposed by research teams in Graz (Austria),37–39 at Stanford-SSRL,40–42 at the University of Oregon,43,44 at ELETTRA (Italy) and EMBL-Hamburg (Germany),45 and by our own group at LURE (France).31–33 Useful Properties of Xenon and Krypton Derivatives
Since the discovery of xenon fluorides in 1962, the covalent compounds of noble gases have been extensively studied. With proteins, however, the interactions of xenon and krypton are of noncovalent origin and they are therefore similar in nature to the forces that give rise to the formation of other well-known complexes of xenon and krypton with small molecules. The best characterized of these complexes are the clathrates and, in particular, the xenon and krypton hydrates.46–48 In these compounds, the noble gas atoms are enclosed as ‘‘guests’’ in the cavities formed by a host structure. The host structure can be a lattice of water molecules (in the case 37
O. Sauer, A. Schmidt, and C. Kratky, J. Appl. Crystallogr. 30, 476 (1997). O. Sauer, Ph.D. thesis. Karl-Franzens Universita¨ t, Graz, Austria, 2001. 39 O. Sauer, M. Roth, T. Schirmer, G. Rummel, and C. Kratky, Acta Crystallogr. D Biol. Crystallogr. 58, 60 (2002). 40 M. H. B. Stowell, M. Soltis, C. Kisker, J. W. Peters, H. Schindelin, D. C. Rees, D. Cascio, L. Beamer, P. John Hart, M. C. Wiener, and F. G. Whitby, J. Appl. Crystallogr. 29, 608 (1996). 41 S. M. Soltis, M. H. B. Stowell, M. C. Wiener, G. N. Philips, and D. C. Rees, J. Appl. Crystallogr. 30, 190 (1997). 42 A. Cohen, P. Ellis, N. Kresge, and M. Soltis, Acta Crystallogr. D Biol. Crystallogr. 57, 233 (2001). 43 M. L. Quillin, W. A. Breyer, I. J. Grisworld, and B. W. Matthews, J. Mol. Biol. 302, 955 (2000). 44 M. L. Quillin and B. W. Matthews, Acta Crystallogr. D Biol. Crystallogr. 58, 97 (2002). 45 K. Djinovic-Carugo, P. Everitt, and P. Tucker, J. Appl. Crystallogr. 31, 812 (1998). 46 R. de Forcrand, C. R. Acad. Sci. 176, 355 (1923). 47 R. de Forcrand, C. R. Acad. Sci. 181, 15 (1925). 48 M. Von Stackelberg and H. R. Mu¨ ller, Z. Elektrochem. 58, 25 (1954). 38
[4]
xenon and krypton as heavy atoms
89
of hydrates) or organic molecules such as hydroquinone, phenol, and p-cresol.49 In xenon hydrate, the gas atoms are enclosed in two different ˚ .48 types of cavities, which have diameters of, respectively, 5.2 and 5.9 A Xenon and krypton are also known to bind noncovalently to -cyclodextrin50 and into zeolites.51 In these latter cases, the binding sites are accessible, as opposed to clathrates, where they are closed cavities. Both types of sites are relevant for the interaction of xenon and krypton to proteins as binding has been observed to both closed cavities (as in myoglobin) and accessible sites (as in serine proteinases).29 Because the binding of xenon and krypton is due to noncovalent interactions, it will be useful to briefly review the atomic properties of noble gases that determine the magnitude of these forces. A summary of relevant physical and chemical properties of xenon and krypton is presented in Table I. Molecular Forces in Xenon/Krypton–Protein Interactions As noble gas atoms have a zero net charge and are spherically symmetric, Coulomb interactions, hydrogen-bonding and dipole–dipole interactions cannot be involved in the binding of xenon and krypton to proteins. Thus, the only possible attractive interactions between noble gas atoms and proteins are charge-induced, dipole-induced, and London (dispersion) forces. The key physical parameter in these interactions is the electronic polarizability of the noble gas atoms. The usual repulsive forces between atoms and molecules that are in close contact with each other also play an important role because they determine the minimum size that a cavity must have for xenon or krypton to bind into it. Under the influence of an external electric field E, an electric dipole P is induced in a xenon or krypton atom. The induced dipole can then establish attractive electrostatic interactions with surrounding electric charges or dipoles. This is the physical basis of the forces involved in xenon or krypton binding to proteins. Assuming a linear response, the induced dipole is proportional to the external field strength P ¼ E, where the proportionality coefficient is called the electronic polarizability. This is a characteristic parameter for a given atom type and, qualitatively, it expresses the ease with which the electron cloud can be displaced. Most atoms and diatomic molecules have rather low polarizabilities52 (1.63 for argon, 1.60 for O2, 1.48 for H2O), but for larger molecules and for diatomic molecules, which have a large number of electrons, the values are higher (4.6 for 49
L. Mandelcorn, Chem. Rev. 59, 827 (1959). F. Cramer and F. M. Hengelein, Chem. Ber. 90, 2572 (1957). 51 J. E. Cline, Health Phys. 40, 71 (1981). 52 ˚ 3 ¼ 1.11 1040 C2 m2 J1. Electronic polarizabilities are expressed in units of (4e0)A 50
90
[4]
phases TABLE I Properties of Xenon and Krypton Xenon
Atomic number Atomic mass Melting point at 1.013 bar Boiling point at 1.013 bar Density at 293.15 K Critical temperature Critical pressure Solubilitya in: Water Methanol Ethanol Propanol Hexanol Cyclopentane Cyclohexane Propanal Heptanal Acetic acid n-Butanoic acid Electronic polarizability van der Waals radius First ionization potential Absorption edgesb K LI LII LIII a b
54 131.39 161.25 K 165.05 K 5.897 g/cm3 289.7 K 58.4 bar 0.1178 2.20 2.47 2.65 2.61 5.75 5.00 2.80 2.98 2.66 2.89 ˚3 4.04 (4"0) A ˚ 2.16 A 12.13 eV 34.582 5.42 5.10 4.78
˚ keV/0.3585 A ˚ keV/2.29 A ˚ keV/2.43 A ˚ keV/2.59 A
Krypton 36 83.8 u 115.95 K 119.75 K 3.74 g/cm3 209.4 K 54.3 bar 0.0670
˚3 2.48 (4"0) A ˚ 2.01 A 14.00 eV ˚ 14.322 keV/0.8657 A ˚ 1.92 keV/6.46 A
Ostwald coefficients at 293.15 K. Energy/wavelength.
Cl2, 4.5 for C2H6, 8.2 for CHCl3) (data from Refs. 5 and 53). It is noticeable that the polarizability of xenon is quite large (4.00), given that it is a single atom. For krypton, the polarizability is 2.46, which is significantly lower and explains why krypton usually binds with a weaker occupancy to known xenon-binding sites.31 As mentioned above, three types of interactions can occur between noble gas atoms and molecular groups in proteins. 1. Charged-induced dipole interactions: In this case, the induced dipole in the xenon or krypton atom is created by an external electric charge. 53
J. Israelachvili, ‘‘Intermolecular and Surface Forces.’’ Academic Press, London, 1985.
[4]
xenon and krypton as heavy atoms
91
These interactions could thus occur when a charged protein group is located in the vicinity of the noble gas atom. The energy for this type of interaction is proportional to the electronic polarizability of the noble gas, proportional to the square of the charge, and inversely proportional to the fourth power of the distance separating the center of the charge from the center of the noble gas atom.53 These are therefore potentially strong interactions (as compared with dipole-induced and London interactions). They are also active over larger distances. In proteins, however, charged groups are usually exposed to the solvent (and/or involved in salt bridges). Whereas this type of interaction may therefore be important in the nonspecific binding of xenon and krypton to the surface of protein molecules, the discrete binding sites observed in crystallographic experiments are usually not directly solvent exposed. 2. Dipole-induced dipole interactions: In this case, the induced dipole in the xenon or krypton atom is created by an external electric dipole. These interactions occur when the noble gas atom is located in the vicinity of polar groups in the protein. The binding energy involved in this type of interaction is proportional to the electronic polarizability of the noble gas atom, proportional to the magnitude of the dipole moment, and inversely proportional to the sixth power of the distance separating the center of the dipole from the center of the gas atom.53 Hence, it is a truly short-range interaction. In a number of binding sites observed in protein crystals, xenon and krypton are bound close to polar groups such as the hydroxyl side chain in the active site of serine proteinases.29 3. London interactions (also called dispersion interactions): These interactions exist between all molecules and atoms, even those that are uncharged and nonpolar. They are usually described as arising from the interaction of the instantaneous dipoles in both molecules (or atoms). Each molecule (or atom) possesses fluctuating dipoles according to the particular instantaneous distribution of the electrons. These instantaneous dipoles constantly change direction and magnitude, each existing only for a minute fraction of time. However, the net overall interaction between these instantaneous dipoles is an attractive force as is demonstrated by a quantummechanical perturbation treatment.54,55 The interaction energy is given by UðrÞ ¼
3 Ia Ib a b 2 Ia þ Ib ð4"0 Þ2 r6
where a and b label the two atoms that are involved in the interaction, I stands for the first ionization potential, and stands for the electronic 54 55
F. London, Z. Phys. 63, 245 (1930). F. London, Trans. Faraday Soc. 33, 8 (1937).
92
phases
[4]
polarizability. The distance between the two interacting atoms is denoted by r. Once more, the interaction energy is proportional to the polarizability of the noble gas atom (as well as being proportional to the polarizability of the other interacting atom) and inversely proportional to the sixth power of the distance separating the two atoms. These energies are again weak, usually only a few units of kT, but the xenon and krypton atoms in a binding site typically interact with several protein atoms, the interaction energies being added up. The binding of xenon and krypton to proteins is dominated by London interactions, mainly because the majority of binding sites are formed by nonpolar groups (aliphatic and/or aromatic side chains). But even if there are polar groups in a binding site, it can be shown33,53 that the London interactions generally contribute to a much larger extent to the binding than do the dipole-induced forces. In crystals of Fusarium solani pisi cutinase, xenon was found to bind into the enzymatic site of the molecule, with the closest protein group being the hydroxyl side chain of the active site serine. The mutation of this serine into alanine (i.e., changing a polar group into a nonpolar group) did not significantly alter the ability of xenon to bind into the site.33 Finally, the size of the noble gas atom is another important parameter because it determines the minimum size of a cavity that can act as a poten˚ (data from tial binding site. The van der Waals radius of xenon is 2.16 A Refs. 53 and 56), which makes this atom comparable in size to a methane ˚ ), but significantly larger than the molecule (van der Waals radius, 2.00 A ˚ ). Krypton is slightly smaller effective van der Waals radius of water (1.4 A ˚ ) and might therefore bind to sites too small (van der Waals radius, 2.01 A to accommodate a xenon atom. Insights into the structural factors that affect the binding of noble gases to proteins have been obtained from a major experimental investigation conducted by Quillin et al.43 These authors determined the crystal structures of pseudo-wild-type and cavity-containing mutant phage T4 lysozymes in the presence of argon, krypton, and xenon. In the engineered, predominantly apolar cavities of varying size and shape, the noble gases bind preferentially at highly localized sites that appear to be defined by constrictions in the walls of the cavities. The investigators conclude that the plasticity of the protein matrix permits repulsion due to increased ligand size, in going from argon to xenon, to be more than compensated for by attraction due to increased ligand polarizability. A review of xenon- and krypton-binding sites in various proteins has been presented by Prange´ et al.30 (see also Fig. 2). It reveals that 56
A. Bondi, ‘‘Physical Properties of Molecular Crystals, Liquids and Glasses.’’ John Wiley & Sons, New York, 1968.
[4]
xenon and krypton as heavy atoms
93
Fig. 2. Examples of xenon- and krypton-binding sites in proteins. Top: A buried hydrophobic cavity in hen egg-white lysozyme. The molecular surface was computed by the Conolly algorithm. The site is inaccessible in terms of a static description of the protein structure, but both xenon and krypton are shown to bind into this cavity.31,33 Bottom left: Details of the environment of the bound xenon atom in the cavity in hen egg-white lysozyme. This cavity is delimited by hydrophobic residues: a leucine, a valine, two isoleucines, and a methionine. Bottom right: The xenon-binding site in the enzyme urate-oxidase,34 showing the superimposed models of both the native and the xenon derivative refined independently. Small changes induced by the xenon are illustrated.
noble gases can bind into preexisting nonfunctional intra- and intermolecular hydrophobic cavities, as is the case in myoglobin; into ligandand substrate-binding pockets, as in RXR-36; in serine proteinases29; and into the pore of channel-like coiled-coil structures, as is the case in COMP57, bacteriophage T4 fibritin M58, and DSMO reductase59. The majority of these sites are hydrophobic, but in serine proteinases and in
94
phases
[4]
Fig. 3. Solubilities of noble gases. The plot represents the concentration (in mole fraction) of xenon and krypton in pure water as a function of their partial pressure. Data computed from Ref. 61.
COMP, the xenon atoms displace water molecules that are well defined in the native structures. Solubilities of Noble Gases The macroscopic properties that are of interest to us are the solubility of xenon and krypton in water and various organic solvents. Numerous experimental data have been tabulated.60–64 The concentrations of xenon and krypton in water at 298.15 K as a function of gas pressure are plotted in Fig. 3. It should be noted that these data are valid for pure water. It was observed that a 22% glycerol–water mixture—a common medium used in cryocrystallography—is less efficient than pure water to dissolve rare 57
V. N. Malashkevitch, R. A. Kammerer, V. P. Efimov, T. Schulthess, and J. Engel, Science 274, 761 (1996). 58 S. V. Strelkov, Y. Tao, M. M. Shneider, V. V. Mesyanzhinov, and M. G. Rossmann, Acta Crystallogr. D Biol. Crystallogr. 54, 805 (1998). 59 H. Schindelin, C. Kisker, J. Hilton, K. V. Rajagopalan, and D. C. Rees, Science 272, 1615 (1996). 60 H. L. Clever, in ‘‘Krypton, Xenon and Radon Gas Solubilities,’’ Solubilities Data Series Vol. 2, IUPAC Commission V. 8. Elsevier Science, Amsterdam, 1979. 61 E. Wilhelm, R. Battino, and J. Wilcock, Chem. Rev. 77, 219 (1976). 62 R. P. Kennan and G. L. Pollack, J. Chem. Phys. 93, 2724 (1990). 63 R. P. Kennan, G. L. Pollack, and J. F. Himm, J. Chem. Phys. 90, 6569 (1989). 64 G. L. Pollack, J. F. Himm, and J. J. Enyeart, J. Chem. Phys. 81, 3239 (1984).
[4]
xenon and krypton as heavy atoms
95
gases.38 It is probably also the case for concentrated saline solutions because the ions tend to compete with noble gas atoms for solvation, but no experimental data have been reported so far. The Ostwald coefficient65 for the solubility of xenon in water at a partial gas pressure of 1 bar is 0.2094 at a temperature of 273.15 K and drops in a nearly linear fashion to 0.1198 at 293.15 K (the corresponding values for krypton are 0.1099 and 0.0670). Xenon is thus more soluble in water than are, for instance, N2, O2, and CO, but less soluble than CO2. It is the noticeable solubility of noble gases in water that allows them to diffuse through the solvent channels of protein crystals to the binding sites on the macromolecules. In organic solvents, the solubility of xenon is substantially larger63 (Table I), thus ranging it into the category of hydrophobic solutes and reflecting its ability to establish London interactions in apolar environments. As an example, the Ostwald coefficient for xenon in cyclohexane is 5.75 at a temperature of 293.15 K. The large differences of xenon and krypton solubilities in polar versus nonpolar solvents can be exploited in low-resolution X-ray contrast-variation experiments to distinguish between lipid or detergent and aqueous phase regions in crystals, as was demonstrated with OmpF porin.39 X-Ray Scattering Properties With 54 electrons, xenon belongs to the category of atoms that qualify as ‘‘heavy’’ in protein crystallography, although it is certainly ‘‘lighter’’ than many standard heavy atoms such as mercury, platinum, or uranium. Substantial progress has been achieved in experimental and theoretical methods to tackle the phase problem with the isomorphous replacement method. Improved data collection and processing strategies (that exploit small intensity differences) and, most importantly, optimal statistical phasing methods66 now allow the extraction of useful phase information from much weaker signals. As a consequence, the fact that xenon is not as heavy as mercury is no longer a serious drawback. Even krypton with its 36 electrons per atom is useful in isomorphous replacement. Similarly, short soaks in solution containing halide ions (Br, I) or certain cations (Csþ, Gd3þ)67,68 have become popular in producing useful heavy 65
Ostwald solubility is the ratio of the concentration of dissolved gas molecules in a liquid solvent to their concentration in the gas phase at equilibrium. Ostwald solubility is a thermodynamically important, as well as an intuitive, measure of solubility. 66 E. de La Fortelle and G. Bricogne, Methods Enzymol. 276, 472 (1997). 67 Z. Dauter, M. Dauter, and K. R. Rajashankar, Acta Crystallogr. D Biol. Crystallogr. 56, 232 (2000). 68 R. A. Nagem, Z. Dauter, and I. Polikarpov, Acta Crystallogr. D Biol. Crystallogr. 57, 996 (2001).
96
phases
[4]
˚ ). Fig. 4. Anomalous scattering factors f 0 and f 00 (e) for xenon as a function of wavelength (A
atom derivatives. In all these cases, the good isomorphism characteristics of the derivatives (combined with their anomalous scattering properties) largely outweigh their somewhat smaller signal strength as compared to classical heavy atom derivatives. In addition, xenon and krypton exhibit interesting anomalous scattering properties (Table I). The f 0 and f 00 values for xenon as a function of wavelength are plotted in Fig. 4. The absorption edges of xenon are situated ˚ ) or long wavelengths (L edges are all either at short (K edge at 0.358 A ˚ ). Despite the fact that these wavelengths are not routinely acabove 2 A cessible on most synchrotron beamlines and that data collection with soft X-rays is still experimentally challenging,69 anomalous scattering experiments in SIRAS mode have been carried out successfully at both the K edge32 and the LI edge.69,70 For standard experiments on laboratory sources, it is important to note that the residual anomalous signal from the LI edge is still f 00 ¼ 7.35 at the Cu K wavelength. This is sufficiently large to provide useful phase information, which, furthermore, is complementary to the isomorphous phasing signal. The anomalous scattering 69
M. Cianci, P. J. Rizkallah, A. Olczak, J. Raftery, N. E. Chayen, P. Zagalsky, and J. R. Helliwell, Acta Crystallogr. D Biol. Crystallogr. 57, 1219 (2001). 70 S. Yuda, H. Kobayashi, T. Matssumoto, and T. Nonaka, ‘‘SPRING8 User Experiment.’’ Report 1999A0337-NL-np. Japan Synchroton Radiation Research Institute (JASRI), 1999.
[4]
xenon and krypton as heavy atoms
97
factors for krypton are smaller and the f 00 does not exceed 4 over the range of wavelengths used in protein crystallography. However, the K absorption ˚ , which is accessedge of krypton is located at the wavelength of 0.8655 A ible on most tunable synchrotron beamlines and that is, in fact, close to the absorption edges of Se and Br, two atoms that are frequently used as anomalous scatterers in MAD experiments on proteins and nucleic acids, respectively. Both SIRAS31 and MAD42 experiments have been carried out successfully at the krypton K edge. White-line features are absent from the K edges of both xenon32 and krypton,31 implying that no specific near-edge enhancement of the anomalous signal can be exploited. Potential Advantages of Xenon and Krypton Derivatives Having outlined some useful physical and chemical properties of noble gases, we are now in a position to summarize the potential advantages of xenon and krypton as heavy atoms and anomalous scatterers in protein crystallography. 1. Provided that binding sites are present, xenon and krypton derivatives can be prepared with relative ease. It is sufficient to produce native crystals and put them under gas pressure. In contrast to the preparation of classic derivatives, no modification of the buffer or crystal mother liquor is required (except for the addition of cryoprotectants if the crystals are to be frozen). The addition of xenon and krypton has little effect on the pH or ionic strength of mother liquors. Of course, collection of X-ray data on crystals under gas pressure and the freezing of pressurized crystals present a number of technical challenges by themselves, but satisfactory solutions have been developed to address these questions, as is shown in the next section. 2. Xenon and krypton derivatives usually present a high degree of isomorphism. This is a direct consequence of the weak-energy noncovalent interactions that are involved in the binding of noble gases to proteins. Although rearrangements of side chains and displacements of water molecules may occur on xenon or krypton binding, these are usually rather small and local structural changes that only marginally impede the quality of the phases obtained from such derivatives. 3. Because of their affinity for apolar environments, xenon and krypton atoms are likely to bind to sites that are different from those of standard heavy atoms, which like Hg or Pt bind predominantly to specific functional groups. In MIR phasing, noble gas derivatives may therefore provide phase information that is truly complementary to that obtained from other derivatives. The structure determination of the enzyme urateoxidase from Aspergillus flavus34 illustrates this advantage: Hg and Pb derivatives had been obtained by standard soaking techniques, but the
98
phases
[4]
major binding sites of both cations were close to each other, thus limiting their phasing power. Subsequently, a xenon derivative was prepared and the single binding site was found to be far away from the Hg and Pb sites. Combining the data from all derivatives led to a substantial improvement of the MIR phases and allowed the structure to be solved. 4. When there are several binding sites, their number and occupancies can be modified by varying the applied gas pressure. This could eventually give rise to several different derivatives. 5. The anomalous scattering properties of xenon and krypton were outlined above. Although edge experiments with xenon are feasible, the true advantage of this heavy atom resides in its large anomalous signal at the standard Cu K wavelength. This makes it an ideal choice for SIRAS experiments on a laboratory home source.26 Krypton has the advantage that K edge experiments are routinely feasible on most synchrotron beamlines. The absence of white-line features in the absorption spectrum of krypton31 implies that the maximum anomalous signal that can be exploited is rather modest (2f 00 is on the order of 7–7.5 at the peak wavelength). However, single or multiple-wavelength phases can often be improved on by also using isomorphous differences with respect to the native structure. With xenon or krypton derivatives, this additional phase information comes practically for free because native data are almost always available. 6. Not all proteins contain noble gas-binding sites. From the experience gathered at LURE and SSRL,40 we can estimate conservatively that the success rate for xenon binding is higher than 30–40%. It is, however, possible to generate noble gas-binding sites by using site-directed mutagenesis as was first suggested by Vitali et al.26 Quillin and Matthews44 have demonstrated that by truncating leucine and phenylalanine residues to alanine, it is possible to create noble gas-binding sites in the core of T4 lysozyme. Multiple mutations offered the possibility of obtaining multiple xenon derivatives, with different binding sites. This method is of general applicability, including cases in which selenomethionine incorporation may be difficult or impossible (i.e., in expression systems other than Escherichia coli). The mutated protein may be destabilized owing to cavity formation but this disadvantage is largely outweighed by the fact that expression levels for mutant proteins grown in rich media are typically much higher than those obtained for wild-type proteins grown in minimal media during selenomethionine incorporation. Quillin and Matthews44 estimate that if leucine-to-alanine substitutions are made at random, there is, for each mutation, a chance of about 30% of generating a useful noble gas-binding site. The same success rate is likely to apply for mutations of other bulky hydrophobic residues (phenylalanines, isoleucines, and valines) to alanine.
[4]
xenon and krypton as heavy atoms
99
Preparation and X-Ray Data Collection of Noble Gas Derivatives
The strategies for preparing isomorphous noble gas derivatives differ, depending on whether X-ray data are collected at room temperature or on flash-frozen crystals. At room temperature, because the process of xenon/krypton binding is completely reversible, the gas pressure must be maintained during the data collection. On the other hand, once pressurized crystals are frozen at cryogenic temperatures, the gas atoms are trapped at the binding sites so that data collection can proceed in the standard way, without the need to maintain gas pressure. Both techniques will be described below. Room Temperature Experiments Historically, all xenon-binding experiments performed before 1997 were carried out at room temperature. The protein crystal is subjected to a pressurized xenon gas atmosphere and the pressure is maintained during X-ray data collection. Schoenborn et al.11 did not give any detailed description of the pressure cell they used, merely stating that ‘‘the crystals were mounted in a special cassette.’’ Vitali et al.26 used crystals mounted and pressurized in sealed quartz capillaries. However, their technique of flame-sealing the capillary while the crystal is already mounted and pressurized had a rather high failure rate. Tilton71 also presented an enclosed fixture with beryllium shrouds for diffraction studies under high pressures (up to 400 bar), but this has not been found particularly useful for the more moderate pressures that are applied in noble gas-binding experiments. Pressurization Apparatus. In the design of our own pressure cell,28 we have drawn on the idea of mounting and pressurizing crystals in quartz capillaries. Preliminary tests have shown that, despite a wall thickness of only 1/100 mm, these capillaries resist pressures up to 25 bar. For higher pressures (up to 60 bar) we have successfully used capillaries with an increased wall thickness of 3/100 mm. The pressure cell is depicted in Fig. 5. The main part is a small brass piece that connects two perpendicularly mounted Swagelock male gas fittings and that can be fixed on a standard Huber goniometer head. The smaller 1/16-in. gas fitting is connected to the gas reservoir via PEEK 1/16-in. tubing (Teflon tubing should not be used as it tends to become porous at high pressures). The larger 1/8-in. gas fitting connects a small copper tube into which the quartz capillary is glued with epoxy. The capillary is flame-sealed at one end and open at the end that connects with the gas fitting. The device is supplemented (Fig. 6) by a metering valve, a bleeder valve, and a Swagelock ‘‘quick’’ connector, which allows 71
R. F. Tilton, Jr., J. Appl. Crystallogr. 21, 4 (1988).
100
phases
[4]
Fig. 5. The pressure cell for room temperature xenon- and krypton-binding diffraction experiments. (A) The cell mounted on a standard goniometer head. (B) The spare parts of the pressure cell. Further explanations are given in text.
Fig. 6. Details of the xenon/krypton line. Explanations are given in text.
one to disconnect (and reconnect) the pressure cell from the xenon line without loss of pressure inside the cell. The Swagelock gas fittings and the high-pressure tubing are routinely used in gas chromatography and can be purchased from specialized suppliers.
[4]
xenon and krypton as heavy atoms
101
With this apparatus, the maximum pressure that can be reached may be limited by the pressure in the gas cylinder. It is, however, possible to insert a small hand pump between the gas tank and the pressure cell. This allows gas to be pumped from the cylinder into the pressure cell, even when the pressure in the cylinder has dropped to a low level. Capillary Preparation. X-ray absorption by the pressurized xenon or krypton gas in the capillary might be substantial, especially at high pressures. One way to circumvent this problem is to reduce the amount of gas in the X-ray beam path. It is thus of great importance to select a capillary of the smallest possible diameter that is consistent with the size of the crystal. The selected capillary should be carefully inspected under a light microscope to make sure no microcracks are present. The sealed end should be inspected with particular scrutiny and, if necessary, flamesealing should be redone. The mounting to the copper tube is done with standard epoxy glue. The bending moment exerted on the walls of an ideal cylinder put under gas pressure and firmly fitted at both ends is proportional to the square of its length. Thus, capillaries must be mounted as short as possible: 15 mm or less. In our experience, the most frequent reason for the explosion of capillaries is due to lack of respect for this recommendation. If the crystals are rare and precious, it might be advisable first to check the mounted but empty capillary under N2 pressure. Crystal Mounting and Pressurization. Using a flame-drawn Pasteur pipette, the sealed bottom end of the capillary is filled with a small amount of buffer. Care must be taken to avoid inclusion of air bubbles. This is usually enough to prevent the dehydration of the crystal. The crystal can then be transferred into the capillary with a Lindemann tube fitted to a 20-l Gilson pipette. An alternative method of crystal mounting is described in Fig. 7. The capillary is filled with buffer and the crystal is deposited at the top of the buffer column. With the capillary in a vertical position, the crystal will sink down by gravity. The excess of buffer can then be removed with a flame-drawn Pasteur pipette. Whichever method is used, it is important to ensure that no remaining buffer drops block the access of gas to the crystal, that is, the interior walls of the capillary should be dried out as well as possible with small strips of filter paper. The pressure cell can then be connected to the gas tank. A few (typically five) pressurization–depressurization cycles (by alternatively opening and closing the metering valve and the bleeder valve) are sufficient to purge the residual air out of the system. The crystal is then pressurized at the selected rate and X-ray data collection can start after a short equilibration time (10 min seems to be a long enough time28,40). To avoid losses of large amounts of gas on possible capillary explosion or leak, the tank valve is closed immediately after pressurization. Monitoring the pressure inside
102
phases
[4]
Fig. 7. Mounting a protein crystal in the room temperature cell. Explanations are given in text.
the system during the first few minutes will also allow for the detection of any leak. For obvious safety reasons, experimenters must be adequately trained in the handling of gas-pressurized equipment. Even though the explosive power of a tiny capillary is relatively small, the wearing of safety glasses is mandatory at all steps of the procedure. Initial pressurization is best performed behind protective shields. X-Ray Data Collection. Once the crystal is pressurized, data collection can proceed in the usual way except for one point that needs particular attention. Because of the possible absorption of X-rays by the pressurized gas in the capillary, the spindle axis should be set to a angle such that the X-ray beam paths in the capillary are minimal for the diffracted beams (see Fig. 8). In other words, it should be the incident beam that is absorbed by the pressurized gas, not the diffracted beams. Absorption of the incident beam can easily be corrected for by a single angledependent scale factor. It is much more difficult to correct for the absorption of the diffracted beams, because it depends on many more geometric parameters.
[4]
xenon and krypton as heavy atoms
103
Fig. 8. Absorption of X-rays by pressurized gas in the capillary. (A) If the crystal is oriented in such a way that it is the incident X-ray beam that travels through the gas atmosphere, then only that beam will be absorbed and the overall effect is a reduction in intensity that is the same for all diffracted beams. (B) If the crystal is oriented in such a way that the diffracted beams travel through the gas atmosphere, then the absorption will affect the intensity of each beam differently, resulting in a deterioration of data quality.
Two main problems may occur with room temperature experiments: absorption of X-rays and formation of xenon or krypton hydrate. As already mentioned above, the absorption of X-rays by pressurized xenon gas may be substantial, especially at longer wavelengths. The linear absorption coefficients of xenon and krypton gas under standard conditions (pressure, 1.013 bar; temperature, 273.14 K) are plotted in Fig. 9.72 To obtain values at different pressures or temperatures one can use the ideal gas equation, considering that linear absorption coefficients are proportional to the gas density. Values for the intensity reduction of X-ray beams in a xenon gas atmosphere of various pressures and for selected path lengths and wavelengths are presented in Table II. We recall that it is essential to minimize the beam path in the xenon gas atmosphere by selecting a capillary of the smallest possible diameter consistent with the crystal size. Absorption problems are considerably reduced at shorter wavelengths. However, anomalous scattering factors also diminish correspondingly and the choice of an appropriate wavelength will result from a tradeoff between 72
E. B. Saloman, J. H. Hubbel, and J. H. Scofield, Atom. Data Nucl. Data 38, 1, 1988.
104
phases
[4]
Fig. 9. Linear absorption coefficients l (cm1) for xenon and krypton at a temperature of 273.15 K and a gas pressure of 1.013 bar, computed from data reported in Ref. 72. To compute values for other pressures or temperatures, the ideal gas equation can be used, considering that l is proportional to the gas density.
the two factors. Because the main phasing power of a xenon derivative comes from its isomorphous signal, it is probably wise to sacrifice part of the anomalous signal strength for more accurate data at shorter wavelengths. On a laboratory home source, the choice of wavelengths is of course more limited and for data collections at the Cu K wavelength one should aim for high redundancy in the data. Carefully applied local scaling methods will then allow one to correct for most of the systematic errors that are due to absorption.31 With krypton gas, absorption is much less of a problem. However, in experiments at the K edge there will be increased background noise due to fluorescence from pressurized krypton gas. The same holds true for experiments at the xenon K edge. Again, minimizing the beam path through the pressurized gas will help to solve this problem. It is important to realize that the concentration of xenon or krypton in the gas phase is much higher than in the liquid phase surrounding the crystal (Ostwald coefficients are less than 1). Thus, absorption is primarily due to the pressurized gas in the capillary and not to the gas dissolved in the crystal mother liquor.
[4]
105
xenon and krypton as heavy atoms TABLE II Absorption of X-Rays by Pressurized Xenon Gasa X-ray beam path length (mm)
˚) Wavelength (A
Xe pressure (bar)
0.1
0.3
0.6
1.54 (Cu K)
10 25 50
0.84 0.65 0.43
0.59 0.28 0.08
0.36 0.08 0.06
0.90
10 25 50
0.96 0.90 0.81
0.88 0.74 0.54
0.78 0.54 0.29
0.70 (Mo K)
10 25 50
0.98 0.95 0.90
0.94 0.85 0.72
0.88 0.72 0.52
a
Reported here are computed values for I/I0 (where I0 is the intensity of the incident X-ray beam and I is the intensity reduced by absorption) for various xenon gas pressures and X-ray beam path lengths.
Xenon and krypton hydrates are solid clathrates formed by a host structure of water molecules in which noble gas atoms are retained as guests in closed cavities. In some cases, we have observed the formation of xenon hydrate during X-ray data collection on protein crystals.28 The hydrate forms with the mother liquor and leads to the disruption of the crystal lattice and the appearance of a characteristic powder diffraction pattern on the detector. The salts and/or organic solvents that are often present inhibit the formation of hydrates to a certain extent. In our experience, at pressures below 15 bar, xenon hydrate formation is not a serious problem except for crystal mother liquors that do contain only small amounts of salts or organic solvents and at temperatures close to 0 . The addition of cryoprotectants will also help to prevent hydrate formation. One advantage of using krypton is that the hydrate forms only at much higher pressures. Cryocrystallographic Experiments Cryocrystallographic experiments, in which the protein crystals are flash-frozen in a stream of cold N2 vapor,73 have revolutionized the field of macromolecular crystallography, mainly because they drastically reduce the rate of radiation damage in the samples.74 The standard technique of 73 74
E. F. Garman and T. R. Schneider, J. Appl. Crystallogr. 30, 211 (1997). A. C. M. Young, J. C. Dewan, C. Nave, and R. F. Tilton, Jr., J. Appl. Crystallogr. 26, 309 (1993).
106
phases
[4]
flash-freezing crystals is, however, not directly applicable to noble gas crystals, the reason being that it is difficult to freeze crystals while at the same time maintaining them under xenon or krypton gas pressure. The boiling points of xenon and krypton are indeed higher than the cryogenic temperatures at which crystals are usually frozen. Attempts to freeze crystals by plunging the pressurized capillaries into liquid N2 lead to the immediate condensation of xenon in the capillary. A possible way out of this quandary is to freeze the crystals at pressures above the critical point of the gas. In this case, the phase change is continuous, taking time on a time scale of tens of seconds—enough to flash-freeze the crystal and release the gas pressure.31 The alternative method, which is now used in all published cryopressurization cells, consists of separating the pressurization and flash-freezing steps. It has indeed been found that, although xenon and krypton binding is reversible, the kinetics of the binding process in crystals is such that the gas pressure can be released before flash-freezing, provided the time lapse between the two steps is on the order of a few seconds only. Once the crystal is frozen the xenon or krypton atoms remain at the binding sites even in the absence of external gas pressure. This is due to the high viscosity of the amorphous phase that the crystal water forms at cryogenic temperatures. It may also be related to the low vapor pressures of xenon and krypton at these temperatures. Kinetics of Noble Gas Binding. The kinetics of xenon binding to crystals of porcine pancreatic elastase was first investigated by Schiltz et al.,28 who showed that the binding is essentially completed within minutes after the initial pressurization. A more detailed investigation of the kinetics of xenon binding and unbinding to myoglobin crystals was conducted by Soltis et al.41 They showed that the diffusion of xenon atoms from the tight binding sites following depressurization occurs on a time scale of minutes. Similar observations were made by Sauer et al.,37 who intentionally prolonged the time between gas pressure release and shock freezing by 40 s and found that this leads only to a 20% decrease in the occupancy of xenon atoms. Thus, it can be assumed that during the few seconds required for xenon pressure release and freeze quenching, only a small fraction of the dissolved xenon should leave the protein crystal. Pressurization Cells. Pressurization cells for cryocrystallographic experiments have been presented by Sauer et al.,37 Soltis et al.,41 Djinovic-Carugo et al.45 and Verne`de and Fontecilla-Camps.75 Commercially available devices (see Fig. 10) include the Xcell from Oxford Cryosystems,76 the Cryo-Xe-Siter from Rigaku/MSC, and the Xenon Chamber 75 76
X. Verne`de and J. C. Fontecilla-Camps, J. Appl. Crystallogr. 32, 505 (1999). Oxford Cryosystems, Acta Crystallogr. D Biol. Crystallogr. 55, 724 (1999).
[4]
xenon and krypton as heavy atoms
107
Fig. 10. Three commercially available devices for pressurization and flash-freezing of xenon and krypton derivatives. (A) The Xcell from Oxford Cryosystems; (B) the Xenon Chamber from Hampton Research; (C) the Cryo-Xe-siter from Rigaku/MSC.
from Hampton Research. We do not attempt to give detailed descriptions of each apparatus, which all operate in a similar mode: the crystal is mounted in a loop mounted on a crystal cap pin. The free-standing film technique77 may also be used. The pressure cell essentially consists of a
108
phases
[4]
small gas chamber containing a magnet onto which the crystal cap can be mounted. A small amount of buffer is deposed in the chamber to avoid dehydration of the crystal. The chamber is then sealed and pressurized xenon or krypton gas is let in. After 5 min of pressurization, the gas pressure is released by opening a bleeder valve and the mounted crystal can be extracted and immediately flash frozen, either in a stream of cold N2 vapor or by directly plunging it into a cryogenic liquid. The Cryo-Xe-Siter device from Rigaku/MSC has a two-chamber design, which allows the flash-cooling of the crystal while it is still in the pressurized atmosphere. This alleviates problems during the delicate step of dismounting the crystal holder from the pressure chamber before flash-freezing. The design goals for pressure cells have been laid down by Sauer et al.37: (1) dehydration of the crystal and its surrounding cryoprotectant during pressurization must be avoided; (2) the transfer of the crystal after pressure release must be as speedy as possible; (3) the cell should be designed to minimize gas turbulence around the crystal during pressure changes to avoid blowing the crystal from its mount; (4) the total volume of the chamber should be small to prevent excessive consumption of expensive xenon or krypton gas; (5) it should be possible to observe the crystal through a microscope during the pressurization; and (6) the instrument should be simple to build and easy to operate. For a commercialized apparatus, the price will also be an important criterion. They also noted that not all design criteria could be equally well addressed with a single instrument. In fact, the diversity of available pressurization cells simply reflects the varying priorities that different constructors have given to each of these design criteria. A glance at Fig. 10 reveals the outcome of these differing choices. In opting for one or the other of the devices, prospective users should therefore carefully ponder the advantages and drawbacks of each apparatus and how these match their own needs. From an entirely personal point of view, we find the Mark 2 pressure cell developed by Sauer et al.37 appealing because it presents a well-balanced realization of these various design goals. Finally, we should mention that Panjikar and Tucker78 have demonstrated that xenon can penetrate paraffin oil and panjelly layers surrounding crystals. With this method, cryocooled xenon derivatives of porcine pancreatic elastase could be successfully prepared without the need to transfer the crystals to a cryoprotectant buffer and to maintain them under the vapor pressure of this buffer during the pressurization step. Room Temperature Experiments versus Cryocrystallography. The most distinct advantage of protein cryocrystallography versus room temperature 77 78
T. Teng, J. Appl. Crystallogr. 23, 387 (1990). S. Panjikar and P. Tucker, J. Appl. Crystallogr. 35, 117 (2002).
[4]
xenon and krypton as heavy atoms
109
experiments is the dramatically reduced radiation damage.74 This leads to a higher sample lifetime (especially with intense synchrotron radiation), which in turn allows the collection of more data on a single crystal, thus reducing intercrystal scaling errors as well as other systematic errors. Furthermore, the specific drawbacks of noble gas experiments at room temperature are essentially resolved: absorption is not a major problem because crystals, once frozen, do not have to be maintained under gas pressure; xenon or krypton hydrate formation is usually avoided, even at high pressures, because the onset of clathrate formation is generally much slower than the pressurization times.37 At first glance, it thus seems that there is little to say in favor of room temperature experiments with noble gases. However, the difficulties in obtaining suitable cryoprotectant buffers for flash-freezing crystals are rarely reported. Increased mosaicities, anisotropic intensity falloffs, irreproducible cell parameter changes, and increased nonisomorphism between crystals are pathologies frequently observed in flash-cooling experiments. In contrast, the room temperature pressurization and data collection procedure that we have described above presents a number of interesting advantages that cannot be exploited in cryocrystallographic experiments. 1. Derivative and native data can be collected on the same crystal, in the same orientation, provided the sample is sufficiently resistant to radiation damages. This strategy will allow for many systematic errors to cancel out in the process of extracting phase information from the measured intensity differences.31 2. It is possible to modify the gas pressure and to collect data at varying pressure rates on the same crystal. This can be useful to produce derivatives with varying numbers of sites and substitution levels. 3. If radiation damage is a serious problem, the method enables one to pressurize several crystals in the same capillary that can then be used for consecutive data collections under nearly identical conditions. The user should therefore carefully consider whether to conduct noble gas experiments at room temperature or not. If important radiation damage occurs, then cryocooling is probably the only option. If, however, the samples are stable in the beam (especially on a laboratory source), then it is worth considering the collection of native and derivative data on the same crystal at room temperature. In our experience, this strategy yields isomorphous differences of high accuracy, giving the best possible phasing information.31
110
phases
[4]
Fig. 11. Isotherms of xenon and krypton binding to porcine pancreatic elastase at temperatures of 281.15 K (8 ), 293.15 K (20 ), and 301.15 K (28 ). The occupancies of the noble gas atoms at the various pressure points were determined from X-ray diffraction experiments. Langmuir isotherms were fitted to the experimental data points by a leastsquares procedure.
Which Gas Pressures Can Be Used? The question arises concerning what might be the exact relationship between the substitution level of xenon or krypton in a protein crystal and the applied gas pressure. Insight into the thermodynamics of noble gas binding has been obtained from a crystallographic analysis of xenon and krypton binding to porcine pancreatic elastase at various gas pressures and temperatures.33 In this protein, xenon and krypton bind into a single location, which is the enzymatic site.29 The occupancies of the xenon or krypton atoms at various temperatures and as a function of gas pressure are plotted in Fig. 11. A simple interpretation of these data follows. Gas binding to the protein can be considered a two-step process: 1. Solvation of the gas atoms in the liquid solvent part of the crystal: XeðgasÞ Ð XeðsolvÞ Assuming an ideal behavior, this process follows Henri’s law: ðTÞ
PXeðgasÞ ¼ hXe XXeðsolvÞ
[4]
111
xenon and krypton as heavy atoms
where PXe(gas) is the partial pressure of xenon in the gas atmosphere that is (T) in equilibrium with the crystal, hXe is Henri’s law constant for the temperature T and XXe(solv) is the mole fraction of xenon dissolved in the crystal liquid. 2. Binding of xenon from the solvent phase to the discrete sites on the protein molecules (P): XeðsolvÞ þ P0 Ð PXe Here, P–0 denotes a protein molecule with an unoccupied site. Again, assuming ideal behavior, this process can be considered a chemical equilibrium obeying the law of mass action with equilibrium constant K(T): KðTÞ ¼
X denotes a mole fraction, whereas N denotes a number of moles. Nptot is the total number of protein moles involved. Combining the two equilibrium equations gives NPXe KðTÞ ¼ ðTÞ PXeðgasÞ NPtot NPXe hXe and after rearrangement, an expression for the occupancy of the gas atoms is found: ðTÞ PXeðgasÞ NPXe ¼ occupancy T; PXeðgasÞ ¼ NPtot 1 þ ðTÞ PXeðgasÞ
with ðTÞ ¼
KðTÞ ðTÞ
hXe
This is the functional form of the Langmuir isotherm equation that describes the adsorption of gases to a surface in single layers.79 is the temperature-dependent binding constant of the complete two-step process. Langmuir isotherms can be fitted to the experimental data of xenon and krypton binding to elastase, as shown in Fig. 11. At partial pressures much smaller than 1, the occupancy is a nearly linear function of the partial pressure, the proportionality constant being . On the other hand, as the pressure increases to values much larger than 1, the occupancy will asymptotically reach unity. The value of the binding constant will of course vary from one protein to another and from one binding site to another. Also, if several binding sites are present in one protein molecule, gas binding might exhibit some degree of 79
S. W. Benson, in ‘‘The Foundations of Chemical Kinetics.’’ Robert E. Krieger, Malabar, FL, 1982.
112
phases
[4]
cooperativity, as is indeed the case in myoglobin.37 In such cases, the simple description in terms of Langmuir isotherms does not hold. What are the practical consequences of this relation for xenon or krypton binding? If a noble gas derivative is found to be weakly substituted it should nevertheless be possible to refine (or at least estimate) the occupancies of the xenon or krypton sites. From the refined occupancies, a rough estimate for the binding constants can be obtained, assuming that one is still in the linear regime. Having an estimate of the binding constant will in turn allow one to choose the pressure range that will give the desired substitution level, or it will enable one to get an idea of the occupancies at the maximum experimentally reachable pressure. The maximum reachable pressures vary from one device to another. For cryofreezing experiments, the Mark 2 pressure cell of Sauer et al.37 seems to hold the current record, because the authors report experiments at pressures as high as 100 bar.39 At room temperature, the abovementioned problems of X-ray absorption and possible clathrate formation become more and more severe as the pressure is raised and, broadly speaking, preclude experiments at pressures above 30 bar. In our laboratory, we do room temperature experiments usually at a ‘‘standard’’ pressure of 15 bar. This seems to be a good compromise when absorption problems are still tractable and when one can, at the same time, expect a good substitution level for xenon in a large number of cases. With krypton gas, it is possible to go to much higher pressures (50 bar) in either room temperature or cryotemperature experiments.31,42 The Langmuir isotherms are useful tools to characterize noble gas binding to proteins. It is also possible to compute thermodynamic quantities (free energies) of xenon or krypton binding from these experimental data. In particular, it has been shown in the case of elastase that the temperature dependence of the binding constant can be entirely ascribed to the entropic change associated with the loss of translational degrees of freedom as the xenon or krypton atoms are transferred from the gas phase to discrete binding sites.33 Selected Case Studies
Xenon and Krypton Binding to Porcine Pancreatic Elastase Since our first studies of xenon binding to proteins, the 26-kDa protein elastase has become a kind of guinea pig for testing new experimental methods of noble gas derivatization. It has been used to explore the anomalous scattering of both krypton and xenon in SIRAS mode at their respective K edges.
[4]
xenon and krypton as heavy atoms
113
For the krypton experiment,31 data were collected at the high-energy ˚ ) on the DW21b beamline at the side of the krypton K edge ( ¼ 0.86 A LURE-DCI synchrotron. A single crystal was used for both native and derivative data collection under a krypton gas pressure of 56 bar. The occupancy of the single krypton site was found to be 0.5, giving isomorphous and anomalous signal strengths of, respectively, 15 and 2 electrons (at zero scattering angle). This derivative was used successfully for phase determination in SIRAS mode. After density modification by solvent flipping, the ˚ electron density map is of exceptionally high quality (see resulting 1.87-A Fig. 12) and has a correlation coefficient of 0.90 with a map computed from the refined native structure. Such is the quality of the map, which is based purely on experimental data (i.e., before any model building), that more than 50 water molecules appear as spherical density peaks. That the equivalent of half a krypton atom is sufficient to phase a 26-kDa protein is by itself a remarkable result. The key ingredients for the success of this experiment were (1) the near-perfect isomorphism of the derivative, (2) the high quality of the data and the optimal design of the experiments, (3) the reduction of systematic errors in isomorphous and anomalous intensity differences by collecting all data sets on the same, adequately preoriented sample and by careful local scaling of the data sets, and (4) the optimal statistical treatment of the isomorphous and anomalous differences by the maximum likelihood method with the program SHARP.66 The xenon experiment32 was carried out on the ID11 beamline at the ESRF with X-rays tuned at the high-energy side of the K edge ˚ ) and with a gas pressure of 16 bar. It essentially confirmed ( ¼ 0.36 A the conclusions drawn from the krypton experiment with regard to the importance of careful data collection and processing strategies. In addition, this experiment was the first fully documented report on a complete and successful protein crystallography experiment carried out with X-rays of ultrashort wavelength. More recently, an analogous experiment was conducted at the LI ˚ ) edge of xenon on a frozen specimen. The data were collected ( ¼ 2.3 A at the ELLETRA synchrotron on a beamline, which is specifically designed for data collections at longer wavelengths.80 N-Myristoyltransferase The structure determination of N-myristoyltransferase81 also proves to be a nice illustration of the genuine advantages that highly isomorphous noble gas derivatives can offer. A three-wavelength MAD data set of 80
M. S. Weiss, T. Sicker, K. Djinovic-Carugo, and R. Hilgenfeld, Acta Crystallogr. D Biol. Crystallogr. 57, 689 (2001).
114 phases
[4]
Fig. 12. Experimental electron density maps (after solvent flattening) for porcine pancreatic elastase obtained from SIRAS experiments. Left: The ˚ ). Right: The SIRAS map SIRAS map obtained from a xenon derivative with data collected at the high-energy side of the Xe K edge ( ¼ 0.3585 A ˚ ). Superimposed on the maps is the obtained from a krypton derivative with data collected at the high-energy side of the Kr K edge ( ¼ 0.8655 A model of the refined protein structure. It should be noted that these maps were computed from purely experimental data, that is, before any model building.
[4]
xenon and krypton as heavy atoms
115
standard quality was collected on Se-Met protein, but the determination of the positions of the 12 Se atoms proved intractable. At a subsequent stage, a flash-frozen xenon derivative was prepared at a gas pressure of 10 bar and data were collected on a laboratory source with Cu K radiation. A second crystal was subjected to exactly the same treatment, but the xenon gas was allowed to diffuse out of the crystal before cryocooling, thus giving a highly isomorphous native data set. The authors report that the relative weakness of isomorphous differences (Riso ¼ 0.091) almost led them to discard the xenon derivative. However, the isomorphous difference Patterson map showed clear cross-vectors and 10 xenon sites could be refined with SHARP. The resulting SIRAS phases allowed for the straightforward detection of the positions of the Se atoms by difference Fourier analysis. Most surprisingly, the xenon SIRAS phases alone (after solvent flattening) turned out to be of better quality than both the combined Xe/Se phases and the pure Se-MAD phases (after solvent flattening). Eventually, the structure was solved from the solvent-flattened xenon SIRAS map. Further Successful Applications More data on protein structure determinations that involved xenon derivatives are presented in Table III.34–36,57–59,69,81–99 This summary is not intended to be exhaustive, but it shows the wide variety of protein classes for which noble gas derivatives were found useful in structure determination. In many cases, xenon was used in conjunction with other heavy-atom derivatives in MIR or MIR þ MAD mode. In a large number of these cases, the xenon derivatives turned out to be those that had the best
81
S. A. Weston, R. Camble, J. Colls, G. Rosenbrock, I. Taylor, M. Egerton, A. D. Tucker, A. Tunnicliffe, A. Mistry, F. Mancia, E. de la Fortelle, J. Irwin, G. Bricogne, and R. A. Pauptit, Nat. Struct. Biol. 5, 213 (1998). 82 N. Krauß, W. D. Schubert, O. Klukas, P. Fromme, H. T. Witt, and W. Saenger, Nat. Struct. Biol. 3, 965 (1996). 83 L. J. Beamer, S. F. Caroll, and D. Eisenberg, Science 276, 1861 (1997). 84 M. Welch, N. Chinardet, L. Mourey, C. Birck, and J. P. Samama, Nat. Struct. Biol. 5, 25 (1998). 85 L. Mourey, J. D. Pe´ delacq, C. Birck, C. Fabre, P. Rouge´ , and J. P. Samama, J. Biol. Chem. 273, 12914 (1998). 86 X. D. Su, L. N. Gastinel, D. E. Vaughn, I. Faye, P. Poon, and P. J. Bjorkman, Science 281, 991 (1998). 87 K. Va˚ legard, A. C. van Scheltinga, M. D. Lloyd, S. Ramaswamy, A. Perrakis, A. Thompson, H. J. Lee, J. E. Baldwin, C. J. Schofield, J. Hajdu, and I. Andersson, Nature 394, 805 (1998). 88 M. Hilge, S. M. Gloor, W. Rypniewki, O. Sauer, T. D. Heightman, W. Zimmermann, K. Winterhalter, and K. Piontek, Structure 6, 1433 (1998).
116
TABLE III Protein Structures Solved with Xenon Derivativesa Xenon derivative
Molecule
X-ray source; wavelength
Number of sites
Phasing method
Resolution ˚ ) Ref. limit (A
478
20
RT
˚ SR; 0.90 A
2
MIRAS (2 derivatives)
3.2
36
781 230 1844 482 456 295 1048 1176 2 226 391 311
19 10 8 13 NR 8 13 10 15 26 þ 30 NR
RT RT RT RT RT RT RT CC RT RT NR
RA; Cu K RA; Cu K ˚ SR; 0.90 A ˚ SR; 0.90 A RA; Cu K ˚ SR; 0.90 A ˚ SR; 0.90 A RA; Cu K ˚ SR; 0.97 A RA; Cu K RA; Cu K
1 8 8 3 NR 1 8 10 21 NR 1
1.88 2.05 4.0 2.7 2.4 2.05 2.95 2.45 1.9 3.1 1.3
59 57 82 35 83 34 84 81 85 86 87
222 302 664 473
10 50 34.5 17
RT CC CC CC
RA; Cu K RA; Cu K ˚ SR; 1.08 A SR
8 4 4 5
MIRAS (6 derivatives) MIRAS (3 derivatives) MIR (4 derivatives) MIR (2 derivatives) þ MR MIRAS (10 derivatives) MIR (4 derivatives) MIRAS (4 derivatives) SIRAS MIRAS (2 derivatives) MIR (2 Xe + 2 Pt derivatives) MIR (2 derivatives) þ SeMet MAD MIR (2 derivatives) þ MR MIR (5 derivatives) SIRAS MIRAS (3 derivatives) þ Fe MAD
SIRAS MIRAS (2 Xe der) þ SeMet MAD MIR SIRAS MIR (2 derivatives)
520 296 343
10 20 18
RT CC CC
˚ SR; 0.97 A RA; Cu K RA; Cu K
1.9 2.25 1.8
93 94 95
1216
15
RT
RA þ SR
2
MIRAS (2 derivatives)
1.95
96
372 2 180
42 14.2
CC CC
RA; Cu K ˚ SR; 2.045 A
2 22
SIRAS SIRAS
1.6 1.4
97 69
320
14
CC
RA; Cu K
3
SIRAS
2.6
98
268
14
CC
RA þ SR
1
MIRAS (3 derivatives)
1.75
99
Abbreviations: NR, not reported; RT, room temperature; CC, cryocooling; RA, rotating anode; SR, synchrotron radiation; MIR(AS), multiple isomorphous replacement (with anomalous scattering); SIRAS, single-isomorphous replacement with anomalous scattering; MAD, multiwavelength anomalous scattering; MR, molecular replacement. a Reported here are only cases in which xenon derivatives have been used to solve an unknown protein structure, that is, test cases and studies of xenon complexes with known protein structures have been excluded.
phasing power. In a significant fraction of cases, solely xenon derivatives were used in SIRAS mode. Other Applications of Xenon and Krypton in Structural Biology This chapter has focused on the use of noble gases as heavy atoms and anomalous scatterers in protein crystallography. However, xenon in particular has been used for other interesting applications in macromolecular crystallography, of which we mention the mapping of hydrophobic cavities in proteins,21,30 the exploration of pathways for gas access in hydrogenases,100 the low-resolution detergent tracing in crystals of membrane proteins,38 the tracking of putative substrate binding cavities in methane monooxygenase hydroxylase,101 and the study and modeling of protein– small molecule interactions in engineered cavities of T4 lysozyme.43,102 Further, the use of the xenon isotopes 129Xe (spin I ¼ 1/2) and 131Xe (spin I ¼ 3/2) to investigate the binding of xenon to proteins and membranes by NMR spectroscopy was pioneered by Miller et al.103 Applications to hydrophobic site explorations in solution are also to be mentioned.104 129 Xe polarized laser light can be used effectively in high-resolution 89
M. Machius, L. Henri, and J. Deisenhofer, Proc. Natl. Acad. Sci. USA 96, 11717 (1999). P. A. Williams, J. Cosme, V. Sridhar, E. F. Johnson, and D. E. McRee, Mol. Cell 5, 121 (2000). 91 M. Harel, G. Kryger, T. L. Rosenberry, W. D. Mallender, T. Lewis, R. J. Fletcher, J. M. Guss, I. Silman, and J. L. Sussman, Protein Sci. 9, 1063 (2000). 92 X. Zhang, L. Zhou, and X. Cheng, EMBO J. 19, 3509 (2000). 93 C. Bompard-Gilles, H. Remaut, V. Villeret, T. Prange´ , L. Fanuel, M. Demarcelle, B. Joris, J. M. Frere, and J. van Beeumen, Structure 8, 971 (2000). 94 T. Zhou, M. Daugherty, N. V. Grishin, A. L. Osterman, and H. Zhang, Structure 8, 1247 (2000). 95 A. Dong, J. A. Yoder, X. Zhang, L. Zhou, T. H. Bestor, and X. Cheng, Nucleic Acids Res. 29, 439 (2001). 96 W. C. Wang, W. H. Hsu, F. T. Chien, and C. Y. Chen, J. Mol. Biol. 306, 251 (2001). 97 G. Capitani, R. Rossmann, D. F. Sargent, M. G. Grutter, T. J. Richmond, and H. Hennecke, J. Mol. Biol. 311, 1037 (2001). 98 F. D. Schubot, C. J. Chen, J. P. Rose, T. A. Dailey, H. A. Dailey, and B. C. Wang, Protein Sci. 10, 1980 (2001). 99 F. D. Schubot, I. A. Kataeva, D. L. Blum, A. K. Shah, L. G. Ljungdahl, J. P. Rose, and B. C. Wang, Biochemistry 40, 125524 (2001). 100 Y. Montet, P. Amara, A. Volbeda, X. Vernede, C. Hatchikian, M. J. Field, M. Frey, and J. C. Fontecillia-Camps, Nat. Struct. Biol. 4, 523 (1997). 101 D. A. Whittington, A. C. Rosenzweig, C. A. Frederick, and S. J. Lippard, Biochemistry 40, 3476 (2001). 102 G. Mann and J. Hermans, J. Mol. Biol. 302, 979 (2000). 103 K. W. Miller, N. V. Reo, A. J. M. Schoot-Uiterkamp, D. P. Stengle, T. R. Stengle, and K. L. Williamson, Proc. Natl. Acad. Sci. USA 78, 4946 (1981). 104 C. Landon, P. Berthault, F. Vovelle, and H. Desvaux, Protein Sci. 10, 762 (2001). 90
[4]
xenon and krypton as heavy atoms
119
magnetic resonance imaging of living organisms105 and the radiocative isotope 133Xe is used in nuclear medicine to study cerebral blood flow and pulmonary function.106 There has also been a considerable resurgence of interest in the anesthetic action of xenon. We quote only some representative reports, of the large number of experimental investigations on the molecular mechanism of xenon anesthesia.107–109 And finally, with the advent of low-flow, closed-system anesthesia machines, xenon is beginning to be used as a clinical anesthetic agent, especially in replacement of nitrous oxide, over which it seems to have many advantages.110 As such, xenon has been predicted to become the ‘‘anesthesia for the 21st century.’’111 Concluding Remarks
After a long period of latency, the usefulness of noble gas derivatives for isomorphous replacement and anomalous scattering methods is now firmly established. Xenon and krypton derivatives have a number of attractive advantages and the number of cases in which they have been used successfully in structure determinations is steadily growing. The good phasing power of noble gas derivatives is the result of their high degree of isomorphism and of their optimal anomalous scattering properties, permitting the computation of phases from a single derivative, even for relatively large structures. The technical aspects of noble gas derivatization and X-ray data collection are now well mastered, thanks to the joint efforts of several research teams around the world. With the current emphasis on high-throughput macromolecular crystallography, the use of xenon and krypton derivatives should feature prominently in the list of available tools to solve protein structures. The method of generating noble gas-binding sites by site-directed mutagenesis bears with it the promise of an almost universally applicable experimental phasing method. Acknowledgments This work was supported by EXMAD European contract ref. HPRI CT-1999-50015. 105
M. S. Albert, G. D. Cates, B. Driehuys, W. Happer, B. Saam, C. S. Springer, Jr., and A. Wishnia, Nature 370, 199 (1994). 106 G. B. Saha, W. J. MacIntyre, and R. T. Go, Semin. Nucl. Med. 24, 324 (1994). 107 N. P. Franks, R. Dickinson, S. L. M. de Sousa, A. C. Hall, and W. R. Lieb, Nature 396, 324 (1998). 108 T. Yamakura and R. A. Harris, Anesthesiology 93, 1095 (2000). 109 T. Yamakura, C. Borghese, and R. A. Harris, J. Biol. Chem. 275, 40879 (2000). 110 H. Schwilden and J. Schuttler, Ana¨ sthesiol. Intensivmed. Notfallmed. 36, 640 (2001). 111 J. A. Joyce, J. Am. Assoc. Nurse Anesth. 68, 259 (2000).
120
[5]
phases
[5] Phasing on Rapidly Soaked Ions By Ronaldo A. P. Nagem, Igor Polikarpov, and Zbigniew Dauter Introduction
There are three basic ways of solving macromolecular crystal structures: the molecular replacement method, direct methods, and the heavyatom method. Molecular replacement involves the use of a known search model, closely similar to the macromolecule being investigated. Direct methods are routinely used to solve the structures of small molecules, where the diffraction data extend to atomic resolution. As a consequence, these two methods cannot be used to solve novel structures with crystals diffracting to lower than atomic resolution. The most general method of solving novel crystals structures is therefore a heavy-atom approach in its various modifications. In general, in this approach the initial phases are derived from differences in crystal scattering caused by the presence of a small number of heavy and/or anomalously diffracting atoms, which can be inherently present in the native macromolecule or introduced by chemical derivatization. Various types of diffraction differences can be employed. In the single or multiple isomorphous replacement (SIR or MIR) methods only the differences between the intensities of the native and derivative crystals are used. If anomalous differences are used in addition, the method becomes the isomorphous replacement with anomalous scattering (SIRAS or MIRAS) method. If only the differences between Friedel-related intensities at one or more X-ray wavelengths are used, the technique is termed single- or multiple-wavelength anomalous dispersion (SAD or MAD), respectively. The details of each of these approaches are comprehensively discussed in many classic textbooks.1,2 The heavy atoms, providing the initial phase information, can be present in the original macromolecule, such as certain transition metals in metalloproteins or, more recently proposed as general tools, sulfur in proteins3,4 and phosphorus in nucleic acids.5 However, most general is the derivatization before or after crystallization. An example of the former is 1
T. L. Blundell and L. N. Johnson, ‘‘Protein Crystallography.’’ Academic Press, New York, 1976. J. Drenth, ‘‘Principles of Protein X-Ray Crystallography,’’ 2nd Ed. Springer-Verlag, Heidelberg, Germany, 1999. 3 Z. Dauter, M. Dauter, E. de La Fortelle, G. Bricogne, and G. M. Sheldrick, J. Mol. Biol. 289, 83 (1999). 4 E. Micossi, W. N. Hunter, and G. A. Leonard, Acta Crystallogr. D Biol. Crystallogr. 58, 21 (2002). 5 Z. Dauter and D. A. Adamiak, Acta Crystallogr. D Biol. Crystallogr. 57, 990 (2001). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[5]
phasing on rapidly soaked ions
121
the production of selenomethionine protein variants6 by genetic engineering for the MAD technique. The classic derivatization approach involves prolonged soaking of the native crystals in diluted solutions of various heavy metal salts and coordination compounds. Such soaking procedures are time consuming and often unsuccessful, owing to the lack of heavy atom binding or the deterioration of crystal quality. Dauter and co-workers have proposed that certain simple anions or cations suitable for phasing, such as halides or alkali metals, can be introduced into protein crystals by rapid soaks in the appropriate cryo derivatization solutions7,8 (‘‘quick cryosoaking approach’’). This procedure combines, in one rapid single step, derivatization and cryogenic protection. Immediately before freezing the crystal for data collection, a native crystal is immersed for a short period of time in a cryoprotectant solution drop containing in addition a high concentration of the appropriate salt. Compared with classic soaks, which combine low heavy-metal concentration and long immersion times, this procedure is able to generate good isomorphous derivatives significantly faster. This approach is based on somewhat different chemical behavior of halides and alkaline ions in comparison with the classic heavy-atom compounds. Protein crystals contain a significant proportion of the liquid solvent phase, filling the voids between the more or less globular protein molecules. Various small chemical compounds can diffuse through the solvent channels within protein crystals, and this has often been used to obtain, for example, enzyme complexes with inhibitors, cofactors, and so on. This diffusion is quick, which is evidenced by the frequent presence of the glycerol molecules bound to the protein surface after a short (2- to 5-s) immersion of the crystal in the cryoprotecting solution containing glycerol.9 The rapid soak approach uses this property of protein crystals, which allows small ions to diffuse within a short time to the solvent regions surrounding the protein molecules and adopt ordered sites at their surface. Ions Used for Rapid Soaks
Both negatively charged heavy halides and positively charged heavy alkali ions have been proposed for this fast derivatization approach. 6
W. A. Hendrickson and C. M. Ogata, Methods Enzymol. 276, 494 (1997). Z. Dauter, M. Dauter, and K. R. Rajashankar, Acta Crystallogr. D Biol. Crystallogr. 56, 232 (2000). 8 R. A. P. Nagem, Z. Dauter, and I. Polikarpov, Acta Crystallogr. D Biol. Crystallogr. 57, 996 (2001). 9 J. Lubkowski, Z. Dauter, F. Yang, J. Alexandratos, G. Merkel, A. M. Skalka, and A. Wlodawer, Biochemistry 38, 13512 (1999). 7
122
phases
[5]
The presence of chloride anions has been observed in the structures of proteins crystallized from solutions containing a significant concentration of sodium chloride, for example, in tetragonal lysozyme.3,10 When lysozyme was crystallized from a solution containing NaBr11,12 or Nal,13 a number of sites occupied by these halides appeared at the protein surface. This observation, and the analysis of data collected on a few test crystals, led to the proposal of using soaked bromides and iodides for phasing.7 The two heavier halides, bromine and iodine, display a significant anomalous signal in the range of wavelength easily accessible by most syn˚ (13,474 chrotron beam lines. Bromine has the K absorption edge at 0.92 A eV) and is appropriate for phasing by the MAD method. Bromine has one more electron than selenium and it has been used for MAD solution of oligonucleotide structures after substituting thymine by the almost isostructural bromouracil.14 It can be considered as the nucleic acid equivalent of the SeMet substitution in protein crystallography.6 ˚ and LI at 2.39 A ˚ ) are not The iodine absorption edges (K at 0.37 A easily accessible, and iodine is therefore not suitable for the MAD work. However, it retains a significant anomalous signal ( f 00 ¼ 6.8 electron units) ˚ . Iodine has been used as a at the copper characteristic wavelength of 1.54 A heavy atom in protein crystallography after chemical modification of the tyrosine aromatic rings.15,16 Chlorine, the halide lighter than bromine or iodine, has its K edge at a ˚ ), and displays only a small anomalous effect at long wavelength (4.39 A ˚ and f 00 ¼ 0.88 at 1.74 more accessible wavelengths ( f 00 ¼ 0.70 at 1.54 A ˚ ). Nevertheless, with the accurately measured data it is possible to use A its anomalous signal for phasing.3,17,18 10
C. C. F. Blake, G. A. Mair, A. C. T. North, D. C. Phillips, and V. R. Sarma, Proc. R. Soc. Lond. B Biol. Sci. 167, 365 (1967). 11 K. Lim, A. Nadarajah, E. L. Forsythe, and M. L. Pusey, Acta Crystallogr. D Biol. Crystallogr. 53, 240 (1998). 12 Z. Dauter and M. Dauter, J. Mol. Biol. 289, 93 (1999). 13 L. K. Steinrauf, Acta Crystallogr. D Biol. Crystallogr. 54, 767 (1998). 14 J. L. Smith and W. A. Hendrickson, in ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, Chapter 14.2.1. Kluwer Academic, Dordrecht, The Netherlands, 2001, p. 299. 15 L. Q. Chen, J. P. Rose, E. Breslow, D. Yang, W. R. Chang, W. F. Furey, M. Sax, and B. C. Wang, Proc. Natl. Acad. Sci. USA 88, 4240 (1991). 16 L. Brady, A. M. Brzozowski, Z. S. Derewenda, E. Dodson, G. Dodson, S. Tolley, J. P. Turkenburg, L. Christiansen, B. Huge-Jensen, L. Norskov, L. Thim, and U. Menge, Nature 343, 767 (1990). 17 C. Lehmann, (2000). Ph.D. thesis. University of Go¨ ttingen, Go¨ ttingen, Germany. 18 P. J. Loll, Acta Crystallogr. D Biol. Crystallogr. 57, 977 (2001).
[5]
phasing on rapidly soaked ions
123
The use of heavy alkali metals for rapid cryosoaks was proposed as an extension to the quick cryosoaking approach with halides.8 The heavier alkali metals rubidium and cesium have two electrons more than the halides bromine and iodine, respectively. This difference, extremely important from the chemical point of view, allows Cs or Rb cations to occupy different positions in the crystal structure when compared with the positions occupied by I or Br anions. This fact adds an additional flexibility to the procedure of quick cryosoaking and permits combined use of such derivatives in MIR(AS) phasing, even when none of them are strong enough to produce interpretable electron-density maps individually. ˚ ) is in the same range as Br The K absorption edge of rubidium (0.82 A and Se K edges, which makes it a suitable atom for MAD phasing. Indeed, it was tested as a useful MAD phasing source.19 On the other hand, cesium atoms, even though possessing a strong ˚ , are anomalous signal with f 00 ¼ 7.90 electrons at a wavelength of 1.54 A not suitable for MAD experiments. The Cs absorption edges (K at 0.34 ˚ and LI at 2.17 A ˚ ) are far away from the wavelength range accessible at A most synchrotron beam lines. The anomalous signal of cesium has been used for phasing in the past, for example, for gramicidin.20 Differences between Classic Reagents and Ions Used for Quick Soaks
The ions used for the quick-soak approach differ in their chemical properties from the classic heavy atom reagents. Halides in water solution occur as not coordinated, monoatomic anions, although they interact with water through hydrogen bonds. Alkali cations are coordinated by water molecules, but the coordination is not strong, and the metal aquo ligands can be easily exchanged, for example, by the carboxyl or carbonyl oxygen atoms. In contrast, in many standard heavy-atom reagents the ligands are strongly coordinated or covalently bound to the metal.21 To bind to the appropriate protein sites such reagents must undergo a partial hydrolysis or a similar chemical substitution. They usually form strong complexes or bind covalently to certain chemical functions of the protein, such as, for example, mercury reagents with the cysteine sulfhydryl groups. If present in higher concentration, these reagents often disrupt the protein intermolecular 19
S. Korolev, I. Dementieva, R. Sanishvili, W. Minor, Z. Otwinowski, and A. Joachimiak, Acta Crystallogr. D Biol. Crystallogr. 57, 1008 (2001). 20 B. A. Wallace, W. A. Hendrickson, and K. Ravikumar, Acta Crystallogr. B 46, 440 (1990). 21 D. Carvin, S. A. Islam, M. J. E. Sternberg, and T. E. Blundell, in ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, Chapter 12.1. Kluwer Academic, Dordrecht, The Netherlands, 2001, p. 247.
124
phases
[5]
interactions, damaging the crystalline order and adversely influencing the crystal diffraction. The usual procedures involve long (several hours or days) soaks at low, millimolar concentration of the appropriate reagent, allowing the chemical reactions to proceed slowly. Bromide and iodide anions are soft, monoatomic, and polarizable. They are attracted to the protein surface through various types of relatively weak, noncovalent interactions. First, because of their negative charge they can form ion pairs with the positively charged arginine and lysine side chain functions. Second, they can accept hydrogen bonds from various proton donors, such as protein amides (in the main or side chains) and hydroxyls, as well as solvent water molecules. Third, they can interact with the protein hydrophobic surfaces. All these interactions are observed in protein crystals containing halide sites. With respect to chemical interactions with proteins, halides are not highly specific. The halide-bonding interactions do not require any slow chemical reaction to take place and can be formed quickly. The cations are more specific in the character of their binding. Alkali ions have a preference for oxygen functions such as carboxyls (negatively charged), carbonyls, or water. Their sites are in the vicinity of a few sidechain carboxyls, or main- and side-chain carbonyls, available at the protein surface, and usually have a number of water ligands. Rubidium and cesium ions are not strongly demanding of the coordination geometry and can have five to eight ligands.19 Because they do not coordinate water molecules strongly, the substitution of water ligands by the protein oxygens takes place rapidly. As stated above, binding of halides and alkali ions is not strong, and their sites around the protein surface are partially occupied even if their concentration in the mother liquor is higher than 1 M. All these ions can share the sites with water molecules, and their occupancy results from the competition between them and water in binding to various protein functions with variable strength in the state of equilibrium. It is difficult to estimate accurately the absolute occupancy factors of these sites. The relative occupancies of the strongest anomalous sites in a few example structures are given in Table I. Figure 1 illustrates the most typical sites and coordination of soaked ions from the crystal structures solved by their use. Procedure
A number of macromolecular structures from a variety of organisms with different functions have been solved by using the quick cryosoaking approach with halides and alkaline metals. In Table II8,22–34 we show several derivatization aspects of some of these structures, including size of protein,
[5]
phasing on rapidly soaked ions
125
soaking time, cryoderivatization conditions, and so on. It is easy to see from Table II that there is not a general recipe for all types of proteins. However, a few instructive steps, acquired with practice, could be followed to enhance the chance of obtaining suitable derivatives for phasing. Normally, the cryoderivatization solution is prepared from the original mother liquor. The simple addition of a cryoprotectant and an appropriate salt in high concentration is often enough to produce a good cryoderivatization solution. Depending on the crystallization conditions, a complete or partial substitution of salts containing various anions by bromides or iodides also can be performed. As in the SptP:SicP complex structure determination,33 the sodium chloride used during crystallization was completely replaced by sodium bromide during derivatization. Analogously, lithium, sodium, and potassium can be replaced by cesium or rubidium. In cases with a saturated crystallization solution, a partial substitution of reagents is likely to work. The results indicate that the quick cryosoaking approach can be used with success under a number of adverse crystallization conditions. Different types of precipitants have been used so far, including several polyethylene glycol (PEG) sizes, ammonium sulfate in various concentration, and others. The use of other additives as detergent,26 azide,25 sucrose,27 and even ammonium sulfate in high concentration30 seems not to affect the derivatization drastically. 22
F. F. Vajdos, M. Ultsch, M. L. Schaffer, K. D. Deshayes, J. Liu, N. J. Skelton, and A. M. de Vos, Biochemistry 40, 11022 (2001). 23 D. M. Hoover, K. R. Rajashankar, R. Blumenthal, A. Puri, J. J. Oppenheim, O. Chertov, and J. Lubkowski, J. Biol. Chem. 275, 32911 (2000). 24 C. Chang, A. Mooser, A. Plu¨ ckthun, and A. Wlodawer, J. Biol. Chem. 276, 27535 (2001). 25 J.-P. Declercq, C. Evrard, A. Clippe, D. V. Stricht, A. Bernard, and B. Knoops, J. Mol. Biol. 311, 751 (2001). 26 R. A. P. Nagem, D. Colau, L. Dumoutier, J.-C. Renauld, C. Ogata, and I. Polikarpov, Structure 10, 1051 (2002). 27 Y.-S. J. Ho, L. M. Burden, and J. H. Hurley, EMBO J. 19, 5288 (2000). 28 A. Wlodawer, M. Li, Z. Dauter, A. Gustchina, K. Uchida, H. Oyama, B. M. Dunn, and K. Oda, Nat. Struct. Biol. 8, 442 (2001). 29 A. M. Golubev R. A. P. Nagem. J. R. Branda˜ o Neto, K. N. Neustroev, E. V. Eneyskaya, A. A. Kulminskaya, A. N. Savel’ev, and I. Polikarpov, in preparation (2003). 30 Y. Devedjiev, Z. Dauter, S. R. Kuznetsov, T. L. Z. Jones, and Z. S. Derewenda, Structure 8, 1137 (2000). 31 L.-J. Baker, J. A. Dorocke, R. A. Harris, and D. E. Timm, Structure 9, 539 (2001). 32 C. E. Dann, J.-C. Hsieh, A. Rattner, D. Sharma, J. Nathans, and D. J. Leahy, Nature 412, 86 (2001). 33 C. E. Stebbins and J. E. Gala´ n, Nature 414, 77 (2001). 34 A. Rojas, R. A. P. Nagem, K. N. Neustroev, A. M. Golubev, E.V. Eneyska, A. A. Kulminskaya, and I. Polikarpov, in preparation (2003).
126
TABLE I Anomalous Scatterer Sites in Some Crystal Structuresa -Galactosidase (cesium)
Described in Table II. The 10 strongest sites are listed with their corresponding peak heights given in and normalized to the highest peak in the anomalous difference-Fourier map.
The choice between ethylene glycol and glycerol for cryogenic protection is in principle not too relevant for the derivatization process itself. It should rather be chosen for each solution in order to provide complete cryogenic protection. Another decision that must be made before preparation of the derivative is the choice of the correct salt. If MAD data collection can be performed, the use of bromide and rubidium salts such as NaBr, KBr, or RbCl is recommended. The pH of the mother liquor may suggest the most appropriate salt. In principle, in low pH the protein molecules are positively charged, which therefore makes halides a better option. At high pH the alkaline metals may be recommended. These recommendations are based more on chemical intuition than experience, and more extensive
[5]
phasing on rapidly soaked ions
129
Fig. 1. (Continued)
studies with different pH values are required for a final conclusion. On the other hand, if the MAD approach is not applicable, the use of iodide (LiI, NaI, or KI) or cesium (CsCl) salts combined with a longer X-ray wavelength is advised. It seems appropriate to mention that even though the content of the asymmetric unit can only be estimated during data collection, and such information cannot be used to verify the applicability of the quick cryosoaking approach, the results indicate that even larger macromolecules such as -galactosidase34 or the SptP:SicP complex33 can be solved by this technique. Similarly, macromolecular crystals with a solvent content as low as 35% have already been solved.26,24
130
phases
[5]
Fig. 1. (Continued)
Bromides and iodides do not require long soaking times. Experience shows that soaking for longer than 30 s is not necessary, and may lead to deterioration of the diffraction power. Surprisingly, it has been observed that short halide soaks can improve the crystal diffraction35 and sometimes even cause the crystal phase transition to a different symmetry.36 Metal ions seem to require somewhat longer soaking for successful derivatization.
35
M. Harel, R. Kasher, A. Nicolas, J. M. Guss, M. Balass, M. Fridkin, A. B. Smit, K. Brejc, T. K. Sixma, E. Katchalski-Katzir, J. L. Sussman and S. Fuchs, Neuron 32, 265 (2001). 36 Z. Dauter, M. Li, and A. Wlodawer, Acta Crystallogr. D Biol. Crystallogr. 57, 239 (2001).
[5]
phasing on rapidly soaked ions
131
Fig. 1. Stereo plot of the representative sites of cryosoaked ions. Surrounding residues are ˚ from the ion. Coordination distances to shown if they have at least one atom closer than 4.5 A ˚ are marked. (A–D) Bromide sites in acyl protein thioesterase I; polar atoms closer than 3.5 A (E–H) cesium sites in trypsin inhibitor.
Examples
a-Galactosidase from Trichoderma reesei The first crystals of -galactosidase from Trichoderma reesei were obtained in 1993.37 Since then, much effort has been spent on solving the phase problem for this protein. In spite of using a number of heavy metals for derivatization (nearly 20 chemicals including Pt, Ag, Au, U, W, and
37
A. M. Golubev and K. N. Neustroev, J. Mol. Biol. 231, 933 (1993).
132
TABLE II Examples of Crystal Structures Solved by Cryosoaking
Protein
Solvent Size (au) content kDa (%) 16
55
-Defensin-2 from Homo sapiens
44
40
C-terminal domain of TonB from Escherichia coli Peroxiredoxin 5 from Homo sapiens
28
35
1 17
65
Trypsin inhibitor from Copaifera langsdorffi
1 18
45
Interleukin-22 from Homo sapiens
2 17
33
GAF domain YKG9 from Saccharomyces cerevisiae
2 18
65
25% (w/v) PEG 3350, 30% MPD, 0.2 M sodium cacodylate (pH 6.5), 2.8 mM deoxy-BIGCHAP, 1.0 M NaBr 36% PEG 4000, 0.32 M lithium sulfate, 0.16 M MOPS (pH 7.1), 10% glycerol, 0.25 M KBr (0.25 M Kl) 28–30% PEG 3350, 0.1 M Tris (pH 7.5), 50–100 mM calcium chloride, 1.0 M KBr 1.6 M ammonium sulfate, 0.1 M sodium citrate (pH 5.3), 0.2 M potassium sodium tartrate, 1 mM 1,4-dithio-dlthreitol, 0.02% (w/v) azide, 20% (v/v) glycerol, 1.0 M NaBr 20–25% PEG 8000, 0.1 M sodium acetate (pH 4.5), 20% ethylene glycol, 1.0 M CsCl 0.9 M sodium tartrate, 0.1 M HEPES (pH 7.5), Triton X-100 detergent, 15% ethylene glycol, 0.125 M Nal 2.5 M ammonium sulfate, 0.05 M lithium sulfate, 30% sucrose, 0.1 M Tris-HCl (pH 8.0), 10% (v/v) glycerol, 0.5 M NaBr
Soak Phasing time (s) method
No. of sites
Resolution ˚ )a (A Ref.
30
MAD
1 Br plus 2.00 (1.80) 22 6 Cys S
60
MIRAS 9 Br (9 I)
2.00 (1.40) 23
50
MAD
4 Br
2.50 (1.55) 24
30
MAD
5 Br
1.90 (1.50) 25
300
SIRAS
5 Cs
2.00 (2.00)
180
SIRAS
10 I
1.92 (1.92) 26
45
MAD
7 Br
2.80 (1.90) 27
phases
Insulin-like growth factor-I from Homo sapiens
Cryoderivatization conditions
8
[5]
56
-Galactosidase from Trichoderma reesei
1 47
47
Acyl protein thioesterase I from Homo sapiens
2 25
38
Thiamine pyrophosphokinase 2 35 from Saccharomyces cerevisiae
49
6 14
45
2 18 4 12 1 110
58
Cysteine-rich domain of Sfrp-3 from Mus musculus SptP:SicP complex (2:4) from Salmonella typhimurium -Galactosidase from Penicillium sp.
a
72
1.0 M ammonium sulfate, 0.005 M guanidine, 0.1 M sodium citrate (pH 3.3),18% glycerol, 1.0 M NaBr 15% PEG 3350, 100 mM potassium phosphate, 10% glycerol, 0.26 M CsCl 42% saturation ammonium sulfate, 0.1 M sodium acetate (pH 5.0), 20% (v/v) glycerol, 1.0 M NaBr 25% PEG-MME 2000, 0.1 M ammonium sulfate, 0.1 M sodium acetate (pH 5.1), 50 mM sodium chloride, 1.0 M NaBr 0.1 M HEPES (pH 6.6), 33% PEG 3350, 0.5 M NaBr 5–10% PEG 6000, 15% glycerol, 2.0 M NaBr 15% PEG 8000, 50 mM sodium phosphate (pH 4.0), 30% ethylene glycol, 0.25 M Nal (CsCl)
30
SAD
9 Br
1.80 (1.40) 28
SIRAS
10 Cs
1.60 (1.60) 29
20
SAD
22 Br
1.80 (1.50) 30
45
MAD
12 Br
2.00 (1.80) 31
40
MAD
9 Br
1.90 (1.90) 32
30
MAD
31 Br
2.50 (1.90) 33
13 I (12 Cs)
1.95 (1.85) 34 2.04 (1.85)
480
180 SIRAS (300)
Data set resolution used for phasing. Resolution in parentheses refers to the maximum resolution of all data sets used.
phasing on rapidly soaked ions
1 41
[5]
Carboxyl proteinase from Pseudomonas sp. 101
133
134
phases
[5]
rare earth elements), all ‘‘derivatives’’ suffered the absence of heavy-atom binding. The method of quick cryosoaking was then used to overcome this difficulty. In the first trial, native crystals of -galactosidase (P212121; a ¼ 46.5, ˚ ) were soaked in a cryoprotectant solution containing b ¼ 79.1, c ¼ 119.4 A in addition 0.1–0.5 M KI and then used for X-ray diffraction data collection. Initially, these diffraction data, collected at the PCr beamline38 at the Brazilian National Synchrotron Light Source (LNLS), could not be used for phasing because the search for iodine binding sites failed (nonisomorphism ˚ ). was also observed; a ¼ 41.9, b ¼ 79.9, c ¼ 120.0 A A second quick derivative, prepared with 0.2 M CsCl in the cryopro˚ resolution data collection at the tectant solution, was then used for 1.6-A same beamline. The incorporation of Cs atoms was successful and RSPS39 and SnB40 programs found a few equivalent Cs sites independently, using solely the anomalous signal of Cs atoms. However, similarly to the ˚) I-soaked crystal, the Cs-soaked crystal (a ¼ 42.1, b ¼ 80.4, c ¼ 120.0 A was nonisomorphic to the native. Both soaked crystals showed a difference of almost 10% in the a cell parameter compared with the native crystals. Therefore, data for the iodine pseudo-derivative and the Cs derivative of -galactosidase were used as native and derivative, respectively, for initial SIRAS phasing. Ten Cs sites were used by SHARP41 in the SIRAS approach, followed by DM,42 and the resulting phases were good enough for automatic model building by wARP.43 b-Galactosidase from Penicillium Species One of the highest molecular weight protein structures solved so far with a derivative prepared according to the quick cryosoaking procedure was -galactosidase from a Penicillium sp., with 110 kDa (one molecule) in the asymmetric unit. Initial X-ray diffraction studies revealed that galactosidase crystallized in space group P43 with unit cell parameters ˚ , c ¼ 161.0 A ˚ and diffracted to 1.85-A ˚ resolution. a ¼ b ¼ 110.9 A An iodine derivative was prepared by immersion of a native crystal in mother liquor solution containing, in addition, 0.25 M NaI and 30% ethylene glycol. Crystals were visually stable in the derivatization solution and 38
I. Polikarpov, L. A. Perles, R. T. de Oliveira, G. Oliva, E. E. Castellano, R. C. Garratt, and A. Craievich, J. Synchrotron Radiat. 5, 72 (1997). 39 CCP4, Acta Crystallogr. D Biol. Crystallogr. 50, 760 (1994). 40 C. M. Weeks and R. Miller, J. Appl. Crystallogr. 32, 120 (1999). 41 E. de La Fortelle and G. Bricogne, Methods Enzymol. 276, 472 (1997). 42 K. Cowtan, Acta Crystallogr. D Biol. Crystallogr. 55, 1555 (1999). 43 A. Perrakis, R. Morris, and V. S. Lamzin, Nat. Struct. Biol. 6, 458 (1999).
[5]
phasing on rapidly soaked ions
135
did not suffer large changes in cell parameters or loss of diffraction power compared with native crystals. The SnB40 program used the normalized anomalous differences of this derivative to locate the halide substructure. Phase calculation performed by SHARP41 in the SIRAS approach with ˚ 13 iodine sites gave a mean figure of merit of 0.37 in the 27.0- to 2.60-A resolution range. The final electron density map obtained after density modification with SOLOMON44 was used by wARP43 for automatic model building. The number of built residues was increasing in each cycle; however, the convergence was slow and 3 days were required to obtain the final model (95% complete). Even though just one halide derivative was used for phasing and solving of the crystal structure, we mention that a second quick cryo soaked derivative was obtained during model building. At this time, CsCl was used instead of NaI in the derivatization solution. Twelve cesium sites were used for SIRAS phasing. Similar to the first phase calculation, this one gave a mean figure of merit of 0.37 in the same resolution range. When native and both derivative data sets were combined in MIRAS phasing, the resulting figure of merit was 0.52 and the electron density map showed significant improvement compared with either of the SIRAS maps. Acyl Protein Thioesterase Crystals of human acyl protein thioesterase I were grown from a solution containing a high concentration of ammonium sulfate in a monoclinic cell with two molecules of 228 residues each in the asymmetric unit. They were soaked for 20 s in the mother liquor with added 1 M NaBr. The dif˚ resolution at an energy 50 eV higher fraction data were collected to 1.8-A than the Br absorption edge. The data displayed a clear anomalous signal and the structure was solved by SAD, using only this data set. The initial seven Br sites were located by SnB.40 They were input to SHARP,41 which after two iterations identified 22 Br sites in the residual maps and produced a phase set with an overall figure of merit of 0.40. The strongest 18 Br sites formed two groups of almost identical constellations, clearly identifying the presence of a noncrystallographic 2-fold axis. The application of density modification with DM42 increased the figure of merit to 0.85 and produced an easily interpretable map. The majority of the residues were built automatically using wARP.43 At the end of the refinement, the anomalous difference Fourier map identified a total of 40 bromide sites located at the surface of the two independent protein molecules. They were included in the refinement of the final model with either full 44
J. P. Abrahams and A. G. W. Leslie, Acta Crystallogr. D Biol. Crystallogr. 52, 30 (1996).
136
phases
[5]
or half occupancies, depending on the appearance of the corresponding peaks in the anomalous difference map. Conclusions
The quick cryosoaking approach was established in 2000, and since then a number of macromolecular crystal structures have been solved with this method. The results obtained so far indicate that, owing to its several intrinsic aspects, it can be particularly applicable for high-throughput crystallographic projects. This approach can be used with a great number of different crystallization solutions, and the presence of various compounds, such as sugars, additives, or even precipitants, in high concentration does not impede the fast incorporation of halide or alkaline metal ions to the solvent regions surrounding the protein molecules. Moreover, halides or alkaline metals are less likely to react with certain compounds that are used during crystallization than are some heavy-metal salts (e.g., they do not precipitate with phosphate anions). Another interesting point is that little preparative effort and only a short time are required to produce a potential derivative. All the equipment and chemical compounds used in this approach can be found in a simple protein crystallization laboratory. A single 2- to 5-l derivatization solution drop is used in each trial and the crystal soak time is usually less than 1 min. In addition, one can see immediately whether the crystal is stable or not in solution and can modify the salt concentration and/or soaking time to obtain a better derivative. The choice between a halide and an alkaline metal salt must be made before the derivatization procedure; this selection does not mean that a second salt cannot also be used for another derivative. Sometimes this decision can be made easily, depending on the types of compounds used in the crystallization solution. Iodides or bromides can replace chlorides. Similarly, lithium, sodium, and potassium can be substituted by cesium or rubidium. Because preparation of derivatives is fast, and data collection normally is performed with frozen crystals, both types of derivatives can be prepared and immediately frozen for data collection. Moreover, the use of halide and alkaline metal salts during derivatization opens up the possibility of using two essentially different derivatives to solve a protein structure through MIR(AS) when none of them alone is able to do so. Bromide and rubidium salts have an advantage over iodide and cesium salts, in that the former can be easily used for MAD because their K absorption edges are in the similar energy range as the Se edge, the scatterer used most often for MAD phasing. The latter derivatives, on the other ˚ , are not suitable hand, with K absorption edges in the vicinity of 0.35 A
[6]
137
bijvoet-difference fourier synthesis
for MAD phasing, even though some X-ray diffraction experiments have been done at this energy.45 Nevertheless, iodine and cesium atoms possess significant anomalous signal at longer wavelengths that makes them appropriate for SIRAS, MIRAS, and SAD phasing. Specifically at the copper˚ ), f 00 of I and Cs atoms are 6.8 anode characteristic wavelength (1.54 A and 7.9 electrons, respectively. In general this new approach was proposed as an alternative way of preparing derivatives when a protein does not bind heavy-metal atoms or is not amenable to the preparation of a SeMet variant. Acknowledgments The authors thank A. Golubev for providing substantial information about galactosidase phasing. This work was supported in part by grants 98/06218-6 from FAPESP (Brazil). The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organization imply endorsement by the U.S. Government. 45
K. Takeda, H. Miyatake, S.-Y. Park, M. Kawamoto, K. Miki, and N. Kamiya, poster presented at the 7th International Conference on Biology and Synchrotron Radiation, Sa˜ o Pedro, Brazil, July 30 to August 4 (2001).
[6] The Bijvoet-Difference Fourier Synthesis By Jeffrey Roach Introduction
The is no dispute over the profound influence anomalous dispersion-based techniques have had on macromolecular X-ray crystallography. Anomalous difference Patterson synthesis and both single and multiple anomalous dispersion phasing have become standard techniques of structural analysis. Typically the initial goal of such an analysis is the determination of the locations of the anomalously scattering atoms. Bijvoet-difference Fourier synthesis provides both a method of establishing absolute configuration and an approximation to the heavy atom substructure of the system in question. In general, this approximation is sufficient to identify the location of the most significant anomalous scatterers. Once these locations are known, the positions of the minor anomalous scatterers can be determined by the more delicate imaginary Fourier synthesis.
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[6]
137
bijvoet-difference fourier synthesis
for MAD phasing, even though some X-ray diffraction experiments have been done at this energy.45 Nevertheless, iodine and cesium atoms possess significant anomalous signal at longer wavelengths that makes them appropriate for SIRAS, MIRAS, and SAD phasing. Specifically at the copper˚ ), f 00 of I and Cs atoms are 6.8 anode characteristic wavelength (1.54 A and 7.9 electrons, respectively. In general this new approach was proposed as an alternative way of preparing derivatives when a protein does not bind heavy-metal atoms or is not amenable to the preparation of a SeMet variant. Acknowledgments The authors thank A. Golubev for providing substantial information about galactosidase phasing. This work was supported in part by grants 98/06218-6 from FAPESP (Brazil). The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organization imply endorsement by the U.S. Government. 45
K. Takeda, H. Miyatake, S.-Y. Park, M. Kawamoto, K. Miki, and N. Kamiya, poster presented at the 7th International Conference on Biology and Synchrotron Radiation, Sa˜o Pedro, Brazil, July 30 to August 4 (2001).
[6] The Bijvoet-Difference Fourier Synthesis By Jeffrey Roach Introduction
The is no dispute over the profound influence anomalous dispersion-based techniques have had on macromolecular X-ray crystallography. Anomalous difference Patterson synthesis and both single and multiple anomalous dispersion phasing have become standard techniques of structural analysis. Typically the initial goal of such an analysis is the determination of the locations of the anomalously scattering atoms. Bijvoet-difference Fourier synthesis provides both a method of establishing absolute configuration and an approximation to the heavy atom substructure of the system in question. In general, this approximation is sufficient to identify the location of the most significant anomalous scatterers. Once these locations are known, the positions of the minor anomalous scatterers can be determined by the more delicate imaginary Fourier synthesis.
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
138
phases
[6]
For X-ray wavelengths typical of crystallographic structure determination, the scattering factors for light atoms such as C, N, or O are calculated with the assumption that the energy of the incident X-ray is far greater than the binding energy of the electrons of the given atom. Therefore the scattering from each atom is proportional to that of a free electron and the scattering factor is a real-valued function of the scattering angle. This assumption fails to be true, however, when calculating the scattering factors of heavier atoms such as Fe, Se, and even P and S. In this case the binding energy of the electrons is no longer negligible and scattering of incident X-rays involves resonance with the natural frequency of the bound electrons. To account for resonance the native scattering factor of the atom is modified by a small complex-valued correction. That is, if an anomalously scattering atom has native scattering factor fh0 , the scattering factor corrected to account for anomalous dispersion will take the form fh ¼ fh0 þ f 0 þ if 00 where f 0 and f 00 represent the real and imaginary anomalous corrections respectively. Although the anomalous corrections depend on wavelength, they are assumed not to depend on the reciprocal vector h. The complex-valued anomalous dispersion corrections cause the Fourier series, r(x), defined by the structure factors to take on complex values and consequently differ from the true electron density, which is real and non negative. To emphasize this difference, Hendrickson and Sheriff1 suggest the term general density function. Friedel’s Law
It is well known that Friedel’s law, F h ¼ Fh ; holds only in the absence of anomalous dispersion, and consequently only for real-valued electron densities. Note that the overscore denotes the complex conjugate of the structure factor but negative of the index. An analogous result holds for purely imaginary densities, namely, F h ¼ Fh . Together these observations establish formulas for the structure factors of the real and imaginary parts of the electron density in terms of the structure factors of the complex-valued density. Consider the structure factors Fh of the electron density r on the unit cell V: Z Fh ¼ ðxÞe2ihx d 3 x V 1
W. A. Hendrickson and S. Sheriff, Acta Crystallogr. A 43, 121 (1987).
[6]
bijvoet-difference fourier synthesis
139
Letting the overscore denote the complex conjugate, r takes only real values if and only if ðxÞ ¼ ðxÞ for all x. In terms of structure factors, Z Z 2ihx 3 Fh ¼ ðxÞe d x¼ ðxÞe2ihx d3 x ¼ F h (1) V
V
Similarly r takes on only imaginary values if and only if ðxÞ ¼ ðxÞ for all x. In terms of structure factors, Z Z 2ihx 3 Fh ¼ ðxÞe d x ¼ ðxÞe2ihx d3 x ¼ F h (2) V
V
The application of Eq. (2) is not immediately obvious; however, it is the basis for the method of determining absolute configuration of chiral structures described later in this section. A complex-valued electron density map r is composed of two realvalued maps,
Fh þ F h Fh F h þi 2 2i
By Eq. (1), the first summand defines an entirely real map, and by Eq. (2), the second summand defines a purely imaginary map. Because their sum determines the map r, Eq. (3) must follow. Conversely, if r is a real-valued 00 map then its imaginary component must be zero, consequently Fh must be zero and F h ¼ Fh . Analogously, if r is an entirely imaginary-valued map 0 then its real part must be zero, therefore Fh must be zero and F h ¼ Fh .
140
[6]
phases
The formulas given in Eq. (3) represent the reciprocal space expression of <ðxÞ ¼
ðxÞ þ ðxÞ 2
=ðxÞ ¼
ðxÞ ðxÞ 2i
Note that the structure factors of are given by Z Z ðxÞe2ihx d3 x ¼ ðxÞe2iðhÞx d3 x ¼ F h V
V
Explicitly, the action of complex conjugation in map space corresponds to complex conjugation and inversion through the origin in reciprocal space. The dual observation, that complex conjugation in reciprocal space corresponds to complex conjugation and inversion in map space, provides an effective method of determining absolute configuration with anomalous dispersion data. The application is of particular interest for biological systems, which often exhibit chirality. Consider a structure given by electron density r(x) and structure factors Fh. The enantiomer of this structure then has electron density ~ðxÞ ¼ ðxÞ and structure factors F˜h, where Z Z ~h ¼ F ðxÞe2ihx d3 x ¼ ðxÞe2ihx d3 x ¼ Fh V
V
Now if r is strictly real we have that jFh j ¼ jF˜hj, and consequently the diffraction patterns of r and ~ will be identical. Therefore nonanomalous scattering is unable to distinguish between a structure r and its enantiomer. In the presence of anomalous dispersion, however, the two can be distinguished. Consider the imaginary part of the electron density, =r, calculated with F h , that is, the correct magnitude, jFhj, but the enantiomer’s phase, h: g
~h F 1 X F h Fh 2ihx 1 XF h 2ihðxÞ e e ¼ ¼ = jVj h jVj h 2i 2i
(4)
where jVj denotes the volume of the unit cell. Therefore the imaginary part of the map corresponding to the phases of the enantiomer will consist of negative electron density concentrated at the enantiomer of the anomalous scattering centers. It is worthwhile to note that it is a consequence of Eq. (2) that Eq. (4) holds for any imaginary-valued electron density r with structure factors Fh: 1 X 1 X F h e2ihx ¼ F e2ihðxÞ ¼ ðxÞ jVj h jVj h h
[6]
bijvoet-difference fourier synthesis
141
In particular, the method applies when the electron density is known only up to an approximation ^ —provided the approximation is sufficient to distinguish the correct structure from =^ (x) and =^ (x). Approximate Electron Density
The essential problem with constructing the imaginary-valued part of the electron density with Eq. (3) is that phases must be determined for both structure factors in a Friedel-related pair, that is, for both Fh and Fh . Typically this is not the case, for example, in phases determined by isomorphous replacement. The construction presented by Kraut2 provides an approximation that may be more suitable when only one phase is known for each Friedel pair. Assume that a phase 0h has been determined for one structure factor of each Friedel pair. The phase of the other structure factor in the Friedel pair, 0 ; is given the phase 0h . This will define an electron-density map, h ^; whose structure factors, Fˆ h, satisfy a weak form of Friedel’s law where jFˆ hj is not, in general, equal to jFˆ h j yet 0h ¼ 0 . According to Eqs. (1) h and (2) of the previous section, the map ^ will take both real and imaginary values. The imaginary part of ^; —=^ , is the so-called Bijvoet-difference Fourier as defined by Kraut.2 Explicitly, the Bijvoet-difference Fourier is the map given by the structure factors jFh j jFh j i0 e h 2i Note that the factor of i in the denominator introduces a phase shift of 90 . Although the Bijvoet-difference Fourier was intended to be applied to systems where only one phase is known for each Friedel pair, it is possible to use this approximation when both phases h and h are known for some (or all) reflections. In this case Kraut2 suggests the phases be defined as follows: 1 0h ¼ ½h h 2
(5)
for each reflection h. Note that this set of phases will satisfy 0h ¼ 0 . h Chacko and Srinivasan3 and Hendrickson and Sheriff1 have shown that when the phases for both Friedel mates are fairly accurately known, the imaginary map generated from Eq. (3) will lead to better results. Although 2 3
J. Kraut, J. Mol. Biol. 35, 511 (1968). K. K. Chacko and R. Srinivasan, Z. Kristallogr. 131, 88 (1970).
142
phases
[6]
the map generated using phases given by Eq. (5) will correctly predict the location of the most significant anomalously scattering atoms, the density around all anomalously scattering atoms will be weaker, in some cases by as much as a factor of four. Consequently weaker anomalously scattering atoms such as P and S often are not visible. Note that if the positions of the most significant anomalous scatters are determined using the Bijvoet-difference Fourier, this information can be used to generate phases for both structure factors in a Friedel pair. Once suitable phases are determined, Eq. (3) can be used. Details on this procedure are available in Chacko and Srinivasan3 and in Hendrickson and Sheriff.1 To investigate the differences between the approximation ^ and the true density r, consider the following relationship between the structure factors: ^h ¼ Fh Fh ð1 cos "h Þ þ iFh sin "h F where "h ¼ 0h h . Let Mc and Ms be the maps given by structure factors (1 cos "h) and sin "h, respectively. Thus, in terms of maps, ^ ¼ Mc þ i Ms where denotes convolution. Separating ^ into real and imaginary parts gives < ¼ < < ½<Mc þ =Ms þ = ½=Mc <Ms
(6)
=^ ¼ = < ½=Mc <Ms < ½<Mc þ =Ms
(7)
and
Defining the error maps M1 and M2 as M1 ¼ <Mc þ =Ms
M2 ¼ =Mc <Ms
gives <½^ ¼ < M1 þ = M2
(8)
=½^ ¼ = M1 < M2
(9)
and
Note that the real part and imaginary parts of r are intertwined in each part of ^. Because the real part of the true map is far stronger than the imaginary part of the true map, the influence of < M2 in the imaginary part of the approximation is significantly more disruptive than the influence of = M2 in the real part of the approximation. Thus the Bijvoet-difference
[6]
143
bijvoet-difference fourier synthesis
TABLE I Phase Error and Map Correlation between Approximate Density and True Density Phase error ˚) (A
approximation, in fact, has two sources of error: the first due to the convolution of the true map with the error maps and the second due to the real part of r. The exact nature of the error maps themselves, being defined entirely with respect to the phase error, is essentially random. However, certain general properties will hold, for example, using Eq. (1) it is straightforward to verify that M1 and M2 take only real values. Furthermore, although in general, neither M1 nor M2 will be centrosymmetric, it is the case that M1 will be centrosymmetric when sin "h ¼ sin "h . Similarly, M2 will be centrosymmetric when cos "h ¼ cos "h . Note that both of these conditions hold when Eq. (5) is used to generate phases for the Bijvoet-difference Fourier. This special case has been studied thoroughly in Hendrickson and Sheriff,1 where an analogous result to Eqs. (6) and (7) is developed. An Example
To conclude this analysis with an example, consider the solved structure of the H42Q mutant of the high-potential iron protein.4 The structure ˚ with anomalous factors of this model structure were calculated to 1.0 A 5 scattering corrections given by Creagh and McAuley. Using the calculated phases as a basis, we construct a single hemisphere of phases as follows. For each Friedel pair, h and h , let be a uniformly distributed random number between 0 and 1. Consider the phase defined by 4
E. Parisini, F. Capozzi, P. Lubini, V. Lamzin, C. Luchinat, and G. M. Sheldrick, Acta Crystallogr. D Biol. Crystallogr. 55, 1773 (1999). 5 D. C. Creagh and W. J. McAulley, in ‘‘International Tables for Crystallography’’ (A. J. C. Wilson, ed.), Vol. C, p. 206. Kluwer Academic, Dordrecht, The Netherlands, 1992.
144
[6]
phases
Fig. 1. The Fe-S cluster of the imaginary map (A) and the Bijvoet-difference Fourier (B). Both maps are contoured at the same level. Note the missing density around the sulfur atoms in the Bijvoet-difference Fourier.
Fig. 2. The Fe-S cluster of the Bijvoet-difference Fourier at low contour level. Note that even at this contour level the map lacks density around the sulfur atoms.
0h ¼ ð1 Þh h
0h ¼ 0h
By introducing an element of randomness this system is intended to represent a typical phase set at an early stage in structure determination. As indicated in the previous section the approximation to the real part of the map is much better than the approximation to the imaginary part of the map. Phase errors and map correlation coefficients for both parts are found in Table I. Focusing on the Fe-S cluster, Fig. 1A and B
[7]
145
isomorphous difference methods
shows the true imaginary map and the Bijvoet-difference Fourier contoured at the same level. The Bijvoet-difference Fourier correctly identifies the iron atoms; however, even at lower contour levels (Fig. 2), the S atoms are indistinguishable from the background noise. Unfortunately, the low correlation coefficient and high phase errors of Table I give a rather pessimistic and somewhat misleading view of the Bijvoet-difference Fourier. The strength of the Bijvoet-difference Fourier is in its ability to determine the locations of the most significant anomalously scattering atoms, particularly when phases are available for only one structure factor in each Friedel pair. Acknowledgments This work was supported by the National Science Foundation under grant CCR-0086013.
[7] Isomorphous Difference Methods By Mark A. Rould and Charles W. Carter, Jr. Every crystallographer has the opportunity to apply isomorphous difference Fourier methods to better understand their favorite macromolecule. By these relatively simple methods, one can see extremely small changes in otherwise identical structures with a high degree of confidence. Surprisingly, over the decades these powerful methods have been largely forgotten by mainstream crystallographers. The purpose of this chapter is to remind us what these grossly underutilized methods are, how they are carried out, and when they are applicable—or can be made applicable. These methods are certain to be more widely practiced as macromolecular crystallography evolves from being a form of atomic-level wildlife photography to a bona fide science in which specimens are modified and characterized in order to test structural or biochemical hypotheses. In particular, these methods are eminently applicable to structure-based drug design, in which hundreds or thousands of isomorphous crystals bearing different small molecule ligands need to be rapidly but thoroughly analyzed. Visualizing Differences between Similar Crystal Structures
Given crystals of two or more similar structures, there are several ways of determining the differences between them. The most common procedure is to build and refine a model for the structure in one of the crystals,
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[7]
145
isomorphous difference methods
shows the true imaginary map and the Bijvoet-difference Fourier contoured at the same level. The Bijvoet-difference Fourier correctly identifies the iron atoms; however, even at lower contour levels (Fig. 2), the S atoms are indistinguishable from the background noise. Unfortunately, the low correlation coefficient and high phase errors of Table I give a rather pessimistic and somewhat misleading view of the Bijvoet-difference Fourier. The strength of the Bijvoet-difference Fourier is in its ability to determine the locations of the most significant anomalously scattering atoms, particularly when phases are available for only one structure factor in each Friedel pair. Acknowledgments This work was supported by the National Science Foundation under grant CCR-0086013.
[7] Isomorphous Difference Methods By Mark A. Rould and Charles W. Carter, Jr. Every crystallographer has the opportunity to apply isomorphous difference Fourier methods to better understand their favorite macromolecule. By these relatively simple methods, one can see extremely small changes in otherwise identical structures with a high degree of confidence. Surprisingly, over the decades these powerful methods have been largely forgotten by mainstream crystallographers. The purpose of this chapter is to remind us what these grossly underutilized methods are, how they are carried out, and when they are applicable—or can be made applicable. These methods are certain to be more widely practiced as macromolecular crystallography evolves from being a form of atomic-level wildlife photography to a bona fide science in which specimens are modified and characterized in order to test structural or biochemical hypotheses. In particular, these methods are eminently applicable to structure-based drug design, in which hundreds or thousands of isomorphous crystals bearing different small molecule ligands need to be rapidly but thoroughly analyzed. Visualizing Differences between Similar Crystal Structures
Given crystals of two or more similar structures, there are several ways of determining the differences between them. The most common procedure is to build and refine a model for the structure in one of the crystals,
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
146
phases
[7]
and then use that model to rebuild and refine models for the second and subsequent structures. Differences are determined by superimposing and comparing the final refined models, or by analysis of difference distance matrices. In either case, the comparison involves the atomic coordinates of the models. Even the best models determined at moderate resolution ˚ , with have overall root mean square (RMS) errors on the order of 0.2 A some atoms deviating from their ‘‘true’’ positions by many times this distance. This intrinsic error in the refined atomic coordinates thus places a lower limit on the atomic shifts that can be determined with confidence. Larger shifts between models, which appear to be significantly greater than the coordinate error, can still be artifactual, particularly in regions that are not well ordered or are present in multiple conformations. Nonetheless, to the vast majority of structural biologists, comparing molecular structures is synonymous with comparing molecular models. When crystals of related structures are isomorphous with each other, we have the preferable opportunity to eliminate the intermediate use of atomic coordinates by using the experimental data directly in the comparison. Isomorphous difference imaging methods were developed originally to visualize heavy atom positions, either of minor sites or in a new derivative, in the intermediate stages of phase determination using heavy atom derivatives. They are, however, completely general and, as we will show, considerably more widely applicable. Ideally, we would like to compare the true structures present in the crystals by subtracting their unbiased electron densities: xyz ¼ P; xyz P; xyz where the subscripts P and P denote the two related isomorphous states. P is the parent structure, and might be the native protein, for example. P is an isomorphous variant of the parent, such as a site-directed mutant or the protein with a substrate or inhibitor added. Referring to Fig. 1, FP and FP are the complex structure factor vectors for these two ‘‘states’’ of the macromolecule, and f is the complex structure factor vector that represents the difference between the two states. If P and P are isomorphous states, so that the structure factors are sampling the continuous molecular transforms at the same points in reciprocal space, then F P ¼ F P þ f (Note that these are complex structure factor vectors, not just amplitudes.) The difference electron density that we are after is thus the Fourier transform of f:
[7]
isomorphous difference methods
FP∆ FP
147
f∆
ø∆ •
FP∆ = FP + f∆ FP |
• | f∆ | « | • øP ≈ øP∆
ø P ø P∆ Fig. 1. The relationship between parent and isomorphous variant structure factor vectors.
xyz ¼ FTf f g ¼ FTfF P F P g ¼ FTfF P g FTfF P g Expressed in terms of amplitudes and phases, the difference electron density, xyz, is given by xyz ¼ FTfjFobsP j expðiP Þg FTfjFobsP j expðiP Þg If the two crystals are isomorphous, and the differences between them are small (i.e., j f j j FP j), then P P, and so xyz FTfðjFobsP j jFobsP jÞ expðiP Þg Of course, phases are not observable, and a choice must be made between available phase sets. Although MIR and MAD methods can give highquality phases free of model bias, conventional wisdom has it that calculated phases are closer to the true phases, even if they may suffer from some degree of model bias. This gives the equation for the difference Fourier as it is commonly applied: xyz ¼ FT ðjFobsP j jFobsP jÞ expðiP;calc Þg
(1)
Thus, to a first approximation, the isomorphous difference map is calculated as the Fourier transform of the differences between the two sets of observed amplitudes with phases usually calculated from one of the models. In practice, it may be useful to examine the effects of different phase sets, including unbiased experimental phases. The formulation in Eq. (1) has several important advantages over comparing refined structural models. As these advantages seem to be underappreciated, it is worth reviewing them here. One of the greatest strengths of isomorphous difference Fourier maps is their insensitivity to model bias. One might ask why phases calculated by back-transforming one of the models do not bias the isomorphous
148
phases
[7]
difference Fourier map in the same way as the comparison of the models themselves. This absolutely would be the case if we used xyz ¼ FT jFobsP j expðiP; calc Þ FT jFobsP j expðiP; calc Þ Such a map is equivalent to the differences between the refined models. We cannot then know whether features in it are real or simply due to bias inherent in the two models. A key distinction is that, because the amplitudes are observed, the calculated phases are the sole source of model bias in these maps. Expanding Eq. (1) for the difference Fourier to represent it as the difference between two maps, xyz FT ðjFobsP j jFobsP jÞ expðiP; calc Þ ¼ FTfjFobsP j exp ðiP; calc Þ FT jFobsP j exp ðiP; calc Þ reveals that the model bias effectively cancels out; that is, the model bias in the first term is to a good approximation the same as in the second term, because both use the same phases. On closer examination, one may ask why these maps work at all! As discussed above, the exact difference electron density map is xyz ¼ FTf f g ¼ FT j f j expði Þ That is, if we want a map of the differences, we should transform the amplitudes for the difference in the structures—which is not the same as the difference in the observed amplitudes between the two structures—along with the phases for the structural difference, . Note that in the practical approximation to the difference map, xyz FT ðjFobsP j jFobsP jÞ expðiP; calc Þ the phases are those of one of the models, not at all the phases for the structural difference. In fact, we expect there to be almost no correlation between the phases we are using for these maps, P, calc and the true phases that we should be using, . Even worse, we know that phases dominate the Fourier transform!1 So why does this method work? Let’s consider the three limiting cases, shown in Fig. 2. In the first case, when f is in the same direction as FP, the difference in observed amplitudes j FPj j FP j is identical to the amplitude of the difference structure factor, j fj. Because the vectors point in the same direction, the phase of the difference structure factor is the same as the phase of the parent, ¼ P . 1
K. Cowtan, www.yorvic.york.ac.uk/ cowtan/fourier/magic.html
[7]
isomorphous difference methods
f∆
FP∆
Case I:
•
FP
f∆ in same direction as FP
| FP∆ | − | FP | ≈ | f∆ |
• øP∆
f∆ Case II:
FP∆
149
≈ øP
f∆ in opposite direction as FP
| FP∆ | − | FP | ≈ − | f∆ | • ø∆ ≈ øP + 1808 •
FP f∆
FP∆
Case III:
•
FP
f∆ perpendicular to FP
| FP∆ | − |FP | ≈ 0
• ø∆ ≈ øP + 908
Fig. 2. The three limiting cases relating the difference structure factor vector to the differences between the parent and variant amplitudes.
In the second case, f points in the opposite direction as FP. The phases
thus differ by 180 , ¼ P þ 180 . The difference in observed amplitudes jFPj jFPj is the negative of the amplitude of the difference structure factor, jf j, and so exactly compensates for the anticorrelated phases
[j f j exp ðP þ 180 Þ ¼ jf j expðP Þ. Furthermore, note that for a given f, cases I and II give the largest values of the differences in observed amplitudes. In the final limiting case, f is perpendicular to FP. Because j f j jF P j, the lengths of FP and FP are almost the same, and thus the difference in observed amplitudes is almost zero. This is fortunate, because the phase of
the difference structure factor differs from the phase of the protein by 90 in this case. Because the Fourier transform is a linear transform, these small terms will contribute weakly to the difference electron density map. The bottom line is that our practical equation for the difference electron density, Eq. (1), is most accurate for the terms that contribute most to the Fourier transform. When it matters the most, the approximations are most applicable. When the difference between the observed amplitudes is large, that is, when f is either correlated or anticorrelated with FP, the
150
phases
[7]
difference (FobsP FobsP) combined with phases from the protein, P, gives an excellent approximation to the true difference structure factor, f. When the phases deviate most from P, that is, when f is nearly perpendicular to FP, the difference between the observed amplitudes is nearly zero, and these terms contribute negligibly to the Fourier transform. The magnitude differences between the observed amplitudes themselves determine how much a given Fourier term will contribute to the image: amplitude differences most closely related to the true structural differences contribute most strongly to the Fourier sum. Illustration of Strength of Difference Fourier Methods
A simple computational experiment illustrates the strength of differential crystallography to detect, for instance, a minuscule shift in position of a single atom. This simulation can be carried out with XPLOR, CNS, CCP4, or any similar program. 1. Calculate structure factors (amplitudes and phases) by inverse Fourier transform of your favorite macromolecular model. ˚ (one-thousandth of an angstrom). 2. Shift one atom by 0.001 A 3. Calculate new structure factor amplitudes as in step 1. 4. Subtract the amplitudes in step 3 from the amplitudes in step 1 and, using phases from step 1, generate the difference Fourier map to a ˚ . (To better simulate a 3-A ˚ data set, adjust all the atomic resolution of 3 A ˚ 2.) B-factors of the model so the average is about 40 A An example of the resulting difference electron density map is shown in Fig. 3, in which one atom in the 58-kDa poly(A) polymerase model2 was ˚ , and the above-described procedure was applied. The shifted by 0.001 A ˚ resolution difference density map is shown, contoured at resulting 3-A þ30.0 and 30.0. While the shift in the CD atomic position of the isoleucine is imperceptible, a 70 peak and hole flank the atom. Of course, these are simulated data with no measurement errors. By adding noise to simulate real amplitudes and phases, one can determine that detection of a shift of a few hundredths of an angstrom in the position of a single atom is still several above the noise level in the difference electron density map for reasonably attainable data quality. A practical example of this calculation ˚ in active was used to document the significance of atomic shifts of 0.2 A site Zn-S distances in cytidine deaminase on approaching the transition state.3 2 3
G. Martin, W. Keller, and S. Doublie, EMBO J. 19, 4193 (2000). S. Xiang, S. A. Short, R. Wolfenden, and C. W. Carter, Jr., Biochemistry 35, 1335 (1996).
[7]
isomorphous difference methods
151
Fig. 3. Positive and negative peaks in an isomorphous difference Fourier map resulting ˚ shift of a single atom (arrow) in a 58-kDa protein using idealized data to 3.0-A ˚ from a 0.001-A resolution. The maps are contoured at 30.0 . The peak heights and hole depths are 70 .
Practical Methodology
Keys to Maintaining Isomorphism Isomorphism literally means ‘‘same form’’: same cell parameters, same orientation and overall conformation of the macromolecules in the unit cell, same everything except for the single small difference we deliberately introduced. In practice, isomorphism is necessary to ensure that equivalent reflections in different crystals sample the continuous molecular transform at identical points in reciprocal space. It is therefore critical to the success of difference Fourier methods, and is maintained by treating the crystals identically at all stages of the experiment. A generally applicable approach is to prepare a stabilizing/cryoprotecting solution in which the crystals are stable for a long time, at least for the typical duration of a soak, hours to days. Various substrates, inhibitors, regulatory compounds, or their analogs can be dissolved in the stabilizing/cryoprotecting solution. Care should be taken to maintain the pH of the solution, and the concentration of its other components, as the test compounds are added. To maximize the retention of isomorphism, these soaks are preferred over quick dips in cryoprotectant before flash freezing. Temperature is particularly important and often overlooked, especially in the use of cryoprotection. Although different crystals can easily be kept at exactly the same temperature during data collection, the relevant parameter is the temperature at the instant they plunge into their frigidly rigid state: this temperature determines the cell parameters, orientation, and
152
phases
[7]
conformation the crystals retain for the duration of the experiment. Just as it was important to keep capillary-mounted crystals at the same temperature from experiment to experiment, care must be taken to keep the preplunge temperature constant from crystal to crystal. In particular, although crystal annealing4 sometimes improves diffraction, it tends to be less reproducible and often introduces nonisomorphism. In practice, cell parameters determined on area detectors are generally too inaccurate to discount isomorphism on their basis alone. In most data reduction software, cell parameters are lumped in with the dozen or so interdependent variables that are adjusted in order to best match the observed and predicted locations of the reflections on the diffraction image. Because these variables are strongly coupled, errors in one parameter propagate into errors in the other parameters; for instance, a small error in the crystal-to-detector distance is compensated by an adjustment (error) in the cell dimensions. Differences in measured cell parameters therefore should never discourage one from testing for isomorphism between the diffraction amplitudes of the various crystals. Conversely, identical cell parameters do not necessarily imply isomorphism. Calculating Difference Fourier Maps A protocol for computation of the difference Fourier map is given in Fig. 4. Careful examination of the practical equation for the difference electron density map [Eq. (1)] shows that, aside from the assumptions of isomorphism and small differences between the two structures, there are three important computational considerations. These are pairing of the observed amplitudes, bringing the two datasets to the proper scale; and detection and deletion of deviant observations that disproportionately degrade map quality. Pairing. Clearly, the approximation that j fj (j FobsPj jFobsPj) will not hold if either jFobsPj or jFobsPj is missing for a given H K L. The coefficients of isomorphous difference Fourier maps comprise only the amplitude differences for which FobsP and FobsP have both been measured (or otherwise accurately estimated). Although obvious, this simple requirement is often overlooked in implementation, resulting in uninterpretable maps. Two situations can make the process of pairing reflections between data sets difficult: (1) asymmetric unit definitions often overlap but are not identical. Thus, different data reduction programs often give the final unique 4
J. M. Harp, D. E. Timm, and G. J. Bunick, Acta Crystallogr. D Biol. Crystallogr. 54, 622 (1998).
[7]
153
isomorphous difference methods
Refined Native Model (.PDB file)
Native Dataset (Fnat)
Derivative Dataset (Fder)
Scale Derivative Dataset to Native Dataset
Cross R-factor (Riso)
Reject Excessively Large Differences within Resolution Shells
Is cross R-factor unreasonably high?
Inverse Fourier Transform
Yes Calculated Native Phases
Calculate Difference Fourier
Difference Fourier Map
Apply transform to H K L indices of derivative dataset Pick Peaks and Holes in difference density in vicinity of model
Peak and Hole Coordinates (.PDB file)
Fig. 4. Protocol for computation of an isomorphous difference Fourier map. The option to transform the reflection indices is available only when the symmetry of the reciprocal lattice exceeds the Laue symmetry of the space group.
set of reflections in different asymmetric units. In most cases, this is not a problem because subsequent crystallographic programs (such as CAD in CCP4) convert the input reflections to their own preferred asymmetric unit. For the few programs that do not, the easiest way to overcome the lack of overlap is to expand one of the data sets to a full sphere, and find the matches with the asymmetric unit of the other data set; (2) even when the same reduction program is used for all data sets, for space groups in which the symmetry of the Bravais lattice is greater than the Laue symmetry of the diffraction intensities, degenerate indexing can confuse proper pairing. In these cases, the H K L indices of one crystal may not correspond to the same H K L indices of another identical crystal indexed in an alternate but equally valid manner. This obviously gives rise to unreasonably large cross-R-factors. The simplest solution is to apply the reciprocal space symmetry operator that is missing from the Laue group to the indices of one of the data sets, taking care not to flip the hand. Degenerate indexing is discussed in more detail in Dauter.5 5
Z. Dauter, Acta Crystallogr. D Biol. Crystallogr. 55, 1703 (1999).
154
phases
[7]
Scaling. jFobsPj and jFobsPj need to be put on the same scale before taking their difference. A linear scale factor and resolution-dependent scale (either isotropic B-factor or anisotropic B-factor matrix) that minimizes the overall least-squares differences is often used. Because the difference in total scattering mass is usually small (unless the differences are due to heavy atoms), this is a reasonable approximation. Alternatively, for diffraction data of sufficiently high resolution, Wilson scaling can be used to put both data sets on an absolute scale. After either form of scaling, there are often regions of reciprocal space where the number of paired reflections with jFobsPj > jFobsPj is significantly different from the number with jFobsPj < jFobsPj. One would expect that in any neighborhood in reciprocal space, jFobsPj would be greater than jFobsPj about half the time, and vice versa. Because data collection on image plates or other area detectors does not typically incorporate an empirical correction for absorption and decay, systematic errors are often present. Cryocrystallographic methods have reduced the severity of this problem compared with crystals mounted in capillaries, but in most cases some residual error remains that is not averaged out by redundancy of observations. Local scaling6 of the data sets often solves this problem. In most cases, one data set is collected with a high degree of redundancy, typically the native or reference crystal, and subsequent ones are optimized to collect the unique set as rapidly as possible. Local scaling adjusts the amplitudes of the latter data set to be on the same scale as the former (i.e., the native amplitudes are not modified). In local scaling, each reflection of one data set is individually scaled using a sphere of reflections centered on that reflection and the corresponding sphere of reflections in the other data set (Fig. 5). The reflection being scaled (and optionally its closest neighbors) are omitted from the sphere before calculating the local scale factor, to reduce bias. This method is described in more detail elsewhere.7 After scaling by any method, one can obtain a quantitative estimate of how similar the new crystal is to previously collected crystals with a simple statistic. The Riso or ‘‘cross-R-factor’’ is the mean fractional deviation between observed amplitudes: P jFP FP j Riso ¼ P ðFP þ FP Þ=2 where the sum is over all pairs of reflections observed in both data sets. An analogous equation can be written for the intensities, yielding the 6 7
B. W. Matthews and E. W. Czerwinski, Acta Crystallogr. A 31, 480 (1975). M. A. Rould, Methods Enzymol. 276, 461 (1997).
[7]
155
isomorphous difference methods Local neighborhood sphere of FP∆ reflections
Reflections excluded from the neighborhood
Corresponding local neighborhood sphere of FP reflections
Reflection to be scaled
FHKL, P∆
FHKL, P
Fig. 5. Local scaling of individual reflections in reciprocal space.
‘‘Riso on I’’ instead of the ‘‘Riso on F.’’ The larger the value of Riso, whether due to nonisomorphism or large local changes in the structure, the greater the difference between øP and øP, and thus the less applicable is the approach described above for calculating the difference density. In practice, Riso values (on amplitudes) up to about 25% still give interpretable maps. Low values of Riso (< 10%) give remarkably clear difference density, even when subtle differences are present. Identifying and Deleting Deviant Observations. On occasion a reflection is mismeasured, perhaps due to overlap with stray scattering, such as from ice crystals or contaminating K radiation, or reflections may be partially occluded by part of the diffraction apparatus or beam stop holder. These aberrant reflections can give rise to large difference terms in the Fourier transform, which show up as ripples in the electron density map, obscuring the true signal. As discussed above, the true largest differences carry the largest signal, so it is crucial to determine which of the large differences are likely due to mismeasurement. A histogram of the absolute value of the amplitude differences has the approximate form of a skewed bell curve (Fig. 6). It is noteworthy that neither the mean nor most probable absolute difference is zero. Thus, when determining the standard deviation of the distribution of differences, it is important to calculate it as the square root of the mean square deviation from the mean absolute difference, rather than simply as the root–mean– square difference (relative to zero). Looked at from this perspective, the distribution fits a standard curve rather well. Nearly all the differences lie within 4 of the mean, with just a few large differences straying past that. Differences exceeding that are suspect. Empirically, discarding those differences greater than 4 or 5 standard deviations from the mean improves
156
[7]
phases
41% 28% 15%
-2s -1s Av
10% 5%
1s
2s
1% <1% <1%
3s
4s
5s
6s
7s
Fig. 6. Representative histogram of absolute differences in amplitude between a parent protein and an isomorphous variant. Absolute differences along the abscissa are expressed in terms of RMS deviation from the mean difference.
the quality of the difference map significantly. There is, of course, a lower limit to the absolute difference, 0, which usually occurs well within 2 of the mean. Further, because we expect the true isomorphous difference signal to vary substantially with resolution, the histogram of differences and calculation of mean absolute difference and RMS deviations from that mean should be calculated in shells of resolution. Alternative approaches to identifying outliers are described elsewhere.8 Some classes of reflections have expected average amplitudes that differ from those of general reflections. For example, the allowed reflections on a two-fold screw axis are on average twice as strong as general reflections. This effect will propagate into the difference terms involving those reflections. Normalizing the differences between reflections by their expected average amplitudes, often denoted eHKL ,9 corrects for this effect while checking for outliers. In addition, because centric and acentric reflections show different amplitude distributions, it is often worthwhile to treat them separately for the purpose of determining outliers. Scripts for generating difference Fourier maps, incorporating the three major components of pairing, scaling, and outlier rejection, are available in the standard distribution of most crystallographic software suites, including CNS and CCP4. For example, the script in CNS has the filename fourier_map.inp, and is found in the ‘‘input files’’ directory in the standard installation. 8 9
R. J. Read, Acta Crystallogr. D Biol. Crystallogr. 55, 1759 (1999). J. M. Stewart and J. Hauptman, Acta Crystallogr. A 32, 1005 (1976).
[7]
isomorphous difference methods
157
Resolution Range Although there is a tendency to use the entire range of data collected, one can often improve the signal-to-noise ratio of the map by judicious choice of resolution cutoffs. The lowest resolution terms in the difference Fourier synthesis are limited primarily by the quality of the phases. Lowresolution phases determined by MIR or MAD methods usually have a high figure of merit. Phases calculated by back-transforming even a wellrefined model generally decrease in quality at resolutions lower than 6 or ˚ in the absence of a bulk solvent correction. If cross-validated crystallo8A graphic R-factors in the lowest resolution shells are comparable to the values in the midrange, then the calculated phases at those resolutions should likewise be of comparable quality. At higher resolutions, nonisomorphism between two crystals gives rise to errors in the amplitude differences due to sampling of the molecular transform at different points in reciprocal space. A monotonic decrease in the RMS deviation from the mean difference as a function of increasing resolution generally correlates with an isomorphous pair of crystals. This is not unexpected, because the magnitude of the mean scattering due to the differences should decrease with resolution just as the mean amplitudes themselves do. When nonisomorphism is present, the RMS deviation from the mean absolute difference decreases, reaches a minimum, and then increases as a function of resolution. Because noise begins to overtake signal at resolutions higher than where the RMS deviation from the mean reaches a minimum, maps calculated to this resolution tend to provide clearer difference density than maps made to the resolution limit of the collected data. Interpreting Difference Density Maps With a little practice, difference electron density maps are straightforward to interpret. Quite simply, if the difference density map is derived from coefficients of the form (jFBj jFAj), then positive density in the map corresponds to locations where the electron density in crystal B is greater than the electron density in crystal A, and vice versa for negative density. Intuitively, such difference density is easiest to understand as the density of crystal B summed with the negative of the density of crystal A. For example, consider the case in which scattering matter is present in crystal B and absent in crystal A. In this case, the additional density is reflected by positive density (peaks) in the difference map at the locations of the additional atoms. This might result from a bound small molecule or a large shift in a protein side chain. Conversely, an isolated hole (negative density) indicates that additional atom(s) are present in crystal A that are absent in crystal B.
158
phases
[7]
Binding of ligands, mutation of residues, or changes in environment can induce structural shifts in and around the macromolecule. These give rise to features in the difference map that are often misinterpreted by novices. The interpretation of shifts depends on the distance of the shift relative to the diameter of the atom. In the simplest case, when the shift distance is larger than the diameter of the shifted atom, the difference density that results has a peak centered on the atom at its new location, and a hole centered on where the atom originally was. This is shown in Fig. 7. A more complicated situation results when an atom shifts by a distance less than its diameter. A peak–hole pair also results in the difference density, but in this case exaggerates the true shift; that is, the distance between the minimum of the hole and the maximum of the peak is greater than the distance the atom actually shifts. This is most easily understood graphically by inspection of Fig. 8. This enhanced shift effect must be taken into account if one builds a model for the new state based on the original model. Because the probability of a paired strong peak and hole arising due to noise and approximately centered on an atom of the model is very low, a peak–hole pair occurring in a difference map indicates a small but credible shift in an atom.10 A side chain flanked on one side by positive difference density and on the other by negative is as nearly certain an indication of a shift away from the hole and toward the peak as crystallography can provide. An isolated peak or hole directly on top of a model atom suggests that the atom has changed its occupancy at that position, or that the atom has been replaced with a more or less electron dense atom. Finally, on occasion a peak in the difference map centered on an atom is enshrouded or encircled by negative density (or vice versa). This has three potential causes. It could indicate that the atom or moiety in the peak has become more ordered than in the original state; that is, that its B-factor has decreased (Fig. 9). A related cause is that a well-ordered bound moiety has displaced less ordered moieties such as solvent molecules. In a third case, if the atom is a particularly strong scatterer, such as a heavy atom, and is present in only one of the two crystals, then alternating shells of positive and negative density can arise due to truncation of the Fourier transform (the absence of higher resolution terms). Such a ripple can be anisotropic, and obscures interpretation of difference density in the immediate vicinity of the heavy atom. An example of a real difference Fourier map is shown in Figs. 10 and 11. For the parent data set, cryodiffraction data were collected from a crystal of tetragonal lysozyme soaked in stabilizing solution. The variant was a lysozyme crystal treated similarly, except that the stabilizing solution 10
S. Xiang, S. A. Short, R. Wolfenden, and C. W. J. Carter, Biochemistry 36, 4768 (1997).
[7]
isomorphous difference methods
159
Fig. 7. Interpretation of difference density for the case of an atom that has shifted by a distance greater than the diameter of the atom.
˚, contained 2.8 M acetone. Both crystals diffracted to better than 1.75 A giving an Riso on amplitudes of 7.4%. The isomorphous difference Fourier ˚ , using calculated phases from map was computed to a resolution of 1.75 A the partially refined parent model. Even though the map in Fig. 10 covers the entire molecule, differences are localized to a small region encompassing the active site. A close-up of this region (Fig. 11) reveals that in the presence of acetone, the carboxamide side chain of Asn-59 rotates
90 , from a vertical orientation to a horizontal. Concomitantly, Asp-52 shifts to the left and pivots slightly clockwise. These subtle structural changes could be detected at a significance level 3 above the noise, within minutes of completion of data collection, without the need for further refinement. Refinement of Isomorphous Structures Crystallographic refinement of a structure that is isomorphous with a previously refined structure is expedited in two ways. Obviously, the first refined model can serve as a starting point for manual rebuilding of models for subsequent structures. Second, one can use a refinement protocol called
160
phases
[7]
Fig. 8. Interpretation of difference density for the case of an atom that has shifted by a distance less than the diameter of the atom. Note that the peak (maximum) and hole (minimum) in the difference map are not centered on the new and original atom positions.
‘‘difference refinement11’’, particularly when the primary interest is in determining the differences between the two refined models. The key assumptions underlying this method are that the residual model errors will be the same among these isomorphous structures, and that the major source of model error is the inadequacy of the model to completely represent the true structure within the crystal. Given these assumptions, difference refinement minimizes the squared functional residual 2 ðjFcalc; P j jFcalc; P jÞ ðjFobs; P j jFobs; P jÞ =ð2P þ 2P þ E2diff Þ the numerator of which can be rearranged to 2 ðjFobs; P j jFcalc; P jÞ ðjFobs; P j jFcalc;P jÞ
11
T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 51, 609 (1995).
[7]
isomorphous difference methods
161
Fig. 9. Interpretation of difference density for an atom that has become more ordered (i.e., has a lower B-factor) in the variant than in the parent structure. The occupancy (area under the curve) remains the same.
Fig. 10. An example of the typical signal-to-noise ratio of isomorphous difference Fourier maps. Lysozyme is the parent crystal, and lysozyme soaked in 2.8 M acetone is the variant. Positive and negative density are contoured at 4.8 . The map covers the entire molecule. Difference density above and below the model is due to symmetry-related lysozyme molecules.
162
phases
[7]
Fig. 11. A close-up of the active site in Fig. 10, with difference Fourier maps contoured at 4 .
The first term is the conventional crystallographic residual for the new structure currently being refined, whereas the second term is the final residual remaining from refinement of the parent structure. Difference refinement is thus related to conventional refinement but with the ‘‘unmodelable’’ model error subtracted. The denominator weights each term by the reciprocal of the sum of the measurement error estimates (P2 and 2P ) and an estimation of the expected residual error in the difference refinement, Ediff (usually estimated in shells of resolution). Implementation is rather straightforward, by replacing the amplitudes, Fobs, P used in conventional refinement with [jFobs, Pj (jFobs, Pj jFcalc, Pj)], and P with (2P þ 2P þ E 2diff )1/2. Conventional variance-weighted refinement can then be carried out on the new model with any crystallographic refinement program. When some nonisomorphism is present, such that the residual model errors are not identical between structures, the appropriate refinement scheme is a hybrid of independent and difference refinement methods. A Bayesian approach to this refinement has been developed.12 As with the difference refinement protocol described above, the Bayesian difference refinement protocol is straightforward to implement and incurs almost no additional computational overhead. 12
T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 52, 1004 (1996).
[8]
large asymmetric macromolecular assemblies
163
Conclusion
The sensitivity of isomorphous differences to subtle structural changes and their relative insensitivity to the phase set chosen for imaging make difference maps based on Eq. (1) the method of choice for many of the structural studies carried out subsequent to initial structure determinations. As they provide direct, visual access to what is inherent in the data, they can be of the greatest value when substitution is weak, or when subtle structural changes are to be evidenced. Difference density maps are superior to the more commonly used residual density maps fjFobs j jFcalc j; calc g for detecting and validating the presence of bound ligands, owing both to the relative insensitivity of the difference map to model bias, and to the close adherence to the observed data. Indeed, these maps directly resolve many current controversies regarding the purported binding of ligands,13,14 as well as clarify the changes in local sidechain and main-chain conformations upon binding ligands or arising from mutation. 13 14
M. A. Hanson and R. C. Stevens, Nat. Struct. Biol. 7, 687 (2000). B. Rupp and B. Segelke, Nat. Struct. Biol. 8, 663 (2001).
[8] X-Ray Crystallographic Structure Determination of Large Asymmetric Macromolecular Assemblies By Jan Pieter Abrahams and Nenad Ban Introduction
X-ray crystallography is currently the only technique available that permits high-resolution structure determination of extremely large macromolecular complexes. Since the first crystal structure of a protein, myoglobin, was solved in 1960, an increasing number of large structures and macromolecular assemblies has been determined by crystallographic methods. Among these are symmetric complexes such as viruses (e.g., Harrison et al.1), the bacterial chaperonin GroEL,2 and the proteasome.3 Also, large asymmetric assemblies could be determined by X-ray crystallography, such 1
S. C. Harrison and A. Jack, J. Mol. Biol. 97, 173 (1975). K. Braig, Z. Otwinowski, R. Hegde, D. C. Boisvert, A. Joachimiak, A. L. Horwich, and P. B. Sigler, Nature 371, 578 (1994). 3 J. Lowe, D. Stock, B. Jap, P. Zwickl, W. Baumeister, and R. Huber, Science 268, 533 (1995). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[8]
large asymmetric macromolecular assemblies
163
Conclusion
The sensitivity of isomorphous differences to subtle structural changes and their relative insensitivity to the phase set chosen for imaging make difference maps based on Eq. (1) the method of choice for many of the structural studies carried out subsequent to initial structure determinations. As they provide direct, visual access to what is inherent in the data, they can be of the greatest value when substitution is weak, or when subtle structural changes are to be evidenced. Difference density maps are superior to the more commonly used residual density maps fjFobs j jFcalc j; calc g for detecting and validating the presence of bound ligands, owing both to the relative insensitivity of the difference map to model bias, and to the close adherence to the observed data. Indeed, these maps directly resolve many current controversies regarding the purported binding of ligands,13,14 as well as clarify the changes in local sidechain and main-chain conformations upon binding ligands or arising from mutation. 13 14
M. A. Hanson and R. C. Stevens, Nat. Struct. Biol. 7, 687 (2000). B. Rupp and B. Segelke, Nat. Struct. Biol. 8, 663 (2001).
[8] X-Ray Crystallographic Structure Determination of Large Asymmetric Macromolecular Assemblies By Jan Pieter Abrahams and Nenad Ban Introduction
X-ray crystallography is currently the only technique available that permits high-resolution structure determination of extremely large macromolecular complexes. Since the first crystal structure of a protein, myoglobin, was solved in 1960, an increasing number of large structures and macromolecular assemblies has been determined by crystallographic methods. Among these are symmetric complexes such as viruses (e.g., Harrison et al.1), the bacterial chaperonin GroEL,2 and the proteasome.3 Also, large asymmetric assemblies could be determined by X-ray crystallography, such 1
S. C. Harrison and A. Jack, J. Mol. Biol. 97, 173 (1975). K. Braig, Z. Otwinowski, R. Hegde, D. C. Boisvert, A. Joachimiak, A. L. Horwich, and P. B. Sigler, Nature 371, 578 (1994). 3 J. Lowe, D. Stock, B. Jap, P. Zwickl, W. Baumeister, and R. Huber, Science 268, 533 (1995). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
164
phases
[8]
as membrane proteins,4 DNA–protein complexes such as the nucleosome,5 large enzymes such as F1-ATPase,6 yeast and bacterial RNA polymerases,7,8 and ribosomes and ribosomal subunits9–11 (Table I). Estimation of diffraction phases for crystals of symmetrical assemblies has been discussed in several excellent reviews (e.g., Rossmann12), and is treated only in passing in this chapter. Here we discuss the additional challenges posed by large asymmetric assemblies. These additional challenges have two causes: 1. Data sets need to be complete because there are no a priori constraints to estimate missing diffraction intensities. 2. Initial phase information needs to be of a higher quality because there are no noncrystallographic symmetry constraints on the phases. The last decade has seen several improvements in technology and methods for data collection and crystal structure determination, which made it possible to approach particularly complex crystallographic problems. Some of these improvements include the use of cryocrystallography, the increased size and sensitivity of area detectors, the availability of highintensity synchrotrons and beamlines with a tunable wavelength, an increase in computer speed, and improved software for data integration and phasing of the measured amplitudes. Here we discuss how these improvements have permitted some of the most difficult structure determinations yet. Challenges in Crystal Production
All the usual problems that come with growing crystals apply to large asymmetric complexes. Here we discuss only some of the more recent advances in crystal treatment that can dramatically improve their diffraction qualities. Crystal dehydration by the addition of higher concentrations of precipitating agents or cryoprotecting agents has been a useful way to improve 4
J. Deisenhofer, O. Epp, K. Miki, R. Huber, and H. Michel, J. Mol. Biol. 180, 385 (1984). K. Luger, A. W. Mader, R. K. Richmond, D. F. Sargent, and T. J. Richmond, Nature 389, 251 (1997). 6 J. P. Abrahams, A. G. Leslie, R. Lutter, and J. E. Walker, Nature 70, 621 (1994). 7 P. Cramer, D. A. Bushnell, and R. D. Kornberg, Science 292, 1863 (2001). 8 G. Zhang, E. A. Campbell, L. Minakhin, C. Richter, K. Severinov, and S. A. Darst, Cell 98, 811 (1999). 9 N. Ban, P. Nissen, J. Hansen, P. B. Moore, and T. A. Steitz, Science 289, 905 (2000). 10 B. T. Wimberly, D. E. Brodersen, W. M. Jr., Clemons, R. J. Morgan-Warren, A. P. Carter, C. Vonrhein, T. Hartsch, and V. Ramakrishnan, Nature 407, 327 (2000). 11 F. Schluenzen, A. Tocilj, R. Zarivach, J. Harms, M. Gluehmann, D. Janell, A. Bashan, H. Bartels, I. Agmon, F. Franceschi, and A. Yonath, Cell 102, 615 (2000). 12 M. G. Rossmann, Curr. Opin. Struct. Biol. 5, 650 (1995). 5
[8]
TABLE I ˚ Large Structures Solved by X-Ray Crystallography at Resolutions Better Than 3.0 A
Ta6Br12 cluster Two Ir compounds Hg3N3C18O12H24 Methylthioxorhenium Os hexamine Os pentamine Os bipyridine Lu chloride Os pentamine Ir hexamine Os hexamine Ta6Br12 cluster
6 6, 7 30, 14
500
2.8
7
93 6, 49 4 18, 14 132 84 38 9
900
3.0
10
1600
2.4
9
large asymmetric macromolecular assemblies
˚) Unit cell (A
Space group
165
166
phases
[8]
diffraction properties. This procedure, which perturbs crystal packing and often can result in dramatically different unit cell dimensions or even induce space group changes, was described for nucleosome core particle crystals13,14 and for various proteins.15 In the case of yeast RNA polymerase II, dehydration of crystals through addition of PEG 400 (PEG 6000 was ˚ shrinkage of the crystalthe initial precipitating agent) resulted in an 11-A ˚ lographic a axis to 131 A and improvement of their diffraction properties.7 These procedures should generally be attempted with large assemblies because they are more likely to suffer from poor diffraction and variable unit cell dimensions. The procedure can be performed through serial transfers in small steps (1–2% concentration difference per step) to higher concentrations of the precipitating agent (usually different types of PEG, sucrose, or 2-methyl-2,4-pentanediol are tried) with at least a 15-min incubation time between steps or by dialysis. Crystal annealing by slow transfer to 4 before flash-freezing improved diffraction of large ribosomal subunit crystals. Similar procedures could be attempted by reducing the temperature even further to below 0 . Flashannealing, in which the frozen crystals mounted in the goniometer head are thawed and rapidly refrozen, using a cold nitrogen stream at 100 K, has also been successful.16 On occasion, dramatic changes of the unit cell or even of the space group are observed as a consequence of these procedures (in the case of ˚ -long c axis was the large ribosomal subunit the crystallographic 575-A ˚ on dehydration). Such changes produce a different samshortened by 30 A pling of the reciprocal space or a changing of the orientation of the molecule. These changes can be used for the improvement of phases through intercrystal averaging, provided there is no associated conformational change of the molecule. In the case of the large ribosomal subunit, such ˚ resoprocedures were useful for phase improvement only to about 4.5-A lution, because of poor diffraction of the related crystal forms and probable nonisomorphism between subunits in different packing arrangements. Challenges in Data Collection
Compared with crystals of average-sized proteins, crystals with large unit cells diffract more weakly, yet produce a larger number of more closely spaced unique reflections. Overlap of these Bragg reflections 13
T. J. Richmond, J. T. Finch, B. Rushton, D. Rhodes, and A. Klug, Nature 311, 532 (1984) . M. M. Struck, A. Klug, and T. J. Richmond, J. Mol. Biol. 224, 253 (1992). 15 B. Schick and F. Jurnak, Acta Crystallogr. D Biol. Crystallogr. 50, 563 (1994) . 16 J. I. Yeh and W. G. J. Hol, Acta Crystallogr. D Biol. Crystallogr. 54, 479 (1998) . 14
[8]
large asymmetric macromolecular assemblies
167
presents additional problems, and the larger the spot size, the more severe the overlap. With an optimally tuned X-ray beam, the spot size is limited by the point-spread function of the detector and by the inherent disorder of the crystal. Disorder of the crystal affecting the shape and the size of diffracted reflections can be caused, for example, by (anisotropic) mosaicity or ‘‘compression-wave’’ crystal packing disorder.17,18 The number of diffraction orders that can be resolved is usually lower than the one predicted on the basis of the size of the X-ray beam focused on the detector. Overcoming these difficulties requires the best that modern crystallographic instrumentation can provide, and this usually means synchrotron radiation and electronic, area-sensitive X-ray detectors. It is worth the experimenter’s time to learn a little synchrotron and detector lore in order to critique the experimental setup that might be presented by the beamline staff. The biggest problem is usually that the area-sensitive detectors do not have adequate point-to-point resolution. With these detectors, if one were to illuminate the X-ray-sensitive surface with a point source of X-rays (something like 20 m is quite practical) the image that is detected may be as much as 400 m wide at 5% of the peak of the spot. This represents a real practical limit to how closely diffraction spots can be spaced on the detector face. Given that limitation, it still pays to make individual reflections as small and bright as practical. Modern high-brightness (high intensity and highly parallel) synchrotron sources (see Hart19) give the best performance from any specimen crystal. With these, the beam size simply needs to be adjusted to be some fraction of this 400-m limit to give the optimum peak intensity. Of course, to make the background on the detector as low as possible, the beam size should not be any bigger than the actual size of the specimen itself. Even with older synchrotron sources one can tweak conditions to achieve good-quality data. The focused X-ray beams converge to the specimen position. One can make sure that the focus point for the beam is at or near the surface of the detector to obtain the smallest spot. Of course, one doesn’t get something for nothing: because the beam is converging, the aperture at the crystals actually limits the width of the beam, so the only portion of the beam that is used is the central ray. Large area detectors are crucial for efficient data collection. The separation of spots will improve with an increase in the crystal-to-detector distance. Furthermore, increasing the distance between the detector and the crystal substantially improves the signal-to-noise ratio of the diffracted 17
J. R. Helliwell, J. Crystal Growth 90, 259 (1988). M. E. Wall, J. B. Clarage, and G. N. Phillips, Structure 5, 1599 (1997). 19 M. Hart, Methods Enzymol. 368, 239 (2003). 18
168
phases
[8]
intensities, because the background radiation per pixel is inversely proportional to the square of the distance. However, increasing the crystalto-detector distance requires the detector to be sufficiently large to permit data collection at high diffraction angles. If the detector is not large enough one must resort to heroic measures: moving the detector far back and offsetting the 2 axis to capture high-angle data (an unpleasant option because it is difficult to obtain complete data), or finding a different facility with more nearly adequate capabilities. Unlike in virus crystallography, incomplete or poorly sampled data severely compromise phasing calculations for large asymmetric unit cells. The size of the detector becomes even more important when collecting anomalous data. This is why the Mar345, the largest commonly used area detector available, was often the detector of choice for high-resolution data collections on large macromolecular assemblies in spite of its disadvantages regarding the slow readout time. During the data collection on the large macromolecular assemblies mentioned in this chapter the largest available CCD detector was SBC2 3K 3K CCD with 210 210 mm active area at the Advanced Photon Source (APS; Argonne National Laboratory, Chicago, IL), and the Brandeis 2K 2K B4 CCD with 200 200 mm active surface area at the National Synchrotron Light Source (NSLS; Brookhaven National Laboratory, Upton, NY). Already at the time of this writing, new large surface area detectors have become available for data collection on crystals with large unit cell dimensions. When the crystal has only one large unit cell dimension (not uncommon in crystals of large assemblies), spot overlap in is reduced by aligning the long axis of the unit cell with the spindle axis (Fig. 1). Intimate knowledge of the crystal habit is helpful for this alignment. However, if the longest axis does not coincide with the axis of highest symmetry, a larger wedge of data needs to be collected to obtain a complete data set. Crystals with large unit cells can have huge resolution-dependent differences in diffracted intensity, often too large to be recorded with a single data collection pass when one is using modern CCD-based detectors, which have a relatively narrow dynamic range. We have noticed that accurately measured low-resolution data are often critical for the successful application of the density-modification procedures. Therefore, data collection of the final high-resolution data set should be carried out in two passes, optimized in turn for the measurement of high-angle and low-angle reflections. Crystal decay owed to radiation damage can be prevented to a large extent by establishing a consistent protocol for crystal stabilization and flash freezing at liquid nitrogen temperatures (see Garman and Doublie20). 20
E. Garman and S. Doublie, Methods Enzymol. 368, 188 (2003).
[8]
large asymmetric macromolecular assemblies
169
Fig. 1. Diffraction pattern from C2221 crystals of the large ribosomal subunit derivatized with Ir hexamine aligned with the crystal long axis along spindle (data was collected at Brookhaven National Laboratory beamline, 25). The image shows diffraction to the edge of ˚ resolution, and close positioning of the spots on the the MAR345 image plate at about 2.9-A detector. Inset: Relationship between the morphology of the thin, platelike crystals and the reciprocal space axes. The short edge of the thin plate crystal always corresponds to the long reciprocal space axis, and spot overlap could be prevented only when the X-ray beam was exposing the crystal plate sideways throughout the data collection. For this purpose bent cryoloops 90 were constructed and used.
170
phases
[8]
Initially pioneered by Hass and Rossman,21 this method was first successfully applied in ribosome crystallography by Hope et al.22 Even frozen crystals do not have an unlimited lifetime in the X-ray beam, however. Using strong X-ray sources and long exposure times to collect data to the highest possible resolution will reduce the amount of unique data per crystal. Because this means that data from several crystals must be merged to obtain a complete data set, this is advisable only after potential problems with nonisomorphism, crystal quality, and twinning have been resolved. However, frozen crystals often are sufficiently stable to permit collection of complete data sets at medium resolution, even when the unit cell is large. In general, this is a good strategy for sorting out such issues. Less intense beam lines can be used for this purpose, and prescreened crystals can be reused for high-resolution data collection at third-generation synchrotrons. Because anomalous data played a critical role in most of the structure determinations discussed here, every derivative data set should be collected, whenever possible, at the energy of the absorption peak of the anomalous scatterer. Often it is advantageous to collect anomalous data with a perfectly aligned crystal so that Friedel pairs are collected simultaneously. This is possible only at beam lines equipped with the kappageometry goniostat and software that allows for quick calculation of crystal offsetting angles. For quick evaluation of the crystal orientation, resolution limits, and isomorphism between crystals, the program package HKL has proved valuable. Using a single exposure, the crystal could be aligned and fully recorded reflections could be scaled against previously collected data sets to provide rapid insight into the crystal characteristics. For this purpose both the R-factor and the 2 statistical parameters were considered as a function of resolution for all fully observed reflections on a single image. Binding of an anomalous scatterer can be confirmed quickly by comparing the fluorescence-excitation spectrum of the crystal with that of an equivalent volume of the original solution used for the soaking experiment (frozen in a cryoloop). If there is significant binding, the crystal shows much stronger fluorescence. Once the data are collected, 2 values for the merged Friedel pairs can be used quickly to evaluate the anomalous signal as a function of resolution. 2 values between 2 and 10 indicate there is a good chance that a useful signal has been measured—see the HKL Manual, and Table II. 21 22
D. J. Haas and M. G. Rossmann, Acta Crystallogr. B 26, 998 (1970). H. Hope, F. Frolow, K. von Bohlen, I. Makowski, C. Kratky, Y. Halfon, H. Danz, P. Webster, K. S. Bartels, H. G. Wittmann, and A. Yonath, Acta Crystallogr. B 45, 190 (1989).
[8]
171
large asymmetric macromolecular assemblies TABLE II Statistics for Iridium Hexamine Data Collection at Anomalous Edge
Lower limit
Upper limit
I/
Linear R-factor
2
Anomalous R-factor
30.0 5.46 4.34 3.79 3.45 All reflections
5.46 4.34 3.79 3.45 3.20
24.1 18.8 10.3 5.2 3.2 11.4
0.050 0.072 0.120 0.194 0.309 0.091
6.519 2.217 1.452 1.165 1.513 2.634
0.051 0.043 0.061 0.101 0.249 0.067
Phasing Challenges
Phasing crystals by the heavy atom isomorphous-replacement method becomes increasingly more difficult for crystals with larger unit cells. Difficulties arise because compounds traditionally used in macromolecular crystallography contain one or only a few heavy atoms. Even with current detector technology such compounds need to react at a large number of sites to produce measurable differences. However, this is likely to yield uninterpretable difference Patterson maps. Advances in direct phasing of heavy or anomalous sites (using methods initially developed for solving the structures of small molecules) are already reducing the severity of this problem (see [3] in this volume23 and, e.g., De Graaff et al.24). Difference Fourier analysis is much more powerful in identifying anomalous scatterers or heavy atoms. However, such analysis requires some phase information, and often low-resolution phases are adequate. To determine such low-resolution phases, it is possible to use electron microscopy three-dimensional reconstructions of the particle. This technique was pioneered in the structure determination of crystals of the tomato bushy stunt virus, for which three-dimensional maps derived from electron micrographs of negatively stained virus particles were employed.25 In that case the crystallographic symmetry unambiguously determined the orientation and the position of the virus particle in the unit cell, and molecular replacement procedures were not necessary. In the case of an asymmetric macromolecular assembly such as the large ribosomal subunit or the entire 23
C. M. Weeks, P. D. Adams, J. Berendzen, A. T. Brunger, E. J. Dodson, R. W. GosseKunstleve, T. R. Schneider, G. M. Sheldrick, T. C. Terwilliger, M. G. W. Turkenburg, and I. Uson, Methods Enzymol. 374, [3], 2003 (this volume). 24 R. A. G. De Graaff, M. Hilge, J. L. van der Plas, and J. P. Abrahams, Acta Crystallogr. D Biol. Crystallogr. 57, 1857 (2001). 25 A. Jack and S. C. Harrison, J. Mol. Biol. 99, 15 (1975).
172
phases
[8]
ribosome, a full six-dimensional molecular replacement search is necessary in order to determine the location of the particle inside the crystal. Podjarny and Urzhumtsev26 extensively described the theoretical considerations of the low-resolution molecular replacement problem. In the conventional molecular replacement procedures, low-resolution data ˚ ) are excluded from calculations because diffraction (lower than 8 to 10 A amplitudes, in this resolution range, have a significant noncollinear solvent contribution, which cannot be adequately reproduced with a molecular ˚ ) lies model. The success of the method at low resolution (lower than 20 A in the observation that although the solvent contribution is large, it is essentially collinear with the contribution of the model and therefore amounts to a scale factor. Molecular replacement can be performed using either a flat envelope (like a protein mask) or a density-based envelope. Experience with the large ribosomal subunit and several viruses shows that a flat envelope contains enough information for obtaining a correct molecular replacement solution. Such a map is often more convenient for the calculations. Here we describe briefly the procedure followed for determining the low-resolution phases of the 50S ribosomal subunit of Haloarcula marismortui.27 The original electron microscopy maps were stored as an array of densities, (x, y, z), on an arbitrary scale, sampled along all three directions at equal intervals, which should be no coarser than about one-third of the calculated maximum resolution of the map. The map was reduced to a threedimensional step function by assigning unit weight to all points of the array for which (x, y, z) was greater than some minimum value, and setting all other points to zero. For the molecular replacement calculations, the volume inside the mask was represented as a collection of atoms. The best signal-to-noise ratio was obtained when we used a uniform B-factor assign˚ 2 for all atoms, and when the number of atoms used within ment of 200 A the envelope approached the actual number of atoms in the molecule. This was achieved by appropriately changing the sampling of the electron density, thereby controlling the number of points where the fake atoms were introduced. This parameter was important even when the molecular replacement was performed at resolutions where individual atoms of the model remained unresolved. Although several molecular replacement algorithms and programs were tried, the only correct solution with a significant signal-to-noise ratio was obtained using the Patterson correlation coefficient (PCC) search 26 27
A. D. Podjarni and A. G. Urzhumtsev, Methods Enzymol. 276, 641 (1997). N. Ban, B. Freeborn, P. Nissen, P. Penczek, R. A. Grassucci, R. Sweet, J. Frank, P. B. Moore, and T. A. Steitz, Cell 93, 1105 (1998).
[8]
large asymmetric macromolecular assemblies
173
Fig. 2. Internal cross-check of the translation function. Shown are three Harker sections of the translation function for the large ribosomal subunit, using a cryoelectron microscopic reconstruction as the model. The translation function was calculated using data at a resolution ˚ . Contours begin at the 2 value and are spaced 1 apart. The consistent, between 30 and 80 A and also the largest, peaks in the sections are marked with connecting lines.
implemented in X-PLOR.28,29 To avoid identifying a false maximum for the correlation coefficient during the translational search it is useful to perform individual searches for the relative translation between symmetryrelated molecules in the crystal along the Harker sections; these sections should produce consistent peaks for the correct solution (Fig. 2). The PCC search procedure yielded solutions that were comparable to the solution obtained by full six-dimensional searches using either the R-factor or correlation coefficient as the target functions. Low-angle reflections (30.0 < ˚ ) were important for these calculations and special care was d < 200.0 A taken to collect those. Once a correct molecular replacement solution was obtained, continuous electron density from electron micrographs was expanded into the unit cell, using the Uppsala Software Factory (Uppsala, Sweden) program package RAVE. This program package can also be used to refine the transformation matrix relating the electron microscopy (EM) reconstruction in an arbitrary unit cell to the particle in the experimental unit cell. This procedure is analogous to the one preceding intercrystal averaging. Amplitudes 28
R. Huber, ‘‘Proceedings of the Daresbury Study Weekend, Daresbury, February,’’ p. 58. Science and Engineering Research Council, The Librarian, Daresbury Laboratory, Daresbury, UK, 1985. 29 A. T. Brunger, Methods Enzymol. 276, 558 (1997).
174
phases
[8]
calculated from a unit cell with fitted continuous electron density rather than a flat envelope usually show somewhat better agreement with the measured amplitudes. For the 50S ribosomal subunit, the average phase ˚ resolution, using the expdifference remained below 40 to about 20-A anded EM density. This is approximately the nominal resolution of the expanded map as judged by the Fourier shell correlation calculation reaching 50%.27 If instead a step function is used, the phase difference increases ˚ resolution, and becomes essentially random at to above 40 at about 25-A ˚ (Fig. 6). The phases of the expanded EM resolutions higher than 19 A density also produced heavy atom peaks in difference Fourier maps that had a higher signal-to-noise ratio. The averaging of Fourier maps based on isomorphous and anomalous differences can achieve additional improvement of the signal-to-noise ratio for the heavy atom peak. Crystallographic MIR or MAD phasing with heavy atom clusters, or more conventional heavy atom derivatives, produced better quality maps at resolutions higher than the nominal resolution of the electron micro˚ in this case). However, the EM molecular rescopy reconstructions (20 A placement phasing performs significantly better than MIR or MAD ˚ ). phasing at low resolution (>30 A Use of Heavy Atom Clusters for Phasing
Large clusters of roughly 10 heavy atoms have long been used to phase macromolecular crystal structures. This was the approach used by Corey and Pauling when they were solving hen egg lysozyme. In 1962 they showed that Ta6Cl14 and Nb6Cl14 clusters bind in a specific way to the crystallized protein, but owing to nonisomorphism these compounds were not useful for phasing. These clusters have a beguiling property that arises from the fact that the power of scattering from a center of electron density is proportional to the square of the number of electrons. Therefore, at a low resolution where the entire cluster appears as a single scattering center, the scattering goes as the square of the sum of the electrons on each atom, providing a strong signal. Unfortunately, at a resolution where the individual atoms are distinguished, the scattering power reverts to the sum of squares or, even worse, drops below that value because of interference. Using heavy atom clusters for phasing of large complexes was suggested by Schneider and Lindqvist30 and Knablein et al.31
30 31
G. Schneider and Y. Linqvist, Acta Crystallogr. D Biol. Crystallogr. 50, 186 (1994). J. Knablein, T. Neuefeind, F. Schneider, A. Bergner, A. Messerschmidt, J. Lowe, B. Steipe, and R. Huber, J. Mol. Biol. 270, 1 (1997).
[8]
175
large asymmetric macromolecular assemblies TABLE III Heavy Atom Clusters Used in Structure Determinations
Heavy atom clusters
Abbreviation
No. of electrons
System used
[Pt(en)I]22þ
PIP
314
Nucleosome
Tetrakis (acetoxymercuri) methane Ta6Br122þ
TAMM
446
Nucleosome
TABR
856
Cs5(PW11O39 [Rh2(CH3COO)2])
W11
1250
RNA polymerase Small subunit Large subunit Large subunit
H4SiO4[12WO] Li10[P2W17O61] (AsW9O33)2(PhSn)4
W12Si W17 W18
1000 1800 2000
Small subunit Small subunit Large subunit
Source Strem Chemicals (Newburyport, MA) Strem Chemicals (Newburyport, MA) J. Lowe, G. Schneider, N. Brnicevic, M. Pope, Georgetown University, Washington, D.C. M. Pope M. Pope M. Pope
Heavy atom clusters have been used for almost all large-assembly structure determinations, mentioned here, and a list of full molecular formulas for these compounds can be found in Table III. On occasion, these compounds will bind at only a few locations allowing difference Patterson Function maps to be readily interpretable (Fig. 3; also see Ban et al.27 and Clemons et al.32). If it is possible to interpret one of the Patterson maps, the correctness of the molecular replacement solution and the interpretation of the derivative difference Patterson maps can be verified simultaneously by calculating difference Fourier maps based on EM-derived phases (Fig. 4). Once locations for the heavy atom clusters are determined, crystallographic phases can be computed by several approaches. The group scattering factors can be calculated using the formulas described by O’Halloran et al.33 Alternatively, the entire cluster can be treated as a rigid body, refining its position and orientation by maximum likelihood minimization of the lack of closure. This option, particularly useful when the clusters are ordered31 or have a unique binding mode due to asymmetric surface 32
W. M. Clemons, Jr., D. E. Brodersen, J. P. McCutcheon, J. L. May, A. P. Carter, R. J. Morgan-Warren, B. T. Wimberly, and V. Ramakrishnan, J. Mol. Biol. 310, 827 (2001). 33 T. V. O’Halloran, S. J. Lippard, T. J. Richmond, and A. Klug, J. Mol. Biol. 194, 705 (1987).
176 phases
[8]
Fig. 3. Harker sections of the combined isomorphous-anomalous difference-Patterson maps of the W18 clusterderivatized crystals of the large ribosomal subunit. The map is contoured at 2 with 1 increments. The sections on the left are experimental, and the sections on the right predicted Patterson maps using the refined cluster coordinates with one major and six minor sites. The major peak consistent on all three Harker sections is 6 values over the background. ˚. The map was calculated using data at a resolution between 35 and 14 A
[8]
large asymmetric macromolecular assemblies
177
Fig. 4. (A) Histogram of the largest positive and negative isomorphous peaks identified in ˚ difference Fourier map of the large subunit crystals using phases derived from the the 14-A molecular replacement solution of the electron microscopy reconstruction of the large ribosomal subunit represented as an envelope only. (B) Major peak (positive peak 1 in the histogram on the left) as it appears in a section of the difference Fourier electron density map. ˚ in resolution and is contoured at 2 The map was calculated using data between 35 and 14 A with 1 increments.
features, is implemented in CNS.34 It is also possible to treat the cluster as a spherically averaged shell. Scattering from such a shell is strong at low resolution but drops dramatically at resolutions similar to the diameter of the cluster, and shows a subsidiary maximum at a resolution equal to approximately half the diameter of the shell35 (Fig. 5). Phase calculation using a spherically averaged cluster is available as an option in SHARP.36 In either case, the phasing power of heavy atom clusters drops dramatically at high resolution because of the interference between atoms constituting the cluster, and frequently because of disorder. Therefore, this approach has not been used for the purposes of phasing of large assemblies at high resolution. Nevertheless, if the phase calculation is performed carefully, and there are no significant positive or negative ghost peaks in the synthesis 34
A. T. Brunger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gros, R. W. Grosse-Kunstleve, J. S. Jiang, J. Kuszewski, M. Nilges, N. S. Pannu, R. J. Read, L. M. Rice, T. Simonson, and G. L. Warren Acta Crystallogr. D Biol. Crystallogr. 54, 905 (1998). 35 J. Fu, A. L. Gnatt, D. A. Bushnell, G. J. Jensen, N. E. Thompson, R. R. Burgess, P. R. David, and R. D. Kornberg, Cell 98, 799 (1999). 36 E. de La Fortelle and G. Bricogne, Methods Enzymol. 276, 472 (1997).
178
phases
[8]
Fig. 5. Radial distribution (as a function of resolution) of diffracted intensity (F2) for a heavy atom cluster and for randomly distributed heavy atoms. All occupancies were set to 1.0 ˚ 2. (—) W18 heavy atom cluster derivative used for MIR and temperature factors to 30 A phasing of the large ribosomal subunit9 (only W and Sn atoms were used in this calculation). (. . .) 22 Os atoms randomly distributed in the unit cell. The heavy atom cluster at low resolution scatters approximately 10 stronger than the distributed atoms (at a resolution of ˚ , cluster scattering approaches 200,000 on this graph) and at about 7.5 A ˚ shows a 25 A subsidiary minimum. (–––) scattering from four Os atoms is shown for comparison. The horizontal axis is linear with respect to sin()/ . (B) Metal atoms of the heavy atom cluster used in this calculation. This ‘‘sandwich’’ compound contains 18 tungsten and 4 tin atoms. For a complete molecular formula see Table III.
maps, the medium-resolution phase information obtained opens the door to high-resolution phasing. Using this strategy, it is possible to identify positions of a large number of bound single heavy atom derivatives by difference Fouriers, which would otherwise be impossible to locate in difference Pattersons. High-Resolution Phasing and Heavy Atom Compounds
F1-ATPase was one of the first examples of the successful application of anomalous scattering in combination with MIR methods, using a relatively small number of bound heavy atoms compared with the size of the macromolecular assembly.6 Many subsequent structure determinations of other large complexes followed this approach. In all of these examples, accurate measurement of diffracted intensities using high-quality crystals and extremely intense synchrotron radiation was critical. For example, if the
[8]
large asymmetric macromolecular assemblies
179
expected anomalous signal is 3%, in the diffraction experiment, the signal should be greater than 2(I) or 0.03I > 2(I1/2) (assuming that ¼ I1/2), which would imply that every useful unique measurement must be at least 4500 counts above background. In practice this value will be lower, because anomalous pairs often will be measured more than once in the higher symmetry space groups. Because density modification by solvent flipping is capable of phase extension past the limit of experimental phasing, accurate measurements of native amplitudes are important even in the resolution range where there are no derivative data available. Incorporating anomalous scatterers is not always straightforward. In the case of ribosomal subunits, because more than half the particle mass is nucleic acid, the total number of methionines would have been insufficient to provide a signal that would be measurable and useful for phasing. Different problems were faced during the preparation of yeast RNA polymerase where, as is the case with ribosomes, the complex was purified directly from the cells without any overexpression. Here, only partial selenomethionine incorporation could be achieved, because of the toxicity of this compound, and the anomalous differences were used mostly for locating the selenium atoms and sequence assignment. Successful approaches for structure determination of macromolecular assemblies, through the use of anomalous scattering, included the use of various heavy metals with strong white lines at their LIII edge. Lanthanides are well known for this property37 and contribute more than 20 electrons per bound heavy atom, but other elements such as tungsten, osmium, and iridium should also be considered. Osmium and iridium hexamine and pentamine compounds have been particularly useful for structure determinations of RNA molecules and RNA–protein complexes.38 Before phasing, one should estimate the anomalous signal resulting from the incorporation of selenomethionine residues, using the Crick and Magdoff equations. For this calculation, consider also the fraction of selenomethionine incorporation, which can be measured by amino acid analysis. A useful Web page maintained by E. Merritt offers information on the absorption edges and theoretical f 0 and f 00 values (http://www.bmsc.washington.edu/scatter/). This site also allows quick estimation of the anomalous signal in experimental data. Anomalous phasing alone from several or perhaps even a single heavy atom derivative, followed by solvent flipping, can be sufficiently powerful
37
W. I. Weis, R. Kahn, R. Fourme, K. Drickamer, and W. A. Hendrickson, Science 254, 1608 (1991). 38 J. H. Cate, A. R. Gooding, E. Podell, K. Zhou, B. L. Golden, C. E. Kundrot, T. R. Cech, and J. A. Doudna, Science 273, 1678 (1996).
180
phases
[8]
Fig. 6. (A) Electron microscopy phasing. Phase difference between the atomic model phases and phases obtained as a result of molecular replacement with a cryoelectron microscopy reconstruction of the large ribosomal subunit represented as an envelope (—) or as a continuous electron density (. . .) and crystallographic MIR phasing using several heavy atom clusters (–––) (for details of heavy atom phasing statistics see Ban et al.27) as a function of resolution. (B) Phasing at different stages of the structure determination. Phase difference between the atomic model phases and MIR cluster phasing shown as a dashed line in (A) is
[8]
large asymmetric macromolecular assemblies
181
to provide high-resolution phasing, as was shown for the small ribosomal subunit.32 The advantage of using anomalous phasing, with data collected at the peak of anomalous scattering, is that the signal is strong while there are none of the nonisomorphism problems associated with MIR methods. A difficult and often subjective aspect of structure determination is in judging the quality of electron density maps. Statistical criteria used in heavy atom phasing such as R-Cullis, R-Kraut, or phasing power can be remarkably uninformative when dealing with large unit cells. Because most phasing attempts yield poor statistics, it is often impossible to know whether a particular heavy atom phasing contribution in the end helped to improve the maps. It is also particularly difficult to evaluate the usefulness of phase combination from different sources. Furthermore, because crystals of large macromolecular assemblies diffract so weakly, one is often dealing with data sets that are collected below the optimal exposure necessary for phasing to the diffraction limit of these crystals. In our experience, an objective measure of the quality of the phases can be obtained by calculating difference Fourier maps and comparing the peak heights corresponding to the heavy atoms not used in phasing. In these procedures, which proved extremely useful throughout the structure determination of the large ribosomal subunit, peak heights of the heavy atoms based on anomalous difference Fouriers (to avoid nonisomorphism effects) are calculated and compared for all possible phasing procedures and combinations of various heavy atom data sets. The increased signal-to-noise ratio judged by the peak height in terms of units (number of standard deviations above the mean of the map) corresponds to better phased maps. This quick calculation also can be performed at different resolution ranges, providing an estimation of the phasing as a function of resolution. The procedure allows for objective and systematic evaluation of the map quality even at resolution ranges where it is difficult to recognize macromolecular features and therefore impossible to judge objectively any improvement of the map resulting from various phasing attempts. An example of phasing improvement during the course of structure determination of the large ribosomal subunit is summarized in Fig. 6B. ˚ , MIR phasing was improved by averaging between Low-resolution, <9 A crystals that were subjected to different stabilization protocols that yielded
shown as a gray line. Phasing at higher resolution obtained through intercrystal averaging and solvent flipping (–.–.–) (for details of this phasing see Ban et al.39). Combined SAD phasing (–––) combined MIR and SAD phasing (—), combined MIR, SAD, and phases obtained through averaging (. . .). Phase difference after solvent flipping using crystallographic amplitudes beyond the limit of experimental phases (—). The horizontal axis is linear with respect to sin()/ .
182
phases
[8]
Fig. 7. Electron density maps of the large ribosomal subunit contoured at 1.8 value. The polypeptide exit tunnel can be seen extending through the middle of the subunit. The atomic model for ribosomal RNA and the backbone for ribosomal proteins are shown in black. Symmetry-related molecules in the crystallographic unit cell are shown without the atomic model. Both maps were calculated at resolutions at which the phase difference with respect to the model reaches 60 at the highest resolution shell. (A) Electron density map ˚ , using MIR and SAD phasing with before solvent flipping calculated at a resolution of 4.8 A derivatives described in Ban et al.9 (B) The same area of the electron density map as shown in (A), after solvent flipping.42
related space groups or unit cell dimensions.39 When the resolution im˚ , solvent-flipping procedures became increasingly useful proved beyond 6 A for phase improvement. The resulting phases were then used to identify several new single heavy atom derivatives, which bound at numerous sites, and combined MIR and SAD phases were calculated. The heavy atom phases, which critically depended on anomalous scattering, were poor, but were sufficient for remarkably successful application of solvent-flipping procedures (Fig. 7). Power of Solvent-Flipping Phase Refinement for Crystals with Large Asymmetric Units
Although we have described a multitude of problems associated with data collection and phasing calculations for crystals with large unit cells, it is not generally appreciated that for such crystals phase refinement 39
N. Ban, P. Nissen, J. Hansen, M. Capel, P. B. Moore, and T. A. Steitz, Nature 400, 841 (1999).
[8]
large asymmetric macromolecular assemblies
183
through density modification by solvent flattening is relatively more powerful. The qualitative argument runs as follows. 1. The restraints on the electron density map imposed by the solvent mask translate into phase restraints in reciprocal space. 2. These phase restraints are local in reciprocal space: each structure factor is replaced by a summation of its close neighbors. 3. The more neighbors contribute significantly to this summation, the more accurately the phases can be reconstructed, because random errors cancel each other out more effectively when the summation is over many terms. 4. In this summation each of the neighboring structure factors is multiplied by a different complex structure factor determined by the Fourier transform of the solvent mask (the interference function). 5. Therefore, the larger a unit cell, and/or the higher the resolution of the solvent mask, the more neighboring structure factors contribute to these restraints (if, e.g., most of the intensity of the interference function ˚ , this will include about eight structure factors falls within a radius of 1/50 A ˚ , yet if the cell axes for a P1 crystal with unit cell axes of about 100 A ˚ measure about 300 A the interference function will include more than 200 terms). 6. Ergo, the larger the unit cell, the more powerful is solvent flattening. Both a high solvent content and a large unit cell benefit solvent flattening. However, note that the size of the unit cell (or the resolution of the solvent mask) and the relative solvent content affect the success of solvent flattening in different ways. In the case of a large unit cell, the benefit comes from the increased number of structure factor constraints, whereas in the case of a high solvent content the error components of the structure factors constraints are scaled down (see also below). To quantify this argument, we must first consider why solvent flattening works. The mathematical background is presented below in Eqs. (1)–(15).* In practice, a flat solvent is imposed in real space by multiplying an experimentally phased map with a binary function that defines the shape of the protein and its location in the unit cell: m ðxÞ ¼ ðxÞ ¼ ðxÞgðxÞ
(1)
where x is a real space vector; m(x) is the modified, solvent flattened map; (x) is the unmodified map (in the first cycle: the experimentally phased
*
We use the following conventions: jxj is the amplitude of x; <x> is the expected or mean value of x.
184
phases
[8]
map); and g(x) is the shape function, which is 0 in the solvent region and 1 in the protein region. According to the convolution theorem, the Fourier transform of the product of two functions is equivalent to the convolution of their Fourier transforms, which, in the case of Eq. (1), results in Z Fm ðaÞ ¼ Gðx aÞFðxÞdx (2) where x is a reciprocal space vector; Fm(x) is the Fourier transform of m(x), the modified map; F(x) is the set of unmodified (experimentally phased) structure factors; and G(x) is the Fourier transform of the shape function g(x), also called the ‘‘interference function.’’ It was shown elsewhere that on solvent flatting the map remains biased to the original density map by a factor that is proportional to the protein content of the crystal40: Z Z (3) Gðx aÞFðxÞdx ¼ pFðaÞ þ G ðx aÞFðxÞdx where p is the protein fraction, which is equal to the magnitude of the origin vector G(0) of G(x); and G (x) is the interference function with its origin vector G(0) zeroed. Its Fourier transform is Fourier transform of ðG ðxÞÞ ¼ gðxÞ p
(3a)
To remove this bias toward F(x), the interference function G(x) should be used instead of G(x), which in real space results in solvent flipping rather than flattening40: Z Fu ðaÞ ¼ G ðx aÞFðxÞdx (4) where Fu(x) is the set of unbiased modified structure factors. Because of the definition of G (x), the following is true in the absence of phase errors: Z ð1 pÞFt ðaÞ ¼ G ðx aÞFt ðxÞdx (5) where Ft(x) is the set structure factors without errors. An important characteristic of G (x) is its power, a scalar defined by the function s2(G (x)). It is related to the solvent content of the crystal. According to Parseval’s theorem, the power of G(x) is proportional to
40
J. P. Abrahams, Acta Crystallogr. D Biol. Crystallogr. 53, 371 (1997).
[8]
large asymmetric macromolecular assemblies
185
that of its Fourier transform [g(x) p]. It is straightforward to calculate this value in real space: Z Z 2 2 (6) s ðG ðxÞÞ jG ðxÞj dx ¼ jgðxÞ pj2 dx ¼ pð1 pÞ It is useful to treat the set of structure factors F(x), obtained at any given cycle in the phase refinement process, as the sum of two sets that are (on average) orthogonal: FðxÞ ¼ Ft ðxÞ þ Fe ðxÞ
(7)
where Fe(x) is the error component of Fm(x); is a scalar related to the phase error, chosen such that Ft(x) and Fe(x) are (on average) orthogonal. may be resolution dependent: < Ft ðxÞ Fe ðxÞ > ¼ 0
(7a)
Given Eqs. (4)–(7) it follows that Fu ðaÞ ¼ ð1 pÞFt ðaÞ þ
Z
G ðx aÞFe ðxÞdx
(8)
Comparing Eqs. (7) and (8) shows what solvent flattening does: the true component is scaled down by a factor of (1 p), while the error component Fe(x) is replaced by a different error component. This new error component has a random phase and its mean magnitude is determined by the conR volution in G(x a)R Fe(x) dx. Because G(x a) and Fe(x) are not correlated, the mean of G (x a) Fe(x) dx over all structure factors obviously is zero. However, we need to determine the variance of this term, as this is a measure of the remaining errors in each of the structure factors. To do so, we recast the integral in Eq. (8) into a discrete summation and rearrange: ð1 pÞFt ðaÞ Fu ðaÞ ¼ ð1=nÞ
n X
G ðxi aÞFe ðxi Þ
(9)
i
where n is the number of relevant structure factor amplitudes in G (x); n depends on the resolution of the protein mask (d) and the volume of the unit cell (V): n ¼ ð4VÞ=ð3d3 Þ
(9a)
In practice it is probably better to determine n directly from G(x) (see Fig. 8). Because G(xi a) and Fe(xi) can be treated statistically as uncorrelated for (1 i n), the summation of their products, the right-hand term of Eq. (9), can be approximated as the square root of the mean
186
[8]
phases 600 500
冕 |G(x)|
2
400 300 200 100 0 0
200
400
600
800
N (number of structure factors) Fig. 8. Radial distribution of the accumulated power of the interference function derived ˚ envelope of the structure of F1 ATPase (Abrahams et al, 1994). On the horizontal from a 3.2 A axis the number of structure factors, in spherical bins around the origin. On the vertical axis R the integral of the squared structure factor amplitudes jGðxÞj2 within these bins on a linear, arbitrary scale.
squared distance to the origin, resulting from a random walk in two dimensions: X n < Gðxi aÞFe ðxi Þ > ¼ < > ðnÞ1=2 (10) i where < > is the mean amplitude over all vectors a of the individual products of the summation in Eq. (7): < > ¼ ð1=nÞ <
n X
jGðxi aÞFe ðxi Þj >
(10a)
i
The value of < > is determined by the probability distribution of jG (xi a)Fe(xi)j. Applying the central limit theorem to Eq. (10a) results in Eq. (11), which is true on average for large sets of structure factors: < > ¼ ð1=nÞ < <
n X
n X
jG ðxi aÞFe ðxi Þj > ¼ < jFe ðaÞj > ð1=nÞ
i
(11)
jG ðxi aÞj >
i
For an uncorrelated set of structure factors as G (x a), the probability distribution p(jG (x a)j) of its amplitudes is given in Eq. (12) (e.g., Drenth41):
where s2(jG (x a)j) is the power of jG (x a)j as defined in Eq. (6). From Eqs. (6), (11), and (12) the mean amplitude <> of the individual terms of the convolution in Eq. (9) can be calculated by integration: Z < > ¼ < jFe ðxÞj > jG ðx aÞjpðjG ðx aÞjÞdjG ðx aÞj 1=2 (13) ¼ < jFe ðxÞj > s2 ðjG ðx aÞjÞ=4 ¼ < jFe ðxÞj > fpð1 pÞ=4g1=2 Combining Eqs. (9), (10), and (13), it can be concluded that solvent flattening reduces the error in an experimentally phased set of structure factors according to < j ð1 pÞFt ðaÞ Fu ðaÞj > ¼ ð1=nÞ1=2 < jFe ðxÞj > fpð1 pÞ=4g1=2 (14) where < jFðaÞj > fpð1 pÞ=4g1=2 < jFu ðaÞj > < jFðaÞj > ð1 pÞ: Comparing this result with Eq. (15), which is a rearranged form of Eq. (7), indicates the effects of unbiased solvent flattening (solvent flipping) on the average errors: < j Ft ðaÞ FðaÞj > ¼ < jFe ðxÞj >
(15)
In conclusion, unbiased solvent flattening (solvent flipping) has the following effects. 1. The average error of each structure factor that remains after solvent flipping has a random phase; 2. The expected mean squared amplitude of the error is reduced by a factor determined by p, the protein content of the crystal, and by the inverse square root of n, the number of relevant structure factors in G(x). 3. The resolution and the absolute volume of the protein mask determine n [see Eq. (9a)]. These theoretical considerations are borne out in practice. For example, the structure of F1-ATPase (54% solvent; unit cell volume, approximately ˚ 3) could be solved using the isomorphous differences extending 5.5 106 A ˚ of just 2.5 Hg atoms.42 In the case of even larger asymmetric structo 3.2 A tures like ribosomal subunits, the phases could be extended from even
41
J. Drenth, In ‘‘Principles of Protein X-Ray Crystallography,’’ p. 122. Springer-Verlag, New York, 1994. 42 J. P. Abrahams and A. G. W. Leslie, Acta Crystallogr. D Biol. Crystallogr. 52, 30 (1996).
188
[9]
phases
˚ ) to yield interpretable electron density.39 Next to lower resolution (6–8 A very careful data processing, these results could be obtained because the large size of the unit cell increases the power of density modification techniques in improving phases. Conclusions
We have witnessed a dramatic increase in the success of crystallography in solving the structures of large asymmetric subunits. The major contributors to this success are the improved quality and flux of synchrotron beam lines, allowing data to be measured at the highest possible accuracy. More accurate data allow even weak phasing signals to become useful. These weak phasing signals are often essential for the success of more powerful computational methods for determining and refining such phases that have been developed. Finally, a hitherto unrealized benefit of large unit cells has assisted structure determinations of large asymmetric macromolecular complexes: phase refinement through solvent flipping is more powerful for large unit cells. Acknowledgments Some of the strategies and procedures described here were developed through inspiring discussions between N.B. and researchers at Yale University in the course of the large ribosomal subunit structure determination: T. A. Steitz, P. B. Moore, and P. Nissen. This work was supported by a Burroughs Welcome Fund Career Award to N.B. J.P.A. warmly thanks N. Pannu, J. Plaisier, and R. A. G. de Graaff for vital feedback on the mathematical derivations.
[9] Multidimensional Histograms for Density Modification By Kam Y. J. Zhang Introduction
Density modification improves the quality of an approximate electron density map by imposing physical constraints based on some conserved features of the correct electron density map. These conserved features are independent of the unknown fine detail of the structural conformation. They are often expressed as constraints on the electron density in various forms, either in real or reciprocal space. Because the structure factor amplitudes
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
188
[9]
phases
˚ ) to yield interpretable electron density.39 Next to lower resolution (6–8 A very careful data processing, these results could be obtained because the large size of the unit cell increases the power of density modification techniques in improving phases. Conclusions
We have witnessed a dramatic increase in the success of crystallography in solving the structures of large asymmetric subunits. The major contributors to this success are the improved quality and flux of synchrotron beam lines, allowing data to be measured at the highest possible accuracy. More accurate data allow even weak phasing signals to become useful. These weak phasing signals are often essential for the success of more powerful computational methods for determining and refining such phases that have been developed. Finally, a hitherto unrealized benefit of large unit cells has assisted structure determinations of large asymmetric macromolecular complexes: phase refinement through solvent flipping is more powerful for large unit cells. Acknowledgments Some of the strategies and procedures described here were developed through inspiring discussions between N.B. and researchers at Yale University in the course of the large ribosomal subunit structure determination: T. A. Steitz, P. B. Moore, and P. Nissen. This work was supported by a Burroughs Welcome Fund Career Award to N.B. J.P.A. warmly thanks N. Pannu, J. Plaisier, and R. A. G. de Graaff for vital feedback on the mathematical derivations.
[9] Multidimensional Histograms for Density Modification By Kam Y. J. Zhang Introduction
Density modification improves the quality of an approximate electron density map by imposing physical constraints based on some conserved features of the correct electron density map. These conserved features are independent of the unknown fine detail of the structural conformation. They are often expressed as constraints on the electron density in various forms, either in real or reciprocal space. Because the structure factor amplitudes
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[9]
multidimensional histograms
189
are known, these constraints restrict the value of phases and therefore can be used for phase improvement. Density modification methods generally require an initial map with substantial phase information. In most cases, these phases are obtained from multiple isomorphous replacement (MIR)1 or multiwavelength anomalous dispersion (MAD),2 and they are of pivotal importance when the experimental source of phase information is bimodal, as it is in single-wavelength anomalous scattering (SAD) and single isomorphous replacement methods (SIR). It is also possible to improve maps from other sources, such as molecular replacement. The amount of information in the initial map is dependent on phase accuracy, data resolution, and completeness. As more powerful constraints are incorporated, the density modification can be initiated from lower resolution maps with less accurate phases. Density modification methods are usually implemented as an iterative procedure that alternates between density modification in real space and phase combination in reciprocal space. This paradigm was first proposed by Hoppe and Gassmann3 in their ‘‘phase correction’’ method. This approach takes advantage of the particular properties of the constraints and uses them in a way that is most convenient to implement. A broad range of techniques has been developed to modify electron density maps by imposing chemical or physical information. The most commonly used density modification method is solvent flattening,4 which exploits the observation that the solvent region of the electron density map is featureless at medium resolution because of the high thermal motion and disorder of the solvent molecules. Flattening of the solvent region suppresses noise in the electron density map and thereby improves phases. A complementary method to solvent flattening is histogram matching,5 which modifies the protein region of the map by systematically adjusting the electron density values so that the electron density distribution conforms to an ideal distribution. Sayre’s equation is used to restrain the local shape of the electron density.6 Molecular averaging forces the electron density at equivalent positions to be equal when there are multiple copies of the same molecule in the asymmetric unit.7 Electron density skeletonization
1
M. F. Perutz, Acta Crystallogr. 9, 867 (1956). W. A. Hendrickson, J. R. Horton, and D. M. LeMaster, EMBO J. 9, 1665 (1990). 3 W. Hoppe and J. Gassmann, Acta Crystallogr. B 24, 97 (1968). 4 B. C. Wang, in ‘‘Diffraction Methods for Biological Macromolecules’’ (H. W. Wyckoff, C. H. W. Hirs, and S. N. Timasheff, eds.), Vol. 115, p. 90. Academic Press, Orlando, FL, 1985. 5 K. Y. J. Zhang and P. Main, Acta Crystallogr. A 46, 41 (1990). 6 D. Sayre, Acta Crystallogr. 5, 60 (1952). 7 G. Bricogne, Acta Crystallogr. A 32, 832 (1976). 2
190
phases
[9]
imposes main-chain connectivity in the electron density, which is characteristic of protein molecules.8–10 Comprehensive descriptions of various density modification techniques can be found in Cowtan and Zhang11 and Zhang et al.12 Density Histogram and Histogram Matching
Histogram matching seeks to bring the distribution of electron density values of a map to that of an ideal map. It has proved to be a powerful method for phase improvement.5,13–15 The electron density histogram of a map is the probability distribution of the electron density values. The density histogram specifies not only the permitted values of the electron density but also their frequencies of occurrence. This distribution contains structural information about the underlying protein structure, such as the types of atoms and their packing. Proteins consist of mostly C, N, O, and a few S atoms, and these atoms are certain characteristic distances apart. The atoms are packed together in protein structures and the packing density is relatively independent of the detailed structure conformation.16,17 The distribution of atomic types, and the distances and angles between different atomic types, are all similar among different structures. Differences in structural conformation arise mainly from the dihedral angles of each residue. The density histogram discards this spatial information and therefore is independent of the factors that make each structure unique. Rather, it captures the commonality between different structures: the 8
C. Wilson and D. A. Agard, Acta Crystallogr. A 49, 97 (1993). D. Baker, C. Bystroff, R. J. Fletterick, and D. A. Agard, Acta Crystallogr. D Biol. Crystallogr. 49, 429 (1993). 10 C. Bystroff, D. Baker, R. J. Fletterick, and D. A. Agard, Acta Crystallogr. D Biol. Crystallogr. 49, 440 (1993). 11 K. D. Cowtan and K. Y. J. Zhang, in ‘‘Progress in Biophysics and Molecular Biology’’ (T. Blundell, ed.), Vol. 72, p. 245. Elsevier Science, Amsterdam, 1999. 12 K. Y. J. Zhang, K. D. Cowtan, and P. Main, in ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold, eds.), Vol. F, p. 311. Kluwer Academic, Dodrecht, The Netherlands, 2001. 13 K. Y. J. Zhang, K. D. Cowtan, and P. Main, in ‘‘Macromolecular Crystallography’’ (C. W. Carter and R. M. Sweet, eds.), Vol. 277, p. 53. Academic Press, New York, 1997. 14 K. Y. J. Zhang, Acta Crystallogr. D Biol. Crystallogr. 49, 213 (1993). 15 K. D. Cowtan, K. Y. J. Zhang, and P. Main, in ‘‘International Tables for Crystallography’’ (M. G. Rossmann and E. Arnold eds.), Vol. F, p. 705. Kluwer Academic, Dodrecht, The Netherlands, 2001. 16 B. W. Matthews, J. Mol. Biol. 33, 491 (1968). 17 B. W. Matthews, J. Mol. Biol. 82, 513 (1974). 9
[9]
multidimensional histograms
191
similar atomic composition and the characteristic distances between atoms. These common features can distinguish correct from incorrect structures. Therefore the ideal density histogram can be used to improve an electron density map5,18,19 or to select a correct phase set among many randomly generated phase sets in ab initio phasing.20 The density histogram is degenerate in encoding structural information. While having an ideal density distribution is a necessary condition for being a correct structure, it is not a sufficient condition. Many incorrect structures may also have ideal density histograms. Moreover, the density histogram does not capture all the common features found in protein structures. Because the density histogram accounts for the value only at a given point and ignores its neighboring environment, any information about the neighborhood of a grid point will be complementary to the density histogram. Multidimensional Histograms
One way of reducing the degeneracy of the electron density histogram is to incorporate more stereochemical information into the constraints. The electron density histogram takes the density values as independent objects, and no relationship between them is taken into account. Xiang and Carter proposed extending the density histogram to include relationships between neighboring density values via multidimensional histograms defined as joint probabilities of the density values and their higher-order derivatives.21 Stereochemical information is usually expressed as bond length and angles between atoms. This information has been routinely used as restraints in structure refinement.22–24 However, it cannot be applied in the same form to the electron density distribution because the objects in the density distribution are not atomic positions but pixels of electron density. Nevertheless, in macromolecular structure determination, the characteristic geometric shape of the electron density that provides a unique guide for the crucial step of model building derives implicitly from the same bond lengths, bond angles, and atomic types. Thus this characteristic geometric shape expresses the stereochemical information.
18
R. W. Harrison, J. Appl. Crystallogr. 21, 949 (1988). V. Y. Lunin, Acta Crystallogr. A 44, 144 (1988). 20 V. Y. Lunin, A. G. Urzhumtsev, and T. P. Skovoroda, Acta Crystallogr. A 46, 540 (1990). 21 S. Xiang and C. W. J. Carter, Acta Crystallogr. D 52, 49 (1996). 22 A. T. Bru¨ nger, J. Kuriyan, and M. Karplus, Science 235, 458 (1987). 23 D. E. Tronrud, L. F. Ten Eyck, and B. W. Matthews, Acta Crystallogr. A 43, 489 (1987). 24 J. H. Konnert and W. A. Hendrickson, Acta Crystallogr. A 36, 344 (1980). 19
192
phases
[9]
Geometric shape at a particular grid point, a, in an electron density map is not completely defined by the electron density value, (a). Complementary information is provided by the derivatives, (1)(a), (2)(a) . . . (n)(a), at grid point a. If all the derivatives are known, the electron density, (r), at the neighborhood of a can be expressed as a Taylor series, ðrÞ ¼ ðaÞ þ ðr aÞð1Þ ðaÞ þ
(1) Here, ðkÞ ¼ rk is defined as the kth derivative of , where 0 1 0 1 i i @ @ @ @j A r ¼ ðrx ry rz Þ@ j A ¼ @X @Y @Z k k is the gradient operator. Here, @/@X, @/@Y and @/@Z represent the partial derivative along the orthogonal axes X, Y, and Z respectively; and (i j k) represents the unit vector along the three orthogonal axes. Successive applications of the gradient operator on the electron density give the successive orders of derivatives. The nth derivative can be represented as the derivative of (n 1)th derivative and this implies that the successive derivative contains information about the neighborhood of the (n 1)th derivative. The (a) in Eq. (1) represents the ‘‘average’’ value of the function in the neighborhood (a). The first-order derivative represents the difference between (r) and its immediate neighbor at (r a). The second-order derivative represents the differences between its neighbor’s neighbor. The higher the order of the derivative, the longer range interaction it represents and therefore enables the expansion to a longer range around (r). If all these derivatives are known, (r) could be determined precisely using the Taylor series. Even if the derivatives are not known, their distribution will constrain the values of the density and more importantly, its relationship with its neighbors. A multidimensional histogram therefore could capture neighborhood information about the electron density. The n-dimensional (n-D) histogram is the joint distribution of the derivatives of the electron density to the nth order, Hn ¼ Pðð0Þ ; ð1Þ ; . . . ; ðnÞ Þ:
(2)
The projection of the n-D histogram along each dimension could also give us the 1D histogram or histograms of lower dimensions, such as
[9]
multidimensional histograms
Pðð0Þ Þ ¼
X
193
Pðð0Þ ; ð1Þ ; . . . ; ðnÞ Þ ¼ PðÞ
ðiÞ 6¼ð0Þ
Pðð1Þ Þ ¼ Pðð0Þ ; ð1Þ Þ ¼
X
Pðð0Þ ; ð1Þ ; . . . ; ðnÞ Þ ¼ PðgÞ
ðiÞ 6¼X ð1Þ
(3)
Pðð0Þ ; ð1Þ ; . . . ; ðnÞ Þ ¼ Pð; gÞ
ðiÞ 6¼ð0Þ ; ð1Þ
The derivatives of any order can be calculated using the following formulae: If the fractional coordinates (x y z) along the crystal axes a, b, and c are transformed to the orthogonal coordinates (X Y Z) along the orthonormal axes (i j k) by the orthogonalization matrix 0 10 1 a bcos c cos x ðX Y ZÞ ¼ @ 0 bsin cð cos cos cos Þ=sin A@ y A (4) 0 0 V=ðab sin Þ z where V ¼ abc (1 cos2 cos2 cos2 þ 2 cos cos cos)1/2. The first derivative of the orthogonal coordinates (X Y Z) along the orthonormal axes (i j k) is given by 0 1 i @ @ @ @ A j @X @Y @Z k
r¼
0
a b cos ¼ @ 0 b sin 0 0
0 1 @ 1B @x C0 1 c cos B @ C i B C cð cos cos cos Þ= sin AB C@ j A B @y C V=ðab sin Þ @ @ A k @z
(5)
The second derivative is then given by 1 @ 0 1 B @X C i C B @ @ @ @2 @2 @2 @ C @ j Aði j kÞB r2 ¼ þ þ C¼ B B @Y C @X 2 @Y 2 @Z2 @X @Y @Z k @ @ A 0
(6)
@Z The third derivative follows as r3 ¼ r2 r ¼
@2 @2 @2 þ þ @X 2 @Y 2 @Z2
0 1 i @ @ @ @ A j @X @Y @Z k
(7)
194
[9]
phases
For even order derivatives the equation becomes 2 n=2
n
r ¼ ðr Þ
¼
@2 @2 @2 þ þ @X 2 @Y 2 @Z2
n=2 (8)
For odd order derivatives we have rn ¼ ðr2 Þðn1Þ=2 ¼
@2 @2 @2 þ þ 2 2 @X @Y @Z2
0 1 i @ @ @ @ A j (9) @X @Y @Z k
ðn1Þ=2
The even order derivatives are scalars that can be calculated by one fast Fourier transform (FFT). The odd order derivatives are vectors that can be calculated by three FFTs. The modulus of the odd order derivatives can be calculated as sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ðn1Þ=2 2 2 2 2 2 @ @ @ @ @ @ (10) jrn j ¼ þ þ þ þ @X @Y @Z @X 2 @Y 2 @Z2 There is no restriction on the number of components used to construct a multidimensional histogram. The more components used, the more stereochemical information can be encoded in the multidimensional histogram. However, the density derivatives are not independent of each other. The lower order derivatives carry most of the information about the characteristic shape of the electron density. Also considering the computational cost, Xiang and Carter examined only the electron density value and its two lowest order derivatives, gradient and Laplacian,21 EGL ¼ Pð; g; lÞ
(11)
where ¼ ð0Þ is the electron density, g ¼ ð1Þ is the gradient, and l ¼ ð2Þ is the Laplacian. Projections of the above-described three-dimensional histogram give rise to the following one-dimensional histograms, and two-dimensional histograms, Z Z E ¼ PðÞ ¼ Pð; g; lÞdgdl (12) l
G ¼ PðgÞ ¼
l
L ¼ PðlÞ ¼
g
Z Z Pð; g; lÞddl
(13)
Pð; g; lÞddg
(14)
Z Z g
[9]
multidimensional histograms
EG ¼ Pð; gÞ ¼
195
Z Pð; g; lÞdl
(15)
Pð; g; lÞdg
(16)
Pð; g; lÞd
(17)
l
EL ¼ Pð; lÞ ¼
Z g
GL ¼ Pðg; lÞ ¼
Z
The gradient can be calculated by three FFTs with the Fourier coefficient modified accordingly, sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2 @ @ @ g ¼ jrj ¼ (18) þ þ @X @Y @Z where @ 2 i X ¼ a1 hFðhklÞ exp ½2 iðhx þ ky þ lzÞ ¼ gx @X V hkl @ 2 i X ¼ ða2 h þ b2 kÞFðhklÞ exp ½2 iðhx þ ky þ lzÞ ¼ gy @Y V hkl @ 2 i X ¼ ða3 h þ b3 k þ c3 lÞFðhklÞ exp ½2 iðhx þ ky þ lzÞ ¼ gz @Z V hkl (19) Similarly, the Laplacian can be calculated by one FFT with the Fourier coefficient modified accordingly, 4 2 X Dhkl FðhklÞ exp ½2 iðhx þ ky þ lzÞ (20) l ¼ r2 ¼ V hkl with
where the elements of the orthogonalization matrix are a1 ¼ a sin ð Þ sin ðÞ a2 ¼ a sin ð Þ cos ðÞ a3 ¼ a cos ð Þ
b2 ¼ b sin ð Þ b3 ¼ b cos ð Þ c3 ¼ c The a ; b ; c ; ; , and * variables are reciprocal space cell parameters and x, y, and z are crystallographic coordinates. The orthogonal
196
[9]
phases
axes, X, Y, and Z were chosen such that X is along the crystallographic axis a and Z is along the c* axis. To verify the insensitivity of the multidimensional histograms to molecular conformation, Xiang, Carter, and coworkers built three different secondary structures artificially from the same 16-residue peptide taken from cytidine deaminase25 as well as a random atom model from the same peptide. For simplicity, only the one-dimensional histograms E, G, and L were tested from the above-described four different atomic models. They found that histograms from different secondary structural conformations, helix, sheet, and loop, are almost the same. In contrast, the histograms of the random atom model differ significantly everywhere from those of the corresponding secondary structures. They noted specifically that the gradient histogram recorded the largest overall differences between the random atom model and those models with regular secondary structures. It was suggested that the gradient histogram encodes much more stereochemical information than either the electron density or the Laplacian histograms. The usefulness of histograms in density modification and ab initio phasing depends on their sensitivity to phase errors. The phase sensitivity ˚ X-ray diffraction of the multidimensional histograms was tested with 3.5-A data of cyclophilin A.26 To give a quantitative measure of the sensitivity of histograms Xiang and Carter defined a histogram R factor P to phase errors, P as Rh ¼ jP Pm j= Pm , where P is the histogram of the electron density in question and Pm is the error-free histogram. Rh measures the difference between a histogram and the error-free histogram and, by implication, the phase difference between them because the structure factor amplitudes are the same in both cases. The changes of Rh with phases error for the three-dimensional histogram, EGL, as well as its various projections to the lower dimensions are nicely illustrated in Fig. 5 in Xiang and Carter.21 The variation of Rh, Rh, when phase error changes from 0 to 90 , are summarized in the following table:
Rh
25
E
G
L
EG
EL
GL
EGL
0.16
0.36
0.17
0.43
0.29
0.42
0.49
L. Betts, S. Xiang, S. A. Short, R. Wolfenden, and C. W. J. Carter, J. Mol. Biol. 235, 635 (1994). 26 H. M. Ke, L. D. Zydowsky, J. Liu, and C. T. Walsh, Proc. Natl. Acad. Sci. USA 88, 9483 (1991).
[9]
multidimensional histograms
197
Thus, higher dimension histograms have increased sensitivity to phase errors. This indicates that the components of the histograms, the density, the gradient, and the Laplacian, encode somewhat independent stereochemical information. They also found that histograms that contain the gradient have a higher sensitivity to phase error. This enhanced phase sensitivity arises probably because the gradient captures more stereochemical information owing to the higher molecular shape sensitivity compared with either the density or the Laplacian. On the basis of their studies, Xiang and Carter have concluded that the multidimensional histogram, including additional dimensions composed of the gradient magnitude and the Laplacian of the density, is minimally dependent on molecular folding and packing, while capturing substantially more stereochemical information than the conventional electron density histogram. The multidimensional histogram substantially reduces the degeneracy of the electron density histogram. They suggested multidimensional histograms could be used as improved targets for density modification and as more reliable figure of merit for evaluating correct phases. Multidimensional Histogram as Constraint for Density Modification
Double Histogram Method In the conventional (1D) electron density histogram matching method,5 a one-to-one mapping is made on the original electron density to the new electron density so that the density histogram of the modified map matches that of the ideal histogram. The order of the electron density values is retained after histogram matching. Two grid points with the same electron density value will have the same density value after histogram matching. Therefore, the pattern of peaks and troughs in the modified map is similar to that in the original map. This is necessary in the histogram matching process, because there are many alternative ways of adjusting the electron density values to match an ideal electron density distribution. However, this feature is undesired, especially when spurious electron density peaks need to be removed and new electron densities corresponding to missing atoms need to be generated. This coupling of original and modified maps is broken during phase combination or when other constraints are introduced. Alternative ways of decoupling the electron density order between the modified and original maps include incorporating other features of electron density in addition to the electron density distribution. Refaat et al. proposed a double-histogram matching method in which the density modification takes into account not only the current density
198
phases
[9]
values at a grid point but also some characteristics of the environment of that grid point within some distance.27 They investigated three local density environments: (1) local minimum density, Lmin, (2) local maximum density, Lmax, and (3) local density variance, VL. By local in this context it means within some distance R of the grid point in question. The local minimum and maximum density for a given grid point can be easily determined by comparing the density values of all the grid point within a radius R. The local density variance can be found by the use of Fourier transforms. VL ¼ h2 iL hi2L ¼ 2 W ð WÞ2 2 ¼ =1 =ð2 Þ =ðWÞ =1 ½=ðÞ =ðWÞ 2 ¼ =1 ½G Q =1 ½F Q
(21)
Here, = and =1 represent Fourier transform and inverse Fourier transform and the symbols and represent convolution and multiplication, respectively. F and G are the Fourier transform of the electron density and the squared electron density 2, respectively. Q is the Fourier transform of the weight function W used to calculate the average. Two types of weighting schemes were used to calculate the average density within a given radius. The first was a uniform weight everywhere within a sphere radius of R (ball function, Wb). The second was a weight that varies linearly from a maximum at the center to zero at the surface within a sphere radius of R (tent function, Wt). The Fourier transform of both the ball and tent functions can be derived analytically and scaled so that the integration of the function over the sphere gives unity. The Fourier transform of the ball function is 3 sin ð2 RsÞ cos ð2 RsÞ (22) Qb ðsÞ ¼ =ðWb Þ ¼ 4 2 R2 s2 2 Rs The Fourier transform of the tent function is 3 1 cos ð2 RsÞ sin ð2 RsÞ Qt ðsÞ ¼ =ðWt Þ ¼ 2 3 R3 s3 2 Rs
(23)
In their double-histogram matching procedure, the grid points are divided into 10 groups containing the same number of grid points in each group over 10 different value ranges of the local characteristic, such as local minimum, local maximum, and local variance of density. Therefore, 27
L. S. Refaat, C. Tate, and M. M. Woolfson, Acta Crystallogr. D Biol. Crystallogr. 52, 252 (1996).
[9]
multidimensional histograms
199
10 different histograms are created and each corresponds to a different value of the local environment. The electron density values within each local environment are modified according to the histogram matching process described by Zhang and Main5 such that the resulting histogram after modification conforms to that of the ideal histogram of the same local environment. The ideal histograms for the 10 different local environments are obtained from a model structure resembling the one under investigation. To reduce the changes to the density in each cycle, a damping factor, c, was used. The revised modified density, 0 , is given by 0 ¼ ð1 cÞ0 þ cm
(24)
where 0 and m are the original and histogram matching modified density, respectively. The double-histogram matching method has been tested on two known protein structures, RNAp128 and 2Zn insulin.29 Various averaging radius and damping factors have been tested. It was found that the best results for RNAp1 are from using the local density variance as the local character˚ as the averaging scheme. istic with the tent function and a radius of 0.5 A The mean phase error was reduced by 10 and the map correlation coefficient was improved by 0.14 as compared with the normal density histogram matching method. The best results for 2Zn insulin are from using the local density histogram matching method. The best results for 2Zn insulin are from using the local density maximum as the local characteristic with a ˚ and a damping factor of 0.9. The improvement over the radius of 0.5 A normal density histogram matching method was a 4 reduction in phase error and a 0.06 increase in map correlation. Refaat et al. have shown that judicious use of the double-histogram matching method can give appreciably better results than use of the normal density histogram matching procedure. The choice of the damping factor c is consistently indicated at about 0.9, but the best value for the radius of local characteristic R seems to be structure dependent. Good results are obtained with either the local density maximum or the tent function weighted local density variance as the local characteristic. However, the most reliable choice of parameters seems to be to use local maximum density and a damping factor of 0.9 and a value of R in the range of 0.5–0.6, which gives good results for both RNAp1 and 2Zn insulin. The 28
S. I. Bezborodova, L. A. Ermekbaeva, S. V. Shlyapnikov, K. M. Polyakov, and A. M. Bezborodov, Biokhimiya 53, 965 (1988). 29 E. N. Baker, T. L. Blundell, J. F. Cutfield, S. M. Cutfield, E. J. Dodson, G. G. Dodson, D. M. Hodgkin, R. E. Hubbard, N. W. Isaacs, C. D. Reynolds, N. Sakabe, and M. Vijayan, Philos. Trans. R. Soc. Lond. B Biol. Sci. 319, 369 (1988).
200
phases
[9]
double-histogram matching procedure has been incorporated into a computer program package, PERP (phase extension and refinement program), by Refaat et al.30 Two-Dimensional Histogram Matching Method In pursuit of an electron density constraint that reduces the degeneracy of the electron density histogram and incorporates more stereochemical information from the underlying structure, Goldstein and Zhang examined the 2D histogram of the joint probability distribution of the electron density and its gradient31 in a manner similar to that of Xiang and Carter.21 They considered an extended scope of protein structures from 16 distinct fold families.32 Electron density gradients were calculated by FFTs in a way similar to that proposed by Xiang and Carter.21 The accumulation of the 2D histogram is similar to that for the 1D density histogram.5 They have systematically examined the 16 structures to study the dependence of the 2D histogram on resolution, overall temperature factor, structural conformation, and phase error. The 2D histogram was found to vary with resolution and overall temperature factor, but was found to be insensitive to structure conformation. The average correlation coefficient between pairs of 2D histograms at three different resolutions examined was 0.90, with a standard deviation of 0.04. The 2D histogram was also found to be sensitive to phase error. The average correlation coefficient between 2D histograms with a 10 phase difference is 0.71. The variation of the 2D histogram due to structure conformation was estimated to be equivalent to that of a 4 phase error. This establishes the minimal phase error that a 2D histogram matching method could achieve. The conservation of the 2D histogram with respect to structure conformation enables the prediction of the ideal 2D histogram for unknown structures. The sensitivity of the 2D histogram to phase error suggests that it could be used as a target for the density modification method and also could be used as a figure of merit for phase selection in ab initio phasing. Having established the predictability of the 2D histogram due to its independency to structural conformation and its sensitivity to phase error, a 2D histogram matching procedure has been developed by Nieh and Zhang to exploit the joint probability distribution of electron density and its gradient as a constraint for density modification.33 30
L. S. Refaat, C. Tate, and M. M. Woolfson, Acta Crystallogr. D Biol. Crystallogr. 52, 1119 (1996). 31 A. Goldstein and K. Y. J. Zhang, Acta Crystallogr. D Biol. Crystallogr. 54, 1230 (1998). 32 C. A. Orengo, T. P. Flores, W. R. Taylor, and J. M. Thornton, Protein Eng. 6, 485 (1993). 33 Y. P. Nieh and K. Y. J. Zhang, Acta Crystallogr. D Biol. Crystallogr. 55, 1893 (1999).
[9]
multidimensional histograms
201
The 2D histogram matching on density and its gradients is achieved through two alternating steps of 1D histogram matching on density and 1D histogram matching on gradients. The 1D histogram matching on density follows the method described by Zhang and Main.5 In this method, the new electron density value is derived from the old electron density value through a linear transformation such that the cumulative distribution of the new density value equals the cumulative distribution of the ideal histogram. The histogram matching on gradients also follows a similar protocol in which the density value was replaced by the gradients. The modified gradient maps were converted to the modified structure factors by the fast Fourier transform method as shown in the following equation: V X hFðhklÞ ¼ gx exp ½2 iðhx þ ky þ lzÞ 2 iN xyz V X kFðhklÞ ¼ gy exp ½2 iðhx þ ky þ lzÞ (25) 2 iN xyz V X lFðhklÞ ¼ gz exp ½2 iðhx þ ky þ lzÞ 2 iN xyz Three structure factor sets are generated through the inverse FFT on each of the three modified gradient maps after the histogram matching on gradients. A single structure factor data set is obtained by vector averaging the equivalent reflections in the three structure factor sets. The 2D histogram matching process was implemented in two modes: the parallel mode and the sequential mode. In the parallel mode, the histogram matching on density and gradients is applied in parallel, using the same initial structure factor set. After matching, the two new structure factor sets are combined by vector summation. In the sequential mode, as shown in Fig. 1, the structure factor set calculated after density histogram matching is used as input for the gradient histogram matching, and vice versa. Test results showed that the histogram matching in sequential mode gave better phase improvements and converged closer to the ideal 2D histogram in fewer matching cycles compared with the parallel mode. The 2D histogram matching procedure was incorporated into the density modification program SQUASH.14,34 Nieh and Zhang have tested their 2D histogram matching procedure using the 2Zn insulin29 with both MIR phases and calculated phases with random errors. The MIR phases contain both random and systematic errors. The contribution of these two components to MIR phase errors varies from structure to structure. Whereas random errors can all be modeled statistically and are therefore easier to eliminate, systematic errors 34
K. Y. J. Zhang and P. Main, Acta Crystallogr. A 46, 377 (1990).
202
phases
[9]
Fig. 1. The 2D histogram matching procedure. The 2D histogram matching is achieved through alternating application of 1D histogram matching on electron density and 1D histogram matching on density gradients in the sequential mode. The ideal 1D density histogram and 1D gradient histogram are obtained from the projection of the ideal 2D histogram along the gradient and density, respectively. Starting from an initial structure factor sets, F0, an electron density map, 0, is calculated by a fast Fourier transform, =. The initial map, 0, is modified by the 1D histogram matching, H to produce a new map, 0 , whose density histogram conforms to the ideal density histogram. The modified map, 0 , is inverse Fourier transformed, =1, to give a new structure factor set, F1, from which three gradient maps, gx, gy, and gz, along each crystal axis are calculated. The three gradient maps are transformed from the crystal axes system to the orthogonal axes system by the orthogonalization matrix, O. The transformed gradient maps, gu, gv, and gw, are modified by the 1D histogram matching on gradients similar to that on density to produce new gradient maps, gu0 , gv0 , and gw0 , whose gradient distribution matches the ideal gradient distribution. The new gradient maps are then transformed from the orthogonal axes to the fractional axes by the deorthogonalization matrix, D, to produce gx0 , gy0 , gz0 . Three sets of structure factors, Fx0 , Fy0 , Fz0 , are obtained from each of the three new gradient maps, gx0 , gy0 , gz0 , by the inverse Fourier transforms. These three sets of structure factors, Fx0 , Fy0 , Fz0 , are combined to produce a new set of structure factors, F2, from which a new map, 0, is calculated. The process is iterated until the 2D histogram of the modified map matching that of the ideal 2D histogram.
[9]
multidimensional histograms
203
vary from case to case and are difficult to model and therefore more difficult to eliminate. Phase refinement and extension results from phases with random errors will demonstrate the upper limit of phase improvement when no systematic errors are present in the MIR phases. Tests on 2Zn insulin have shown that employing extra constraints based on the density gradients can further reduce the phase errors and improve the overall map quality. For phase refinement and extension to high resolution, the result showed that 2D histogram matching improves the phases more than 1D histogram matching. The phase improvement for the refine˚ was 9.6 . However, ment and extension of MIR phases from 1.9 to 1.5 A the difference between 2D and 1D histogram matching methods decreases at medium resolution. The phase improvement for the refinement and ˚ was 6.2 . extension of MIR phases from 3.0 to 2.0 A Although the results using the MIR phases of 2Zn insulin provided a typical example of phase refinement and extension, the results from the phases with random errors gave an upper limit of phase improvement when no systematic errors are present in the MIR phases. When tested on phases with randomly generated errors, the 2D histogram matching method im˚ phases by 34.2 versus the 1D histogram matching proved the 1.9- to 1.5-A method. This demonstrates the importance of eliminating systematic errors during any phasing process. The 2D histogram specifies not only the probability of the electron density value for a given grid point in the map but also the local environment around that grid point as reflected by the density gradients, which arise from chemical bonding. It also provides a means of decoupling the electron density order between the modified and original maps through sequential incorporation of the electron density gradient distribution. Test results have demonstrated that the increased sensitivity to phase error in the 2D histogram of the density and gradient can be translated into an improved density modification method. The method used to achieve 2D histogram matching is computationally efficient. The density and gradient maps, and the corresponding structure factors, can be efficiently transformed back and forth through fast Fourier transform techniques. More importantly, 2D histogram matching can be efficiently achieved by applying 1D matching on density and density gradients alternately. The strategy of using alternating 1D histogram matching to achieve 2D histogram matching can be generalized to the matching of higher dimension histograms. By exploring histograms of density values and its higher-order derivatives, it may be possible to obtain a density modification method with further enhanced effectiveness in phase determination, refinement, and extension.
204
[10]
phases
[10] Docking of Atomic Models into Reconstructions from Electron Microscopy By Niels Volkmann and Dorit Hanein Introduction
In the three-dimensional structure determination of macromolecules, X-ray crystallography covers the full range from small molecules to large assemblies with molecular masses of megadaltons. The limiting factors are expression, the stability and homogeneity of the structure, and subsequent crystallization. In the case of nuclear magnetic resonance (NMR), structures can be determined from molecules in solution, but the size limit, although increasing, is presently on the order of 100 kDa. Dynamic aspects can be quantified, but again the structures of mixed conformational states cannot be determined. Owing to dramatic improvements in experimental methods and computational techniques, electron microscopy (EM) has matured into a powerful and diverse collection of methods that allow visualization of the structure and dynamics of an extraordinary range of macromolecular assemblies at resolutions spanning from molecular to near atomic.1–6 In addition, cryomethods enable the observation of molecules under nearly physiological conditions in their native aqueous environment.7 Although not hampered by many of the limitations of NMR or crystallography, EM imaging is ˚ ), thus limited to lower resolution for most biological specimens (10–30 A precluding atomic modeling directly from the data. Still, atomic models often can be generated by combining highresolution structures of individual components in a macromolecular complex with a low-resolution structure of the entire assembly.8 Analysis of the resulting models can lead to new hypotheses and contribute important insights into interaction and regulation of the individual molecular 1
W. Ku¨hlbrandt and K. A. Williams, Curr. Opin. Chem. Biol. 3, 537 (1999). W. Chiu, A. McGough, M. B. Sherman, and M. F. Schmid, Trends Cell Biol. 9, 154 (1999). 3 W. Baumeister, R. Grimm, and J. Walz, Trends Cell Biol. 9, 81 (1999). 4 W. Baumeister and A. C. Steven, Trends Biochem. Sci. 25, 624 (2000). 5 H. Stahlberg, D. Fotiadis, S. Scheuring, H. Remigy, T. Braun, K. Mitsuoka, Y. Fujiyoshi, and A. Engel, FEBS Lett. 504, 166 (2001). 6 H. R. Saibil, Nat. Struct. Biol. 7, 711 (2000). 7 J. Dubochet, M. Adrian, J.-J. Chang, J.-C. Homo, J. Lepault, A. W. McDowall, and P. Schultz, Q. Rev. Biophys. 21, 129 (1988). 8 T. S. Baker and J. E. Johnson, Curr. Opin. Struct. Biol. 6, 585 (1996). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[10]
docking of atomic models into EM maps
205
components. Thus the combination of atomic resolution structures with EM provides a powerful tool to gain insight into cellular processes. The Fitting Problem
The combination of high-resolution features of individual components with the more complete picture of macromolecular assemblies that EM produces requires fitting of the atomic structures of the components into the density provided by EM. According to Numerical Recipes in C,9 a genuinely useful fitting procedure should provide the following items: 1. The best possible fitting parameters (globally) 2. Error estimates on these parameters 3. A statistical measure for goodness-of-fit (a confidence interval) To illustrate the rationale behind this statement, suppose that the third item suggests that a certain ‘‘best fit’’ satisfies the data just as well as any other fit. Providing the best-fitting parameters (item 1) would then be basically meaningless. Items 2 and 3 are especially important in the context of inferring high-resolution information from fitting into low-resolution reconstructions as determined by EM. The accuracy and reliability of the conclusions depend highly on the error margin and confidence of the fit. Despite the importance of items 2 and 3, most of the work on fitting as of today has focused on item 1. Fitting Methods
The current practice for fitting atomic models into EM reconstructions can be divided into three classes: (1) interactive manual fitting with various degrees of refinement and quality assessment, (2) semiautomatic fitting based on a reduced vector representation of the model and the data, and (3) automated fitting based on density-correlation measures with optional filtering operations. Until more recently, interactive manual fitting with optional refinement was the method of choice. However, the other two methodologies are gaining significantly in popularity. Correlation measure based fitting approaches are, in particular, available in a variety of implementations. A comparison between various computational fitting methods was performed using calculated error-free densities.10 The comparison gives an 9
W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, ‘‘Numerical Recipes in C: The Art of Scientific Computing.’’ Cambridge University Press, Cambridge, 1988. 10 W. Wriggers and S. Birmanns, J. Struct. Biol. 133, 193 (2001).
206
phases
[10]
assessment for the fitting precision of these methods in one particular implementation in an error-free environment. Unfortunately, this comparison is of only limited use for practical purposes. One of the main concerns in fitting atomic structures into EM reconstructions is the presence of fitting artifacts due to experimental errors in the EM data. Implementations of fitting functions that perform well on error-free, perfect data may fail completely in a noisy environment. In particular, edge enhancement algorithms such as convolution with Laplacian operators11 are known to be notoriously sensitive to noise.12 Another potential source of fitting artifacts that was not addressed in the comparison is the tendency of molecules to exhibit local conformational changes (induced fit mechanism) on complex formation.13 In this chapter we give first an overview of the main features of the three classes of fitting methods while having real-life applications in mind. We try to pinpoint potential problem areas of these methods in dealing with experimental data that carry systematic and random errors. We then give a more detailed example of one particular implementation of a density correlation measure. Last, we present the concept of solution sets14,15 that can provide error estimates (item 2) and confidence intervals (item 3) for fitting of atomic models into reconstructions from EM, and build the basis for a more quantitative assessment of the final fits. Manual Fitting
Interactive manual fitting is widely used for combining atomic structures with EM reconstructions.16–23 In this approach, the fit of the model into isosurface envelopes is judged by eye and corrected manually, using 11
W. Wriggers and P. Chacon, Structure 9, 779 (2001). M. Seul, L. O’Gorman, and M. Sammon, ‘‘Practical Algorithms for Image Analysis.’’ Cambridge University Press, Cambridge, 2000. 13 T. J. Smith, E. S. Chase, T. J. Schmidt, N. H. Olson, and T. S. Baker, Nature 383, 350 (1996). 14 N. Volkmann, D. Hanein, G. Ouyang, K. M. Trybus, D. J. DeRosier, and S. Lowey, Nat. Struct. Biol. 7, 1147 (2000). 15 N. Volkmann and D. Hanein, J. Struct. Biol. 125, 176 (1999). 16 I. Rayment, H. M. Holden, M. Whittaker, C. B. Yohn, M. Lorenz, K. C. Holmes, and R. A. Milligan, Science 261, 58 (1993). 17 D. Voges, R. Berendes, A. Burger, P. Demange, W. Baumeister, and R. Huber, J. Mol. Biol. 238, 199 (1994). 18 R. Beroukhim and N. Unwin, Neuron 15, 323 (1995). 19 H. Sosa, D. P. Dias, A. Hoenger, M. Whittaker, E. Wilson-Kubalek, E. Sablin, R. J. Fletterick, R. D. Vale, and R. A. Milligan, Cell 90, 217 (1997). 20 A. Hoenger, S. Sack, M. Thormahlen, A. Marx, J. Muller, H. Gross, and E. Mandelkow, J. Cell Biol. 141, 419 (1998). 12
[10]
docking of atomic models into EM maps
207
a modeling program such as O,24 until the fit ‘‘looks best.’’ Sometimes this initial fit is refined locally using various reciprocal-space25–28 or real-space scoring functions,29–32 some of which originate in crystallographic refinement or molecular replacement.33 If the components of the assembly under study are large molecules with distinctive shapes at the resolution of the reconstruction, manual fitting often can be performed with relatively little ambiguity.16,34 For example, when the complex of human rhinovirus and an attached Fab was solved by X-ray crystallography, it was found that a model from previous manual docking experiments was accurate to ˚ .13 On the other hand, divergent models docked manually in difwithin 4 A ferent laboratories using EM data of identical constructs (e.g., microtubule decorated with kinesin) also have been reported.20,35 One obvious disadvantage of the manual docking approach is its subjectivity. Objective scoring functions have been used occasionally to assess the quality and to refine the initial manual fit. However, local refinement does not necessarily increase precision or resolve ambiguities because the refinement can easily become trapped in a local maximum close to the initial manual fit that served as a starting point. Local refinement cannot answer the question concerning whether there is perhaps a better or equivalent fit in some remote comer of parameter space that was missed 21
E. A. Hewat, T. C. Marlovits, and D. Blass, J. Virol. 72, 4396 (1998). R. J. Gilbert, J. L. Jimenez, S. Chen, I. J. Tickle, J. Rossjohn, M. Parker, P. W. Andrew, and H. R. Saibil, Cell 97, 647 (1999). 23 X. Yu, T. Horiguchi, K. Shigesada, and E. H. Egelman, J. Mol. Biol. 299, 1299 (2000). 24 T. A. Jones, J.-Y. Zou, S. W. Cowan, and M. Kjeldgaard, Acta Crystallogr. A 47, 110 (1991). 25 R. H. Cheng, V. S. Reddy, N. H. Olson, A. J. Fisher, T. S. Baker, and J. E. Johnson, Structure 2, 271 (1994). 26 Z. Che, N. H. Olson, D. Leippe, W. M. Lee, A. G. Mosser, R. R. Rueckert, T. S. Baker, and T. J. Smith, J. Virol. 72, 4610 (1998). 27 W. R. Wikoff, G. Wang, C. R. Parrish, R. H. Cheng, M. L. Strassheim, T. S. Baker, and M. G. Rossmann, Structure 2, 595 (1994). 28 M. Mathieu, I. Petitpas, J. Navaza, J. Lepault, E. Kohli, P. Pothier, B. V. Prasad, J. Cohen, and F. A. Rey, EMBO J. 20, 1485 (2001). 29 J. M. Grimes, J. Jakana, M. Ghosh, A. K. Basak, P. Roy, W. Chiu, D. I. Stuart, and B. V. Prasad, Structure 5, 885 (1997). 30 P. L. Stewart, S. D. Fuller, and R. M. Burnett, EMBO J. 12, 2589 (1993). 31 E. Nogales, M. Whittaker, R. A. Milligan, and K. H. Downing, Cell 96, 79 (1999). 32 E. A. Hewat, N. Verdaguer, I. Fita, W. Blakemore, S. Brookes, A. King, J. Newman, E. Domingo, M. G. Mateu, and D. I. Stuart, EMBO J. 16, 1492 (1997). 33 J. Navaza, J. Lepault, F. A. Rey, C. Alvarez-Rua, and J. Borge, Acta Crystallogr. D Biol. Crystallogr. 58, 1820 (2002). 34 T. J. Smith, N. H. Olson, R. H. Cheng, H. Liu, E. S. Chase, W. M. Lee, D. M. Leippe, A. G. Mosser, R. R. Rueckert, and T. S. Baker, J. Virol. 67, 1148 (1993). 35 F. Kozielski, I. Arnal, and R. Wade, Curr. Biol. 8, 191 (1998). 22
208
phases
[10]
in the initial manual fitting attempt. Only global fitting protocols can address this question. Fitting Based on Vector Quantization
In this approach the distribution of atoms within the high-resolution structure as well as the low-resolution reconstructions are approximated by a small number of vectors (typically about three to six each)36 that are calculated by vector quantization (VQ), a technique used in data compression for image and speech processing applications.37 The use of a small number of vectors reduces the complexity of the fitting problem to a least-squares fit of two coordinate sets, making the method fast. However, the price that must be paid is the loss of information: VQ is known to be a so-called lossy compression technique in its compression implementation.37 In the fitting context, VQ does not preserve the information content of the density but reduces it substantially. In VQ-based fitting, the best fit is selected according to the lowest root– mean–square deviation (RMSD) between the two fitted vector sets. However, the RMSD of matched vector distributions is known to be a poor estimator for the fitting accuracy of the rest of the aligned volumes.38 In addition, the accuracy of the docking and the usefulness of the RMSD as a scoring function are critically dependent on how well the vector distribution represents the atomic structure and the EM reconstruction at the given resolution. Because the exact placement of the VQ vectors is sensitive to experimental noise,11 the method needs to be used carefully in real cases. To account for conformational changes that occur on complex formation, one can perform an atom-based positional refinement after the initial rigid-body fit.10 The movement of the atoms is constrained by a standard molecular dynamics force field and is subject to enforcing a match of the VQ vector distributions. This essentially deforms the atomic structure with the hope that the deformed structure is a better representation of the EM density than the original structure. The procedure introduces a large number of additional degrees of freedom by allowing the atoms to move relatively freely. This can lead to overfitting artifacts, especially because the information content of the data is already reduced by the VQ procedure. This problem can be reduced somewhat by interactively introducing some additional distance constraints (skeletons) during docking.11 A test of the flexible VQ-based fitting method with error-free calculated data10 36
W. Wriggers, R. A. Milligan, and J. A. McCammon, J. Struct. Biol. 125, 185 (1999). R. Gray, IEEE ASSP Mag. 1, (1984). 38 J. Fitzpatrick, J. West, and C. Maurer, IEEE Trans. Med. Imaging 17, 694 (1998). 37
[10]
docking of atomic models into EM maps
209
˚ in a shows that—even in the absence of noise—distortions of up to 9 A 10 ˚ 15-A resolution map can occur, likely because of an insufficient parameter-to-observable ratio. As a consequence, real-life applications of flexible VQ-based docking tend to lead to unacceptable loss of secondary structure.39 VQ-based fitting is limited to cases in which all density in the EM map is accounted for by the atomic model.11 Missing portions or disordered regions of the atomic model that are present in the EM map need to be modeled and accounted for before VQ-based fitting can be applied. Sometimes the application of iterative rounds of discrepancy mapping14,15 can be used to help alleviate the problem.40 In summary, the high computational speed of the method is appealing, particularly for applications in which accuracy is less of an issue. If accuracy is a concern, the methodology needs to be applied with care in the presence of noise, especially if the flexible fitting option is used. In practice, even rigid-body applications of VQ-based fitting often need to be refined by correlation-based methods in order to make the fit acceptable.41 Several parameters need to be adjusted interactively, making the approach semiautomatic rather than automatic. Density-Correlation-Based Fitting
Various flavors of automated, global searches using density-correlation measures as fitting criteria are being developed by different laboratories.10,11,14,15,42,43 These flavors vary in the exact mathematical form of the density correlation, the use of various preprocessing steps including masking and filtering, and in implementation details. Masking operations using the calculated envelope42 or using the atomic model directly43 both enhance high-resolution features and suppress low-resolution information, making it somewhat equivalent to high-pass filtering in Fourier space. The amplification of background noise is a common side effect of high-pass filtering.12 Thus, the success of this type of masking will depend strongly on the noise level in the reconstruction. Convolution with a Laplacian operator11 is also known to be sensitive to, and tends to amplify, noise.12 39
W. J. Rice, H. S. Young, D. W. Martin, J. R. Sachs, and D. L. Stokes, Biophys. J. 80, 2187 (2001). 40 S. A. Darst, N. Opalka, P. Chacon, A. Polyakov, C. Richter, G. Zhang, and W. Wriggers, Proc. Natl. Acad. Sci. USA 99, 4296 (2002). 41 M. Kikkawa, E. P. Sablin, Y. Okada, H. Yajima, R. J. Fletterick, and N. Hirokawa, Nature 411, 439 (2001). 42 A. M. Roseman, Acta Crystallogr. D Biol. Crystallogr. 56, 1332 (2000). 43 M. G. Rossmann, Acta Crystallogr. D Biol. Crystallogr. 56, 1341 (2000).
210
phases
[10]
Although these filters have the potential to boost the signal in the absence of noise,11 they must be used with considerable care if noise is present in the reconstruction. Implementation details can dramatically affect the performance and accuracy of the underlying algorithm. For example, using one particular implementation of a density-correlation measure, the fitting becomes severely ˚ root–mean–square difference from correct fit) inaccurate (more than 10 A ˚ for error-free data.11 The time for completat resolution lower than 15 A ing a typical docking with this implementation is stated as 6 h. A different implementation of essentially the same scoring function14 yields reasonable ˚ resolution in the presence of noise (root–mean– results even at 100-A ˚ )44 and completes a typical square difference from correct fit was 3.5 A docking task within 5 min on a single-processor Linux box. Product-Moment Correlation Coefficient as a Fitting Criterion
We describe now in detail a fitting approach based on a global search of parameter space using as a scoring function a real-space, product-moment correlation coefficient (CC).14,15 There are six adjustable parameters involved in the fitting problem of a single rigid body: the three translational parameters x, y, and z; and the rotational parameters , , and . As a consequence, CC(x, y, z, , , ) is a function of these six parameters. The correlation coefficient used in this fitting approach is defined as P aÞðei eÞ ðai ffi CC qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (1) P 2P 2 ðei eÞ ðai aÞ where ai denotes density at voxel i, calculated from the atomic model in the trial position; ei denotes experimental EM density at voxel i, and the sum is over all i. The overbars denote mean value. The CC is a Pearson-type, product-moment correlation coefficient and is used routinely as a voxel-based similarity measure to align 3D representations of medical data such as those coming from magnetic resonance and computed tomography.45 The CC was shown to be among the best criteria in blind tests using experimental data,46,47 whereas techniques based on surface information performed poorest.48 44
N. Volkmann, in ‘‘Biophysical Discussion,’’ Asilomar, www.biophysics.org/discussions/ volkmann-speaker.pdf (2002). 45 P. Van den Elsen, E. Pol, T. Sumanawaeera, P. Hemler, S. Napel, and J. Adler, in ‘‘Proceedings of Visualization in Biomedical Computing,’’ p. 227. SPIE Press, Rochester, MN, 1994.
[10]
docking of atomic models into EM maps
211
Once the global CC search is done, it is followed by a statistical analysis of the CC distribution. Statistical properties of distributions related to the CC are well characterized and can be used to obtain confidence intervals49 that lead eventually to the definition of ‘‘solution sets.’’ These sets contain all fits that satisfy the data within the error margin defined by the chosen confidence level, which implicitly accounts for all error sources in the data and the fitting calculation. Structural parameters of interest such as fitting uncertainty or interaction probabilities50 can be evaluated as properties of these sets. The method can be used to fit modules (domains or substructures) of the assembly individually into the EM reconstruction and thus accounts for conformational changes that can be modeled as relative domain movements.14 Most protein conformational changes involve movements of rigid domains that have their internal structure preserved.51–53 Because the CC is calculated in real space, all types of real-space constraints, including the position of physical labels in the reconstruction such as heavy atom clusters54 or compact protein domains,55 can be easily incorporated into the fitting process by restricting search space accordingly. All information that is contained in the reconstruction is used. No compression, arbitrary cutoffs, or reliance on surface representations are necessary. Biochemical and mutagenesis information can be exploited to assist the fitting process.50 Accuracy of Fitting To assess the performance of the CC, the crystal structure of the ironbinding 691-residue protein lactoferrin containing iron bound (PDB entry 1LFG) was fit into a density calculated from lactoferrin without iron bound (PDB entry 1LFH). This is a good test system, because it displays a 46
C. Studholme, D. L. Hill, and D. J. Hawkes, in ‘‘Proceedings of the British Machine Vision Conference’’ (D. Pycock, ed.), Vol. 1, 27. British Machine Vision Association Southampton, UK, 1995. 47 C. Studholme, D. L. Hill, and D. J. Hawkes, Med. Image Anal. 1, 163 (1996). 48 J. West, J. M. Fitzpatrick, M. Y. Wang, B. M. Dawant, C. R. Jr., Maurer, R. M. Kessler, and R. J. Maciunas, IEEE Trans. Med. Imaging 18, 144 (1999). 49 R. von Mises, ‘‘Mathematical Theory of Probability and Statistics.’’ Academic Press, New York, 1964. 50 D. Hanein, N. Volkmann, S. Goldsmith, A. M. Michon, W. Lehman, R. Craig, D. DeRosier, S. Almo, and P. Matsudaira, Nat. Struct. Biol. 5, 787 (1998). 51 W. G. Krebs and M. Gerstein, Nucleic Acids Res. 28, 1665 (2000). 52 M. Gerstein and W. Krebs, Nucleic Acids Res. 26, 4280 (1998). 53 S. Hayward, Proteins 36, 425 (1999). 54 J. F. Hainfeld, J. Struct. Biol. 127, 93 (1999). 55 T. G. Wendt, N. Volkmann, G. Skiniotis, K. N. Goldie, J. Muller, E. Mandelkow, and A. Hoenger, EMBO J. 21, 5969 (2002).
212
phases
[10]
substantial conformational change involving three independently moving rigid-body domains. Size-wise it is somewhere in between small (about 300 residues) and large (more than 1000 residues) proteins that, while bound to helical filaments such as actin or microtubules, are accessible by EM and image reconstruction techniques that take advantage of the helical symmetry of the filament scaffold for alignment and averaging. Lactoferrin would be too small for reconstruction techniques that do not rely on symmetry (single-particle reconstruction). The smallest particle solved without symmetry so far is the Arp2/3 complex with about 2000 residues.56 The ‘‘gold standard’’ for the fitting was chosen to be the least-squares fit of the coordinates after dividing the structure into the three rigid-body domains that move relative to each other during the conformational change between 1LFG and 1LFH. This calculation yields a root–mean– ˚ if all flexible loops are included. The square deviation (RMSD) of 0.94 A problem of fitting 1LFG to 1LFH was previously also tackled using the VQ-based flexible fitting approach,10 allowing a direct comparison between the accuracy of the two methods. Modular, CC-based fitting of the atomic ˚ map of the model of one lactoferrin conformation into a calculated 15 A ˚ . This accuracy is essentially other conformation yields an RMSD of 0.98 A the same as that of the least-squares fit using both coordinate sets, our gold ˚ without user intervention standard. The vector-based fitting yielded 4.54 A ˚ after interactive selection of vectors.10 In problematic regions and 2.72 A ˚ (Fig. 1), making the fit of the VQ-based fit, the displacement exceeds 9 A virtually useless for interpretation in that particular region. Confidence Intervals
It is reassuring that the CC gives such accurate parameter estimates in this test application. However, there are important issues that go beyond the mere finding of the best fitting parameters. Data are generally not exact. They are subject to measurement errors (noise). Thus, typical data never exactly fit the model, even when that model is correct. We need to assess whether or not a model is appropriate, that is, we need to test the goodness-of-fit against some useful statistical standard. The statistical properties of the CC distribution are well characterized.49 In particular, the distribution derived by Fisher’s z-transformation of the CC follows a Gaussian distribution. We can use this fact to estimate confidence intervals. Fisher’s z-transform of the CC is defined by
56
N. Volkmann, K. J. Amann, S. Stoilova-McPhie, C. Egile, D. C. Winter, L. Hazelwood, J. E. Heuser, R. Li, T. D. Pollard, and D. Hanein, Science 293, 2456 (2001).
[10]
docking of atomic models into EM maps
213
Fig. 1. Comparison of fits to a calculated map of lactoferrin. The density was calculated ˚ . The gray chain corresponds to the iron-free conformation that was at a resolution of 15 A used to calculate the density. The white chain corresponds to the fitted iron-bound conformation (a) Correlation-based modular fitting (RMSD 0.98). (b) Vector quantizationbased flexible fitting using molecular mechanics rules (RMSD 2.72).10 The blow-up in (b) shows a region of a problematic region in the vector quantization fitting.
1 1 þ CC z ¼ log 2 1 CC
and
1 z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi N3
(2)
z denotes the standard deviation of the z distribution and N is the number of independent pieces of information in the data (degrees of freedom). N cannot be estimated easily from the EM density because neighboring voxels are not independent, especially if we use oversampling. However, z can be estimated directly from the data if we have more than one data set for the fitting problem. If only a single data set exists it can be split in two, as commonly done for estimation of EM resolution.57 This is different from the crystallographic cross-validation procedures (e.g., free R factor) that rely on splitting the data into a small test set that is omitted from refinement and a working set used for refinement. Here, we try to generate two independent structures from the same data by splitting it in two (or 57
J. Frank, ‘‘Three-Dimensional Electron Microscopy of Macromolecular Assemblies.’’ Academic Press, San Diego, CA, 1996.
214
[10]
phases
more) equal-sized parts and repeat the complete fitting procedure for each of the independent sets. This will give us a CC distribution with a defined standard deviation (related to z). Once we have an estimate for z, we can use the complementary error function to test the hypothesis that a particular CCi is significantly different from the maximum CCmax: z max zi Pi ¼ erfc (3) 2z zmax denotes the z value derived from CCmax, zi that from CCi. erfc denotes the complementary error function. Now, by choosing a particular confidence level Conf, we can define the solution set {S} as fSg
is the set of all
mðx; y; z; ; ; Þ
for which
1 Pm Conf
(4)
A model m in the orientation (x, y, z, , , ) is an element of {S} if zm is not significantly different from zmax. Interpretation and Analysis of Solution Sets In this context, the confidence level can be interpreted as the likelihood of finding the truly correct (or best possible) fit inside the corresponding solution set. In general, the higher the confidence level, the more solutions must be taken into account. In the limiting case, to achieve a likelihood of 1.0 (confidence level, 1.0), that is, to be absolutely sure that the best solution is included, all possible solutions must be included. Otherwise there is always the possibility that the one solution excluded was the correct one. The more restrictive the solution set, the lower the likelihood of having the best fit included. In the limiting case of this extreme, the confidence level is zero when we rely on a single solution (even the one with the best score). This makes sense: the likelihood of picking exactly the correct fit, not even 0.000001 of an angstrom off, is practically zero, no matter how good the fitting procedure. The quality of the fitting can be assessed by analyzing the size (volume) and shape of the corresponding solution set in six-dimensional fitting space. These properties of solution sets and their dependence on the confidence level can be best visualized by plotting P1/2 as a function of one of the six docking parameters (termed confidence plots; see Fig. 2). The shape of the sets can be arbitrary and gives information about degeneracies in the fitting. For example, presuming the center of mass is well defined and looking only at the three orientational parameters (, , ), a well-defined fit would yield a spherical solution set (with a small volume). If there is a 2-fold rotational symmetry (or pseudo-symmetry) present, the solution set would consist of two, disjointed regions. If the fitting is less well defined around
[10]
docking of atomic models into EM maps
215
Fig. 2. Application of the z-transform. This shows an angular scan through the correlation landscape (a) from a docking experiment using a helical reconstruction of microtubules decorated with Kinesin and the corresponding atomic model. There are two local maxima showing up in this scan that are at 0 (correct solution) and 90 . The correlation (CC) does not fall under 0.7 during the whole scan. The correlation plot is difficult to interpret. It is difficult to tell whether the secondary maximum should be considered a valid solution and what the angular uncertainty would be. The use of the z-transform yields a confidence plot (b) that is much more straightforward to interpret. Plotted is the square root of the confidence measure P [Eqs. (3) and (4)]. The dashed line indicates a confidence level (1 P) of 99.5%. Every angle that has a value above that line is part of the solution set; every angle with a value below the line can be considered significantly different from the best solution at this confidence level and is not part of the solution set. The angular uncertainty can be estimated from the width of the solution set (here, about 15 ). If the confidence level is raised, the line will move down and more solutions will need to be considered. In this example, increasing the confidence level to 99.9% would move the line down far enough so that the secondary solution needs to be considered. At the 99.5% confidence level, the secondary solution does not need to be considered.
one of the axes, the solution set would take the shape of an ellipsoid; the permissible solutions would be smeared out around that axis. A good tool for analyzing the solution sets and for detecting 6D degeneracies is cluster analysis of the best few hundred refined solutions from the global fitting protocol. We use the pairwise RMSD between these fits as a distance measure in a modified agglomerative hierarchical clustering approach. The major modification from the original procedure58 consists of weighting by the individual fitting scores during the cluster merging step. The procedure sorts the fits into clusters with similar orientations and positions biased toward the solution with the highest score in that particular cluster. This allows efficient detection of local correlation maxima. An inspection of the resulting dendrogram (Fig. 3) gives a good indication for the size and partitioning of the respective solution set. 58
R. Sokal and P. Sneath, ‘‘Principles of Numerical Taxonomy.’’ W. H. Freeman, San Francisco, 1963.
216
phases
[10]
Fig. 3. Cluster analysis of the correlation distribution. The mean distance between all atoms of the fitted structures is used as a distance criterion for the clustering. The relative strength of the correlation between the corresponding solution and the density is indicated below the zero-distance line of the dendrogram. The higher the correlation, the longer the bar. Any solution with a bar extending into the white area is part of the solution set. The cluster analysis finds local correlation maxima and identifies degeneracies in the fitting. Each cluster corresponds to a local maximum. The various maxima often relate to each other by low-resolution pseudo-symmetry. From the four local maxima identified in this analysis, three are part of the solution set (A,C, and D). The correct maximum (here subcluster D) is usually associated with a subcluster that has a large number of solution set members with a small mean distance between them. This example is taken from docking the atomic structure of the kinesin motor ncd into the corresponding density isolated from helical reconstructions of ncd-decorated microtubules.55
Once solution sets are determined, parameters of interest are extracted from the sets with all members contributing equally. For example, the center-of-mass position of the fitted model and its experimental uncertainty can be estimated by calculating the mean and standard error of the center-of-mass positions for all solution set members. Similarly, the orientation parameters can be extracted. Because estimates for the expectation value as well as the standard deviation are available, standard statistical tests such as the Student t test can be used to test for the significance of differences between solution sets.14 Robustness of Solution Sets The final outcome of the fitting procedure described here is a solution set. Once we decided on a suitable confidence level, we are not interested in the actual values of the CC anymore. Also, the exact location of the global maximum is not of major interest. The only thing one needs to know is whether a particular CC is significantly different from the global CC
[10]
docking of atomic models into EM maps
217
maximum. In other words, is this CC a member of the solution set or not? The size and shape of the solution set determine the outcome of the structural interpretations (e.g., fitting uncertainties, interaction probabilities, and so on). It is therefore important to understand how robust the size and shape of the estimated solution set are. In particular, the potential difference in scale between the EM reconstruction and the atomic model has been perceived as a factor that can significantly influence fitting results.43 Note that the solution set will not change if we only shift the global maximum within the set. The solution sets are always more robust estimators than the actual maximum. In the solution set approach, all error sources are automatically accounted for without the need for explicit error modeling. Reduction of experimental errors will show immediately in the size and shape of the solution sets (Fig. 4). The influence of various factors on the size and shape of solution sets was assessed by using calculated data from lactoferrin (1LFH). Because error-free data were used for this assessment, an experimental estimate of z cannot be obtained. Instead, the value of N was estimated. A crystallographic FFT (space group P1) was calculated using a cubic box with twice the maximum diameter of the molecule. With this sampling, the structure factors of the molecular transform should be spatially uncorrelated; any higher sampling (larger box) will result in correlation between the structure factors. The number of structure factors at this sampling should therefore be a fair estimate of the degrees of freedom involved in the molecular transform (which is equivalent to the real-space density by Fourier transform). Comparison with experimentally derived z from experimental actomyosin data14 indeed indicated that the number of structure factors up to the resolution in question can be used as a rough estimate for N. A potential problem with using this approximation in practical applications is that the resolution of EM reconstructions is not always well defined. In crystallography, inspection of the X-ray diffraction pattern gives a good indication of where the resolution limit is exceeded and the signal disappears (no more spots). EM reconstructions are not necessarily subsampled in Fourier space but are based on continuous Fourier transforms. In such a case, it is much less straightforward to determine when the signal disappears into the noise. Therefore, an experimental determination of z is preferable because this is independent of the resolution estimation and also potentially accounts for undetected systematic errors. To test the influence of potential disrupting parameters on the solution sets, the calculated data were perturbed accordingly and then the CC distribution was reevaluated using the model for 1LFH as derived from modular fitting. The 6D volumes of the solution sets were roughly spherical with respect to orientation and position. Thus, the volume of the solution sets can
218
phases
[10]
Fig. 4. Influence of various parameters that can be potential sources of systematic errors on solution sets. Shown are scans of the square root of the confidence measure P [Eqs. (3) and (4)] along an arbitrary translation direction (confidence plots). The dotted line parallel to the x axis corresponds to a confidence level of 99.5%. Everything above that line is a member of the solution set defined by that confidence level. The zero on the x axis denotes the position of ˚. the correct fit. The dashed graph represents the confidence plot without perturbation at 15 A The solution set radius is the x projection of the point where the confidence line (dotted line ˚ is about for 99.5%) and the graph cross. The solution set radius for unperturbed data at 15 A ˚ . (a) Influence of perturbing parameter N of Eq. (2). (b) Influence of scaling errors. The 0.5 A step size is 1%. (c) Influence of errors in reciprocal-space amplitude fall-off estimation. Step ˚ 2. (d) Influence of resolution limitation. Step size, 5 A ˚ . (e) Additive Gaussian noise. size, 50 A Step size is 0.1 of mean protein density. SNR denotes signal-to-noise ratio. (f) Influence of lowering the contrast between protein and surroundings. Step size, 0.2 of mean protein density.
be accurately described by a radius. In the following we parameterized this radius in terms of permitted translational displacements within the solution sets. Two types of behavior can be distinguished (Fig. 4). 1. Most of the parameters display an approximate linear relationship with the solution set radius (solution set radius increase, RI). These ˚ RI per 1% misestimation), reciprocalparameters include: scaling (0.1-A ˚ ˚ 2 misestimation in B-factor), space intensity fall-off (0.1-A RI per 50-A ˚ ˚ resolution (1-A RI per 5-A loss in resolution), and additive Gaussian ˚ RI for every 10% mean protein density). random noise (0.1-A 2. Some parameters do not change the sets at all, up to a certain threshold value. Then the effect is dramatic. These parameters include
[10]
docking of atomic models into EM maps
219
sampling (smaller than one-fifth the resolution: no effect) and contrast (as long as outside density below mean protein density: no effect). The parameter N deserves special attention because it is actually used in calculating the confidence interval [Eqs. (2) and (3)]. Reevaluation of the solution sets using 2N and 0.5N in the formula indicates that this only ˚ , respectively (Fig. 4a). A misestichanges the solution set radius by 0.1 A mation of a factor of two in N (or a corresponding misestimation of z) is not critical. An approximate N is fully sufficient to generate meaningful solution sets. All in all, the solution set size and shape are remarkably robust concerning these parameters, especially if one considers the likely size of those errors in practical applications. Because of the large amount of averaging in most EM reconstruction techniques the signal-to-noise level should be well above 5, meaning that the corresponding Gaussian noise would in˚ . The misestimation in recrease the solution set radius by less than 0.5 A ˚ 2 (RI < ciprocal-space fall-off is not likely to be much worse than 200 A ˚ 0.4 A). Scaling (i.e., how to convert the voxel size in an EM reconstruction ˚) into angstroms) is usually accurate within 1–3%59 (therefore RI < 0.3 A and can be improved if additional information (such as the known layerline positions of filamentous structures added to the sample) is introduced.59 The solution set concept, in conjunction with real-space-correlation scoring and evaluation of confidence intervals, leads to more meaningful parameter and uncertainty estimates. The size of the solution set can serve as a normalized goodness-of-fit criterion. The smaller the set, the better the data determine the position of the fitted atomic structure. The statistical nature of the approach allows the use of standard statistical tests, such as the Student t test, to evaluate differences between models of assemblies in different functional states and to gain deeper insight into biological problems than previously possible. Validation of Results
If the resolution is high enough to resolve residues, and a structure is fit into such a map, one can employ tests for validity based on stereochemis˚ , no side chains or even secondary try. At a resolution lower than 10 A structure elements can be recognized in the density map. No mechanism exists that can be used to validate fits into maps with resolution lower than ˚ ; there is no molecular Ramachandran plot to help evaluate the quality 10 A of a fit. Evaluation of close contacts has been proposed43 but is of limited 59
R. A. Milligan, M. Whittaker, and D. Safer, Nature 348, 217 (1990).
220
phases
[10]
use because substantial local conformational changes close to the interface often occur on complex formation.34 If independent information from labeling, mutagenesis, or other biophysical and biochemical experiments is available, these data can be used to validate the fitting. If the fitting is ambiguous, this additional information can be used actively in the fitting in order to resolve these ambiguities.50 If no external information is available, the only validation tools that can be used are checks for self-consistency, using multiple data sets or by splitting the original data set into multiple parts. Application Examples
In practice, each docking problem will tend to be different in one way or the other and it is essential to keep in mind that the main objectives are to extract the maximum amount of reliable information while, on the other hand, avoiding overinterpretation of the underlying data. In the following sections we describe two application examples to demonstrate the use of the solution set concept and the modular docking approach toward this end. Actin-Bound Smooth Muscle Myosin Myosins are a superfamily of actin-based molecular motors, ubiquitous in animal cells. Interaction of myosin with filamentous actin has been implicated in a variety of biological activities including muscle contraction, cytokinesis, cell movement, membrane transport, and certain signal transduction pathways. The filamentous nature of actomyosin complexes has so far hampered all crystallization attempts but also makes this structure an ideal target for EM and helical reconstruction techniques. We analyzed two different strong-binding states of actomyosin (ADP and nucleotidefree, rigor) by electron cryomicroscopy and helical reconstruction.14 The ˚ . Several crystal resulting 3D maps had a resolution of approximately 21 A structures of myosin fragments (more than 20) in different conformations and from different organisms are available. These include a total of three different conformations of fragments with similar length and composition as that used for the EM experiment. Docking of these crystal structures into the EM reconstructions showed that both actin-bound conformations are distinctly different from all unbound myosin conformations imaged by crystallography.14 Analysis of the crystal structures allowed us to define rigid-body domains that move independently during the conformational changes observed by crystallography. The two most significant rigid bodies are the motor domain (MD), which contains the actin-binding interface,
[10]
docking of atomic models into EM maps
221
and the light-chain domain (LC) that is believed to act as a lever arm during force production. We divided the structure into three rigid-body domains (MD, LC, and a small domain called converter) and used the modular fitting approach to model the two actin-bound conformations. A movie of the modular fitting procedure applied to the rigor reconstruction can be viewed at www.burnham.org/papers/actoS1/modu.mov. We used multiple data sets and crystal structures for validation of the solution sets. For example, the fitting of the MD module into the rigor reconstruction was repeated for 12 of the available crystallized MD fragments and into 4 different maps, making 48 independent docking experiments. The resulting solution sets are of well-defined, near-spherical shape and are es˚ ). sentially identical (by t test) for all 48 docking experiments (RMSD, 2.9 A Similarly, the docking into the ADP maps results in well-defined, identical ˚ ). However, a t test solution sets for all 12 crystal structures (RMSD, 2.2 A between the various rigor and ADP solution sets always shows a significant difference for the MD docking. The difference corresponds to a 9 rotation, ˚ resolution that went undetected by a prea relatively subtle change at 21-A vious study with similar-quality reconstructions of the same constructs.60 The use of solution sets and the associated statistical tests was essential in detecting this movement. The solution sets for the subsequent LC docking were less well defined than those for MD. The main contribution to the spread of solutions was a rotational uncertainty parallel to the actin filament axis (Fig. 5). The amount of uncertainty correlates with the distance from the MD connection. The RMSD of the connection point itself was similar to the RMSD in the respective MD. Note that there was no restriction for the movement of this connection point. The fact that the beginning of the LC domain is well determined and the end allows more variation indicates that the LC can adopt multiple conformations that are pivoted at the same place in the MD and consist primarily of a rotational freedom parallel to the filament axis. This interpretation is independently supported by a variance analysis for helical reconstructions61 applied to smooth muscle actomyosin.14 We tested the Laplacian filter on the MD docking of the rigor data sets. Consistent with results with calculated densities,62 the absolute differences between the correlation values were larger than for density-only correlation. However, analyzing the two correlation distributions for confidence levels using the z-transform reveals that the solution set size is actually 60
M. Whittaker, E. M. Wilson-Kubalek, J. E. Smith, L. Faust, R. A. Milligan, and H. L. Sweeney, Nature 378, 748 (1995). 61 L. E. Rost, D. Hanein, and D. J. DeRosier, Ultramicroscopy 72, 187 (1998). 62 P. Chacon and W. Wriggers, J. Mol. Biol. 317, 375 (2002).
222
phases
[10]
Fig. 5. Solution set for actin-bound smooth muscle myosin.14 (a) Representation of the solution set (several representative solutions) within the experimental density for the rigor state. The MD and LC domains were fitted independently, using a modular docking approach. The line shows the approximate location of the interface between MD and LC. The uncertainty within the MD is homogeneous; the uncertainty in the LC correlates with the distance from the MD/LC interface. (b) Component of RMSD from rotations perpendicular (dashed line) and parallel (solid line) to the actin filament axis as functions of the distance from the MD/LC interface. The plot indicates that the main component of the uncertainty is rotational, parallel to the filament axis and pivoted close to the MD/LC interface.
˚ ) for the data subjected to Laplacian filtering. Thus, somewhat larger (0.5 A there is no advantage in using a Laplacian filter for these data. Actin-Binding Domain of Fimbrin Fimbrin is a member of a large superfamily of actin-binding proteins and is responsible for cross-linking of actin filaments into ordered, tightly packed 3D networks such as actin bundles in microvilli or stereocilia of the inner ear. Similar to actomyosin, the tendency of this complex to form higher-order structures hampers crystallization but also makes it a good candidate for EM and image analysis. Helical reconstructions of actin decorated with one of the actin-binding domains of fimbrin yielded 3D maps at ˚ resolution.63 A crystal structure of the same actin-binding about 25-A 63
D. Hanein, P. Matsudaira, and D. J. DeRosier, J. Cell Biol. 139, 387 (1997).
[10]
docking of atomic models into EM maps
223
domain of fimbrin lacking the N-terminal 110 residues (of a total of 375 residues)64 was docked into the fimbrin portion of the maps to elucidate the interactions between fimbrin and actin and to locate the missing N-terminal domain.50 The analysis of the solution sets for this docking problem indicated a partitioning of the set into two separate subsets and an additional degeneracy along the long axis of the density (Fig. 6). According to the confidence level analysis, the two subsets are equally likely to contain the correct solution and the angle around the long axis is arbitrary. A closer examination of the solution sets using cluster analysis reveals that the solutions are arranged in four subclusters, clustering around four distinct local maxima, three of which are part of the solution set. These maxima relate to each other by low-resolution pseudo-symmetry around the principal axes of the molecule (Fig. 6b). To resolve the ambiguity of this docking, we constructed an additional scoring function based on biochemical and mutagenesis data.50 Using this function, we were able to rule out one of the solution subsets and to restrict the angular range of the second subset.50 The correct subcluster turned out not to contain the absolute maximum but only the third highest local maximum. We repeated the analysis for different resolution ranges to validate these results and check consistency. The organization of the solutions into four subclusters occurred ˚ , and the correct subcluster always for all resolutions between 25 and 40 A came out as the second or third highest local maximum. We tested the use of Laplacian filtering for this docking problem in order to assess its performance in a real-life difficult case (Fig. 6c). The Laplacianbased docking partitioned into three subclusters. The first subcluster corresponds to the absolute maximum of the density-only docking. Surprisingly, the second highest local maximum corresponds to a solution that places the molecule partially outside the density. The third maximum corresponds to the second highest maximum of the density-only docking. Only the first maximum is part of the Laplacian solution set. The correct local maximum does not show up at all. Lowering the resolution improves the situation somewhat. Maxima that place the molecule partially outside the density are less frequent and the correct maximum shows up occasionally. However, there is no consistent partitioning of the clustering or consistent maxima that show up at all resolution ranges except for the absolute maximum. This analysis clearly shows the dangers of Laplacian filtering. If one would rely on an analysis using this technique, one would have to conclude that the absolute maximum is the correct solution. However, this solution is incompatible with mutagenesis and biochemical data. 64
S. C. Goldsmith, N. Pokala, W. Shen, A. A. Fedorov, P. Matsudaira, and S. C. Almo, Nat. Struct. Biol. 4, 708 (1997).
224
phases
[10]
Fig. 6. Docking of the actin-binding domain of fimbrin into density from helical reconstructions of an actin–fimbrin complex.50 (a) Analysis of the solution set from densitybased docking. Top: Cluster analysis. It indicates the existence of four main local maxima ˚ ), three of which (X,Y, and Z) are part of the solution set. These (mean distance less than 10 A three maxima relate to each other by 180 rotations around the principal axes (, , and ) shown in (b) (top). X is related to Y through ; Y is related to Z through ; and Z is related to X through . Analysis of biochemical and mutagenesis data indicates that Z is the correct maximum. The orientation of Z is shown with the principal axes and the location of the two CH domains in (b) (top). The other three parts of (b) show the orientation of X, Y, and Z within the density of the helical reconstruction. The middle portion of (a) shows the confidence plot of a scan around the long axis of the density (). Solution Z appears at 0 and solution X at 180 . The dashed line corresponds to a confidence level of 99.5%. There is no orientation that can be safely ignored, even if the confidence level were lowered (the dashed line raised) considerably. This axis is degenerate. The bottom part of (a) shows the confidence plot around principal axis . Solution Z appears at 0 and solution Y at 180 . The dashed line corresponds to a confidence level of 99.5%. The solution set partitions into two distinct subsets around Z and Y. (c) Results of Laplacian filter-based docking. Top: Cluster analysis; the remaining portions show the three resulting local maxima (U, V, and W) within the experimental density. Only cluster U is part of the Laplacian-based solution set according to the analysis of the correlation distribution.
[10]
docking of atomic models into EM maps
225
If those data were not available, the mistake would go unnoticed. The solution set analysis of unfiltered data, on the other hand, clearly exposes the degeneracy in the docking and identifies the problem. Conclusions
Apart from interactive manual fitting, a number of semiautomatic and automatic procedures are available for the fitting of atomic models into reconstructions from electron microscopy. Vector quantization leads to fast algorithms but suffers from potential inaccuracies. Correlation-based approaches are somewhat slower, but tend to be more accurate. Masking and filtering operations can enhance the signal-to-noise ratio in the correlation under favorable circumstances but are also more sensitive to noise artifacts. The concept of solution sets leads to the possibility of defining confidence intervals and error margins for the fitting parameters. Because no independent information is available on how a correct fit should look, definition of confidence intervals and error margins is particularly important in the context of docking atomic structures into low-resolution density maps. For validation of results, tests on self-consistency (cross-validation) using multiple independent reconstructions or splitting of the original data set in two are options. Acknowledgments This work was supported by NIH research grants AR47199 (D.H.), U54 GM64346 (Cell Migration Consortium; D.H., N.V.), and GM64473 (N.V.).
[11]
ARP/wARP model building
229
[11] ARP/wARP and Automatic Interpretation of Protein Electron Density Maps By Richard J. Morris, Anastassis Perrakis, and Victor S. Lamzin Introduction
X-ray crystallography has become a routine tool to aid the investigation of biological phenomena at the atomic level. Once phase information becomes available for the measured structure factor amplitudes, a threedimensional image of the diffracting electronic matter may be computed. From this electron density distribution a chemically sensible model of the molecule must be derived. However, the initial phase estimates are often poor. Model building affords a means to improve these phases by providing a set of atoms whose parameters may be refined according to some optimization residual and a set of stereochemical restraints, the latter being needed for the refinement to proceed smoothly against diffraction data extending to less than atomic resolution. In this chapter phase improvement, coupled with automated map interpretation and model building, are presented as one unified process within the framework of the Automated Refinement Procedure, ARP/wARP, software suite.1 Theory
Pattern Recognition Map interpretation and model building are pattern recognition problems that consist of mapping features of an electron density distribution onto a chemical model of the molecule under study. These features and their significance depend on the information content of the data, which necessarily depends on the resolution of the diffraction pattern. For reso˚ or higher—electron density maps of good quality, in lution around 1.8 A which density peaks correspond well to atomic centers—map interpretation is an exercise in connecting points to produce well-known covalent ˚ , atoms lose their indigeometry. At medium resolution, about 2.5–3.5 A viduality and connectivity becomes the important feature to search for.
1
V. S. Lamzin, A. Perrakis, and K. S. Wilson, ‘‘International Tables for Crystallography: Crystallography of Biological Macromolecules’’ (M. Rossmann and E. Arnold, eds.), p. 720. Kluwer Academic, Dordrecht, The Netherlands, 2001.
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
230
map interpretation and refinement
[11]
In the low-resolution range, map interpretation may simply reduce to distinguishing between a macromolecule and its surrounding solvent. In general, pattern recognition2 may be seen as a mapping, ðfÞ ¼ !k , that takes a continuous input variable f and assigns it to one class !k of a finite set of (predefined) output classes ¼ f!i ji ¼ 1; . . . ; Cg. C is the cardinality of the set, jj, which is the number of chosen classes. The pattern recognition function maps feature space F to the classification set . The goal is to extract from a wealth of information only those properties that are interesting to the problem. The set of features chosen to drive classification is often referred to as a feature vector f ¼ ðf1 ; f2 ; . . . ; fD Þ. D is the dimension of feature space, which is the number of features chosen. For map interpretation at medium resolution, classification space may consist of the classes {helix}, {strand}, {loop}, and {nonprotein}. An appropriate feature vector may consist of properties such as the number of distances between density maxima, minima, and saddle-points, moments of inertia, and other moments of the electron density distribution. The actual values of a feature vector will be denoted by x ¼ ðx1 ; x2 ; . . . ; xD ) and will be called the observed feature vector. An observed feature vector x belongs to class !k if and only if the posterior probability of that class is greatest, Pð!k jxÞ ¼ maxi Pð!i jxÞ; where Pð!i jxÞ ¼ nPðxj!i ÞPð!i Þ, in which n is a normalizing multiplication factor, P(xj!i) is the probability (likelihood) of the class !i being able to reproduce the observed feature vector x, and P(!i) is the prior expectation of this class being observed. The general problem of electron density map interpretation is to find features that can be mapped onto model classes, together with the appropriate mapping functions. For optimal recognition, the features should be as distinct as possible between the individual classes of a given set. A good feature for classification will therefore have a large variance over the entire classification set, or equivalently, different mean values between the classes, and small variances within each class. A standard approach for finding useful features on which to drive classification is initially to choose a large set of all conceivable characteristics. Techniques such as Principal Components Analysis (PCA) may then be employed to analyze the importance of these feature parameters and to introduce new features of higher classification power as linear combinations of the original ones. By choosing only the most significant new features in terms of information content, a large reduction in the dimension of feature space often can be achieved. The abundance of structures in the Protein Data Bank3,4 offers sufficient
2
K. Fukunaga, ‘‘Introduction to Statistical Pattern Recognition,’’ 2nd Ed. Academic Press, New York, 1990.
[11]
ARP/wARP model building
231
flexibility for the application of learning algorithms to develop, validate, and test possible coordinate-based features. Problems in Automation What currently hinders many standard modeling software packages from full automation is the necessity of decisions having to be made by the user. Decision making during the process of model building becomes necessary owing to the failure of current implementations to recognize protein fragments correctly in medium-to-poor quality electron density, or when the fragments or atoms are not placed with sufficient accuracy. This point can be readily shown by adding a random coordinate error to the atomic positions of a refined structure and attempting to find the correct connectivity. Based purely on local geometrical criteria, this becomes increasingly difficult with the inaccuracy of the positions since many pairs of originally nonbonded atoms fall into a valid bonded geometry. This placement error is often caused by poor density. The reasons for poor density are manifold; here we divide them into temporary and permanent. Temporary errors may arise from, for example, poor starting phases from a molecular replacement solution with significant parts of the structure far from the final position or even missing, strong nonisomorphism for MIR, weak anomalous signal in any anomalous scattering technique, and so on. Permanent errors may arise from, for example, disordered loops, nonrandom missing data (e.g., poor resolution; incomplete data), or simply badly measured diffraction data. Both causes will give rise to the same result: the interpretation of the density will be ambiguous and will require a sound knowledge of protein structures and much expertise to be narrowed down successfully to the correct result. Many model-building and modeling packages do a good job in finding connectivity (tracing the main chain) in reasonable density, and database routines then can match the main-chain fragments to a given sequence, build the polypeptide backbone, and place in the side chains. In regions of poor density, these methods often break down and one must rely on the eye of an imaginative and experienced crystallographer (an increasingly rare and endangered species) to find the most plausible solution. The above-described distinction between different sources for poor density was made on the basis of our experience with the automated
3
F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer, Jr., M. D. Brice, J. R. Rodgers, O. Kennard, T. Shomanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977). 4 H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. Nucleic Acids Res. 28, 235 (2000).
232
map interpretation and refinement
[11]
model-building module, warpNtrace,5 of the ARP/wARP package. Through the incorporation of model-building routines into an iterative cycle of refinement, the noninterpretability caused by poor density of the temporary type often can be overcome. If part of the density is correctly interpreted, resulting in a hybrid model consisting of free atoms and the partial model (the atoms that were recognized as part of the protein), subsequent refinement with added restraints on the partial model will provide the next interpretation step with better phases. To initiate this iterative approach, one must have methods that are capable of building parts of a model with a reasonable accuracy into density calculated from initial phase estimates. Although a variety of numerical methods have been proposed for dealing with the identification problem in poor density, the most straightforward is to lower the threshold criteria for acceptance of, for example, connectivity, atoms, fragments, and so on. The price to pay for this simplicity is that often a significant number of false positives are thereby introduced, causing the route of the main chain to become ambiguous—we say the chain becomes branched—and requiring some form of further processing. Decisions must be made as to which route to follow, and we refer to the solution of this problem as resolving branch-points. Model Building and Refinement In terms of the above-described definitions, the Automated Refinement Procedure (ARP)6 has a simple set of output classes consisting of two elements: (1) a given point in real space is an atomic center; (2) a given point is not an atomic center. With this classification ARP drives its realspace model update based on a feature vector reflecting the density shape. This atomic map interpretation with iterative refinement cycles bears a close resemblance to the successful direct methods packages SnB7 and SHELX ‘‘half-baked.’’8 The approach has been shown to be an extremely powerful model refinement method enjoying a large radius of convergence. The basic idea of ARP is that the model consists only of what is found in the electron density map. The initial model after map interpretation with ARP consists of a set of atoms that reproduce the density calculated with the current phases. To go from this set of free atoms to a chemical model of the molecule requires a further layer of pattern recognition based on this intermediate interpretation. It is the macromolecular models with their 5
A. Perrakis, R. J. Morris, and V. S. Lamzin. Nat. Struct. Biol. 6, 458 (1999). V. S. Lamzin and K. S. Wilson. Acta Crystallogr. D Biol. Crystallogr. 49, 129 (1993). 7 C. M. Weeks and R. Miller. J. Appl. Crystallogr. 32, 120 (1999). 8 G. Sheldrick, in ‘‘Direct Methods for Solving Macromolecular Structures’’ (S. Fortier, Ed.), p. 401. Kluwer Academic, Dordrecht, The Netherlands, 1998. 6
[11]
ARP/wARP model building
233
rich structural information of atom types and bonds that have become such an important tool for biology, and not the actual result of a successful diffraction experiment—the electron density—or a representation in terms of just free atoms. A second motivation for model building is that the initial phases often are poor, and substantial improvement is necessary to reproduce faithfully the electron density distribution of the scattering matter within the crystal. A powerful method for improving the phase estimates and thereby the density is that of map interpretation and model building coupled with an iterative manner with refinement—the built model provides restraints with which the diffraction data may be titrated to enhance refinement. This is the underlying idea behind the success of ARP/wARP. The general ARP/wARP flowchart is depicted in Fig. 1. From Free Atoms to a Protein Model Our approach is based on atomic entities: the free atoms placed by ARP. Given a set of N candidate positions, S ¼ {xij i ¼ 1,. . ., N}, the goal is to find a subset of positions, denoted by M S, such that the degree with which the geometrical consequences of this set, G(M), resemble known geometric expectations (prior knowledge) is greatest. M ¼ argmaxs S [p(G(s) )], in which p is some similarity score between the observed and the expected stereochemical parameters—the protein-likeness. Despite the similarity of equations, note that this formulation reverses the abovedescribed strategy of pattern recognition by now fixing the desired output class (protein) and trying to find the best possible feature vector (geometric quantities within a set of free atoms). One seeks to find a subset of positions that maximizes the protein-likeness of the resulting model. Frequencies for various geometric parameters obtained from an analysis of structures in the PDB can be used as prior geometric expectations. For the purpose of model building, a protein may be thought of as a set of long, nonbranching chains of repetitive units, the main chain or backbone, with a number of short structural units attached to it, the side chains. This is a standard simplification to the problem and one used by most modeling programs. The geometry of the main chain characterizes the tertiary structure of a protein. The main chain itself can be determined to an acceptable degree of accuracy by the positions of the C atoms alone.9 Protein model building is therefore often rightly seen as the problem of locating the C positions and we reformulate our task as trying to identify C atoms in a set of free atoms.
9
R. M. Esnouf, Acta Crystallogr. D Biol. Crystallogr. 53, 665 (1997).
234
map interpretation and refinement
[11]
Fig. 1. The ARP/wARP flowchart.
Once the free atoms that best reproduce expected C geometry have been determined, the remaining main-chain atoms can be put in place and the current main-chain hypothesis tested by submitting the current model to refinement against structure factor amplitudes. These main-chain fragments then can be docked into sequence and the side chains placed in density.
[11]
235
ARP/wARP model building
The Main Chain We have concentrated on the analysis of expected C-backbone geometry (and as a second class the expected not-C geometry). The parameterization of the problem in terms of C geometry represents only an approximation to the problem, but the idea may easily be extended to incorporate as many parameters as is feasible; for more details see Morris et al.10 Multidimensional frequency distributions have been computed from the PDB for all C(n)–C(n þ 1)–C(n þ 2)–C(n þ 3) distances, valence angles between nonbonded atoms, and dihedral angles. The distance distributions together with peptide planarity checks are used in ARP/wARP to identify C–C pairs.11 In brief, one searches for pairs of atoms separated ˚ , and checks that there is reasonable density beapproximately by 3.8 A tween them at the expected positions of the atomic centers in the peptide plane (Fig. 2A and B). To catch the correct peptide units, a large number of false positives are also accepted. The multidimensional distance and angle distributions derived directly from database analyses are well suited for pattern recognition in sets of accurate candidate positions. The accuracy of the free atom positions is, however, dependent on current phase quality and resolution. The patterns in a free atoms model differ to a varying degree from those of well-refined structures. Therefore frequency distributions have been computed that correspond to structures with a wide range of random coordinate errors. The
~3.8
A
B
C
D
Fig. 2. The main steps in building the main chain. (A) Search for pairs of atoms separated ˚ and determine the most likely peptide plane orientation approximately by a distance of 3.8 A between these atoms (B). This procedure results in a large number of possible peptide planes, indicated by the arrows in (C). Graph-search techniques are then employed to reduce this to the most likely set of nonbranched, nonoverlapping chains (D).
10
R. J. Morris, A. Perrakis, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 58, 968 (2002). 11 V. S. Lamzin and K. S. Wilson, Methods Enzymol. 277, 269 (1997).
236
map interpretation and refinement
[11]
distributions lose rapidly in classification power beyond a coordinate error ˚ . Provided the accuracy of the free atoms positioning can be of about 0.7 A correctly estimated, the appropriate error distributions prove to be a powerful tool for recognizing C atoms in sequence. The solution strategy for building the best chain, with choices having to be made at each C atom, should ideally consult what has been built so far and what result would be obtained if each possible route were followed from the current point onward, before making a decision. This idea is implicit in the formulation given above of choosing the best chain from all possibilities. When dealing only with a C atom model of the main chain, one must determine (1) which free atoms to use, and (2) the directed (N–C or C–N) connections between all C atom pairs that comprise one peptide unit. The problem of choosing a subset of atoms and their connections so as to maximize the similarity with expected geometry can therefore be formulated as an optimization problem in which the optimization variables are binary (0/1), (Fig. 2C and D). Each optimization variable cij represents a connection between two candidate positions, cij ¼ 1 means a directed connection from atom i to atom j is chosen, and cij ¼ 0 that it is not. Constraints must be added to the problem to ensure that every point in a C backbone trace has at most one incoming and one outgoing connection. The problem is merely to choose which connections to turn on and which to turn off. This transforms the problem to an exercise in combinatorial optimization. This type of problem has been well studied and belongs to the NP-hard class of problems—a class of problems of such complexity that no known algorithms exist that can be guaranteed to run in polynomial time.12 If one assumes that each free atom has an average number hBi of free atoms to connect to (hBi is called the branching average), and that of the N free atoms one can always build chains of length hLi, then the number of chains is equal to the number of decisions that have to be made along the chain, and is approximately proportional to hBihLi. For a modest branching average of 3 and a chain length of 50, the worst-case number of chains exceeds by far the number of seconds elapsed since the Big Bang. This complexity analysis is crude but demonstrates correctly the kind of worstcase behavior to be expected. Enumerating all possible subsets and all possible connections between the elements to find the one with the highest protein-likeness is clearly a formidable task. Rather than evaluating each test chain globally as a single entity, it would be of advantage for approximation schemes to have a handle on decisions at a more local level. One would like to approximate each 12
C. H. Papadimitriou and K. Steiglitz. ‘‘Combinatorial Optimization,’’ p. 194. Dover, Mineola, NY, 1998.
[11]
ARP/wARP model building
237
chain evaluation byPa summation over smaller units PðGðsÞ Þ ¼ PðGðfcij ji; j 2 SgÞ Þ u u ðfcij ji; j 2 u SgÞ, where u represents the overall structural units along a chain and overall possible chains, and u is a unit-based score. The obvious choice for these units would be the actual connection variables, but these have already undergone a quality assessment. Also, they are unsuitable for testing protein-likeness testing since they fail to capture any 3D structure. The minimum 3D structural information in terms of a C parametrization of the problem is provided by the use of fragments consisting of four C atoms. From a list of putative C atoms one can easily scan all possible C fragments of length four. These fragments can be evaluated and stored as structural building blocks for the problem at hand. The main chain can then be built by overlapping the last three atoms of each fragment with the first three of the following one. The summation of some probability-based score means that u log P(u); this can be put on a significance scale by using the log-odds ratio, u log{P(ujC)/P(ujR)}, where the probability has been conditioned on a C atoms model and on a random R model. Random here means: random ˚ apart, since under the restriction that the atoms are approximately 3.8 A this condition has already been applied at an earlier stage. For inaccurately placed free atoms the classification probability for non-C atoms is frequently higher than for C, even for atoms that correspond to C positions. This would result in a local classification error, should the building be carried out only as a classification problem. But in the current optimization scheme this results only in an insignificant lowering of the overall chain score. We have developed specific heuristics based on the divide-and-conquer approach outlined above by recasting the optimization problem as a search for the longest path in a weighted graph. The weights of the connections are the scores derived from the geometry. The nonbranching nature of polypeptide chains imposes restrains on the ideal search strategy. For a given starting point one would like to obtain a set of all single, nonbranching, longest chains. This deep probing into graph structures is accomplished by the depth-first-search (DFS) algorithm.13 The time requirements of the standard algorithm are proportional to the number of nodes and arcs in the graph. The algorithm can readily be modified to keep track internally of all found chains and the fragments used, and to check for geometric clashes. When an end node is encountered the full chain is returned and the algorithm steps back, thereby resetting the availability of those nodes over which the routine back-traced, and creating a new chain for each decision
13
R. Sedgewick, ‘‘Algorithms in Cþþ,’’ p. 415. Addison-Wesley, Reading, MA, 1992.
238
map interpretation and refinement
[11]
that is made. The chains are stored as a list that is returned as the result of the function call. In this manner the whole structure may be systematically searched. A full search through all possible chains can easily become intractable and one must introduce a number of restrictions and settle for an approximate solution. We circumvent this by exponentially limiting the search depth with the average number of branching points per node. Each accepted four-C fragment is assigned a quality score based on the frequency of its geometry in the PDB. The search algorithm can be set up to require a minimum quality of the fragments while scanning for chains. For large problems (above 10,000 candidate positions with a branching average greater than 2) the algorithm attempts first to build chain stretches of high quality before reducing the acceptance threshold level. The highscoring fragments are most commonly helices, following by strands. The second measure we have taken is to limit the search depth and/or the total number of chains per node to evaluate. Our initial implementation may be seen in this framework as an extreme case of search depth equal to one. Sequence Docking and Side-Chain Fitting The process outlined above delivers a set of main-chain fragments (C atoms with the remaining main-chain atoms placed into the positions of best agreement between them). The side chain-building module has two tasks: first, to assign the main-chain fragments to the known protein sequence, and second, to build and refine side chains according to the sequence assignment. For the sequence docking, a feature vector is used that represents the possible connectivity between the free atoms in the vicinity of each C. If there is one free atom close to the C and this atom is also close to one more atom, the feature vector would be ‘‘11.’’ If one further atom is connected then the feature would be ‘‘111,’’ and if two atoms are connected to the last one it would be ‘‘1112.’’ For each of the 20 residues the full side-chain connectivity vector is known (e.g., serine is ‘‘11,’’ valine is ‘‘12,’’ and aspartate is ‘‘112’’). By comparing each observed feature vector with all 20 full-chain connectivity vectors a probability is assigned to each C for its chain side being of a certain residue type. This way each piece of main chain can be represented as a vector of probability vectors (an array)—each probability vector contains 20 values representing how well the observed free atoms match all possible residues. By sliding that array across the given protein sequence, the probability that this placement is correct is computed by calculating the product of all probabilities within
[11]
ARP/wARP model building
239
the main-chain fragment that each sequence residue is observed. In the next step the difference of the score (probability) for the best placement with the second best score for each fragment (a kind of z-score) is used to derive a confidence score for that placement to be unique—that is, presumably correct. After the fragment with the best confidence score is docked, the corresponding sequence space is no longer available for placement of additional fragments, and thus the confidence scores are updated and the algorithm is iterated until either all fragments are docked or the confidence level becomes lower than a preset threshold. At that stage every residue of all fragments is assigned to a specific residue type. In the final step the best rotamer from the Richardson’s rotamer database14 is built. The chosen rotamer angles are refined in real space with a target function that includes the interpolated density at atomic centers and geometric considerations (bad contacts, hydrogen bond contacts). During torsional refinement the angle of the residue also is allowed to vary to achieve better C placement. The methods here resemble closely those implemented by Jones and Thirup15 in the modeling package O and those published by Oldfield.16 Practice
Applications and Limitations of ARP/wARP The ARP/wARP package can currently be used for the following applications. 1. Automatic construction of a protein model from diffraction data ˚ (in some cases 2.7 A ˚ ) or higher and extending to a resolution of 2.5 A reasonable initial phase estimates from heavy atom methods or molecular replacement. The required phase quality for successful autobuilding can vary greatly and depends on the overall quality of the data and on the resolution (in general, the lower the resolution, the better the phases need to be). Sometimes, localized areas of good density in an otherwise uninterpretable map can provide a sufficient seed for the iterative model building to proceed smoothly. 2. Density modification by free atoms models can provide significant ˚ , depending on the solvent phase improvement for data higher than 3.0 A content. 14
S. C. Lovell, J. M. Word, J. S. Richardson, and D. C. Richardson, Proteins 40, 389 (2000). T. A. Jones and S. Thirup. EMBO J. 5, 819 (1986). 16 T. J. Oldfield, Acta Crystallogr. D Biol. Crystallogr. 57, 82 (2001). 15
240
map interpretation and refinement
[11]
3. Automated solvent building with ARP. This requires the resolution ˚ or higher. Protocol validation with of the diffraction data to about 2.5 A Rfree can be employed and is recommended. 4. Side-chain mutations for molecular replacement solutions. The side chain-fitting routines are part of the autobuilding procedure but they can also be used as a stand alone application. The side-chain fitting performs ˚ and can be used whenever well for data of resolution higher than 3.5 A density, main chain (with residue assignment), and sequence are available. Program Interface We have developed a simple knowledge-based system in the form of automated scripts that set up most standard ARP/wARP applications with reasonable default parameters. These scripts take the user through the whole setup by interactive questioning. The startup script takes care of job initialization and distribution over multiple processors. A parameter file is created that then drives the individual routines of the ARP/wARP package. This can be edited if fine tuning is required. There is plenty of flexibility provided within the scripts and these options should be experimented with first. In addition to the UNIX shell scripts, a graphical user interface (GUI) has been written using modules of the CCP4i toolbox. For details, see the documentation at www.arp-warp.org. Example Autobuilding typically takes about 2 to 12 h on a standard workstation as used for other crystallographic computations (depending on the size of the structure and the initial phase quality) and successfully builds about 70–95% of the structure (again dependent on resolution, phase quality, and the amount of disordered regions). The following example is a novel structure solution kindly provided by R. Meijers before publication. It shows a currently rather untypical example of how the modules of ARP/wARP can the applied to a difficult molecular replacement case. Data were collected at EMBL Outstation ˚ . The data are 98% comHamburg beamline X11 to a resolution of 1.5 A plete and overall of good quality. The highest scoring sequence alignment showed 32% sequence identity in the overlapping regions. The model underwent a series of carefully chosen mutilations, deleting step by step those parts with the least sequence similarity until a molecular replacement solution could be picked up with AmoRe.17 Other PDB models of lower similarity were also attempted but no solution was found. The successful 17
J. Navazza, Acta Crystallogr. A 50, 157 (1994).
[11]
ARP/wARP model building
241
search model is shown in Fig. 3A. The C positions of the molecular re˚ away from the nearest equivalents placement solution are on average 1.5 A in the final model and less than 70% of the total number of C atoms were present. This model proved notoriously difficult to refine. The standard warpNtrace protocol of ARP/wARP failed and even with customization it was not able to correct for the poor, highly biased starting phases from the molecular replacement solution. The density modification routine wARP was employed, using four independent free atoms models to calculate a weighted average phase and figure of merit for each structure factor. The improvement gained by such averaging procedures is often not significant in terms of phase quality indicators, but it can be crucial in terms of structure solution. The initial density map had only about 30% correlation
Fig 3. (A) A drawing of the model used for molecular replacement. (B) A drawing of the model autobuilt by ARP/wARP, starting from initial phases from the molecular replacement solution. (C) A drawing of the final model.
242
map interpretation and refinement
[11]
to the final map. The wARP map was given as a starting point to warpNtrace and subjected to 100 ARP cycles (each ARP cycle consisting of real-space density modeling by ARP and three internal reciprocal-space refinement cycles using REFMAC18 from the CCP419 suite) and autobuilding of the main chain after every 10 ARP cycles. The progress of the iterative building and phase refinement is shown in Fig. 4A and B. At the first model-building stage, merely four small-chain fragments are found with a total number of 14 built residues. The ARP cycles then attempt to remove or add atoms according to its density criteria. The poor density makes this procedure rather slow in this case. The phases, however, improve gradually and the procedure takes off at cycle 61 after the sixth building cycle. The slight jump in R-factor after each autobuilding cycle is due to the rearrangement, deletion, and addition of atoms to accommodate the built main chain—this system relaxed again after a few refinement cycles. Figure 3B shows a drawing of the autobuilt model and Fig. 3C shows the final model. The C atoms show a mean distance to their closest C atoms in ˚. the autobuilt structure of 0.09 A Discussion
ARP/wARP is a software suite (copyrighted by the European Molecular Biology Laboratory) based on the paradigm of viewing model building and refinement as one unified procedure for optimizing phase estimates. The current version, ARP/wARP 6.0, released in July 2002, works with density recognition-driven procedures for placing and removing atoms and is therefore limited to diffraction data extending to about ˚ . The iterative cycles of density modeling by the placement of atoms, 2.5 A unrestrained refinement of their parameters, automated model building, and restrained refinement of the hybrid model provide a powerful means of phase refinement. One receives an almost complete protein model as a by-product. Initial phase estimates may be provided in the form of a molecular replacement solution, MIR/SIRAS/MAD/SAD phases, experimental measurements, witchcraft, or heavy atom sites alone (provided the data extend to atomic resolution). Pattern recognition techniques are a crucial element in such a procedure and more robust algorithms for medium resolution are currently under development. Even with better density processing and classification algorithms, the building of a protein model will 18
G. N. Murshudov, A. A. Vagin, and E. J. Dodson, Acta Crystallogr. D Biol. Crystallogr. 53, 240 (1997). 19 Collaborative Computational Project Number 4, Acta Crystallogr. D Biol. Crystallogr. 50, 760 (1994).
[11]
243
ARP/wARP model building A
55 R-factor Rfree-factor
50
R/Rfree
45 40 35 30 25 20 15
0
10
20
30
40
50
60
70
80
90
100
90
100
Number of ARP cycles (building every 10 cycles) Number of residues in the hybrid model
B 250 200
150 100
50 0 0
10
20
30
40
50
60
70
80
Number of ARP cycles (building every 10 cycles) Fig. 4. The progress of warpNtrace over 100 ARP cycles with autobuilding after every 10 cycles. (A) The crystallographic R-factor and the free R-factor. (B) The number of autobuilt residues.
remain a complex process, and decision-making is required to enhance the state of automation. ARP/wARP is an experimental hypothesis-generating and testing procedure for placing atoms in the most likely places (according to density), and using graph-searching combined with geometric comparisons against expected stereochemical parameters to determine the most likely mainchain fragments. The iterative approach, with maximum likelihood refinement using REFMAC of the current model at every stage, has proved to be a powerful tool for overcoming the insufficient robustness of
244
[12]
map interpretation and refinement
the map interpretation routines regarding phase quality, and the inadequate use of knowledge-based decision making during model building. Acknowledgments This work was supported by EU Grant BIO2-CT920524/BIO4-CT96-0189 (R.J.M.). The authors thank Keith Wilson, Zbyszek Dauter, Rob Meijers, and Petrus Zwart for fruitful discussions and useful comments; Ge´ rard Bricogne, for helpful suggestions, advice, mathematical rigor, and for generously allowing R. J. M. to write this contribution while working at Global Phasing, Ltd.; Eric Blanc, Pietro Roversi, Claus Flensburg, and Clemens Vonrhein for constructive critique on an initial draft of this manuscript; Garib Mushudov and Eleanor Dodson for help with REFMAC; and all ARP/wARP users for helpful suggestions.
[12] TEXTAL System: Artificial Intelligence Techniques for Automated Protein Model Building By Thomas R. Ioerger and James C. Sacchettini Introduction
Significant advances have been made toward improving many of the complex steps in macromolecular crystallography, from new crystal growth techniques, to more powerful phasing methods such as multiwavelength anomalous diffraction (MAD),1 to new computational algorithms for heavyatom search,2–4 reciprocal-space refinement, and so on.5 However, the step of interpreting the electron density map and building an accurate model of a protein (i.e., determining atomic coordinates) from the electron density map remains one of the most difficult to improve. Currently it takes days to weeks for a human crystallographer to build a structure from an electron density map, often with the help of a 3D visualization and model-building program such as O.6 This manual process is both time-consuming and error-prone. Even with an electron density map of high quality, model 1
W. A. Hendrickson and C. M. Ogata, Methods Enzymol. 276, 494 (1997). C. M. Weeks, G. T. DeTitta, H. A. Hauptmann, P. Thuman, and R. Miller, Acta Crystallogr. A 50, 210 (1994). 3 de la E. Fortelle and G. Bricogne, Methods Enzymol. 276, 590 (1997). 4 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 849 (1999). 5 A. T. Bru¨ nger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gros, R. W. Grosse-Kunstleve, J.-S. Jiang, J. Kuszewski, M. Nigles, N. S. Pannu, R. J. Read, L. M. Rice, T. Simmonson, and G. L. Warren, Acta Crystallogr. D Biol. Crystallogr. 54, 905 (1998). 6 T. A. Jones, J.-Y. Zou, and S. W. Cowtan, Acta Crystallogr. A 47, 110 (1991). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
244
[12]
map interpretation and refinement
the map interpretation routines regarding phase quality, and the inadequate use of knowledge-based decision making during model building. Acknowledgments This work was supported by EU Grant BIO2-CT920524/BIO4-CT96-0189 (R.J.M.). The authors thank Keith Wilson, Zbyszek Dauter, Rob Meijers, and Petrus Zwart for fruitful discussions and useful comments; Ge´rard Bricogne, for helpful suggestions, advice, mathematical rigor, and for generously allowing R. J. M. to write this contribution while working at Global Phasing, Ltd.; Eric Blanc, Pietro Roversi, Claus Flensburg, and Clemens Vonrhein for constructive critique on an initial draft of this manuscript; Garib Mushudov and Eleanor Dodson for help with REFMAC; and all ARP/wARP users for helpful suggestions.
[12] TEXTAL System: Artificial Intelligence Techniques for Automated Protein Model Building By Thomas R. Ioerger and James C. Sacchettini Introduction
Significant advances have been made toward improving many of the complex steps in macromolecular crystallography, from new crystal growth techniques, to more powerful phasing methods such as multiwavelength anomalous diffraction (MAD),1 to new computational algorithms for heavyatom search,2–4 reciprocal-space refinement, and so on.5 However, the step of interpreting the electron density map and building an accurate model of a protein (i.e., determining atomic coordinates) from the electron density map remains one of the most difficult to improve. Currently it takes days to weeks for a human crystallographer to build a structure from an electron density map, often with the help of a 3D visualization and model-building program such as O.6 This manual process is both time-consuming and error-prone. Even with an electron density map of high quality, model 1
W. A. Hendrickson and C. M. Ogata, Methods Enzymol. 276, 494 (1997). C. M. Weeks, G. T. DeTitta, H. A. Hauptmann, P. Thuman, and R. Miller, Acta Crystallogr. A 50, 210 (1994). 3 de la E. Fortelle and G. Bricogne, Methods Enzymol. 276, 590 (1997). 4 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 849 (1999). 5 A. T. Bru¨nger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gros, R. W. Grosse-Kunstleve, J.-S. Jiang, J. Kuszewski, M. Nigles, N. S. Pannu, R. J. Read, L. M. Rice, T. Simmonson, and G. L. Warren, Acta Crystallogr. D Biol. Crystallogr. 54, 905 (1998). 6 T. A. Jones, J.-Y. Zou, and S. W. Cowtan, Acta Crystallogr. A 47, 110 (1991). 2
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
[12]
TEXTAL system
245
building is a long and tedious process.7 There are many sources of noise and errors that can perturb the appearance of the density map.8–10 All these effects contribute to making the density sometimes difficult to interpret. There exist several prior methods for, or related to, automated model building, such as searching fragment libraries,11,12 template convolution and other fast Fourier transform (FFT)-based approaches,13,14 the freeatom insertion method of ARP/wARP,15,16 DADI17 (which uses a real-space correlational search), X-Powerfit,18 MAID,19 MAIN,20 and molecular scene analysis.21–24 However, most of these methods have limitations. For example, fragment library searches11,12 require user intervention to pick C coordinates in the sequence (although backbone tracing can help, it does not reliably determine C locations), and methods like ARP/wARP and molecular scene analysis seem to work best only at high ˚ or better. resolution, for example, around 2.5 A TEXTAL is a new computer program designed to build protein structures automatically from electron density maps. It uses AI (artificial intelligence) and pattern recognition techniques to try to emulate the intuitive 7
G. J. Kleywegt and T. A. Jones, Methods Enzymol. 277, 208 (1997). J. S. Richardson and D. C. Richardson, Methods Enzymol. 115, 189 (1985). 9 C. I. Branden and T. A. Jones, Nature 343, 687 (1990). 10 T. A. Jones and M. Kjeldgaard, Methods Enzymol. 277, 173 (1997). 11 T. A. Jones and S. Thirup, EMBO J. 5, 819 (1986). 12 L. Holm and C. Sander, J. Mol. Biol. 218, 183 (1991). 13 G. J. Kleywegt and T. A. Jones, Acta Crystallogr. D Biol. Crystallogr. 53, 179 (1997). 14 K. Cowtan, Acta Crystallogr. D Biol. Crystallogr. 54, 750 (1998). 15 A. Perrakis, T. K. Sixma, K. S. Wilson, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 53, 448 (1997). 16 A. Perrakis, R. Morris, and V. Lamzin, Nat. Struct. Biol. 6, 458 (1999). 17 D. J. Diller, M. R. Redinbo, E. Pohl, and W. G. J. Hol, Proteins Struct. Funct. Genet. 36, 526 (1999). 18 T. J. Oldfield, in ‘‘Crystallographic Computing 7: Proceedings from the Macromolecular Crystallography Computing School’’ (P. E. Bourne and K. Watenpaugh, eds.). Oxford University Press, New York, 1996. 19 D. G. Levitt, Acta Crystallogr. D Biol. Crystallogr. 57, 1013 (2001). 20 D. Turk, in ‘‘Methods in Macromolecular Crystallography’’ (D. Turk and L. Johnson, eds.). NATO Science Series I, Vol. 325, p. 148. Kluwer Academic, Dordrecht, The Netherlands, 2001. 21 L. Leherte, S. Fortier, J. Glasgow, and F. H. Allen, Acta Crystallogr. D Biol. Crystallogr. 50, 155 (1994). 22 K. Baxter, E. Steeg, R. Lathrop, J. Glasgow, and S. Fortier, in ‘‘Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology,’’ p. 25. American Association for Artificial Intelligence, Menlo Park, CA, 1996. 23 L. Leherte, J. Glasgow, K. Baxter, E. Steeg, and S. Fortier, Artif. Intell. Res. 7, 125 (1997). 24 S. Fortier, A. Chiverton, J. Glasgow, and L. Leherete, Methods Enzymol. 277, 131 (1997). 8
246
map interpretation and refinement
[12]
decision-making of experts in solving protein structures. Previously solved structures are exploited to help understand the relationship between patterns of electron density and local atomic coordinates. The method applies this along with a number of heuristics to predict the likely positions of atoms in a model for an uninterpreted map. TEXTAL is aimed at solving ˚ range, which are just at the range of human intermaps in the 2.5 to 3.5-A pretability, but turn out to be the majority of cases in practice for maps constructed from MAD data. TEXTAL has the potential to reduce one of the bottlenecks of highthroughput structural genomics.25 By automating the final step of model building (for noisy, medium-to-low resolution maps), less effort will be required of human crystallographers, allowing them to focus on regions of a map where the density is poor. TEXTAL will eventually be integrated with other computational methods, such as reciprocal-space refinement26 or statistical density modification,27 to iterate between building approximate models (in poor maps) and improving phases, which can then be used to produce more accurate maps and allow better models to be built. This will be implemented in the PHENIX crystallographic computing environment currently under development at the Lawrence Berkeley National Laboratory.28 Principles of Pattern Recognition and Application to Crystallography
Pattern recognition techniques can be used to mimic the way the crystallographer’s eye processes the shape of density in a region and comprehends it as something recognizable, such as a tryptophan side chain, or a sheet, or a disulfide bridge. The history of statistical pattern recognition is long, and a great deal of research, both theoretical and applied (i.e., development of algorithms), has been done in a wide range of application domains. Much work has been done in image recognition in two dimensions, such as recognizing military vehicles or geographic features in satellite images, faces, fingerprints, carpet textures, parts on an assembly line for manufacturing, and even vegetables for automated sorting29; however,
25
S. K. Burley, S. C. Almo, J. B. Bonanno, M. Capel, M. R. Chance, T. Gaasterland, D. Lin, A. Sali, W. Studier, and S. Swaminathian, Nat. Genet. 232, 151 (1999). 26 G. N. Murshudov, A. A. Vagin, and E. J. Dodson, Acta Crystallogr. D Biol. Crystallogr. 53, 240 (1997). 27 T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 56, 965 (2000). 28 P. D. Adams, R. W. Grosse-Kunstleve, L.-W. Hung, T. R. Ioerger, A. J. McCoy, N. W. Moriarty, R. J. Read, J. C. Sacchettini, and T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 58, 1948 (2002).
[12]
TEXTAL system
247
patterns in electron density maps are three-dimensional, and less work has been done for these kinds of problems. The basic idea behind pattern recognition (at least for supervised learning) is to ‘‘train’’ the system by giving it labeled examples in several competing categories (such as tanks vs. civilian cars and trucks). Each example is usually represented by a set of descriptive features, which are measurements derived from the data source that characterize the unique aspects of each one (e.g., color, size). Then the pattern recognition algorithm tries to find some combination of feature values that is characteristic of each category that can be used to discriminate among the categories (i.e., given a new unlabeled example, classify it by determining to which category it most likely belongs). There are many pattern recognition algorithms that have been developed for this purpose but work in different ways, including decision trees, neural networks, Bayesian classifiers, nearest neighbor learners, and support vector machines.30,31 The central problem in many pattern recognition applications is in identifying features that reflect significant similarities among the patterns. In some cases, features may be noisy (e.g., clouds obscuring a satellite photograph). In other cases, features may be truly irrelevant, such as trying to determine the quality of an employee by the color of his or her shirt. The most difficult issue of all is interaction among features, where features are present that contain information, but their relevance to the target class on an individual basis is weak, and their relationship to the pattern is recognizable only when they are looked at in combination with other features.32,33 A good example of this is the almost inconsequential value of individual pixels in an image; no single pixel can tell much about the content of an image (unless the patterns were rigidly constrained in space), yet the combination of them all contains all the information that is needed to make a classification. While some methods exist for extracting features automatically,34 currently consultation with domain experts is almost always needed to determine how to process raw data into meaningful, high-level features that are likely to have some form of correlation with 29
R. C. Gonzales and R. C. Woods, ‘‘Digital Image Processing.’’ Addison-Wesley, Reading, MA, 1992. 30 T. Mitchell, ‘‘Machine Learning.’’ McGraw-Hill, New York, 1997. 31 R. O. Duda, P. E. Hart, and D. G. Stork, ‘‘Pattern Classification.’’ John Wiley & Sons, New York, 2001. 32 L. Rendell and R. Seshu, Comput. Intell. 6, 247 (1990). 33 G. John, J. Kohavi, and K. Pfleger, in ‘‘Proceedings of the Eleventh International Conference on Machine Learning,’’ p. 121. Morgan Kaufmann, San Francisco, 1994. 34 H. Liu and H. Motoda, ‘‘Feature Extraction, Construction, and Selection: A Data Mining Perspective.’’ Kluwer Academic, Dordrecht, The Netherlands, 1998.
248
map interpretation and refinement
[12]
the target classes, often making manual decisions about how to normalize, smooth, transform, or otherwise manipulate the input variables (i.e., ‘‘feature engineering’’). A pattern recognition approach can be used to interpret electron density maps in the following way. First, we restrict our attention to local ˚ radius* (a whole regions of density, which are defined as spheres of 5A ˚ map can be thought of and modeled as a collection of overlapping 5A spheres). Our goal is to predict the local molecular structure (atomic coordinates) in each such region. This can be done by identifying the region with the most similar pattern of density in a database of previously-solved maps, and then using the coordinates of atoms in the known region as a basis for estimating coordinates of atoms in the unknown. To facilitate this patternmatching process, features must be extracted that characterize the patterns of density in spherical regions and can be used to recognize numerically when two regions might be similar. This idea of feature-based retrieval is illustrated in Fig. 1. Recall that electron density maps are really 3D volumetric data sets (i.e., a three-dimensional grid of density values that covers the space around and including the protein). As we describe in more detail later, features for regions can be computed from an array of local density values by methods such as calculating statistics of various orders, moments of inertia, contour properties (e.g., surface area, smoothness, connectivity), and other geometric properties of the distribution and ‘‘shape’’ of the density. Not all of these features are equally relevant; we use a specialized feature-weighting scheme to determine which ones are most important. For this pattern recognition approach to work, an important requirement of the features is that they should be rotation-invariant. A feature is rotation-invariant if its value would remain constant even if the region were rotated, i.e., F(region) ¼ F(Rot(,region)), where is an arbitrary set of rotation parameters that can be used to transform grid-point coordinates locally around the center of the region. Rotation-invariance is important for matching regions of electron density because proteins can appear in arbitrary orientations in electron density maps. If one region has a similar pattern to another, we want to detect this robustly, even if they are in different orientations. Features such as Fourier coefficient amplitudes are well-known to be translation-invariant, but they are not rotation-invariant. Therefore, one of the initial challenges in the design of TEXTAL was to *We selected 5 A as a standard size for our patterns because (a) this is just about large enough to cover a single side-chain, (b) smaller regions would lead to redundancy in the database of feature-extracted regions and fewer predicted atoms per region, and (c) larger regions would contain such complexity in their density patterns that sufficiently similar matches might not be found in even the largest of databases, i.e. they might be unique.
[12]
TEXTAL system
249
˚ -spherical region of density is shown Fig. 1. Illustration of feature-based retrieval. A 5A around a histidine residue (centered on the C atom). Below it is shown a hypothetical feature vector, consisting of a list of scalar values that are a function of the density pattern in the region. This feature vector may be used to search for other regions with similar patterns of density, which would presumably have a similar profile of feature values, independent of orientation.
develop a set of rotation-invariant numeric features to capture patterns in spherical regions of electron density. Given a set of rotation-invariant features, this pattern-matching approach may be used to predict local atomic coordinates in arbitrary spherical regions throughout a density map. However, to provide some structure for the process, we subdivide the problem of map interpretation along the traditional lines of decomposition used by human crystallographers: first, we try to identify the backbone (or main-chain) of the protein, modeled as linear chains of C atoms, and then we apply the pattern-matching process to predict the coordinates of other backbone and side-chain atoms around each C. The first step is accomplished by a routine called CAPRA (for ‘‘C-Alpha Pattern Recognition Algorithm’’). We refer to the second step as LOOKUP, because of the use of a database of previously solved maps. Both routines use pattern recognition (though different techniques), and both rely centrally on the extraction of rotation-invariant features. CAPRA feeds the features into a neural network to predict likely locations 35 of C atoms in a map. Then LOOKUP is run on each consecutive C by ˚ sphere around it and using them to identify extracting features for the 5A 35
T. R. Ioerger and J. C. Sacchettini, Acta Crystallogr. D Biol. Crystallogr. 58, 2043 (2002).
250
[12]
map interpretation and refinement
the most similar region of density in the database of solved maps (with features pre-extracted from C-centered regions in maps of known structures), from which coordinates of atoms in the new map may be estimated. Methods
Overview of TEXTAL System As an automated model-building system, TEXTAL takes an electron density map as input and ultimately outputs a protein model (with atomic coordinates). There are three overall stages, depicted in Fig. 2, that TEXTAL goes through to solve an uninterpreted electron density map. The first step involves tracing the main chain, which is done by the CAPRA subsystem. The output of CAPRA is a set of C chains—a PDB file containing several chains (multiple fragments are possible, due to breaks in the main-chain density), each of which contains one atom per residue (the predicted C atoms). These C chains are fed as input into the second stage, which we call LOOKUP. During LOOKUP, TEXTAL calculates features for each region around a predicted C atom and uses these features to search for regions with similar density patterns in a database of regions from maps of known structures, whose features have been calculated offline. For each C, LOOKUP extracts the atoms of the local residue in the best-matching region (from a PDB file) and translates and rotates them into corresponding positions in the uninterpreted map. By concatenating these transformed ATOM records, LOOKUP fills out the C chains with all the additional side-chain and backbone atoms; the output of LOOKUP is a complete PDB file of the residues that could be modeled. There is a variety of inconsistencies that might need to be resolved, such as getting the independent residue predictions in a chain to agree on directionality
CAPRA
Post-Processing
LOOKUP
O
O
PDB file
OH Electron density map
Backbone (Ca chains)
Initial model
Fig. 2. Main stages of TEXTAL.
Final model
[12]
TEXTAL system
251
of the backbone, refining the coordinates of backbone atoms to draw close ˚ spacing of C atoms, and adjusting the side-chain atoms to the ideal 3.8-A to eliminate any steric conflicts. Also, another major postprocessing step involves correcting the identities of the residues by aligning chains into likely positions in the amino acid sequence of the protein, if known. Because TEXTAL is able to recognize amino acids only by the shape (size, structure) of their side chains, there is ambiguity that occasionally leads to prediction of incorrect amino acid identities in chains. By using a special amino acid similarity matrix that reflects the kinds of mistakes TEXTAL tends to make, we often can determine the exact identity of amino acids based on where those chains fit in the known sequence. This can be used to go back through the LOOKUP routine to select the best match of the correct type, producing a more accurate model. CAPRA: C-Alpha Pattern Recognition Algorithm CAPRA operates in essentially four main steps, shown in Fig. 3. First, the map is scaled to roughly 1.0 ¼ 1, which is important for making patterns comparable between different maps. Then, a trace of the map is ˚ made. The trace gives a connected skeleton of pseudo-atoms (on a 0.5-A grid) that generally goes through the center of the contours (i.e., approximating the medial axis). Note that the trace goes not only along the backbone, but also branches out into side chains (similar to the output of Bones).6 CAPRA picks a subset of the pseudo-atoms in the trace (which we refer to as ‘‘waypoints’’) that appear to represent C atoms. This central step is done in a pattern-analytic way using a neural network, described later. Finally, after deciding on a set of likely C atoms, CAPRA must link them together into chains. This is a difficult task because there are often breaks in the density along the main chain, as well as many false connections between contacting side chains in the density. CAPRA uses a combination of several heuristic search and analysis techniques to try to arrive at a set of reasonable C chains. Tracing is done in a way similar to many other skeletonization algo˚ grid rithms,36,37 as follows. First, a list is built of all lattice points on a 0.5-A throughout the map that are contained within a contour of some fixed threshold (we currently use a cutoff of around 0.7 in density). This leaves hundreds of thousands of clustered points that must be reduced to a backbone. The points are removed by an iterative process, going from worst (lowest density) to best (highest density), as long as they do not create a 36 37
J. Greer, Methods Enzymol. 115, 206 (1985). S. M. Swanson, Acta Crystallogr. D Biol. Crystallogr. 50, 695 (1994).
252
map interpretation and refinement
Electron density map
[12]
Scaling of density
Tracing of map
Predicting Ca locations (Neural network)
Ca chains
Linking Ca atoms into chains
Fig. 3. Steps within CAPRA.
local break in connectivity. This is evaluated by collecting the 26 surrounding points in a 3 3 3 box and preventing the elimination of the center point if it would create two or more (disconnected) components among its 26 neighbors. Generally, because the highest density occurs near the center of contour regions, the outer points are removed first, and maintaining connectivity becomes a factor only as the list of points is reduced nearly to a linear skeleton. What remains is roughly a few thousand pseudo-atoms, typically around 10 times as many as expected C atoms (due to the closer ˚ spacing along the backbone, and the meandering of the skeleton into 0.5-A side chains). To determine which of these pseudo-atoms are likely to represent true C atoms, CAPRA relies on pattern recognition. The goal is to learn how to associate certain characteristics in the local density pattern with an estimate of the proximity to the closest C. CAPRA uses a two-layer feedforward neural network (with 20 hidden units and sigmoid thresholds in each layer) to predict, for each pseudo-atom in the trace, how close it is likely to be to a true C (see Ioerger and Sacchettini35 for details). The inputs to the network consist of 19 feature values extracted from the region of density surrounding each pseudo-atom (Table I). The neural network is
[12]
253
TEXTAL system TABLE I Rotation-Invariant Features Used in TEXTAL to Characterize Patterns of Density in Spherical Regions
Class of features Statistical
Symmetry Moments of inertia and their ratios
Shape/geometry
Specific features Mean of density Standard deviation Skewness Kurtosis Distance to center of mass Magnitude of primary moment Magnitude of secondary moment Magnitude of tertiary moment Ratio of primary to secondary moment Ratio of primary to tertiary moment Ratio secondary to tertiary moment Min angle between density spokes Max angle between density spokes Sum of angles between density spokes
Method of calculation P ð1=nÞ Pi ½ð1=nÞ ði Þ2 1=2 P ½ð1=nÞ ði Þ3 1=3 P ½ð1=nÞ ði Þ4 1=4 j < xc ; yc ; zc > j where P xc ¼ ð1=nÞ xi j , etc. Compute inertia matrix, diagonalize, sort eigenvalues
Compute 3 distinct radial vectors that have greatest local density summation
trained by giving it examples of these feature vectors for high-density lattice points at varying distances from C atoms, ranging from 0 to around ˚ , in maps of sample proteins. The weights in the network are optimized 6A on this data set, using the well-known back-propagation algorithm.38 Given these distance predictions, the set of candidate C atoms (i.e., way points) is selected as follows. All the pseudo-atoms in the trace are ranked by their predicted distance to the nearest C. Then, starting from the top (smallest predicted distance), atoms are chosen as long as they ˚ of any previously chosen atoms. This procedure has are not within 2.5 A the effect of choosing waypoints in a more-or-less random order through the protein, but the advantage is that preference is given to the pseudoatoms with the highest scores first, since these atoms are generally most likely to be near C atoms; decisions on those with lower scores are put off until later. Each trace point selected as a candidate C has the property of being locally best, in that it has the closest predicted distance to a true ˚ radius. Trace points whose preC among its neighbors within a 2.5-A ˚ are discarded, as they dicted distances to C atoms are greater than 3.0 A are highly unlikely to really be near C atoms, and are almost always found in side chains. 38
G. E. Hinton, Artif. Intell. 40, 185 (1989).
254
map interpretation and refinement
[12]
Next, the putative C atoms are linked together into linear chains using the BUILD_CHAINS routine. There are many choices on how to link the C atoms together, since the underlying connectivity of the trace forms a graph with many branches and cycles. BUILD_CHAINS relies on a variety of heuristics to help distinguish between genuine connections along the main-chain and false connections (e.g., between side-chain contacts). Possible links between C atoms are first discovered by following connected chains of trace atoms to a neighboring C candidate not too far away ˚ ), provided the path through the trace atoms does not go too (within 5A ˚ ) to another C candidate. This links the C candidates toclose (<2A gether in a way that reflects the connectivity of the underlying trace, typically forming an over-connected graph. Next, BUILD_CHAINS attempts to identify potential pieces of secondary structure, using a geometric analysis. All connected fragments of length 7 are enumerated, and they are evaluated for their ‘‘linearity’’ and their ‘‘helicity.’’ Linearity is measured by the ratio of the end-to-end distance to the sum of the lengths of the individual links, which is typically between 0.8 and 1.0 for strands, and helicity is measured by computing the mean deviation from 95 for bond angles and þ50 for torsion angles among consecutive C atoms within an helix.39 Finally, all this information is brought together to make heuristic decisions about which candidate C atoms to link together into chains. First, the graph is divided into connected components. Then for each separate component, a different strategy is applied based on its size. For small connected components (with up to 20 atoms), a depth-first search is used to enumerate all possible non-self-intersecting paths; they are scored with a function that reflects desirable attributes such as overall length, good neural network predictions, consistency with secondary structure, and so on, and the single path with the highest score is selected. For large connected components (with more than 20 atoms), a different strategy is used in which the links in the overconnected component are incrementally clipped down to linear chains, where no atom has more than two connections to other atoms. First, all cycles are broken at the weakest point (with worst neural network prediction), and then individual links at three-way (or more) branch-points are clipped. The links chosen for clipping are based on a similar scoring function as described previously, tending to prefer to clip links to short branches (e.g., side chains) containing atoms with poor neural net scores, and tending to avoid clipping links that fall along putative secondary structures. The procedures of CAPRA are illustrated in Fig. 4, which shows (1) the pseudo-atoms of the trace, (2) waypoints selected on the basis of locally 39
T. J. Oldfield and R. E. Hubbard, Proteins Struct. Funct. Genet. 18, 324 (1994).
[12]
TEXTAL system
255
Fig. 4. Illustrations of the main steps of the CAPRA routine. (A) Close-up of tracer points in the area of a helix in CzrA (estimated distance to true C is predicted for each of these points by a neural network). (B) Selected waypoints (in red, with locally minimum predicted ˚ ). (C) Bonds (in purple) drawn between distance to a true C, and minimum spacing of 2.5 A connected waypoints, showing links in side chains as well as main chain. (D) Final subset of connected waypoints to form linear chains (in green).
minimal distance-to-true-C predictions by the neural network, (3) linking waypoints together based on trace connectivity, and (4) determination of linearized substructure that uses geometric analysis and other information to identify plausible chains, with cycles removed and branches into side chains clipped off. LOOKUP: The Core Pattern-Matching Routine After constructing the main-chain with CAPRA, the LOOKUP routine is used to fill in the remaining backbone and side-chain atoms. LOOKUP is called individually on each C in the main chain, and the local sets of predicted atoms for each residue are concatenated to produce a complete initial model. LOOKUP takes a pattern recognition approach to predicting the local coordinates of atoms in a region. It tries to find the most similar
256
map interpretation and refinement
[12]
region in a database of previously solved maps and bases its prediction of local coordinates on the positions of atoms found in the matched region. Extraction of Features to Represent Density Patterns. The key to the matching process is the extraction of numerical features that capture and represent various aspects of the patterns of density. As mentioned before, it is important for the features to be rotation-invariant, in order to detect similarities between similar regions that might be in different orientations. Currently, for each region, TEXTAL calculates 19 different features that can be grouped into 4 general classes40 The first class of features consists of statistical measures, such as the mean, standard deviation, skewness, and kurtosis (third and fourth-order moments) of density in the region. These measures are clearly rotation-invariant; for example, a region of density would have the same standard deviation even if it were rotated in an arbitrary direction. A second class of features is based on moments of inertia. During feature extraction, TEXTAL calculates the inertia matrix and diagnoalizes it to extract the moments of inertia as eigenvalues. These reflect aspects of the symmetry and dispersion of the density in the local region. In addition to the absolute value of the moments themselves, we also compute ratios of the moments as features. Another feature, in a class by itself, is the distance to center-of-mass, which measures whether the density in the region is locally balanced or off-set. Finally, there is a class of features we call ‘‘spokes.’’ These measure geometric properties related to the shape of the density. For regions centered on C atoms, there are typically three spokes (or tubes) of density emanating from the center: two for the backbone and one for the side chain. We identify up to three (non-adjacent) directions in space where the weighted sum of the density in those directions is locally maximum, and then we measure the angles among these vectors (min, max, and sum). In some regions, the spokes sit relatively flat (coplanar), with about 120 between them, and in other regions, they form more of a pyramid or are drawn close together. The formal definitions of these features, 19 in all, can be found in [Holton et al., 2000]. While many other features are possible, we have found these to be sufficient. In addition, each feature can be calculated over different radii, ˚ , 4A ˚ , 5A ˚ and 6A ˚ , so every so they are parameterized; currently, we use 3A individual feature has four distinct versions.y All these features are rotation invariant. They are used both in CAPRA, as inputs to the neural network to predict how close a given 40
T. R. Holton, J. A. Christopher, T. R. Ioerger, and J. C. Sacchettini, Acta Crystallogr. D Biol. Crystallogr. 46, 722 (2000). y Note, these radii can each capture slightly different information about the density pattern in the surrounding spherical region, and they could have different sensitivity to noise.
[12]
TEXTAL system
257
pseudo-atom in the trace appears to be to a true C, and also in LOOKUP, to search the database of known regions for regions with similar patterns of density. Searching the Region Database Using Feature Matching. The feature vectors produced by the feature-extraction process can be used to efficiently search a large database of regions from previously-solved maps to find similar matches. LOOKUP implements this database-search process (illustrated in Figure 5). First, given the coordinates of a putative C (from CAPRA), the feature vector representing the pattern of density in the region is calculated. Then it is compared to the feature vectors of previously-solved regions in the database, and the closest match is identified. Matches are first evaluated by comparing feature vectors, i.e., to find those with minimum feature-based differences, and later by density correlation, as described below. Finally, atoms from the best-matching region are looked up in the structure (assumed to be known), and they are rotated and translated into position in the new model (around the original C) by applying the appropriate geometric transformations.
Input: Ca coordinates of new region 1. Feature extraction Feature vector = 2. Calculate weighted euclidean distance to each feature vector in database 3. Sort matches based on distance 4. Select top K regions with smallest distance 5. Evaluate density correlation of region to top K matches (optimizing rotations) 6. Re-rank top K matches by density correlation 7. Select best matching region
Database of Regions: Centered on Ca atoms in solved maps with known structures with pre-extracted feature vectors Each region consists of: PDB id, residue number, coordinates of center (Ca ), and feature values: pdb,res,<x,y,z>, pdb,res,<x,y,z>, pdb,res,<x,y,z>, ...
8. Retrieve coordinates for atoms in matched region from database (PDB files) 9. Rotate and translate atoms into position in the new map Output: Predicted coordinates of atoms in new region
Fig. 5. The LOOKUP process.
258
map interpretation and refinement
[12]
The database of feature-extracted regions that TEXTAL uses is derived from back-transformed maps. These are generated by calculating structure factors (by Fourier transform) from a known protein structure, and then inverse-Fourier transforming them to compute the map, using ˚ (to simulate medium-resolution maps). The only reflections up to 2.8 A proteins consist of a subset of 200 proteins from PDBSelect,41,42 which is a representative set of nonredundant, high-resolution structures (with no more than 25% pairwise homology among them). Since these backtransformed maps have minimal phase error, they contain the ideal density patterns around common small-scale motifs found in real proteins (including many variations in side-chain and backbone conformations, contacts, ˚ temperature factors, occupancies, etc.). Features are calculated for a 5-A spherical region around each C atom in each protein structure for which we generated a map, producing a database with 50,000 regions. The feature-based differences between regions are measured by computing a weighted Euclidean distance between their feature vectors. The formula for the weighted Euclidean distance between two feature vectors is given as: distðR1 ; R2 Þ ¼
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X wi ðFi ðR1 Þ Fi ðR2 ÞÞ2 i
where the Fs are the features and the Rs are the regions. The larger the difference between individual feature values is, the larger the overall distance will be. Regions that have similar patterns of density should have similar feature values, and hence a low feature-based distance score. During the database search, LOOKUP computes the distance between the probe (unknown) region and every (known) region in the database, and keeps a list of the top K ¼ 400 matches with least distance, representing potential matches. The weights wi can be used to normalize the feature differences onto a uniform scale, so larger values do not dominate unfairly. Furthermore, the weights may be biased to give higher emphasis to features that are more relevant, and reduce the influence of less-relevant or noisy features. A novel algorithm for weighting features in this way, called SLIDER, is described in Holton et al.40 and Gopal et al.43 The calculation of feature-based distance, however, is not always sufficient. For example, while two regions that have similar patterns are 41
U. Hobohm, M. Scharf, R. Schneider, and C. Sander, Protein Sci. 1, 409 (1992). U. Hobohm and C. Sander, Protein Sci. 3, 522 (1994). 43 K. Gopal, R. Pai, T.R. Ioerger, T. Romo, Sacchettini J.C. ‘‘Proceedings of the 15th Conference on Innovative Applications of Artificial Intelligence,’’ pp. 93–100 (2003). 42
[12]
TEXTAL system
259
expected to have similar features values, it cannot be guaranteed that regions with different patterns will necessarily have different feature values. Therefore, there could be some spurious matches to regions that are not truly similar. Hence we use this selection initially as a filter, to catch at least some similar regions. Then we must follow this up with a more computationally expensive step of evaluating the candidate matches further by calculating density correlation (this is a time-consuming operation that involves searching for the optimal rotation between the two regions; see Holton et al.40 for details). This is done for all the top K matches. They are then reranked according to their density correlation score. This gives a more accurate estimate of similarity, but because of the slowness of the calculation and size of the database, we cannot afford to run it on all possible regions in the database. Once LOOKUP has identified the region in the database that appears to have the greatest similarity to the region we are trying to model, the final step is to retrieve the coordinates of atoms from the known structure for the map from which the matching region was derived, specifically, the local side-chain and backbone atoms of the residue whose C is at the center of the region, and apply the appropriate transformations to place them into position in the new map. The transformation of the coordinates of each atom happens as follows: each atom from the database region is translated to the vicinity of the origin by subtracting the coordinates of the center of the region, and then the rotation matrix that gave the highest density correlation for the match is applied to put it into the appropriate orientation for the new region. Finally, the predicted atom is translated into the new region by adding the coordinates of its center. This procedure is repeated for all the atoms in each region surrounding a C atom identified by CAPRA. The resulting side-chain and backbone atoms are written out in the form of ATOM records in a new PDB file, which constitutes the initial, unrefined model generated by TEXTAL for the map. Postprocessing Once a complete initial model has been constructed (through CAPRA and LOOKUP), the third stage of TEXTAL consists of postprocessing. There are a number of postprocessing procedures that can be applied to help reduce imperfections in the initial model and improve the accuracy of the final model. The first postprocessing step is a simple routine to fix ‘‘flipped’’ residues, that is, residues whose backbone atoms are going in the wrong direction with respect to their neighbors. After the backbone atoms have been
260
map interpretation and refinement
[12]
retrieved by LOOKUP for all the C atoms in a chain, a calculation of overall directionality for the chain is performed. For each residue i in the chain (except on the termini), the angle between the Ci–Ci vector and the Ci–Ci þ 1 vector is calculated. If the backbone for residue i is pointing ‘‘forward,’’ angle(Ci, Ci, Ci þ 1) should be near 0 (we use a threshold of <45 ). If it is pointing ‘‘backward,’’ then angle(Ci, Ci, Ci 1) should be near 0. Similarly, directionality can be estimated by computing the angle between Ci, Ni, and Ci 1 (near 0 means pointing forward) and to Ci þ 1 (near 0 means pointing backward). These angles are computed all the way up the chain, and a majority vote on the most common direction is taken. Then a sweep is made back through the individual residues in the chain; if they are determined to be going in reverse of the selected direction, subsequent matches in the list of candidate regions retrieved during LOOKUP are scanned (in order of highest density correlation down) until a match is found whose backbone atoms are going in a direction consistent with the rest of the chain. The second postprocessing step is real-space refinement.44–47 TEXTAL employs a simple form of real-space refinement (with coordinates only, not torsion angles). This procedure is not powerful enough to fix all potential problems in the model, but does an adequate job of adjusting the spacing of backbone atoms between adjacent residues in the same chain. The procedure tries 20 random perturbations of the coordinates of each atom up to ˚ from its current position, and picks the one that appears to improve 0.1 A the local energy the most. The local energy function is based on the terms of the global energy function to which a specific atom contributes. The terms in the global energy function include local density of each atom (interpolated from the map), bond distance constraints (quadratic function of the deviation from the optimal distance), bond angle constraints, torsion angle constraints (mainly for planarity of aromatic side chains and peptide bonds in the backbone), and steric contacts (increases exponentially if distance between atoms drops below sum of van der Waals radii). After local perturbations to the atomic coordinates are made, a check of the global energy is made, and the new configuration is accepted only if the global energy has decreased. This cycle is repeated up to 100 times. While initially there is no constraint on the connection between backbone atoms at adjacent residues (as they are modeled independently by LOOKUP), this procedure now pulls the Ni and Ci þ 1 atoms together to 44
R. Diamond, Acta Crystallogr. A 27, 436 (1971). M. S. Chapman, Acta Crystallogr. A 51, 69 (1995). 46 T. J. Oldfield, Acta Crystallogr. D Biol. Crystallogr. 57, 82 (2001). 47 D. E. Tronrud, Methods Enzymol. 277, 306 (1997). 45
[12]
TEXTAL system
261
a reasonable bond distance, rotates the atoms so the peptide bond is planar, ˚ and even tends to pull the neighboring C atoms to within about 3.8 A distance spacing all the way down the chain. Other real-space refinement routines, such as in TNT,45,47 could also conceivably be used. The third postprocessing step, which we are currently in the process of implementing, involves correcting the identities of mislabeled amino acids. Recall that, since TEXTAL models side chains only on the basis of local patterns in the electron density, it cannot always determine the exact identity of the amino acid (e.g., due to isoforms such as valine and threonine), and occasionally even predicts slightly smaller or larger residues due to noise perturbing the local density pattern. However, most of the time TEXTAL outputs a residue that is at least structurally similar to the correct residue. We can correct mistakes about residue identities using knowledge of the true amino acid sequence of the protein if we know how the predicted fragment mapped into this sequence. The idea is to use sequence alignment techniques to determine where each fragment maps into the true sequence; then the correct identities of each amino acid can be determined, and another scan through the list of candidates returned by LOOKUP can be used to replace the side chains with an amino acid of the correct type at each position. We use a gapped alignment algorithm,48 since TEXTAL occasionally adds or skips an extra C in a chain, appearing as an insertion or deletion. However, the gap parameters (gap-open and gap-extension penalties) must be specially optimized for TEXTAL, since the length distribution of gaps is different from that expected from evolutionarily related sequences. Preliminary experiments have shown that the accuracy of alignment of chains is sensitive to length (longer chains are easier to align), and that the alignments can be improved by using a special amino acid similarity matrix that reflects the types of mistakes TEXTAL tends to make (based on empirical statistics of predicting one amino acid for another; TEXTAL tends to confuse residues with similar size and shape, although chemical differences are irrelevant, so traditional similarity matrices, e.g., PAM or BLOSUM, could not be used). Further testing of this alignment method is in progress. Results
Here we report the results of evaluating TEXTAL in detail on two experimental electron density maps. The first protein is CzrA, a dimeric DNA-binding transcription factor.49 CzrA contains 105 amino acids per 48 49
T. F. Smith and M. S. Waterman, J. Mol. Biol. 147, 195 (1981). Eicken et al., in preparation (2003).
262
map interpretation and refinement
[12]
subunit, and is composed of four helices (10 residues on the C and N termini were disordered and could not be built or refined, so the effective size is 95 residues). Three wavelengths of diffraction data were collected at the BioCARS beamline and phased by the MAD method at a resolution of ˚ . After building an initial model, similarity to another protein in the 2.8 A ˚ by PDB was discovered, and the final structure was then solved at 2.3 A molecular replacement. The final R-factor and Rfree values for CzrA were 0.195 and 0.249. The second protein is mevalonate kinase (MVK), an enzyme involved in isoprene biosynthesis.50 It contains 317 amino acids composed primarily of large sheets, with a few helices packed around the outside. The data, also collected at three wavelengths at BioCARS, ˚ . Both were phased with selenomethionine MAD at a resolution of 2.4 A maps were solvent-flattened with DM,51 manually built using O, and refined with CNS.5 The final R-factor and Rfree values for MVK were 0.197 and 0.282. To test the ability of TEXTAL to build models for medium-resolution ˚ maps were generated from both data sets by truncating the maps, 2.8-A structure factors. The results presented below are from running TEXTAL ˚ maps. In addition, we extended the boundaries of each map on these 2.8-A ˚ border on each to cover one complete monomer (with an additional 5-A side), so the chain could be traced in its entirety (i.e., to avoid having a border of the map cut the molecule into multiple pieces appearing in separate parts of the asymmetric unit). Tracer Results The first steps of CAPRA involve scaling and tracing the maps. The maps must first be scaled uniformly to make the density patterns comparable between the target map and database of previously solved maps; the outcome is that the maps are scaled roughly to a level similar to 1.0 ¼ 1. The algorithm adjusts the scale slightly to negate the effect of different solvent proportions. The trace was constructed using discrete lattice points ˚ grid. The resulting pseudo-atoms (tracer points) follow along on a 0.5-A the centers of the ‘‘tubes’’ of density and have a spacing of roughly 0.5– ˚ , depending on the direction. The number of tracer points for each 0.9 A map is shown in Table II.
50
D. Yang, L. W. Shipman, C. A. Roessner, A. I. Scott, and J. C. Sacchettini, J. Biol. Chem. 277, 11559 (2002). 51 K. Cowtan, Joint CCP4 and ESF-EACBM Newsletter on Protein Crystallography 31, 526 (1994).
[12]
263
TEXTAL system TABLE II Statistics of Tracer Output
Protein
Dimensions ˚ 3) of map (A
Volume as % of ASU
CZRA MVK
56 58 39 42 54 78
65% 163%
No. of tracer points
No. of tracer points ˚ of within 3 A structure
No. of true C atoms in structure
Ratio of tracer points to true C
3459 8891
910 2784
95 317
9.6 8.8
Fig. 6. Examples of Tracer output for: (A) a helix in CZRA, and (B) three strands of a sheet in MVK (0.7 contour shown in yellow).
As Table II shows, the Tracer program greatly compresses the information content of an electron density map to on the order of a few thousand tracer points. We find that, on average, there are around 9 tracer atoms per residue (Fig. 6). The trace forms a compact representation of the density in the map. The core CAPRA routine, described next, takes advantage of this simplified representation by hypothesizing that any true C atom will be located near the trace, and in fact CAPRA picks the best local estimates of C positions directly from the trace itself (i.e., the waypoints). CAPRA Results Next, we examine the operation of the core CAPRA algorithm, which in the end produces linear chains of predicted C coordinates. There are four substeps within this routine: 1. Prediction of the estimated distance of each trace point to a true C atom (using a neural net) 2. Selection of waypoints among the tracer atoms
264
map interpretation and refinement
[12]
3. Discovery of the ‘‘structural framework’’ of the protein, consisting of short fragments of connected waypoints that are either linear or helical 4. Linking the structural fragments together into linear chains Figure 7 shows the correlation between predicted and true distances between pseudo-atoms in the trace of CzrA and true C atoms in the manually built structure. This scatter plot illustrates that, in general, trace atoms that are farther away from true C atoms tend to have higher distances predicted by the neural network (based on using rotation-invariant features of the local density pattern as input). The predictions are not perfect, but there is a clear general trend. This is sufficient to enable CAPRA to pick an initial set of waypoints out of the trace (one per residue) that are likely to be near true C atoms, as a basis for subsequent steps. Note that, at ˚ grid; but this point, the waypoints are restricted to the artificial 0.5-A real-space refinement, applied at the end of the whole TEXTAL process, affords an opportunity to adjust the coordinates of the atoms to improve satisfaction of geometric constraints along the backbone, along with the fit to the density. Given these predictions, the rest of the steps in CAPRA do a nice job of creating long C chains that consistently follow the backbone, excluding breaks in the density. The paths and connectivity that it chooses are often
Fig. 7. Correlation between distances predicted by the neural net and true distances from each waypoint to the nearest true C in CzrA.
[12]
TEXTAL system
265
visually consistent with the underlying structure, only rarely traversing false connections through side-chain contacts. It produces candidate C ˚ apart on average (although ranging from atoms spaced roughly 3.8 A ˚ 2.5 up to 5 A), and corresponding nearly one-to-one with true C atoms, leaving only a few skips or spurious insertions. For example, on a back˚ map (with perfect structure factors calculated from the transformed 2.8-A known model), CAPRA was able to identify a single chain of 135 residues in length for 3NLL, an / protein with 137 residues. This accuracy is probably due to the fact that, even if one of the lattice points in the trace near a true C is mispredicted by the neural net to have a high distance, often this is compensated for by another trace point nearby that is correctly predicted. In between C atoms, and also off in side chains, the predicted distance of trace points to C atoms increases roughly as expected. Figure 8 shows stereo views of the overall output of the CAPRA algorithm (C chains) for CzrA, along with the C trace of the manually built and refined model for comparison. Significant secondary structures can easily be seen to be captured, including several helices. When the predicted C chains in the model for CzrA are trimmed down to just those that cover the true structure (monomer), CAPRA is found to output 2 chains of length 65 and 25 residues that cover 94% of the molecule (90 of 95 residues). The break between the chains occurs in the region of a small -hairpin loop that has weak density. For MVK, CAPRA outputs 10 chains covering 91% of the molecule (287 of 317 residues). Except for a short chain of length 6, the chains ranged in length from 19 to 57 residues, with a mean of 28.7. The breaks tend to occur in loop regions, rather than in regular secondary structures like helices and strands. The only significant part of the molecule that was not built was a solvent-exposed helix plus adjacent turns totaling 18 consecutive residues; this was due to weak density that broke up the connectivity in this region. The RMSD (root–mean–square deviation) scores for the predicted C coordinates in the CAPRA chains, compared with their closest neighbors ˚ for CzrA and MVK, rein the manually built models, were 0.68 and 0.84 A spectively. (These RMSD scores were calculated from the final TEXTAL models, which includes real-space refinement.) LOOKUP Results Given the C chains from CAPRA as input, the side-chain coordinates predicted by LOOKUP matched the local density patterns very well, and the additional (non-C) atoms in the backbone were also fit very well. In many cases, TEXTAL placed carbonyl oxygens in nearly the correct
266
map interpretation and refinement
[12]
Fig. 8. CAPRA chains for CzrA (in green), with manually-built model superimposed (in purple). (All stereo views in this paper are rendered as ‘‘cross-eyed’’ stereo. All molecular images in this paper were made with the computer graphics program, Spock, written by Dr. Jon. A. Christopher, http://quorum.tamu.edu.)
position, based purely on the shape of the local density around the backbone. For example, in CzrA, 65 of 90 carbonyl oxygens were predicted ˚ of their correct position, and 144 of 287 carbonyl oxygens in within 1 A ˚ of their correct position. On the basis of MVK were predicted within 1 A placement of carbonyls, TEXTAL also correctly determined the directionality of both predicted chains in CzrA and all 10 chains in MVK (except the shortest one of length 6). The all-atom RMSD scores for the TEXTAL models built for CzrA ˚ , respectively, relative to the manually built and MVK were 0.88 and 1.00 A and refined models (see Table III). Because LOOKUP does not know the true identities of the amino acids being modeled at this stage, it sometimes predicts a residue chemically different from the one in the true model at a given location. Hence, there is not necessarily a one-to-one correspondence between atoms in the TEXTAL-generated and true models. Therefore, these RMSD scores were calculated by pairing up atoms between two models, choosing closest pairs first, and excluding subsequent atoms from matching with atoms already selected for existing pairs. The RMSD scores are the root–mean–square averages over these pairwise (nearest ˚ (to reduce the impact of unmoneighbor) distances, up to a cutoff of 3.0 A deled regions). To better illustrate the meaning of these RMSD scores, histograms of the individual pairwise atomic distance distributions for both CzrA and MVK are shown in Fig. 9. It is clear that the majority of atoms in ˚ of a mate in the manually built structure. the CzrA model are within 1 A The distribution of pairwise distances is similar for MVK.
[12]
267
TEXTAL system TABLE III RMS Scores for TEXTAL Models Compared with Models Built Manually Protein
CzrA
MVK
C RMS (backbone) All-atom RMS (closest pairs)
˚ 0.68 A ˚ 0.88 A
˚ 0.84 A ˚ 1.00 A
Fig. 9. Histograms of pairwise distances between atoms in TEXTAL-generated models ˚ buckets. and nearest neighbors in true structures for CzrA and MVK, shown in 0.2-A
On average, the matches for each region (around each C) had a local density correlation of 0.85 for CzrA and 0.81 for MVK. This indicates that the LOOKUP process was able to discover reasonably similar matches in the database (first filtered by feature-matching, and then further evaluated and ranked by density correlation). However, LOOKUP does not always predict the amino acid identity in each position correctly; the accuracy, in terms of percent identity, was only 26.7 and 19.3% for CzrA and MVK, respectively. Part of the reason why amino acids are not recognized more precisely is because the only information the method has available at this stage is the local pattern of the density, and many residues have similar or even identical structures. Nonetheless, LOOKUP often chooses a structurally similar residue. We divided the amino acids into the following seven categories of similarly sized residues—{AG} {CS} {P} {VTDNLI} {QEMH} {FWY} {RK}{—and computed the frequency with which the TEXTAL {
Note that structural similarity is quite different from the usual biochemical notion of amino acid similarity for the purposes of estimating homology—polarity is irrelevant since it cannot be directly observed in low-resolution electron density patterns; so residues like Thr and Val are considered to match, for example.
268
map interpretation and refinement
[12]
model had a residue that was in the same category as the residue in the true structure. For CzrA, this structural similarity score was 54.4%, and for MVK it was 46.4%. So LOOKUP is clearly being influenced by the patterns of density and inserting residues of sizes similar to those in the true structure. One of the reasons that the similarity scores are not closer to 100% may be that the density for many of the side chains of residues on the surface that project out into solvent has been truncated, due to disorder or possibly solvent-flattening, making them appear smaller than they really are. Several close-up examples of the models built by TEXTAL are shown in Figs. 10 and 11. In the first figure, a fragment of a helix and turn in CzrA is shown. TEXTAL has done a nice job of building a model to fit the density (i.e., predicting reasonable coordinates for backbone and side-chains) in comparison to the true structure. Notice how well even the carbonyl oxygens in the backbone are modeled; this was accomplished purely by pattern-matching (retrieval of regions with similar patterns of density), though the initial placement of cabonyls by this process is improved significantly by real-space refinement in postprocessing. In Fig. 11, a region of a sheet in the core of the MVK is shown. Figures 10 and 11 reflect visually the level of accuracy of protein models constructed by TEXTAL from electron density maps.
Fig. 10. Stereoview of a helix and turn in CzrA, contoured at around 0.7. The TEXTAL model is in red; the true (refined) structure is in blue.
[12]
TEXTAL system
269
Fig. 11. Stereoview of TEXTAL model (red) superimposed on manually built model of MVK (blue) for a buried portion of a sheet.
Discussion
It is important to remember that TEXTAL is based on pattern recognition – this is both its strength and weakness. The analysis of local patterns of electron density allows TEXTAL to make complex predictions of atomic coordinates in a simple way by analogy (database lookup), without requiring commitment to or understanding of a general model of the relationship between scatterers and density. Furthermore, since the features capture such general aspects of the electron density pattern (e.g., symmetry, geometry), TEXTAL is relatively insensitive to resolution, and we ˚ . Howhave obtained good results on maps with resolution as poor as 3.1A ever, in places where noise perturbs the density (e.g., truncated side chains, breaks in the main chain), TEXTAL is often limited because it cannot recognize anything meaningful. While TEXTAL still makes occasional mistakes, many of these are due to poor density, where even a human crystallographer would have a difficult time building in atomic coordinates. The next major extension of TEXTAL’s capabilities is to integrate model building with phase refinement. It is possible that, by building partial models for a few interpretable fragments (or even a polyalanine backbone) in a map with poor density, either standard phase combination (e.g., SigmaA52), reciprocal-space refinement techniques (as in Refmac26 or
270
map interpretation and refinement
[12]
CNS5), or more recent statistical density modification methods27 could be used to improve phase estimates, thereby generating a new map with higher-quality density. This process could then be iterated by applying another round of model building, and so on. However, the threshold in terms of phase error at which TEXTAL starts working well (producing reasonable models), and the magnitude of the improvements in the accuracy of phases by this approach, are still under investigation. The model-building capabilities of TEXTAL also could be potentially used to facilitate other methods, such as determining a mask for solvent-flattening, and automatically detecting noncrystallographic symmetry. All these possibilities can be explored when TEXTAL is integrated as the model-building component into the PHENIX environment28 which will provide a common interface, scripting mechanism, and data format interconversion routines for running these kinds of experiments. There are many additional ideas that can or are being tested to improve its accuracy, such as adding new features or clustering the database. But even in its current state, TEXTAL can be a great time-saving benefit to crystallographers, by automatically building accurate models for electron density maps. Currently, access to TEXTAL is being provided through a Web site (http://textal.tamu.edu:12321), where maps are uploaded and processed on our server. For medium-sized electron density maps, TEXTAL currently takes roughly 1–6 h§ to build a model, including 10–30 min for CAPRA to build the backbone C chains, about 1–3 h to run LOOKUP (for modeling the side chains), the rest of the time for postprocessing (sequence alignment, real-space refinement). Acknowledgments This work was supported in part by grant PO1-GM63210 from the National Institutes of Health.
52 §
R. J. Read, Acta Crystallogr. A 42, 140 (1986). These estimates are based on running on one 400-MHz processor of an SGI Origin 2000.
[13]
applications for macromolecular map interpretation
271
[13] Applications for Macromolecular Map Interpretation: X-AUTOFIT, X-POWERFIT, X-BUILD, X-LIGAND, and X-SOLVATE By Tom Oldfield Macromolecular crystallography has opened up a world of incredible complexity; that of protein and nucleic acid structure. The atomic detail within the Ribosome S50 and S30 complexes have been recently elucidated showing that macromolecules with a large number of atoms can be determined with enough detail so as to understand their mode of action. There are many steps along the road of structure determination. Map interpretation/model building represent a major part of this process, and require the most interactive time and expertise from the crystallographer. This chapter discusses and gives some insight into the use of five model-building applications within the program Quanta that allow various levels of automation of map interpretation and model (re)-building. The aim of interpretation of experimental crystallographic data is to place atom positions that both fit this experimental data, and obey chemical restraints. During de-novo structure determination it is necessary first to identify the overall fold of the molecule. Since protein and nucleic acid structures are un-branched polymers, this involves identifying a continuous pathway that satisfies the experimental information as a trace. Sequence assignment applies the amino acid sequence of a protein to a fitted C trace. Adding sequence information is made complicated because it is not usually possible to identify the complete chain structure of a protein from the experimental map. Model building is required because refinement in a nonlinear optimization technique that becomes trapped in minima that are not valid models of the data. Finally the results require validation, though this is generally becoming integral to the structure determination process. Map interpretation and model building therefore represent the application of various levels of chemical knowledge to unravel information hidden within much experimental noise. This is necessary when the true crystallographic phases cannot be determined directly. Macromolecular crystallography is likely to become a tool for biochemists, in the same way small molecular crystallography is a tool for chemists, although this goal has not yet been attained generally. There is much concurrent pressure to speed up the process of both hypothesis-based structure determinations and molecules solved as part of structural-genomic
METHODS IN ENZYMOLOGY, VOL. 374
Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00
272
map interpretation and refinement
[13]
collaborations. Automation of each step of the crystallographic process is a goal but is vulnerable to the risk that less expertise, care, and time will be afforded to the process. It should be remembered that automated programs generate right and wrong answers equally fast; the difficulty is distinguishing between the two. It is therefore imperative that the software available must make critical assessment of intermediate results, and hopefully produce information that is at least as good as that available now from human workers. When a structure is determined using de novo phase information it is necessary to interpret the experimental map. This map represents a true unbiased form of the data, as it contains no model phase. It should also be recognized that the amount of information that is measured experimentally (the intensities) constitutes roughly 20% of the total information required to define a molecular structure. The remaining information comes from phases that are determined during map interpretation itself. The entire process of macromolecular crystallography can be seen to be highly prone to error, and therefore when tracing an initial map it is critical that that process entails as little error propagation as possible. Automating map-tracing is therefore difficult not only because of problems of identifying information, but also because it is absolutely necessary that the algorithm be both reliable and conservative. Automation of model re-building presents different difficulties. The aim is to move the model coordinates out of false minima so that ‘‘black box’’ refinement (usually reciprocal space gradient refinement using xyzB parameterization with maximum likelihood targets) can continue to improve the quality of the coordinates. The quality of the model coordinates is usually defined by the R-factor and this is cross validated with the free R-factor using data not within the structure determination process. There is of course much more to crystallographic structure solution than just the reduction of the R-factors. Refinement within model building should not supersede the ‘‘black box’’ methods, otherwise no advantage would be gained within the overall structure determination. Fortunately local refinement of the model coordinates in real space with different targets and parameters has a very different impact on the model phases to that of reciprocal space refinement. In fact the two refinement stages have been found to be complementary in their action. Finally, maps generated as part of any model building session include the free R set. This may result in the loss of the independence of this validation data, but then maps calculated without the free R set are of reduced quality, making interpretation more difficult.
[13]
applications for macromolecular map interpretation
273
QUANTA
This article describes the use of the X-ray applications within the program QUANTA, with particular reference to the Q2002 release. The methods described here cannot interpret maps that do not contain useful information but come close to the abilities of a competent crystallographer and provide various levels of automation, depending on the quality of the experimental data. The map-tracing methods described here were designed ˚ for protein crystallographic information within the resolution range 1.5 A ˚ to 4 A. There are a number of other algorithms available within QUANTA suitable for higher-resolution data, but these are not described.1 The tracing method uses the data reduction technique similar to that of Greer2 to produce ridge lines (bones) for subsequent pathway analysis. The tracing ˚ is due to the discrete nature of the density around limitation below 1.5 A atomic positions; since the tracing described is a pathway analysis method it cannot work where the data is atomic. Powerful methods already exist for map interpretation of higher-resolution data.3 The tracing limitation below ˚ is the result of the loss of detail along a helix axis and across beta 4A sheets. The bones representation for a helix is a single line, and the density for a beta sheet is a plane (at best), so the bones data reduction is indeterminate. Model building methods described are suitable for all resolutions ˚ , with additional features that assist at both extremes of better than 4 A the resolution range. In particular the inclusion of a number of different refinement protocols (grid, tree, gradient, MC) provides the user with a high degree of assistance. The x-ray tools within QUANTA were designed with the goal of requiring less interactive time of the expert crystallographer. With good data they make the process available to the nonexpert. Design of Program The functionality available includes solvent boundary mask generation/ editing, bones generation/editing, map tracing, sequence assignment, model building, ligand/solvent fitting, validation and analysis.4–7 To provide tools that carry out all these functions can result in complex menus and options. A design criterion has always been to simplify the interface as much as 1
T. J. Oldfield, Acta Cryst. D 58, 963–967 (2000). J. Greer, J. Mol. Biol. 82, 279–284 (1974). 3 A. Perrakis, R. Morris, and V. S. Lamzin, Nature Struct. Biol. 6, 459–463 (1999). 4 T. J. Oldfield, Acta Cryst. D 57, 82–94 (2001). 5 T. J. Oldfield, Acta Cryst. D 57, 696–705 (2001). 6 T. J. Oldfield, Acta Cryst. D 58, 487–493 (2002). 7 T. J. Oldfield, Acta Cryst. D 59, 483–491 (2003). 2
274
map interpretation and refinement
[13]
possible and make the complexities of protein crystallography available to non-experts. With this in mind, validation and methods that make algorithms self-critical are fundamental before the graphical user interface (GUI) can become simplified. To some degree this design criteria has been attained with tools that combine many ideas to provide near automation to tracing and model rebuilding. Palettes, Tools, Dials, and Dialog Boxes The following defines various terms within the text: Graphical window: The main data display window in which the data is displayed. Text port: The window where all text comments are written. Tool: A text labeled button that when picked will carry out an action. Palette: A vertical list of tools in a separate window. Dials: A list of slider bars that control a parameter such as a view rotation angle. Virtual dial: The ability to attach the mouse movement to a dial bar action; so, for example, the view rotation can be controlled using the mouse movement. Dial box: A separate box that contains rotating controls that affect rotation, etc. Dialog box: A window of buttons, number fields, and general options—usually for user-defined parameters. Graph: A separate window containing a 2D graphical representation of data. Table: A graphical window that contains tabular information such as the currently open molecules, their visibility, activity, etc. Tools found in the sub-palettes of X-BUILD and X-AUTOFIT are referred by using upper case for the palette name followed by the tool name as found on that palette. Tools on the main palettes of X-BUILD, XAUTOFIT, X-SOLVATE, and X-LIGAND are referred to by the tool name. An option on a dialog box is shown as DIALOG NAME/